Method description (v1)
This JavaScript API is considered outdated, we recommend using API version 2
Note that some methods require the yield
keyword.
yield this.request()
yield this.request(method, url, queryParams, opts)
Getting an HTTP response for a request, with the following arguments:
method
- request method (GET, POST...)url
- request linkqueryParams
- hash with GET parameters or hash with POST request bodyopts
- hash with request options
If the POST method is used, the request body can be passed in two ways:
- simply by listing the variable names and their values in
queryParams
. For example:
{
key: set.query,
id: 1234,
type: 'text'
}
- through the body variable in
opts
. For example:
body: 'key=' + set.query + '&id=1234&type=text'
opts.check_content
check_content: [ condition1, condition2, ...]
- an array of conditions for checking the received content, if the check fails, the request will be repeated with a different proxy.
Capabilities:
- using strings as conditions (search by substring)
- using regular expressions as conditions
- using your own verification functions, which receive data and response headers
- you can specify several different types of conditions at once
- for logical negation, put the condition in an array, i.e.
check_content: ['xxxx', [/yyyy/]]
means that the request will be considered successful if the received data contains the substring xxxx and at the same time the regular expression /yyyy/ does not find matches on the page
All specified checks must pass for a successful request.
Example (the comments indicate what is needed for the request to be considered successful):
let response = yield this.request('GET', set.query, {}, {
check_content: [
/<\/html>|<\/body>/, //на полученной странице должно сработать это регулярное выражение
['XXXX'], //на полученной странице не должно быть этой подстроки
'</html>', //на полученной странице должна быть такая подстрока
(data, hdr) => {
return hdr.Status == 200 && data.length > 100;
} //эта функция должна вернуть true
]
});
opts.decode
decode: 'auto-html'
- automatic detection of encoding and conversion to utf8
Possible values:
auto-html
- based on headers, meta tags and page content (optimal recommended option)utf8
- indicates that the document is in utf8 encoding<encoding>
- any other encoding
opts.headers
headers: { ... }
- hash with headers, header name is specified in lowercase, you can specify cookie as well
Example:
headers: {
accept: 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
cookie: 'a=321; b=test',
'user-agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
opts.headers_order
headers_order: ['cookie', 'user-agent', ...]
- allows you to override the order of header sorting
opts.recurse
recurse: N
- the maximum number of redirects, by default 7, use 0 to disable redirects
opts.proxyretries
proxyretries: N
- the number of attempts to execute a request, by default taken from the parser settings
opts.parsecodes
parsecodes: { ... }
- a list of HTTP response codes that the parser will consider successful, by default taken from the parser settings. If you specify '*': 1
then all responses will be considered successful.
Example:
parsecodes: {
200: 1,
403: 1,
500: 1
}
opts.timeout
timeout: N
- response timeout in seconds, by default taken from the parser settings
opts.do_gzip
do_gzip: 1
- determines whether to use compression (gzip/deflate/br), enabled by default (1), to disable, set the value to 0
opts.max_size
max_size: N
- maximum response size in bytes, by default taken from the parser settings
opts.cookie_jar
cookie_jar: { ... }
- hash with cookies.
opts.attempt
attempt: N
- indicates the current attempt number, when using this parameter, the built-in handler for attempts for this request is ignored
opts.browser
browser: 1
- automatic emulation of browser headers (1 - enabled, 0 - disabled)
opts.use_proxy
use_proxy: 1
- overrides the use of a proxy for a separate request inside the JS parser on top of the global Use proxy parameter (1 - enabled, 0 - disabled)
opts.noextraquery
noextraquery: 0
- disables adding Extra query string to the request URL (1 - enabled, 0 - disabled)
opts.save_to_file
save_to_file: file
- allows you to download the file directly to disk, bypassing memory recording. Instead of file, specify the name and path under which to save the file. When using this option, everything related to data is ignored (checking content in check_content, response.data will be empty, etc.).
opts.data_as_buffer
data_as_buffer: 0
- determines whether to return data as a String (0) or as a Buffer object (1), by default a String is returned
opts.bypass_cloudflare
bypass_cloudflare: 0
- automatic bypass of CloudFlare JavaScript protection using the Chrome browser (1 - enabled, 0 - disabled)
opts.follow_meta_refresh
follow_meta_refresh: 0
- allows you to follow redirects declared via the HTML meta tag:
<meta http-equiv="refresh" content="time; url=..." />
opts.tlsOpts
tlsOpts: { ... }
– allows you to pass settings for https connections
yield this.parser.request()
yield this.parser.request(parser, preset, overrideParams, query)
Receiving results from another scraper (built-in or another JS scraper), the following arguments are specified:
parser
- scraper name (SE::Google, JS::Custom::Example)preset
- preset of the called scraperoverrideParams
- hash with overrides of the settings of the called scraperquery
- query
In overrideParams
, you can override the parameters of the called scraper, and the following flags are also available:
overrideParams.resultArraysWithObjects
resultArraysWithObjects: 0
- determines in what form to return arrays of results of the called scraper:
- if enabled (1) - arrays of objects will be returned
[{link: 'link1', anchor: 'anchor1'}, {link: 'link2', anchor: 'anchor2'}, ...]
- if disabled (0) - standard arrays of values will be returned
['link1', 'anchor1', 'link2', 'anchor2', ...]
overrideParams.needData
needData: 1
- determines whether to pass (1) or not (0) data/pages[] in the response, can be used for optimization
tools.*
Global object tools, allows access to built-in A-Parser functions (analog of template toolkit tools $tools.*).
tools.query
is not available, use this.query
instead
this.doLog()
Shows whether task logging is enabled, can be used as a flag for optimization, for cases when logging is not enabled and a complex expression is passed as an argument to this.logger.put
this.logger.*
.put()
this.logger.put(message)
- adds a line message to the task log
.putHTML()
this.logger.putHTML(code)
- output of HTML code to the task log, which will be displayed in the textarea
yield this.sleep()
yield this.sleep(sec)
Sets a delay in the thread for the number of seconds sec, can be fractional.
yield this.mutex.*
Mutex for synchronization between threads, allows you to block a section of code for one thread
.lock()
Waiting for the lock, execution will continue for the first thread that captured the lock, other threads will wait for the lock to be released
.unlock()
Releasing the lock, the next thread will continue execution if it was waiting for the lock(.lock()
)
this.cookies.*
Working with cookies for the current request
.getAll()
Getting an array of cookies
.setAll()
Setting cookies, an array with cookies must be passed as an argument
.set()
this.cookies.set(host, path, name, value)
- setting a single cookie
this.query.add()
this.query.add(query, maxLvl)
Adding a new query (query) with the ability to optionally specify the maximum level (maxLvl), similar to tools.query.add() You can pass a hash with parameters as a query (query), works similarly to Query Builder
Example:
this.query.add({
query: "http://site.com",
param1: "..",
...
});
this.proxy.*
Working with proxies
.next()
Switch to the next proxy, the old proxy will no longer be used for the current request
.ban()
Switch and ban the proxy (must be used when the service blocks work by IP), the proxy will be banned for the time specified in the parser settings (proxybannedcleanup
)
.get()
Get the current proxy (the last proxy with which the request was made)
.set()
this.proxy.set('http://127.0.0.1:8080', noChange = false)
- set the proxy for the next request, the noChange parameter is optional, if set to true, the proxy will not change between attempts
yield this.captcha.*
Working with captcha
.recognize()
yield this.captcha.recognize(preset, image, type, overrides)
- loading a captcha for recognition
image
- binary image data for recognitionpreset
- specifies the preset forUtil::AntiGate
type
is specified as one of: 'jpeg', 'gif', 'png'
The result will be a hash with fields:
answer
- text from the imageid
- captcha id, for the possibility of reporting an error throughreportBad
error
- text error, if answer is not set
.recognizeFromUrl()
yield this.captcha.recognizeFromUrl(preset, url, overrides)
- similar to the previous method, but the captcha will be loaded automatically by link (url), without using a proxy
.reportBad()
yield this.captcha.reportBad(preset, id, overrides)
- report to the service that the captcha was recognized incorrectly
this.utils.*
.updateResultsData()
this.utils.updateResultsData(results, data)
- a method for automatically filling $pages.$i.data and $data, must be called to add content to the resulting page
.urlFromHTML()
this.utils.urlFromHTML(url, base)
- processes the link obtained from HTML code - decodes entities (& etc.), optionally you can pass base - the base URL (for example, the URL of the source page), thus the full link can be obtained
.url.extractDomain()
this.utils.url.extractDomain(url, removeDefaultSubdomain)
- the method takes a link as the first parameter and returns the domain from this link. The second optional parameter determines whether to remove the www subdomain from the domain. By default, 0 - i.e. not to remove.
.url.extractTopDomain()
this.utils.url.extractTopDomain(url)
- the method takes a link as the first parameter and returns the domain from this link, without subdomains.
.url.extractTopDomainByZone()
this.utils.url.extractTopDomainByZone(url)
- the method takes a link as the first parameter and returns the domain from this link, without subdomains including regional zones.
.url.extractMaxPath()
this.utils.url.extractMaxPath(url)
- the method takes a string and selects a URL from it.
.url.extractWOParams()
this.utils.url.extractWOParams(url)
- the method takes a link and returns the same link truncated to the parameter string. That is, it will return the URL up to ?
.removeHtml()
this.utils.removeHtml(string)
- the method takes a string and returns it cleared of html tags.
.removeNoDigit()
this.utils.removeNoDigit(string)
- the method takes a string, removes everything except digits from it, and returns the result.
.removeComma()
this.utils.removeComma(string)
- the method takes a string, removes characters such as .,\r\n from it, and returns the result.
this.sessionManager.*
To use sessions in a JS scraper, you first need to initialize the session manager. This is done using the init()
function.
init() {
this.sessionManager.init({
//здесь можно задать дополнительные параметры
});
}
In this.sessionManager.init()
, you can use the following parameters:
name
- optional parameter, allows you to override the name of the scraper to which the sessions belong, by default it is equal to the name of the scraper in which the initialization takes placecanChangeProxy
- optional parameter, the ability to change the proxy, by default it is equal to 1domain
- optional parameter, specifies to search for sessions among all saved for this scraper (if the value is not set), or only for a specific domain (you need to specify the domain with a dot in front, for example.site.com
)
There are several functions for working with sessions:
.get()
this.sessionManager.get()
- gets a new session, must be called before making a request.
.reset()
this.sessionManager.reset()
- clears cookies and gets a new session. It must be called if the request with the current session was unsuccessful.
.save()
this.sessionManager.save()
- saves a successful session or saves arbitrary data in the session.
results.<array>.addElement()
The results.<array>.addElement()
method allows you to fill arrays in results more conveniently. When using it, you do not need to remember the sequence of variables in the array and list them manually.
results.serp.addElement({
link: 'https://google.com',
anchor: 'Google',
snippet: 'Loreps ipsum...',
});
init()
and destroy()
methods
The init()
method is called at the start of the task, destroy()
- at the end.
Example of use:
const puppeteer = require("puppeteer");
let globalBrowser;
class Parser {
constructor() {
...
}
async init() {
globalBrowser = await puppeteer.launch();
};
async destroy() {
if(globalBrowser)
await globalBrowser.close();
}
}
Useful links
Example of saving a file to disk
An example demonstrating how to save files directly to disk
Example of working with sessions
Using session functionality in JavaScript scrapers
Example of saving data in a session
Demonstration of the ability to store arbitrary data in a session
Using results.addElement()
An example of filling an array of data using results.addElement() and demonstrating the difference from regular .push()