Description of Methods (v1)

caution

This JavaScript API is considered deprecated, we recommend using API version 2

Please note that some methods require the use of the keyword yield

`yield this.request()`

yield this.request(method, url, queryParams, opts)

Getting an HTTP response to a request, the arguments specified are:

method - request method (GET, POST...)
url - link for the request
queryParams - hash with get parameters or hash with the body of the post request
opts - hash with request options

If the POST method is used, the request body can be passed in two ways:

by simply listing the variable names and their values in queryParams. For example:

{
    key: set.query,
    id: 1234,
    type: 'text'
}

through the body variable in opts. For example:

body: 'key=' + set.query + '&id=1234&type=text'

`opts.check_content`

check_content: [ condition1, condition2, ...] - an array of conditions for checking the received content, if the check fails, the request will be repeated with a different proxy.

Capabilities:

use of strings as conditions (search by string inclusion)
use of regular expressions as conditions
use of custom check functions, to which data and response headers are passed
multiple different types of conditions can be set at once
for logical negation, place the condition in an array, i.e., check_content: ['xxxx', [/yyyy/]] means that the request will be considered successful if the received data contains the substring xxxx and at the same time the regular expression /yyyy/ does not find matches on the page

All checks listed in the array must pass for the request to be successful

Example (comments indicate what is needed for the request to be considered successful):

let response = yield this.request('GET', set.query, {}, {
    check_content: [
        /<\/html>|<\/body>/, //на полученной странице должно сработать это регулярное выражение
        ['XXXX'], //на полученной странице не должно быть этой подстроки
        '</html>', //на полученной странице должна быть такая подстрока
        (data, hdr) => {
            return hdr.Status == 200 && data.length > 100;
        } //эта функция должна вернуть true
    ]
});

`opts.decode`

decode: 'auto-html' - automatic detection of encoding and conversion to utf8

Possible values:

auto-html - based on headers, meta tags, and page content (optimal recommended option)
utf8 - indicates that the document is in utf8 encoding
<encoding> - any other encoding

`opts.headers`

headers: { ... } - hash with headers, header name is set in lowercase, you can also specify cookie Example:

headers: {
    accept: 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    cookie: 'a=321; b=test',
    'user-agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

`opts.headers_order`

headers_order: ['cookie', 'user-agent', ...] - allows you to redefine the order of header sorting

`opts.recurse`

recurse: N - the maximum number of redirects to follow, by default 7, use 0 to disable following redirects

`opts.proxyretries`

proxyretries: N - the number of attempts to execute a request, by default taken from the scraper settings

`opts.parsecodes`

parsecodes: { ... } - a list of HTTP response codes that the scraper will consider successful, by default taken from the scraper settings. If you specify '*': 1 then all responses will be considered successful. Example:

parsecodes: {
    200: 1,
    403: 1,
    500: 1
}

`opts.timeout`

timeout: N - response timeout in seconds, by default taken from the scraper settings

`opts.do_gzip`

do_gzip: 1 - determines whether to use compression (gzip/deflate/br), by default enabled (1), to disable set value to 0

`opts.max_size`

max_size: N - the maximum response size in bytes, by default taken from the scraper settings

`opts.cookie_jar`

cookie_jar: { ... } - hash with cookies.

`opts.attempt`

attempt: N - indicates the number of the current attempt, when using this parameter the built-in attempt handler for this request is ignored

`opts.browser`

browser: 1 - automatic emulation of browser headers (1 - enabled, 0 - disabled)

`opts.use_proxy`

use_proxy: 1 - overrides the use of a proxy for an individual request within the JS scraper on top of the global parameter Use proxy (1 - enabled, 0 - disabled)

`opts.noextraquery`

noextraquery: 0 - disables adding Extra query string to the request URL (1 - enabled, 0 - disabled)

`opts.save_to_file`

save_to_file: file - allows you to download a file directly to disk, bypassing writing to memory. Instead of file, specify the name and path under which to save the file. When using this option, everything related to data is ignored (content check in check_content, response.data will be empty, etc.).

`opts.data_as_buffer`

data_as_buffer: 0 - determines whether to return data as a String (0) or as a Buffer object (1), by default a String is returned

`opts.bypass_cloudflare`

bypass_cloudflare: 0 - automatic bypass of CloudFlare's JavaScript protection using the Chrome browser (1 - enabled, 0 - disabled)

`opts.follow_meta_refresh`

follow_meta_refresh: 0 - allows you to follow redirects declared through the HTML meta tag:

<meta http-equiv="refresh" content="time; url=..." />

`opts.tlsOpts`

tlsOpts: { ... } – allows you to pass settings for HTTPS connections

`yield this.parser.request()`

yield this.parser.request(parser, preset, overrideParams, query)

Getting results from another scraper (built-in or another JS scraper), the following arguments are specified:

parser - name of the scraper (SE::Google, JS::Custom::Example)
preset - preset of the settings of the called scraper
overrideParams - hash with overrides of the settings of the called scraper
query - request

In overrideParams, you can override the parameters of the called scraper, and the following flags are also available:

`overrideParams.resultArraysWithObjects`

resultArraysWithObjects: 0 - determines in what form to return arrays of results from the called scraper:

if enabled (1) - arrays of objects will be returned

[{link: 'link1', anchor: 'anchor1'}, {link: 'link2', anchor: 'anchor2'}, ...]

if disabled (0) - standard arrays of values will be returned
```
['link1', 'anchor1', 'link2', 'anchor2', ...]
```

`overrideParams.needData`

needData: 1 - determines whether to pass (1) or not (0) data/pages[] in the response, can be used for optimization

`tools.*`

The global tools object allows access to A-Parser's built-in functions (analogous to template toolkit tools $tools.*).

note

tools.query is not available, use this.query

`this.doLog()`

Shows whether task logging is enabled, can be used as a flag for optimization, for cases when logging is not conducted and the argument to this.logger.put is a complex expression

`this.logger.*`

`.put()`

this.logger.put(message) - adds the line message to the task log

`.putHTML()`

this.logger.putHTML(code) - outputs HTML code to the task log, which will be displayed in a textarea

`yield this.sleep()`

yield this.sleep(sec)

Sets a delay in the thread for a number of seconds sec, can be fractional.

`yield this.mutex.*`

Mutex for synchronization between threads, allows you to lock a section of code for one thread

`.lock()`

Waiting for the lock, execution will continue with the first thread that captured the lock, other threads will wait for the lock to be released

`.unlock()`

Releasing the lock, the next thread will continue execution if it was waiting for the lock (.lock())

`this.cookies.*`

Working with cookies for the current request

`.getAll()`

Getting an array of cookies

`.setAll()`

Setting cookies, an array of cookies must be passed as an argument

`.set()`

this.cookies.set(host, path, name, value) - setting a single cookie

`this.query.add()`

this.query.add(query, maxLvl)

Adding a new request (query) with the option to optionally specify the maximum level (maxLvl), similar to tools.query.add() You can pass a hash with parameters as a request (query), works similarly to Query Builder

Example:

this.query.add({
    query: "http://site.com",
    param1: "..",
    ...
});

`this.proxy.*`

Working with proxies

`.next()`

Change the proxy to the next one, the old proxy will no longer be used for the current request

`.ban()`

Change and ban the proxy (necessary to use when the service blocks work by IP), the proxy will be banned for the time specified in the scraper settings (proxybannedcleanup)

`.get()`

Get the current proxy (the last proxy with which a request was made)

`.set()`

this.proxy.set('http://127.0.0.1:8080', noChange = false) - set a proxy for the next request, the noChange parameter is optional, if set to true then the proxy will not change between attempts

`yield this.captcha.*`

Working with captcha

`.recognize()`

yield this.captcha.recognize(preset, image, type, overrides) - upload a captcha for recognition

image - binary image data for recognition
preset - indicates a preset for Util::AntiGate
type - specify one of: 'jpeg', 'gif', 'png'

The result will be a hash with fields:

answer - text from the image
id - captcha id, for the possibility to report an error later through reportBad
error - text error, if answer is not set

`.recognizeFromUrl()`

yield this.captcha.recognizeFromUrl(preset, url, overrides) - similar to the previous method, but the captcha will be automatically uploaded by the link (url), without using a proxy

`.reportBad()`

yield this.captcha.reportBad(preset, id, overrides) - report to the service that the captcha was solved incorrectly

`this.utils.*`

`.updateResultsData()`

this.utils.updateResultsData(results, data) - a method for automatically filling $pages.$i.data and $data, it is necessary to call to add content to the resulting page

`.urlFromHTML()`

this.utils.urlFromHTML(url, base) - processes a link obtained from HTML code - decodes entities (& etc.), optionally you can pass base - a base URL (for example, the URL of the source page), thus a full link can be obtained

`.url.extractDomain()`

this.utils.url.extractDomain(url, removeDefaultSubdomain) - the method takes a link as the first parameter and returns the domain from this link. The second optional parameter determines whether to cut the subdomain www from the domain. By default 0 - that is, do not cut.

`.url.extractTopDomain()`

this.utils.url.extractTopDomain(url) - the method takes a link as the first parameter and returns the domain from this link, without subdomains.

`.url.extractTopDomainByZone()`

this.utils.url.extractTopDomainByZone(url) - the method takes a link as the first parameter and returns the domain from this link, without subdomains as well. Works with all regional zones

`.url.extractMaxPath()`

this.utils.url.extractMaxPath(url) - the method takes a string and selects a URL from it

`.url.extractWOParams()`

this.utils.url.extractWOParams(url)- the method takes a link and returns the same link trimmed to the parameter string. That is, it will return the URL up to ?

`.removeHtml()`

this.utils.removeHtml(string) - the method takes a string and returns it cleaned from HTML tags

`.removeNoDigit()`

this.utils.removeNoDigit(string) - the method takes a string, removes everything except digits from it, and returns the result

`.removeComma()`

this.utils.removeComma(string) - the method takes a string, removes such characters as .,\r\n and returns the result

`this.sessionManager.*`

To use sessions in the JS scraper, you first need to initialize the Session Manager. This is done using the init() function

init() {
    this.sessionManager.init({
       //здесь можно задать дополнительные параметры
    });
}

In this.sessionManager.init() you can use the following parameters:

name - an optional parameter, allows you to redefine the name of the scraper to which the sessions belong, by default it is equal to the name of the scraper in which the initialization occurs
canChangeProxy - an optional parameter, the ability to change the proxy, by default it is equal to 1
domain - an optional parameter, indicates to look for sessions among all saved for this scraper (if the value is not set), or only for a specific domain (you need to specify the domain with a dot in front, for example .site.com)

There are several functions for working with sessions:

`.get()`

this.sessionManager.get() - gets a new session, you need to call it before making a request

`.reset()`

this.sessionManager.reset() - clearing cookies and getting a new session. It is necessary to call if the request with the current session was not successful.

`.save()`

this.sessionManager.save() - saving a successful session or saving arbitrary data in the session

`results.<array>.addElement()`

The method results.<array>.addElement() allows you to fill arrays in results more conveniently. When using it, you do not need to remember the sequence of variables in the array and list them manually.

results.serp.addElement({
    link: 'https://google.com',
    anchor: 'Google',
    snippet: 'Loreps ipsum...',
});

Methods `init()` and `destroy()`

The init() method is called at the start of the task, destroy() - at the end.

Example of use:

const puppeteer = require("puppeteer");
let globalBrowser;

class Parser {
    constructor() {
       ...
    }
 
    async init() {
        globalBrowser = await puppeteer.launch();
    };

    async destroy() {
        if(globalBrowser)
            await globalBrowser.close();
    }
}

yield this.request()
yield this.parser.request()
- overrideParams.resultArraysWithObjects
- overrideParams.needData
tools.*
this.doLog()
this.logger.*
- .put()
- .putHTML()
yield this.sleep()
yield this.mutex.*
- .lock()
- .unlock()
this.cookies.*
this.query.add()
this.proxy.*
yield this.captcha.*
this.utils.*
this.sessionManager.*
results.<array>.addElement()
Methods init() and destroy()
Useful links

yield this.request()​

opts.check_content​

Capabilities:​

opts.decode​

opts.headers​

opts.headers_order​

opts.recurse​

opts.proxyretries​

opts.parsecodes​

opts.timeout​

opts.do_gzip​

opts.max_size​

opts.cookie_jar​

opts.attempt​

opts.browser​

opts.use_proxy​

opts.noextraquery​

opts.save_to_file​

opts.data_as_buffer​

opts.bypass_cloudflare​

opts.follow_meta_refresh​

opts.tlsOpts​

yield this.parser.request()​

overrideParams.resultArraysWithObjects​

overrideParams.needData​

tools.*​

this.doLog()​

this.logger.*​

.put()​

.putHTML()​

yield this.sleep()​

yield this.mutex.*​

.lock()​

.unlock()​

this.cookies.*​

.getAll()​

.setAll()​

.set()​

this.query.add()​

this.proxy.*​

.next()​

.ban()​

.get()​

.set()​

yield this.captcha.*​

.recognize()​

.recognizeFromUrl()​

.reportBad()​

this.utils.*​

.updateResultsData()​

.urlFromHTML()​

.url.extractDomain()​

.url.extractTopDomain()​

.url.extractTopDomainByZone()​

.url.extractMaxPath()​

.url.extractWOParams()​

.removeHtml()​

.removeNoDigit()​

.removeComma()​

this.sessionManager.*​

.get()​

.reset()​

.save()​

results.<array>.addElement()​

Methods init() and destroy()​

Useful links​

🔗 Example of saving a file to disk

🔗 Example of working with sessions

🔗 Example of saving data in a session

🔗 Using results.addElement()

`yield this.request()`

`opts.check_content`

Capabilities:

`opts.decode`

`opts.headers`

`opts.headers_order`

`opts.recurse`

`opts.proxyretries`

`opts.parsecodes`

`opts.timeout`

`opts.do_gzip`

`opts.max_size`

`opts.cookie_jar`

`opts.attempt`

`opts.browser`

`opts.use_proxy`

`opts.noextraquery`

`opts.save_to_file`

`opts.data_as_buffer`

`opts.bypass_cloudflare`

`opts.follow_meta_refresh`

`opts.tlsOpts`

`yield this.parser.request()`

`overrideParams.resultArraysWithObjects`

`overrideParams.needData`

`tools.*`

`this.doLog()`

`this.logger.*`

`.put()`

`.putHTML()`

`yield this.sleep()`

`yield this.mutex.*`

`.lock()`

`.unlock()`

`this.cookies.*`

`.getAll()`

`.setAll()`

`.set()`

`this.query.add()`

`this.proxy.*`

`.next()`

`.ban()`

`.get()`

`.set()`

`yield this.captcha.*`

`.recognize()`

`.recognizeFromUrl()`

`.reportBad()`

`this.utils.*`

`.updateResultsData()`

`.urlFromHTML()`

`.url.extractDomain()`

`.url.extractTopDomain()`

`.url.extractTopDomainByZone()`

`.url.extractMaxPath()`

`.url.extractWOParams()`

`.removeHtml()`

`.removeNoDigit()`

`.removeComma()`

`this.sessionManager.*`

`.get()`

`.reset()`

`.save()`

`results.<array>.addElement()`

Methods `init()` and `destroy()`

Useful links