Skip to main content

Method descriptions (v1)

caution

This JavaScript API is considered deprecated; we recommend using API version 2

Please note that some methods require the yield keyword

yield this.request()

yield this.request(method, url, queryParams, opts)

Retrieving an HTTP response for a request; the following arguments are specified:

  • method - request method (GET, POST...)
  • url - request link
  • queryParams - hash with get parameters or hash with post request body
  • opts - hash with request options

If the POST method is used, the request body can be passed in two ways:

  • by simply listing variable names and their values in queryParams. For example:
{
key: set.query,
id: 1234,
type: 'text'
}
  • via the body variable in opts. For example:
body: 'key=' + set.query + '&id=1234&type=text'

opts.check_content

check_content: [ condition1, condition2, ...] - an array of conditions to check the received content; if the check fails, the request will be retried with a different proxy.

Features:

  • using strings as conditions (search by string occurrence)
  • using regular expressions as conditions
  • using custom check functions that receive response data and headers
  • multiple different types of conditions can be specified at once
  • for logical negation, place the condition in an array, i.e., check_content: ['xxxx', [/yyyy/]] means the request will be considered successful if the received data contains the substring xxxx and the regular expression /yyyy/ finds no matches on the page

All checks specified in the array must pass for a successful request

Example (comments indicate what is needed for the request to be considered successful):

let response = yield this.request('GET', set.query, {}, {
check_content: [
/<\/html>|<\/body>/, //this regular expression must match on the received page
['XXXX'], //this substring must not be present on the received page
'</html>', //this substring must be present on the received page
(data, hdr) => {
return hdr.Status == 200 && data.length > 100;
} //this function must return true
]
});

opts.decode

decode: 'auto-html' - automatic encoding detection and conversion to utf8

Possible values:

  • auto-html - based on headers, meta tags, and page content (optimal recommended option)
  • utf8 - specifies that the document is in utf8 encoding
  • <encoding> - any other encoding

opts.headers

headers: { ... } - hash with headers, header name is specified in lowercase, including cookie Example:

headers: {
accept: 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
cookie: 'a=321; b=test',
'user-agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

opts.headers_order

headers_order: ['cookie', 'user-agent', ...] - allows overriding the header sorting order

opts.recurse

recurse: N - maximum number of redirect follows, default is 7, use 0 to disable redirect following

opts.proxyretries

proxyretries: N - number of request attempts, default is taken from parser settings

opts.parsecodes

parsecodes: { ... } - list of HTTP response codes that the parser will consider successful, default is taken from parser settings. If '*': 1 is specified, all responses will be considered successful. Example:

parsecodes: {
200: 1,
403: 1,
500: 1
}

opts.timeout

timeout: N - response timeout in seconds, default is taken from parser settings

opts.do_gzip

do_gzip: 1 - determines whether to use compression (gzip/deflate/br), enabled by default (1), set to 0 to disable

opts.max_size

max_size: N - maximum response size in bytes, default is taken from parser settings

opts.cookie_jar

cookie_jar: { ... } - hash with cookies.

opts.attempt

attempt: N - indicates the current attempt number; when using this parameter, the built-in attempt handler for this request is ignored

opts.browser

browser: 1 - automatic browser header emulation (1 - enabled, 0 - disabled)

opts.use_proxy

use_proxy: 1 - overrides proxy usage for an individual request within the JS parser over the global Use proxy parameter (1 - enabled, 0 - disabled)

opts.noextraquery

noextraquery: 0 - disables adding Extra query string to the request URL (1 - enabled, 0 - disabled)

opts.save_to_file

save_to_file: file - allows downloading a file directly to disk, bypassing memory storage. Instead of file, specify the name and path to save the file. When using this option, everything related to data is ignored (content check in check_content, response.data will be empty, etc.).

opts.data_as_buffer

data_as_buffer: 0 - determines whether to return data as a String (0) or as a Buffer object (1), default returns a String

opts.bypass_cloudflare

bypass_cloudflare: 0 - automatic CloudFlare JavaScript protection bypass using Chrome browser (1 - enabled, 0 - disabled)

opts.follow_meta_refresh

follow_meta_refresh: 0 - allows following redirects declared via HTML meta tag:

<meta http-equiv="refresh" content="time; url=..." />

opts.tlsOpts

tlsOpts: { ... } – allows passing settings for https connections ​

yield this.parser.request()

yield this.parser.request(parser, preset, overrideParams, query)

Retrieving results from another parser (built-in or another JS parser); the following arguments are specified:

  • parser - parser name (SE::Google, JS::Custom::Example)
  • preset - settings preset of the called parser
  • overrideParams - hash with setting overrides for the called parser
  • query - query

In overrideParams, you can override parameters of the called parser; the following flags are also available:

overrideParams.resultArraysWithObjects

resultArraysWithObjects: 0 - determines the format for returning result arrays of the called parser:

  • if enabled (1) - arrays of objects will be returned
    [{link: 'link1', anchor: 'anchor1'}, {link: 'link2', anchor: 'anchor2'}, ...]
  • if disabled (0) - standard arrays of values will be returned
    ['link1', 'anchor1', 'link2', 'anchor2', ...]

overrideParams.needData

needData: 1 - determines whether to pass (1) or not (0) data/pages[] in the response; can be used for optimization

tools.*

Global object tools, allows access to built-in A-Parser functions (analogous to template toolkit tools $tools.*).

note

tools.query is not available; you must use this.query

this.doLog()

Indicates whether task logging is enabled; can be used as an optimization flag for cases where logging is disabled and a complex expression is passed as an argument to this.logger.put

this.logger.*

.put()

this.logger.put(message) - adds the message string to the task log

.putHTML()

this.logger.putHTML(code) - outputs HTML code to the task log, which will be displayed in the textarea

yield this.sleep()

yield this.sleep(sec)

Sets a delay in the thread for sec seconds; can be fractional.

yield this.mutex.*

Mutex for synchronization between threads, allows locking a code section for one thread

.lock()

Waiting for a lock; the first thread that acquired the lock will continue execution, while other threads will wait for the lock to be released

.unlock()

Releasing a lock; the next thread will continue execution if it was waiting for a lock (.lock())

this.cookies.*

Working with cookies for the current request

.getAll()

Retrieving an array of cookies

.setAll()

Setting cookies; an array with cookies must be passed as an argument

.set()

this.cookies.set(host, path, name, value) - setting a single cookie

this.query.add()

this.query.add(query, maxLvl)

Adding a new query (query) with the option to specify a maximum level (maxLvl), similar to tools.query.add() A hash with parameters can be passed as a query (query), works similarly to the Query Builder

Example:

this.query.add({
query: "http://site.com",
param1: "..",
...
});

this.proxy.*

Working with proxies

.next()

Change proxy to the next one; the old proxy will no longer be used for the current request

.ban()

Change and ban proxy (must be used when a service blocks access by IP); the proxy will be banned for the time specified in parser settings (proxybannedcleanup)

.get()

Get current proxy (the last proxy used for a request)

.set()

this.proxy.set('http://127.0.0.1:8080', noChange = false) - set proxy for the next request; the noChange parameter is optional, if set to true, the proxy will not change between attempts

yield this.captcha.*

Working with captcha

.recognize()

yield this.captcha.recognize(preset, image, type, overrides) - uploading a captcha for recognition

  • image - binary image data for recognition
  • preset - specifies the preset for Util::AntiGateUtil::AntiGate
  • type specifies one of: 'jpeg', 'gif', 'png'

The result will be a hash with fields:

  • answer - text from the image
  • id - captcha id, for the possibility to report an error later via reportBad
  • error - text error if answer is not set

.recognizeFromUrl()

yield this.captcha.recognizeFromUrl(preset, url, overrides) - similar to the previous method, but the captcha will be downloaded automatically via the link (url) without using a proxy

.reportBad()

yield this.captcha.reportBad(preset, id, overrides) - report to the service that the captcha was solved incorrectly

this.utils.*

.updateResultsData()

this.utils.updateResultsData(results, data) - method for automatically filling $pages.$i.data and $data; must be called to add content of the resulting page

.urlFromHTML()

this.utils.urlFromHTML(url, base) - processes a link obtained from HTML code - decodes entities (&amp; etc.), optionally base can be passed - base URL (e.g., URL of the source page), thus a full link can be obtained

.url.extractDomain()

this.utils.url.extractDomain(url, removeDefaultSubdomain) - the method takes a link as the first parameter and returns the domain from this link. The second optional parameter determines whether to strip the www subdomain from the domain. Default is 0 - do not strip.

.url.extractTopDomain()

this.utils.url.extractTopDomain(url) - the method takes a link as the first parameter and returns the domain from this link, without subdomains.

.url.extractTopDomainByZone()

this.utils.url.extractTopDomainByZone(url) - the method takes a link as the first parameter and returns the domain from this link, including without subdomains. Works with all regional zones

.url.extractMaxPath()

this.utils.url.extractMaxPath(url) - the method takes a string and extracts a URL from it

.url.extractWOParams()

this.utils.url.extractWOParams(url)- the method takes a link and returns the same link truncated to the parameter string. That is, it returns the URL up to ?

.removeHtml()

this.utils.removeHtml(string) - the method takes a string and returns it cleared of html tags

.removeNoDigit()

this.utils.removeNoDigit(string) - the method takes a string, removes everything except digits from it, and returns the result

.removeComma()

this.utils.removeComma(string) - the method takes a string, removes characters such as .,\r\n from it, and returns the result

this.sessionManager.*

To use sessions in a JS parser, you first need to initialize the Session Manager. This is done using the init() function

init() {
this.sessionManager.init({
//additional parameters can be set here
});
}

The following parameters can be used in this.sessionManager.init():

  • name - optional parameter, allows overriding the name of the parser to which the sessions belong; by default, it equals the name of the parser where initialization occurs
  • canChangeProxy - optional parameter, ability to change proxy, default is 1
  • domain - optional parameter, specifies whether to search for sessions among all saved for this parser (if value is not set), or only for a specific domain (the domain must be specified with a leading dot, e.g., .site.com)

Several functions exist for working with sessions:

.get()

this.sessionManager.get() - retrieves a new session; must be called before making a request

.reset()

this.sessionManager.reset() - clears cookies and retrieves a new session. Must be called if the request was not successful with the current session.

.save()

this.sessionManager.save() - saving a successful session or saving arbitrary data in a session

results.<array>.addElement()

The results.<array>.addElement() method allows for more convenient filling of arrays in results. When using it, you don't need to remember the sequence of variables in the array and list them manually.

results.serp.addElement({
link: 'https://google.com',
anchor: 'Google',
snippet: 'Loreps ipsum...',
});

Methods init() and destroy()

The init() method is called when the task starts, and destroy() is called when it finishes.

Usage example:

const puppeteer = require("puppeteer");
let globalBrowser;

class Parser {
constructor() {
...
}

async init() {
globalBrowser = await puppeteer.launch();
};

async destroy() {
if(globalBrowser)
await globalBrowser.close();
}
}