Skip to main content

Method Description (v1)

caution

This JavaScript API is considered deprecated, we recommend using API version 2

Note that some methods require the yield keyword.

yield this.request()

yield this.request(method, url, queryParams, opts)

Getting an HTTP response for the request, arguments are specified as:

  • method - request method (GET, POST...)
  • url - URL for the request
  • queryParams - hash with get parameters or hash with the post-request body
  • opts - hash with request options

If the POST method is used, the request body can be passed in two ways:

  • simply by listing the variable names and their values in queryParams. For example:
{
key: set.query,
id: 1234,
type: 'text'
}
  • via the body variable in opts. For example:
body: 'key=' + set.query + '&id=1234&type=text'

opts.check_content

check_content: [ condition1, condition2, ...] - an array of conditions to check the received content; if the check fails, the request will be retried with a different proxy.

Features:

  • using strings as conditions (search by string inclusion)
  • using regular expressions as conditions
  • using custom check functions, which are passed the response data and headers
  • can set several different types of conditions at once
  • for logical negation, put the condition in an array, e.g., check_content: ['xxxx', [/yyyy/]] means the request will be considered successful if the received data contains the substring xxxx and the regular expression /yyyy/ does not match on the page

All checks specified in the array must pass for the request to be successful

Example (comments indicate what is required for the request to be considered successful):

let response = yield this.request('GET', set.query, {}, {
check_content: [
/<\/html>|<\/body>/, //this regular expression must match on the received page
['XXXX'], //this substring must not exist on the received page
'</html>', //this substring must exist on the received page
(data, hdr) => {
return hdr.Status == 200 && data.length > 100;
} //this function must return true
]
});

opts.decode

decode: 'auto-html' - automatic detection of encoding and conversion to utf8

Possible values:

  • auto-html - based on headers, meta tags, and page content (optimal recommended option)
  • utf8 - indicates that the document is in utf8 encoding
  • <encoding> - any other encoding

opts.headers

headers: { ... } - hash with headers, header name is specified in lowercase, can include cookie Example:

headers: {
accept: 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
cookie: 'a=321; b=test',
'user-agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

opts.headers_order

headers_order: ['cookie', 'user-agent', ...] - allows overriding the header sorting order

opts.recurse

recurse: N - the maximum number of redirect hops, default is 7, use 0 to disable redirect following

opts.proxyretries

proxyretries: N - number of request attempts, defaults to the scraper settings

opts.parsecodes

parsecodes: { ... } - list of HTTP response codes that the scraper will consider successful, defaults to the scraper settings. If you specify '*': 1 then all responses will be considered successful. Example:

parsecodes: {
200: 1,
403: 1,
500: 1
}

opts.timeout

timeout: N - response timeout in seconds, defaults to the scraper settings

opts.do_gzip

do_gzip: 1 - determines whether to use compression (gzip/deflate/br), enabled by default (1), to disable set the value to 0

opts.max_size

max_size: N - maximum response size in bytes, defaults to the scraper settings

opts.cookie_jar

cookie_jar: { ... } - hash with cookies.

opts.attempt

attempt: N - indicates the current attempt number; when using this parameter, the built-in attempt handler for this request is ignored

opts.browser

browser: 1 - automatic browser header emulation (1 - enabled, 0 - disabled)

opts.use_proxy

use_proxy: 1 - overrides the use of a proxy for an individual request within the JS scraper on top of the global Use proxy parameter (1 - enabled, 0 - disabled)

opts.noextraquery

noextraquery: 0 - disables adding Extra query string to the request URL (1 - enabled, 0 - disabled)

opts.save_to_file

save_to_file: file - allows downloading the file directly to disk, bypassing in-memory storage. Instead of file, specify the name and path where to save the file. When using this option, everything related to data is ignored (content check in check_content, response.data will be empty, etc.).

opts.data_as_buffer

data_as_buffer: 0 - determines whether to return data as a String (0) or as a Buffer object (1), String is returned by default

opts.bypass_cloudflare

bypass_cloudflare: 0 - automatic bypass of CloudFlare JavaScript protection using the Chrome browser (1 - enabled, 0 - disabled)

opts.follow_meta_refresh

follow_meta_refresh: 0 - allows following redirects declared via HTML meta tag:

<meta http-equiv="refresh" content="time; url=..." />

opts.tlsOpts

tlsOpts: { ... } – allows passing settings for https connections ​

yield this.parser.request()

yield this.parser.request(parser, preset, overrideParams, query)

Getting results from another scraper (built-in or another JS scraper), arguments are specified as

  • parser - scraper name (SE::Google, JS::Custom::Example)
  • preset - preset settings of the called scraper
  • overrideParams - hash with overrides for the called scraper's settings
  • query - query

In overrideParams you can override the parameters of the called scraper, the following flags are also available:

overrideParams.resultArraysWithObjects

resultArraysWithObjects: 0 - determines how to return result arrays from the called scraper:

  • if enabled (1) - arrays of objects will be returned
    [{link: 'link1', anchor: 'anchor1'}, {link: 'link2', anchor: 'anchor2'}, ...]
  • if disabled (0) - standard arrays of values will be returned
    ['link1', 'anchor1', 'link2', 'anchor2', ...]

overrideParams.needData

needData: 1 - determines whether to include (1) or not (0) data/pages in the response[]_, can be used for optimization

tools.*

Global object tools, allows access to built-in A-Parser functions (similar to templating engine tools $tools.*).

note

tools.query not available, you must use this.query

this.doLog()

Shows whether job logging is enabled, can be used as a flag for optimization, for cases where logging is not enabled and the argument to this.logger.put is a complex expression

this.logger.*

.put()

this.logger.put(message) - adds the string message to the job log

.putHTML()

this.logger.putHTML(code) - outputs HTML code to the job log, which will be displayed in the textarea

yield this.sleep()

yield this.sleep(sec)

Sets a thread delay for sec seconds, can be fractional.

yield this.mutex.*

Mutex for synchronization between threads, allows locking a code section for a single thread

.lock()

Waiting for the lock, the first thread that acquired the lock will continue execution, other threads will wait for the lock to be released

.unlock()

Releasing the lock, the next thread will continue execution if it was waiting for the lock (.lock())

this.cookies.*

Working with cookies for the current request

.getAll()

Get an array of cookies

.setAll()

Setting cookies, an array with cookies must be passed as an argument

.set()

this.cookies.set(host, path, name, value) - setting a single cookie

this.query.add()

this.query.add(query, maxLvl)

Adding a new query (query) with the option to specify the maximum level (maxLvl), similar to tools.query.add() You can pass a hash with parameters as the query (query), works similarly to the Query Builder

Example:

this.query.add({
query: "http://site.com",
param1: "..",
...
});

this.proxy.*

Working with proxy

.next()

Switch the proxy to the next one, the old proxy will no longer be used for the current request

.ban()

Switch and ban the proxy (must be used when the service blocks access by IP), the proxy will be banned for the time specified in the scraper settings (proxybannedcleanup)

.get()

Get the current proxy (the last proxy used for the request)

.set()

this.proxy.set('http://127.0.0.1:8080', noChange = false) - set a proxy for the next request, the noChange parameter is optional; if set to true, the proxy will not change between attempts

yield this.captcha.*

Working with captcha

.recognize()

yield this.captcha.recognize(preset, image, type, overrides) - loading captcha for recognition

  • image - binary data of the image for recognition
  • preset - points to the preset for Util::AntiGateUtil::AntiGate
  • type specifies one of: 'jpeg', 'gif', 'png'

The result will be a hash with fields:

  • answer - text from the image
  • id - id captcha, to allow reporting an error later via reportBad
  • error - text error, if answer is not set

.recognizeFromUrl()

yield this.captcha.recognizeFromUrl(preset, url, overrides) - similar to the previous method, but the captcha will be loaded automatically via the link (url), without using a proxy

.reportBad()

yield this.captcha.reportBad(preset, id, overrides) - inform the service that the captcha was solved incorrectly

this.utils.*

.updateResultsData()

this.utils.updateResultsData(results, data) - method for automatic completion of $pages.$i.data and $data, must be called to add the content of the resulting page

.urlFromHTML()

this.utils.urlFromHTML(url, base) - processes a link obtained from HTML code - decodes entities (&amp; etc.), optionally you can pass a base - base URL (for example, the URL of the source page), thus a full link can be obtained

.url.extractDomain()

this.utils.url.extractDomain(url, removeDefaultSubdomain) - the method takes a link as the first parameter and returns the domain from that link. The second optional parameter determines whether to trim the www subdomain from the domain. Default is 0 - meaning do not trim.

.url.extractTopDomain()

this.utils.url.extractTopDomain(url) - the method takes a link as the first parameter and returns the domain from that link, without subdomains.

.url.extractTopDomainByZone()

this.utils.url.extractTopDomainByZone(url) - the method takes a link as the first parameter and returns the domain from that link, including without subdomains. Works with all regional zones

.url.extractMaxPath()

this.utils.url.extractMaxPath(url) - the method takes a string and extracts a URL from it

.url.extractWOParams()

this.utils.url.extractWOParams(url)- the method takes a link and returns the same link trimmed before the parameters string. That is, it returns the URL before ?

.removeHtml()

this.utils.removeHtml(string) - the method takes a string and returns it cleared from html tags

.removeNoDigit()

this.utils.removeNoDigit(string) - the method takes a string, removes everything except digits from it, and returns the result

.removeComma()

this.utils.removeComma(string) - the method takes a string, removes characters like .,\r\n from it, and returns the result

this.sessionManager.*

To use sessions in a JS scraper, you must first initialize the Session Manager. This is done using the init() function

init() {
this.sessionManager.init({
//additional parameters can be specified here
});
}

The following parameters can be used in this.sessionManager.init():

  • name - optional parameter, allows overriding the name of the scraper to which the sessions belong, defaults to the name of the scraper where initialization occurs
  • canChangeProxy - optional parameter, ability to change the proxy, defaults to 1
  • domain - optional parameter, indicates whether to search for sessions among all saved for this scraper (if the value is not set), or only for a specific domain (you must specify the domain with a dot in front, for example .site.com)

There are several functions for working with sessions:

.get()

this.sessionManager.get() - gets a new session, must be called before making a request

.reset()

this.sessionManager.reset() - clearing cookies and getting a new session. Must be called if the request with the current session was unsuccessful.

.save()

this.sessionManager.save() - saving a successful session or saving arbitrary data in the session

results.<array>.addElement()

The method results.<array>.addElement() allows for more convenient filling of arrays in results. When using it, you don't need to remember the sequence of variables in the array and list them manually.

results.serp.addElement({
link: 'https://google.com',
anchor: 'Google',
snippet: 'Loreps ipsum...',
});

Methods init() and destroy()

The method init() is called when the job starts, destroy() - upon completion.

Example usage:

const puppeteer = require("puppeteer");
let globalBrowser;

class Parser {
constructor() {
...
}

async init() {
globalBrowser = await puppeteer.launch();
};

async destroy() {
if(globalBrowser)
await globalBrowser.close();
}
}