Skip to main content

HTTP requests (including cookies, proxies, and sessions)

Base class methods

To collect data from a web page, you need to perform an HTTP request. The JavaScript API v2 of A-Parser implements an easy-to-use using the HTTP request execution method, which returns JSON object depending on the specified method arguments. Next, you will learn: how an HTTP request is made, what arguments and options the method has, results of the specified options, how to specify the success condition of an HTTP request, and more.

Methods that allow you to easily manipulate cookies, proxies, and sessions in the created scraper are also described. After successful execution of an HTTP request, or before execution, you can set/change proxy/cookie/session data for execution HTTP requests or save them for execution by another thread using the Session Manager.

These methods are inherited from BaseParser and serve as the basis for creating custom scrapers

await this.request(method, url[, queryParams][, opts])

await this.request(method, url, queryParams, opts)

Getting an HTTP response upon request, specified as arguments:

  • method - request method (GET, POST...)
  • url - request link
  • queryParams - a hash with get parameters or a hash with the post request body
  • opts - a hash with request options

opts.check_content

check_content: [ condition1, condition2, ...] - array of conditions to check the received content, if the check fails, the request will be repeated with another proxy.

Features:

  • using strings as conditions (search by string occurrence)
  • using regular expressions as conditions
  • using custom check functions that receive data and response headers
  • multiple different types of conditions can be specified at once
  • for logical negation, place the condition in an array, i.e. check_content: ['xxxx', [/yyyy/]] means that the request will be considered successful if the received data contains the substring xxxx and at the same time the regular the expression /yyyy/ finds no matches on the page

For a successful request, all checks specified in the array must pass

Example (the comments indicate what is needed for the request to be considered successful):

let response = await this.request('GET', set.query, {}, {
check_content: [
/<\/html>|<\/body>/, // this regular expression must trigger on the received page
['XXXX'], // this substring must not be present on the received page
'</html>', // this substring must be present on the received page
(data, hdr) => {
return hdr.Status == 200 && data.length > 100;
} // this function must return true
]
});

opts.decode

decode: 'auto-html' - automatic encoding detection and conversion to utf8

Possible values:

  • auto-html - based on headers, meta tags, and page content (optimal recommended option)
  • utf8 - indicates that the document is in utf8 encoding
  • <encoding> - any other encoding

opts.headers

headers: { ... } - hash with headers, header name is specified in lowercase, can also specify cookie.

Example:

headers: {
accept: 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
cookie: 'a=321; b=test',
'user-agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

opts.headers_order

headers_order: ['cookie', 'user-agent', ...] - allows overriding the header sorting order

opts.onlyheaders

onlyheaders: 0 - determines reading data, if enabled (1), receives only headers

opts.recurse

recurse: N - maximum number of redirect steps, default 7, use 0 to disable following redirects

opts.proxyretries

proxyretries: N - number of request execution attempts, by default taken from scraper settings

opts.parsecodes

parsecodes: { ... } - list of HTTP response codes that the scraper will consider successful, by default taken from scraper settings. If '*': 1 is specified, all responses will be considered successful.

Example:

parsecodes: {
200: 1,
403: 1,
500: 1
}

opts.timeout

timeout: N - response timeout in seconds, by default taken from scraper settings

opts.do_gzip

do_gzip: 1 - determines whether to use compression (gzip/deflate/br), enabled by default (1), to disable set the value to 0

opts.max_size

max_size: N - maximum response size in bytes, by default taken from scraper settings

opts.cookie_jar

cookie_jar: { ... } - a hash with cookies. Example hash:

"cookie_jar": {
"version": 1,
".google.com": {
"/": {
"login": {
"value": "true"
},
"lang": {
"value": "ru-RU"
}
}
},
".test.google.com": {
"/": {
"id": {
"value": 155643
}
}
}

opts.attempt

attempt: N - indicates the current attempt number; when using this parameter, the built-in attempt handler for this request is ignored

opts.browser

browser: 1 - automatic browser header emulation (1 - enabled, 0 - disabled)

opts.use_proxy

use_proxy: 1 - overrides proxy usage for an individual request inside the JS scraper on top of the global parameter Use proxy (1 - enabled, 0 - disabled)

opts.noextraquery

noextraquery: 0 - disables adding Extra query string to the request URL (1 - enabled, 0 - disabled)

opts.save_to_file

save_to_file: file - allows downloading a file directly to disk, bypassing memory recording. Instead of file specify the name and path to save the file. When using this option, everything related to data (content check in opts.check_content will not be performed, response.data will be empty, etc.)

opts.bypass_cloudflare

bypass_cloudflare: 0 - automatic CloudFlare JavaScript protection bypass using Chrome browser (1 - enabled, 0 - disabled)

In this case, Chrome Headless control is carried out by the scraper settings bypassCloudFlareChromeMaxPages and bypassCloudFlareChromeHeadless, which must be specified in static defaultConf and static editableConf:

static defaultConf: typeof BaseParser.defaultConf = {
version: '0.0.1',
results: {
flat: [
['title', 'Title'],
]
},
max_size: 2 * 1024 * 1024,
parsecodes: {
200: 1,
},
results_format: "$title\n",
bypass_cloudflare: 1,
bypassCloudFlareChromeMaxPages: 20,
bypassCloudFlareChromeHeadless: 0
};

static editableConf: typeof BaseParser.editableConf = [
['bypass_cloudflare', ['textfield', 'bypass_cloudflare']],
['bypassCloudFlareChromeMaxPages', ['textfield', 'bypassCloudFlareChromeMaxPages']],
['bypassCloudFlareChromeHeadless', ['textfield', 'bypassCloudFlareChromeHeadless']],
];

async parse(set, results) {
const {success, data, headers} = await this.request('GET', set.query, {}, {
bypass_cloudflare: this.conf.bypass_cloudflare
});
return results;
}

opts.follow_meta_refresh

follow_meta_refresh: 0 - allows following redirects declared via HTML meta tag:

<meta http-equiv="refresh" content="time; url=..."/>

opts.redirect_filter

redirect_filter: (hdr) => 1 | 0 - allows specifying a redirect filtering function; if the function returns 1, then the scraper will follow the redirect (considering the opts.recurse), parameter); if 0 is returned, the redirect redirects to stop:

redirect_filter: (hdr) => {
if (hdr.location.match(/login/))
return 1;
return 0;
}

opts.follow_common_rediects

opts.follow_common_rediects: 0 - determines whether to follow standard redirects (e.g., http -> https and/or www.domain.com -> domain.com), if you specify 1 then the scraper will follow standard redirects regardless of parameter opts.recurse

opts.http2

opts.http2: 0 - determines whether to use the HTTP/2 protocol when performing requests, by default HTTP/1.1 is used

opts.randomize_tls_fingerprint

opts.randomize_tls_fingerprint: 0 - this option allows bypassing website bans by TLS fingerprint (1 - enabled, 0 - disabled)

opts.tlsOpts

tlsOpts: { ... } – allows passing options for https connections ​

await this.cookies.*

Working with cookies for the current request

.getAll()

Getting an array of cookies

await this.cookies.getAll();
Example of the result of getting an array of cookies

.setAll(cookie_jar)

Setting cookies, a hash with cookies must be passed as an argument

async parse(set, results) {
this.logger.put("Start scraping query: " + set.query);

await this.cookies.setAll({
"version": 1,
".google.com": {
"/": {
"login": {
"value": "true"
},
"lang": {
"value": "ru-RU"
}
}
},
".test.google.com": {
"/": {
"id": {
"value": 155643
}
}
}
});

let cookies = await this.cookies.getAll();

this.logger.put("Cookies: " + JSON.stringify(cookies));

results.SKIP = 1;
return results;
}
Example of the result of setting an array of cookies

.set(host, path, name, value)

await this.cookies.set(host, path, name, value) - setting a single cookie.

The cookie scope directly depends on the format of the specified domain, so in host the presence of a dot before the host is considered:

  • if a dot is specified (this.cookies.set('.domain.com', ...)), then the cookie will be used for all subdomains (e.g., a.domain.com, b.a.domain.com)
  • if the host is specified without a leading dot (this.cookies.set('site.com', ...)), then the cookie will be used strictly for the specified host (host-only cookie) and is not passed to subdomains
info

This distinction is critically important, as the simultaneous existence of cookies with and without a dot can lead to their duplication and unpredictable website behavior. For correct emulation, always check exactly how the target website sets cookies (with or without the Domain attribute) and use the appropriate format.

async parse(set, results) {
this.logger.put("Start scraping query: " + set.query);

await this.cookies.set('.a-parser.com', '/', 'Test-cookie-1', 1);
await this.cookies.set('.a-parser.com', '/', 'Test-cookie-2', 'test-value');

let cookies = await this.cookies.getAll();

this.logger.put("Cookies: " + JSON.stringify(cookies));

results.SKIP = 1;
return results;
}
Example of the result of setting a single cookie

await this.proxy.*

Working with proxies

.next()

Change proxy to the next one, the old proxy will no longer be used for the current request

.ban()

Change and ban proxy (necessary to use when the service blocks work by IP), the proxy will be banned for the time specified in the scraper settings (proxybannedcleanup)

.get()

Get current proxy (the last proxy with which the request was made)

.set(proxy, noChange?)

await this.proxy.set('http://127.0.0.1:8080', true) - set proxy for the next request. Parameter noChange is optional; if true is set, the proxy will not change between attempts. By default noChange = false

await this.sessionManager.*

Methods for working with sessions. Each session necessarily stores the used proxy and cookies. You can also additionally save arbitrary data. To use sessions in a JS scraper, you must first initialize the Session Manager. This is done using the await this.sessionManagerinit() method in init()

.init(opts?)

Session Manager initialization. An object (opts) with additional parameters can be passed as an argument (all parameters are optional):

  • name - allows overriding the name of the scraper to which the sessions belong; by default it equals the name of the scraper in which initialization occurs
  • waitForSession - tells the scraper to wait for a session until it appears (this is relevant only when multiple tasks are running, e.g., one generates sessions, the second uses them), i.e. .get() and .reset() will always wait for a session
  • domain - indicates to search for sessions among all saved for this scraper (if value is not set), or only for a specific domain (must specify domain with a leading dot, e.g. .site.com)
  • sessionsKey - allows manually specifying the session storage names; if not set, the name is formed automatically based on name (or the scraper name if name is not set), domain, and proxy checker
  • expire - sets the session lifetime in minutes, default is unlimited

Usage example:

async init() {
await this.sessionManager.init({
name: 'JS::test',
expire: 15 * 60
});
}

.get(opts?)

Getting a new session, must be called before making a request (before the first attempt). Returns an object with arbitrary data saved in the session. An object can be passed as an argument (opts) with additional parameters (all parameters are optional):

  • waitTimeout - ability to specify how many minutes to wait for a session to appear, works independently of the waitForSession parameter in .init() (and ignores it); upon expiration, an empty session will be used
  • tag - getting a session with a given tag; for example, a domain name can be used to bind sessions to the domains they were obtained from

Usage example:

await this.sessionManager.get({
waitTimeout: 10,
tag: 'test session'
})

.reset(opts?)

Clearing cookies and getting a new session. Should be used if the request was not successful with the current session. Returns an object with arbitrary data saved in the session. An object can be passed as an argument (opts) with additional parameters (all parameters are optional):

  • waitTimeout - ability to specify how many minutes to wait for a session to appear, works independently of the waitForSession parameter in .init() (and ignores it); upon expiration, an empty session will be used
  • tag - getting a session with a given tag; for example, a domain name can be used to bind sessions to the domains they were obtained from

Usage example:

await this.sessionManager.reset({
waitTimeout: 5,
tag: 'test session'
})

.save(sessionOpts?, saveOpts?)

Saving a successful session with the ability to save arbitrary data in the session. Supports 2 optional arguments:

  • sessionOpts - arbitrary data for storage in the session, can be a number, string, array, or object
  • saveOpts - an object with session saving parameters:
    • multiply - optional parameter, allows multiplying the session; a number should be specified as the value
    • tag - optional parameter, sets a tag for the saved session; for example, a domain name can be used to bind sessions to the domains they were obtained from

Usage example:

await this.sessionManager.save('some data here', {
multiply: 3,
tag: 'test session'
})

.count()

Returns the number of sessions for the current Session Manager

Usage example:

let sesCount = await this.sessionManager.count();

.removeById(sessionId)

Deletes all sessions with a given id. Returns the number of deleted sessions. The current session id is contained in the variable this.sessionId Usage example:

const removedCount = await this.sessionManager.removeById(this.sessionId);

Complex example of using Session Manager

async init() {
await this.sessionManager.init({
expire: 15 * 60
});
}

async parse(set, results) {
let ses = await this.sessionManager.get();

for(let attempt = 1; attempt <= this.conf.proxyretries; attempt++) {
if(ses)
this.logger.put('Data from session:', ses);
const { success, data } = await this.request('GET', set.query, {}, { attempt });
if(success) {
// process data here
results.success = 1;
break;
} else if(attempt < this.conf.proxyretries) {
const removedCount = await this.sessionManager.removeById(this.sessionId);
this.logger.put(`Removed ${removedCount} bad sessions with id #${this.sessionId}`);
ses = await this.sessionManager.reset();
}
}

if(results.success) {
await this.sessionManager.save('Some data', { multiply: 2 });
this.logger.put(`Total we have ${await this.sessionManager.count()} sessions`);
}

return results;
}
Example of saving arbitrary data and further obtaining it

Request methods await this.request

GET Method

Request parameters can be passed directly in the request string https://a-parser.com/users/?type=staff:

const { success, data, headers } = await this.request('GET', 'https://a-parser.com/users/?type=staff');

Or as an object in queryParams, where key: value equals param=value:

const { success, data, headers } = await this.request('GET', 'https://a-parser.com/users/', {
type: 'staff'
});

POST Method

If the POST, method is used, the request body can be passed in two ways:

  • List variable names and their values in queryParams, for example:

    {
    "key": set.query,
    "id": 1234,
    "type": "text"
    }
  • List them in opts.body, for example:

    body: 'key=' + set.query + '&id=1234&type=text'

If request body is passed as an object, it is automatically converted to form-urlencoded, form; also if body is specified and no header is specified content-type , then it will be automatically assigned content-type: application/x-www-form-urlencoded:

const { success, data, headers } = await this.request('POST', 'https://jsonplaceholder.typicode.com/posts', {
title: 'foo,',
body: 'bar',
userId: 1
});

If the body of the POST request is a string or buffer, it is passed as is:

// request with a string
const string = 'title=foo&body=bar&userId=1';
const { success, data, headers } = await this.request('POST', 'https://jsonplaceholder.typicode.com/posts', {}, {
body: string
});

// request with a buffer
const string = 'title=foo&body=bar&userId=1';
const buf = Buffer.from(string, 'utf8');
const { success, data, headers } = await this.request('POST', 'https://jsonplaceholder.typicode.com/posts', {}, {
body: buf
});

Uploading files

Sending a file via POST request using the form-data module:

const file = fs.readFileSync('pathToFile');
const FormData = require('form-data');
const format = new FormData();
format.append('file', file, 'fileName.ext');

const { success, data, headers } = await this.request('POST', 'https://file.io', {}, {
headers: format.getHeaders(),
body: format.getBuffer()
});

Example of sending a file in a POST request with content type multipart/form-data:

const EOL = '\r\n';
const file = fs.readFileSync('pathToFile');
const boundary = '----WebKitFormBoundary' + String(Math.random()).slice(2);
const requestHeaders = {
'content-type': 'multipart/form-data; boundary=' + boundary
};

const body = '--'
+ boundary
+ EOL
+ 'Content-Disposition: form-data; name="file"; filename="fileName.ext"'
+ EOL
+ 'Content-Type: text/html'
+ EOL
+ EOL
+ file
+ EOL
+ '--'
+ boundary
+ '--';

const { success, data, headers } = await this.request('POST', 'https://file.io', {}, {
headers: requestHeaders,
body
});