HTTP requests (+working with cookies, proxies, sessions)
Base class methods
To collect data from a web page, you need to perform an HTTP request. In JavaScript API v2
of A-Parser, there is an easy-to-use method for making HTTP requests, which returns a JSON object in response depending on the specified arguments of the method. Below you will learn: how an HTTP request is made, what arguments and options the method has, the results of the specified options, how to specify the condition for the success of the HTTP request, and more.
Also described are methods that allow you to easily manipulate cookies, proxies, and sessions in the scraper being created. After a successful HTTP request, or before making one, you can set/change proxy/cookie/session data for making HTTP requests or save it for execution by another thread using Session Manager.
These methods are inherited from BaseParser
and are the foundation for creating your own scrapers.
await this.request(method, url[, queryParams][, opts])
await this.request(method, url, queryParams, opts)
Receiving an HTTP response to a request, the arguments specified are:
method
- request method (GET, POST...)url
- request URLqueryParams
- hash with get parameters or hash with the body of the post requestopts
- hash with request options
opts.check_content
check_content: [ condition1, condition2, ...]
- an array of conditions for checking the received content, if the check fails, the request will be repeated with a different proxy.
Capabilities:
- use of strings as conditions (search by string inclusion)
- use of regular expressions as conditions
- use of custom check functions, which are passed the data and headers of the response
- you can set several different types of conditions at once
- for logical negation, place the condition in an array, i.e.,
check_content: ['xxxx', [/yyyy/]]
means that the request will be considered successful if the received data contains the substringxxxx
and at the same time the regular expression/yyyy/
does not find matches on the page
All checks listed in the array must pass for the request to be successful
Example (comments indicate what is needed for the request to be considered successful):
let response = await this.request('GET', set.query, {}, {
check_content: [
/<\/html>|<\/body>/, // на полученной странице должно сработать это регулярное выражение
['XXXX'], // на полученной странице не должно быть этой подстроки
'</html>', // на полученной странице должна быть такая подстрока
(data, hdr) => {
return hdr.Status == 200 && data.length > 100;
} // эта функция должна вернуть true
]
});
opts.decode
decode: 'auto-html'
- automatic detection of encoding and conversion to utf8
Possible values:
auto-html
- based on headers, meta tags, and content of the page (optimal recommended option)utf8
- indicates that the document is in utf8 encoding<encoding>
- any other encoding
opts.headers
headers: { ... }
- hash with headers, header name is set in lowercase, you can also specify cookie.
Example:
headers: {
accept: 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
cookie: 'a=321; b=test',
'user-agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
opts.headers_order
headers_order: ['cookie', 'user-agent', ...]
- allows you to redefine the order of header sorting
opts.onlyheaders
onlyheaders: 0
- determines reading data
, if enabled (1), only headers are received
opts.recurse
recurse: N
- the maximum number of redirects to follow, by default 7
, use 0
to disable following redirects
opts.proxyretries
proxyretries: N
- the number of attempts to execute the request, by default taken from the scraper settings
opts.parsecodes
parsecodes: { ... }
- a list of HTTP response codes that the scraper will consider successful, by default taken from
the scraper settings. If you specify '*': 1
then all responses will be considered successful.
Example:
parsecodes: {
200: 1,
403: 1,
500: 1
}
opts.timeout
timeout: N
- response timeout in seconds, by default taken from the scraper settings
opts.do_gzip
do_gzip: 1
- determines whether to use compression (gzip/deflate/br), by default enabled (1), to disable
set the value to 0
opts.max_size
max_size: N
- maximum response size in bytes, by default taken from the scraper settings
opts.cookie_jar
cookie_jar: { ... }
- hash with cookies. Example hash:
"cookie_jar": {
"version": 1,
".google.com": {
"/": {
"login": {
"value": "true"
},
"lang": {
"value": "ru-RU"
}
}
},
".test.google.com": {
"/": {
"id": {
"value": 155643
}
}
}
opts.attempt
attempt: N
- indicates the number of the current attempt, when using this parameter the built-in attempt handler for
the given request is ignored
opts.browser
browser: 1
- automatic emulation of browser headers (1 - enabled, 0 - disabled)
opts.use_proxy
use_proxy: 1
- overrides the use of proxy for an individual request within the JS scraper on top of the global
parameter Use proxy (1 - enabled, 0 - disabled)
opts.noextraquery
noextraquery: 0
- disables the addition of Extra query string to the request URL (1 - enabled, 0 - disabled)
opts.save_to_file
save_to_file: file
- allows downloading a file directly to disk, bypassing writing to memory. Instead of file
specify the name and
path under which to save the file. When using this option, everything related to data
is ignored (content check in opts.check_content
will not be performed, response.data
will be empty, etc.)
opts.bypass_cloudflare
bypass_cloudflare: 0
- automatic bypass of CloudFlare's JavaScript protection using the Chrome browser (1 - enabled, 0 -
disabled)
Control of Chrome Headless in this case is carried out by the scraper settings bypassCloudFlareChromeMaxPages
and bypassCloudFlareChromeHeadless
, which need to be specified in static defaultConf
and static editableConf
:
static defaultConf: typeof BaseParser.defaultConf = {
version: '0.0.1',
results: {
flat: [
['title', 'Title'],
]
},
max_size: 2 * 1024 * 1024,
parsecodes: {
200: 1,
},
results_format: "$title\n",
bypass_cloudflare: 1,
bypassCloudFlareChromeMaxPages: 20,
bypassCloudFlareChromeHeadless: 0
};
static editableConf: typeof BaseParser.editableConf = [
['bypass_cloudflare', ['textfield', 'bypass_cloudflare']],
['bypassCloudFlareChromeMaxPages', ['textfield', 'bypassCloudFlareChromeMaxPages']],
['bypassCloudFlareChromeHeadless', ['textfield', 'bypassCloudFlareChromeHeadless']],
];
async parse(set, results) {
const {success, data, headers} = await this.request('GET', set.query, {}, {
bypass_cloudflare: this.conf.bypass_cloudflare
});
return results;
}
opts.follow_meta_refresh
follow_meta_refresh: 0
- allows following redirects declared through the HTML meta tag:
<meta http-equiv="refresh" content="time; url=..."/>
opts.redirect_filter
redirect_filter: (hdr) => 1 | 0
- allows setting a redirect filter function, if the function
returns 1
, then the scraper will follow the redirect (taking into account the opts.recurse
parameter), if it returns 0
the redirect will be stopped:
redirect_filter: (hdr) => {
if (hdr.location.match(/login/))
return 1;
return 0;
}
opts.follow_common_rediects
opts.follow_common_rediects: 0
- determines whether to follow standard redirects (for example http -> https
and/or www.domain.com -> domain.com
), if 1
is specified then the scraper will follow standard redirects regardless of
the opts.recurse
parameter
opts.http2
opts.http2: 0
- determines whether to use the HTTP/2 protocol when making requests, by default
HTTP/1.1 is used
opts.randomize_tls_fingerprint
opts.randomize_tls_fingerprint: 0
- this option allows bypassing site bans by TLS fingerprint (1 - enabled, 0 -
disabled)
opts.tlsOpts
tlsOpts: { ... }
– allows
passing settings for
HTTPS connections
await this.cookies.*
Working with cookies for the current request
.getAll()
Retrieving an array of cookies
await this.cookies.getAll();
.setAll(cookie_jar)
Setting cookies, a hash with cookies should be passed as an argument
async parse(set, results) {
this.logger.put("Start scraping query: " + set.query);
await this.cookies.setAll({
"version": 1,
".google.com": {
"/": {
"login": {
"value": "true"
},
"lang": {
"value": "ru-RU"
}
}
},
".test.google.com": {
"/": {
"id": {
"value": 155643
}
}
}
});
let cookies = await this.cookies.getAll();
this.logger.put("Cookies: " + JSON.stringify(cookies));
results.SKIP = 1;
return results;
}
.set(host, path, name, value)
await this.cookies.set(host, path, name, value)
- setting a single cookie
async parse(set, results) {
this.logger.put("Start scraping query: " + set.query);
await this.cookies.set('.a-parser.com', '/', 'Test-cookie-1', 1);
await this.cookies.set('.a-parser.com', '/', 'Test-cookie-2', 'test-value');
let cookies = await this.cookies.getAll();
this.logger.put("Cookies: " + JSON.stringify(cookies));
results.SKIP = 1;
return results;
}
await this.proxy.*
Working with proxy
.next()
Change the proxy to the next one, the old proxy will no longer be used for the current request
.ban()
Change and ban the proxy (necessary to use when the service blocks work by IP), the proxy will be banned for the time,
specified in the scraper settings (proxybannedcleanup
)
.get()
Get the current proxy (the last proxy with which a request was made)
.set(proxy, noChange?)
await this.proxy.set('http://127.0.0.1:8080', true)
- set a proxy for the next request. The noChange
parameter is optional, if set to true
then the proxy will not change between attempts. By default noChange = false
await this.sessionManager.*
Methods for working with sessions. Each session necessarily stores the used proxy and cookies. You can also save additional arbitrary data.
To use sessions in a JS scraper, you must first initialize the Session Manager. This is done using the await this.sessionManager.init()
method in init()
.init(opts?)
Initialization of the Session Manager. As an argument, you can pass an object (opts
) with additional parameters (all parameters are optional):
name
- allows you to override the name of the scraper to which the sessions belong, by default it is equal to the name of the scraper in which the initialization occurswaitForSession
- tells the scraper to wait for the session until it appears (this is only relevant when several tasks are running, for example, one generates sessions, the other uses them), i.e..get()
and.reset()
will always wait for a sessiondomain
- specifies to look for sessions among all saved for this scraper (if the value is not set), or only for a specific domain (you need to specify the domain with a dot in front, for example.site.com
)sessionsKey
- allows you to manually set the name of the session storage, if it is not set, then the name is automatically formed based onname
(or the name of the scraper, ifname
is not set), domain and proxy checkerexpire
- sets the lifetime of the session in minutes, by default unlimited
Usage example:
async init() {
await this.sessionManager.init({
name: 'JS::test',
expire: 15 * 60
});
}
.get(opts?)
Getting a new session, you need to call it before making a request (before the first attempt). Returns an object with arbitrary data saved in the session. As an argument, you can pass an object (opts
) with additional parameters (all parameters are optional):
waitTimeout
- the ability to specify how many minutes to wait for the session to appear, works independently of thewaitForSession
parameter in.init()
(ignores it), after expiration an empty session will be usedtag
- getting a session with a specified tag, you can use for example the domain name to link sessions to the domains from which they were obtained
Usage example:
await this.sessionManager.get({
waitTimeout: 10,
tag: 'test session'
})
.reset(opts?)
Clearing cookies and getting a new session. It is necessary to use if the current session was not successful. Returns an object with arbitrary data saved in the session. As an argument, you can pass an object (opts
) with additional parameters (all parameters are optional):
waitTimeout
- the ability to specify how many minutes to wait for the session to appear, works independently of thewaitForSession
parameter in.init()
(ignores it), after expiration an empty session will be usedtag
- getting a session with a specified tag, you can use for example the domain name to link sessions to the domains from which they were obtained
Usage example:
await this.sessionManager.reset({
waitTimeout: 5,
tag: 'test session'
})
.save(sessionOpts?, saveOpts?)
Saving a successful session with the ability to save arbitrary data in the session. Supports 2 optional arguments:
sessionOpts
- arbitrary data to be stored in the session, can be a number, string, array, or objectsaveOpts
- an object with session saving parameters:multiply
- an optional parameter, allows you to duplicate the session, as a value you need to specify a numbertag
- an optional parameter, sets the tag for the saved session, you can use for example the domain name to link sessions to the domains from which they were obtained
Usage example:
await this.sessionManager.save('some data here', {
multiply: 3,
tag: 'test session'
})
.count()
Returns the number of sessions for the current Session Manager
Usage example:
let sesCount = await this.sessionManager.count();
.removeById(sessionId)
Deletes all sessions with the specified id. Returns the number of deleted sessions. The id of the current session is contained in the variable this.sessionId
Usage example:
const removedCount = await this.sessionManager.removeById(this.sessionId);
Comprehensive example of using the Session Manager
async init() {
await this.sessionManager.init({
expire: 15 * 60
});
}
async parse(set, results) {
let ses = await this.sessionManager.get();
for(let attempt = 1; attempt <= this.conf.proxyretries; attempt++) {
if(ses)
this.logger.put('Data from session:', ses);
const { success, data } = await this.request('GET', set.query, {}, { attempt });
if(success) {
// process data here
results.success = 1;
break;
} else if(attempt < this.conf.proxyretries) {
const removedCount = await this.sessionManager.removeById(this.sessionId);
this.logger.put(`Removed ${removedCount} bad sessions with id #${this.sessionId}`);
ses = await this.sessionManager.reset();
}
}
if(results.success) {
await this.sessionManager.save('Some data', { multiply: 2 });
this.logger.put(`Total we have ${await this.sessionManager.count()} sessions`);
}
return results;
}