HTTP requests (+working with cookies, proxies, sessions)

Base class methods

To collect data from a web page, you need to perform an HTTP request. In JavaScript API v2 of A-Parser, there is an easy-to-use method for making HTTP requests, which returns a JSON object in response depending on the specified arguments of the method. Below you will learn: how an HTTP request is made, what arguments and options the method has, the results of the specified options, how to specify the condition for the success of the HTTP request, and more.

Also described are methods that allow you to easily manipulate cookies, proxies, and sessions in the scraper being created. After a successful HTTP request, or before making one, you can set/change proxy/cookie/session data for making HTTP requests or save it for execution by another thread using Session Manager.

These methods are inherited from BaseParser and are the foundation for creating your own scrapers.

`await this.request(method, url[, queryParams][, opts])`

await this.request(method, url, queryParams, opts)

Receiving an HTTP response to a request, the arguments specified are:

method - request method (GET, POST...)
url - request URL
queryParams - hash with get parameters or hash with the body of the post request
opts - hash with request options

`opts.check_content`

check_content: [ condition1, condition2, ...] - an array of conditions for checking the received content, if the check fails, the request will be repeated with a different proxy.

Capabilities:

use of strings as conditions (search by string inclusion)
use of regular expressions as conditions
use of custom check functions, which are passed the data and headers of the response
you can set several different types of conditions at once
for logical negation, place the condition in an array, i.e., check_content: ['xxxx', [/yyyy/]] means that the request will be considered successful if the received data contains the substring xxxx and at the same time the regular expression /yyyy/ does not find matches on the page

All checks listed in the array must pass for the request to be successful

Example (comments indicate what is needed for the request to be considered successful):

let response = await this.request('GET', set.query, {}, {
    check_content: [
        /<\/html>|<\/body>/, // на полученной странице должно сработать это регулярное выражение
        ['XXXX'], // на полученной странице не должно быть этой подстроки
        '</html>', // на полученной странице должна быть такая подстрока
        (data, hdr) => {
            return hdr.Status == 200 && data.length > 100;
        } // эта функция должна вернуть true
    ]
});

`opts.decode`

decode: 'auto-html' - automatic detection of encoding and conversion to utf8

Possible values:

auto-html - based on headers, meta tags, and content of the page (optimal recommended option)
utf8 - indicates that the document is in utf8 encoding
<encoding> - any other encoding

`opts.headers`

headers: { ... } - hash with headers, header name is set in lowercase, you can also specify cookie.

Example:

headers: {
    accept: 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    cookie: 'a=321; b=test',
    'user-agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

`opts.headers_order`

headers_order: ['cookie', 'user-agent', ...] - allows you to redefine the order of header sorting

`opts.onlyheaders`

onlyheaders: 0 - determines reading data, if enabled (1), only headers are received

`opts.recurse`

recurse: N - the maximum number of redirects to follow, by default 7, use 0 to disable following redirects

`opts.proxyretries`

proxyretries: N - the number of attempts to execute the request, by default taken from the scraper settings

`opts.parsecodes`

parsecodes: { ... } - a list of HTTP response codes that the scraper will consider successful, by default taken from the scraper settings. If you specify '*': 1 then all responses will be considered successful.

Example:

parsecodes: {
    200: 1, 
    403: 1,
    500: 1
}

`opts.timeout`

timeout: N - response timeout in seconds, by default taken from the scraper settings

`opts.do_gzip`

do_gzip: 1 - determines whether to use compression (gzip/deflate/br), by default enabled (1), to disable set the value to 0

`opts.max_size`

max_size: N - maximum response size in bytes, by default taken from the scraper settings

`opts.cookie_jar`

cookie_jar: { ... } - hash with cookies. Example hash:

"cookie_jar": {
    "version": 1,
    ".google.com": {
        "/": {
            "login": {
                "value": "true"
            },
            "lang": {
                "value": "ru-RU"
            }
        }
    },
    ".test.google.com": {
        "/": {
            "id": {
                "value": 155643
            }
        }
    }

`opts.attempt`

attempt: N - indicates the number of the current attempt, when using this parameter the built-in attempt handler for the given request is ignored

`opts.browser`

browser: 1 - automatic emulation of browser headers (1 - enabled, 0 - disabled)

`opts.use_proxy`

use_proxy: 1 - overrides the use of proxy for an individual request within the JS scraper on top of the global parameter Use proxy (1 - enabled, 0 - disabled)

`opts.noextraquery`

noextraquery: 0 - disables the addition of Extra query string to the request URL (1 - enabled, 0 - disabled)

`opts.save_to_file`

save_to_file: file - allows downloading a file directly to disk, bypassing writing to memory. Instead of file specify the name and path under which to save the file. When using this option, everything related to data is ignored (content check in opts.check_content will not be performed, response.data will be empty, etc.)

`opts.bypass_cloudflare`

bypass_cloudflare: 0 - automatic bypass of CloudFlare's JavaScript protection using the Chrome browser (1 - enabled, 0 - disabled)

Control of Chrome Headless in this case is carried out by the scraper settings bypassCloudFlareChromeMaxPages and bypassCloudFlareChromeHeadless, which need to be specified in static defaultConf and static editableConf:

static defaultConf: typeof BaseParser.defaultConf = {
    version: '0.0.1',
    results: {
        flat: [
            ['title', 'Title'],
        ]
    },
    max_size: 2 * 1024 * 1024,
    parsecodes: {
        200: 1,
    },
    results_format: "$title\n",
    bypass_cloudflare: 1,
    bypassCloudFlareChromeMaxPages: 20,
    bypassCloudFlareChromeHeadless: 0
};

static editableConf: typeof BaseParser.editableConf = [
    ['bypass_cloudflare', ['textfield', 'bypass_cloudflare']],
    ['bypassCloudFlareChromeMaxPages', ['textfield', 'bypassCloudFlareChromeMaxPages']],
    ['bypassCloudFlareChromeHeadless', ['textfield', 'bypassCloudFlareChromeHeadless']],
];

async parse(set, results) {
    const {success, data, headers} = await this.request('GET', set.query, {}, {
        bypass_cloudflare: this.conf.bypass_cloudflare
    });
    return results;
}

`opts.follow_meta_refresh`

follow_meta_refresh: 0 - allows following redirects declared through the HTML meta tag:

<meta http-equiv="refresh" content="time; url=..."/>

`opts.redirect_filter`

redirect_filter: (hdr) => 1 | 0 - allows setting a redirect filter function, if the function returns 1, then the scraper will follow the redirect (taking into account the opts.recurse parameter), if it returns 0 the redirect will be stopped:

redirect_filter: (hdr) => {
    if (hdr.location.match(/login/))
        return 1;
    return 0;
}

`opts.follow_common_rediects`

opts.follow_common_rediects: 0 - determines whether to follow standard redirects (for example http -> https and/or www.domain.com -> domain.com), if 1 is specified then the scraper will follow standard redirects regardless of the opts.recurse parameter

`opts.http2`

opts.http2: 0 - determines whether to use the HTTP/2 protocol when making requests, by default HTTP/1.1 is used

`opts.randomize_tls_fingerprint`

opts.randomize_tls_fingerprint: 0 - this option allows bypassing site bans by TLS fingerprint (1 - enabled, 0 - disabled)

`opts.tlsOpts`

tlsOpts: { ... } – allows passing settings for HTTPS connections

`await this.cookies.*`

Working with cookies for the current request

`.getAll()`

Retrieving an array of cookies

await this.cookies.getAll();

Example of the result of getting an array of cookies

`.setAll(cookie_jar)`

Setting cookies, a hash with cookies should be passed as an argument

async parse(set, results) {
    this.logger.put("Start scraping query: " + set.query);

    await this.cookies.setAll({
        "version": 1,
        ".google.com": {
            "/": {
                "login": {
                    "value": "true"
                },
                "lang": {
                    "value": "ru-RU"
                }
            }
        },
        ".test.google.com": {
            "/": {
                "id": {
                    "value": 155643
                }
            }
        }
    });

    let cookies = await this.cookies.getAll();

    this.logger.put("Cookies: " + JSON.stringify(cookies));

    results.SKIP = 1;
    return results;
}

Example of the result of installing an array of cookies

`.set(host, path, name, value)`

await this.cookies.set(host, path, name, value) - setting a single cookie

async parse(set, results) {
    this.logger.put("Start scraping query: " + set.query);

    await this.cookies.set('.a-parser.com', '/', 'Test-cookie-1', 1);
    await this.cookies.set('.a-parser.com', '/', 'Test-cookie-2', 'test-value');

    let cookies = await this.cookies.getAll();

    this.logger.put("Cookies: " + JSON.stringify(cookies));

    results.SKIP = 1;
    return results;
}

Example of the result of setting a single cookie

`await this.proxy.*`

Working with proxy

`.next()`

Change the proxy to the next one, the old proxy will no longer be used for the current request

`.ban()`

Change and ban the proxy (necessary to use when the service blocks work by IP), the proxy will be banned for the time, specified in the scraper settings (proxybannedcleanup)

`.get()`

Get the current proxy (the last proxy with which a request was made)

`.set(proxy, noChange?)`

await this.proxy.set('http://127.0.0.1:8080', true) - set a proxy for the next request. The noChange parameter is optional, if set to true then the proxy will not change between attempts. By default noChange = false

`await this.sessionManager.*`

Methods for working with sessions. Each session necessarily stores the used proxy and cookies. You can also save additional arbitrary data. To use sessions in a JS scraper, you must first initialize the Session Manager. This is done using the await this.sessionManager.init() method in init()

`.init(opts?)`

Initialization of the Session Manager. As an argument, you can pass an object (opts) with additional parameters (all parameters are optional):

name - allows you to override the name of the scraper to which the sessions belong, by default it is equal to the name of the scraper in which the initialization occurs
waitForSession - tells the scraper to wait for the session until it appears (this is only relevant when several tasks are running, for example, one generates sessions, the other uses them), i.e. .get() and .reset() will always wait for a session
domain - specifies to look for sessions among all saved for this scraper (if the value is not set), or only for a specific domain (you need to specify the domain with a dot in front, for example .site.com)
sessionsKey - allows you to manually set the name of the session storage, if it is not set, then the name is automatically formed based on name (or the name of the scraper, if name is not set), domain and proxy checker
expire - sets the lifetime of the session in minutes, by default unlimited

Usage example:

async init() {
    await this.sessionManager.init({
        name: 'JS::test',
        expire: 15 * 60
    });
}

`.get(opts?)`

Getting a new session, you need to call it before making a request (before the first attempt). Returns an object with arbitrary data saved in the session. As an argument, you can pass an object (opts) with additional parameters (all parameters are optional):

waitTimeout - the ability to specify how many minutes to wait for the session to appear, works independently of the waitForSession parameter in .init() (ignores it), after expiration an empty session will be used
tag - getting a session with a specified tag, you can use for example the domain name to link sessions to the domains from which they were obtained

Usage example:

await this.sessionManager.get({
    waitTimeout: 10,
    tag: 'test session'
})

`.reset(opts?)`

Clearing cookies and getting a new session. It is necessary to use if the current session was not successful. Returns an object with arbitrary data saved in the session. As an argument, you can pass an object (opts) with additional parameters (all parameters are optional):

waitTimeout - the ability to specify how many minutes to wait for the session to appear, works independently of the waitForSession parameter in .init() (ignores it), after expiration an empty session will be used
tag - getting a session with a specified tag, you can use for example the domain name to link sessions to the domains from which they were obtained

Usage example:

await this.sessionManager.reset({
    waitTimeout: 5,
    tag: 'test session'
})

`.save(sessionOpts?, saveOpts?)`

Saving a successful session with the ability to save arbitrary data in the session. Supports 2 optional arguments:

sessionOpts - arbitrary data to be stored in the session, can be a number, string, array, or object
saveOpts - an object with session saving parameters:
- multiply - an optional parameter, allows you to duplicate the session, as a value you need to specify a number
- tag - an optional parameter, sets the tag for the saved session, you can use for example the domain name to link sessions to the domains from which they were obtained

Usage example:

await this.sessionManager.save('some data here', {
    multiply: 3,
    tag: 'test session'
})

`.count()`

Returns the number of sessions for the current Session Manager

Usage example:

let sesCount = await this.sessionManager.count();

`.removeById(sessionId)`

Deletes all sessions with the specified id. Returns the number of deleted sessions. The id of the current session is contained in the variable this.sessionId Usage example:

const removedCount = await this.sessionManager.removeById(this.sessionId);

Comprehensive example of using the Session Manager

async init() {
    await this.sessionManager.init({
        expire: 15 * 60
    });
}

async parse(set, results) {
    let ses = await this.sessionManager.get();

    for(let attempt = 1; attempt <= this.conf.proxyretries; attempt++) {
        if(ses)
            this.logger.put('Data from session:', ses);
        const { success, data } = await this.request('GET', set.query, {}, { attempt });
        if(success) {
            // process data here
            results.success = 1;
            break;
        } else if(attempt < this.conf.proxyretries) {
            const removedCount = await this.sessionManager.removeById(this.sessionId);
            this.logger.put(`Removed ${removedCount} bad sessions with id #${this.sessionId}`);
            ses = await this.sessionManager.reset();
        }
    }

    if(results.success) {
        await this.sessionManager.save('Some data', { multiply: 2 });
        this.logger.put(`Total we have ${await this.sessionManager.count()} sessions`);
    }
    
    return results;
}

Base class methods​

await this.request(method, url[, queryParams][, opts])​

opts.check_content​

Capabilities:​

opts.decode​

opts.headers​

opts.headers_order​

opts.onlyheaders​

opts.recurse​

opts.proxyretries​

opts.parsecodes​

opts.timeout​

opts.do_gzip​

opts.max_size​

opts.cookie_jar​

opts.attempt​

opts.browser​

opts.use_proxy​

opts.noextraquery​

opts.save_to_file​

opts.bypass_cloudflare​

opts.follow_meta_refresh​

opts.redirect_filter​

opts.follow_common_rediects​

opts.http2​

opts.randomize_tls_fingerprint​

opts.tlsOpts​

await this.cookies.*​

.getAll()​

.setAll(cookie_jar)​

.set(host, path, name, value)​

await this.proxy.*​

.next()​

.ban()​

.get()​

.set(proxy, noChange?)​

await this.sessionManager.*​

.init(opts?)​

.get(opts?)​

.reset(opts?)​

.save(sessionOpts?, saveOpts?)​

.count()​

.removeById(sessionId)​

Comprehensive example of using the Session Manager​

Base class methods

`await this.request(method, url[, queryParams][, opts])`

`opts.check_content`

Capabilities:

`opts.decode`

`opts.headers`

`opts.headers_order`

`opts.onlyheaders`

`opts.recurse`

`opts.proxyretries`

`opts.parsecodes`

`opts.timeout`

`opts.do_gzip`

`opts.max_size`

`opts.cookie_jar`

`opts.attempt`

`opts.browser`

`opts.use_proxy`

`opts.noextraquery`

`opts.save_to_file`

`opts.bypass_cloudflare`

`opts.follow_meta_refresh`

`opts.redirect_filter`

`opts.follow_common_rediects`

`opts.http2`

`opts.randomize_tls_fingerprint`

`opts.tlsOpts`

`await this.cookies.*`

`.getAll()`

`.setAll(cookie_jar)`

`.set(host, path, name, value)`

`await this.proxy.*`

`.next()`

`.ban()`

`.get()`

`.set(proxy, noChange?)`

`await this.sessionManager.*`

`.init(opts?)`

`.get(opts?)`

`.reset(opts?)`

`.save(sessionOpts?, saveOpts?)`

`.count()`

`.removeById(sessionId)`

Comprehensive example of using the Session Manager