Skip to main content

JavaScript scrapers: An Overview of Capabilities

JavaScript scrapers are an opportunity to create your own full-fledged scrapers with as complex logic as you want, using the JavaScript language. At the same time, JS scrapers can also use the full functionality of standard scrapers.

Overview

Features

Using the full power of A-Parser, you can now write your own scraper/regger/poster with as complex logic as you want. JavaScript with ES6 features (v8 engine) is used for writing code.

The code of scrapers is as concise as possible, allowing you to focus on writing logic; A-Parser takes care of multithreading, networking, proxies, results, logs, etc. You can write code directly in the scraper interface by adding a new scraper in the Parser Editor. You can also use external editors to write scrapers, such as VSCode.

Automatic versioning is used when saving scraper code through the built-in editor.

note

Working with JavaScript scrapers is available for Pro and Enterprise licenses

Access to the JS Scraper Editor

If A-Parser is used remotely, for security reasons the JS Scraper Editor is not available by default. To grant access to it, you need to:

  • Set a password in the Settings -> General Settings
  • Add the following line to config/config.txt: allow_javascript_editor: 1
  • Restart A-Parser

Operating Instructions

In the Parser Editor, we create a new scraper and set the name of the scraper. By default, a simple example will be loaded, based on which you can quickly start creating your own scraper.

note

If an external editor is used to write code, you need to open the file of the scraper being edited in the /parsers/ folder. File structure of the installed program.

Once the code is ready, save it and use it as a regular scraper: in the Task Editor, select the created scraper, if necessary, you can set the required parameters, thread config, file name, etc.

You can edit the created scraper at any time. All changes related to the interface will appear after re-selecting the scraper from the list of scrapers or restarting A-Parser; changes in the scraper logic are applied when the task with the scraper is restarted.

By default, a standard icon is displayed for each created scraper, you can add your own in png or ico format by placing it in the scraper folder in /parsers/:

General Operating Principles

By default, an example of a simple scraper is created, ready for further editing.

files/parsers/v2-example/v2-example.ts

import { BaseParser } from 'a-parser-types';

export class JS_v2_example extends BaseParser {
static defaultConf: typeof BaseParser.defaultConf = {
version: '0.0.1',
results: {
flat: [
['title', 'HTML title'],
]
},
max_size: 2 * 1024 * 1024,
parsecodes: {
200: 1,
},
results_format: '$query: $title\\n',
};

static editableConf: typeof BaseParser.editableConf = [];

async parse(set, results) {
this.logger.put("Start scraping query: " + set.query);

let response = await this.request('GET', set.query, {}, {
check_content: ['<\/html>'],
decode: 'auto-html',
});

if (response.success) {
let matches = response.data.match(/<title>(.*?)<\/title>/i);
if (matches)
results.title = matches[1];
}

results.success = response.success;

return results;
}
}

The constructor is called once for each task. It is mandatory to set this.defaultConf.results and this.defaultConf.results_format, other fields are optional and will take default values.

The array this.editableConf defines which settings can be changed by the user from the A-Parser interface. You can use the following types of fields:

  • combobox - a drop-down selection menu. You can also make a selection menu for a preset of a standard scraper, for example:
['Util_AntiGate_preset', ['combobox', 'AntiGate preset']]
  • combobox with the ability to select multiple options. You need to additionally set the parameter {'multiSelect': 1}:
['proxyCheckers', ['combobox', 'Proxy Checkers', {'multiSelect': 1}, ['*', 'All']]]
  • checkbox - a checkbox for parameters that can have only 2 values (true/false)
  • textfield - a text field
  • textarea - a multiline text field

The parse method is an asynchronous function, and for any blocking operation, it must return await (this is the main and only difference from a regular function). The method is called for each request that is being processed. It is mandatory to pass set (a hash with the request and its parameters) and results (an empty template for the results). It is also mandatory to return the filled results, having previously set the success flag.

Automatic Versioning

The version has the format Major.Minor.Revision

this.defaultConf: typeof BaseParser.defaultConf = {
version: '0.1.1',
...
}

The Revision value (the last digit) is automatically incremented with each save. The other values (Major, Minor) can be changed manually, and the Revision can be reset to 0.

tip

If for some reason it is necessary to change the Revision only manually, then the version must be enclosed in double quotes ""

Bulk Query Processing

In some cases, it may be necessary to take several requests from the queue at once and process them together. For example, this functionality is used in SE::Yandex::Direct::FrequencySE::Yandex::Direct::Frequency: when scraping with accounts, data is collected in batches of 10 requests.

To implement the same functionality in a JS scraper, you need to set the value bulkQueries: N in this.defaultConf, where N is the required number of requests in a batch. In this case, the scraper will take requests in batches of N pieces, and all requests of the current iteration will be contained in the array set.bulkQueries (including all standard variables: query.first, query.orig, query.prev, etc.). Below is an example of such an array:

[
{
"first": "test",
"prev": "",
"lvl": 0,
"num": 0,
"query": "test",
"queryUid": "6eb301",
"orig": "test"
},
{
"first": "проверка",
"prev": "",
"lvl": 0,
"num": 1,
"query": "проверка",
"queryUid": "774563",
"orig": "проверка"
},
{
"first": "third query",
"prev": "",
"lvl": 0,
"num": 2,
"query": "third query",
"queryUid": "2bc8ed",
"orig": "third query"
}
]

Results during bulk processing must be filled in the array results.bulkResults, where each element is a results object. The elements in results.bulkResults are arranged in the same order that was in set.bulkQueries.