Skip to main content

JavaScript Scrapers: Overview of Capabilities

JavaScript scrapers are the ability to create your own full-fledged scrapers with any complex logic using the JavaScript language. At the same time, in JS scrapers, you can also use all the functionality of standard scrapers.

Features

Using the full power of A-Parser, you can now write your own scraper/regger/poster with any complex logic. JavaScript with ES6 capabilities (v8 engine) is used to write code.

The code of scrapers is as concise as possible, allowing you to focus on writing logic; A-Parser takes care of working with multithreading, network, proxies, results, logs, etc. The code can be written directly in the parser interface by adding a new parser in the Parser Editor. You can also use third-party editors, such as VSCode, to write scrapers.

Automatic versioning is used when saving the scraper code through the built-in editor.

info

Working with JavaScript scrapers is available for Pro and Enterprise licenses

Access to the JS Parser Editor

If A-Parser is used remotely, then for security reasons, the JS Parser Editor is not available by default. To open access to it, you need to:

  • set a password
  • add the following line to config/config.txt: allow_javascript_editor: 1
  • restart A-Parser

Instruction for use

In the Parser Editor, create a new scraper and give it a name. By default, a simple example will be loaded, based on which you can quickly start creating your own scraper.

info

If a third-party editor is used to write code, then you need to open the edited scraper file in the /parsers/ folder

When the code is ready, save it and use it as a regular scraper: in the Task Editor, select the created scraper, if necessary, you can set the necessary parameters, stream configuration, file name, etc.

The created scraper can be edited at any time. All changes related to the interface will appear after re-selecting the scraper in the list of scrapers or restarting A-Parser; changes in the scraper logic are applied when the task with the scraper is restarted.

For each created scraper, the standard icon is displayed by default, you can add your own in png or ico format by placing it in the scraper folder in /parsers/:

General principles of operation

By default, an example of a simple scraper is created, ready for further editing:

class Parser {
constructor() {
this.defaultConf = {
results: {
flat: [
['title', 'HTML title'],
]
},
results_format: '$query: $title\\n',
parsecodes: {
200: 1,
},
max_size: 200 * 1024,
};
}

*parse(set, results) {
this.logger.put("Start scraping query: " + set.query);

let response = yield this.request('GET', set.query, {}, {
check_content: ['<\/html>'],
decode: 'auto-html',
});

if(response.success) {
let matches = response.data.match(/<title>(.*?)<\/title>/i);
if(matches)
results.title = matches[1];
}

results.success = response.success;

return results;
}
}

The constructor is called once for each task. It is necessary to set this.defaultConf.results and this.defaultConf.results_format, the rest of the fields are optional and will take default values.

The this.editableConf array determines which settings can be changed by the user from the A-Parser interface. You can use the following field types:

  • combobox - drop-down menu selection. You can also make a preset selection menu for a standard scraper, for example:
['Util_AntiGate_preset', ['combobox', 'AntiGate preset']]
  • combobox with the ability to select multiple items. You need to additionally set the parameter {'multiSelect': 1}:
['proxyCheckers', ['combobox', 'Proxy Checkers', {'multiSelect': 1}, ['*', 'All']]]
  • checkbox - checkbox, for parameters that can only have 2 values (true/false)
  • textfield - text field
  • textarea - text field with multiline input

The *parse method is a generator, and for any blocking operation, it must return yield (this is the main and only difference from a regular function). The method is called for each request received for processing. set (a hash with the request and its parameters) and results (a blank for results) must be passed. It is also mandatory to return the filled results, setting the success flag.

Automatic versioning

The version has the format: Major.Minor.Revision

this.defaultConf = {
version: '0.1.1',
...
}

The Revision value (last digit) is automatically incremented with each save. Other values (Major, Minor) can be changed manually, as well as reset Revision to 0.

tip

If for some reason it is necessary to change Revision only manually, then the version must be enclosed in double quotes ""

TODO: (next) ## Batch processing of requests TODO: (next) In some cases, it may be necessary to take several requests from the queue at once and process them at once. To do this, in this.defaultConf, you need to set the value bulkQueries: N. In this case, the scraper will take requests in batches of N pieces, and all requests of the current iteration will be contained in the set.bulkQueries array.

Examples and Discussion

Forum thread with examples and discussion of JS scraper functionality

JS scraper catalog

Section in the resource catalog dedicated to JS scrapers

Overview of basic ES6 features

Article on habrahabr dedicated to the overview of basic ES6 features