JavaScript Scrapers: Overview of Capabilities
JavaScript scrapers are the ability to create your own full-fledged scrapers with any complex logic using the JavaScript language. At the same time, in JS scrapers, you can also use all the functionality of standard scrapers.
Features
Using the full power of A-Parser, you can now write your own scraper/regger/poster with any complex logic. JavaScript with ES6 capabilities (v8 engine) is used to write code.
The code of scrapers is as concise as possible, allowing you to focus on writing logic; A-Parser takes care of working with multithreading, network, proxies, results, logs, etc. The code can be written directly in the parser interface by adding a new parser in the Parser Editor. You can also use third-party editors, such as VSCode, to write scrapers.
Automatic versioning is used when saving the scraper code through the built-in editor.
Working with JavaScript scrapers is available for Pro and Enterprise licenses
Access to the JS Parser Editor
If A-Parser is used remotely, then for security reasons, the JS Parser Editor is not available by default. To open access to it, you need to:
- set a password
- add the following line to config/config.txt:
allow_javascript_editor: 1
- restart A-Parser
Instruction for use
In the Parser Editor, create a new scraper and give it a name. By default, a simple example will be loaded, based on which you can quickly start creating your own scraper.
If a third-party editor is used to write code, then you need to open the edited scraper file in the /parsers/
folder
When the code is ready, save it and use it as a regular scraper: in the Task Editor, select the created scraper, if necessary, you can set the necessary parameters, stream configuration, file name, etc.
The created scraper can be edited at any time. All changes related to the interface will appear after re-selecting the scraper in the list of scrapers or restarting A-Parser; changes in the scraper logic are applied when the task with the scraper is restarted.
For each created scraper, the standard icon is displayed by default, you can add your own in png or ico format by placing it in the scraper folder in /parsers/
:
General principles of operation
By default, an example of a simple scraper is created, ready for further editing:
class Parser {
constructor() {
this.defaultConf = {
results: {
flat: [
['title', 'HTML title'],
]
},
results_format: '$query: $title\\n',
parsecodes: {
200: 1,
},
max_size: 200 * 1024,
};
}
*parse(set, results) {
this.logger.put("Start scraping query: " + set.query);
let response = yield this.request('GET', set.query, {}, {
check_content: ['<\/html>'],
decode: 'auto-html',
});
if(response.success) {
let matches = response.data.match(/<title>(.*?)<\/title>/i);
if(matches)
results.title = matches[1];
}
results.success = response.success;
return results;
}
}
The constructor is called once for each task. It is necessary to set this.defaultConf.results
and this.defaultConf.results_format
, the rest of the fields are optional and will take default values.
The this.editableConf
array determines which settings can be changed by the user from the A-Parser interface. You can use the following field types:
combobox
- drop-down menu selection. You can also make a preset selection menu for a standard scraper, for example:
['Util_AntiGate_preset', ['combobox', 'AntiGate preset']]
combobox
with the ability to select multiple items. You need to additionally set the parameter{'multiSelect': 1}
:
['proxyCheckers', ['combobox', 'Proxy Checkers', {'multiSelect': 1}, ['*', 'All']]]
checkbox
- checkbox, for parameters that can only have 2 values (true/false)textfield
- text fieldtextarea
- text field with multiline input
The *parse
method is a generator, and for any blocking operation, it must return yield (this is the main and only difference from a regular function). The method is called for each request received for processing. set
(a hash with the request and its parameters) and results
(a blank for results) must be passed. It is also mandatory to return the filled results
, setting the success flag.
Automatic versioning
The version has the format: Major.Minor.Revision
this.defaultConf = {
version: '0.1.1',
...
}
The Revision value (last digit) is automatically incremented with each save. Other values (Major, Minor) can be changed manually, as well as reset Revision to 0.
If for some reason it is necessary to change Revision only manually, then the version must be enclosed in double quotes ""
TODO: (next) ## Batch processing of requests
TODO: (next) In some cases, it may be necessary to take several requests from the queue at once and process them at once. To do this, in this.defaultConf
, you need to set the value bulkQueries: N
. In this case, the scraper will take requests in batches of N pieces, and all requests of the current iteration will be contained in the set.bulkQueries
array.
Useful links
Examples and Discussion
Forum thread with examples and discussion of JS scraper functionality
JS scraper catalog
Section in the resource catalog dedicated to JS scrapers
Overview of basic ES6 features
Article on habrahabr dedicated to the overview of basic ES6 features