Hook Methods
These methods work on the principle of hooks. Implementing these methods allows controlling the scraper's operation at different stages, from initialization to object destruction
The implementation of all methods except parse is optional.
async parse(set, results)
The method parse implements the main logic for processing the request and obtaining the parsing result. The arguments passed are:
set- object with request information:set.query- text string of the queryset.lvl- request level, by default0
results- object with results that must be filled in and returned from theparse()- the scraper must check for the presence of each key in the results object and fill it only if present, thereby optimizing speed and parsing only the data used in forming the result
resultscontains keys of necessary flat variables with the valuenone, by default this means that the result has not been received, as well as keys of array variables (arrays) with a value of an empty array, ready for fillingresults.successmust be set to1upon successful processing of the request, by default the value is0, meaning that the request was processed with an error
Let's look at an example:
class JS_HTML_Tags extends BaseParser {
static defaultConf = {
results: {
flat: [
['title', 'Title'],
],
arrays: {
h2: ['H2 Headers List', [
['header', 'Header'],
]],
}
},
...
};
async parse(set, results) {
// We get the content of the HTML page whose address was passed in the request
const {success, data, headers} = await this.request('GET', set.query);
// We check the success and type of data; when correctly processing HTML pages, we should receive the type 'string', otherwise A-Parser returns an object of type Buffer
if (success && typeof data == 'string') {
let matches;
// We check the need to collect the title and save the value
if (results.title && matches = data.match(/<title[^>]*>(.*?)<\/title>/))
results.title = matches[1];
// We check the need to collect h2
if (results.h2) {
let count = 0;
const re = /<h2[^>]*>(.*?)<\/h2>/g;
while(matches = re.exec(data)) {
// We save all found h2 tags in the loop
results.h2.push(matches[1]);
}
}
// We notify about successful parsing
results.success = 1;
}
// We return the processed results
return results;
}
};
Note that you can create your custom functions and methods for better code organization:
function Answer() {
return 42;
}
class JS_HTML_Tags extends BaseParser {
...
async parse(set, results) {
results = await this.doWork(set, results);
return results;
}
async doWork(set, results) {
results.answer = Answer();
return results;
}
};
async processConf?(conf)
This method is used to transform the config according to certain rules, for example, when using captcha we always need to use sessions:
async processConf(conf) {
if (conf.useCaptcha)
conf.useSessions = 1
}
async parse(set, results) {
if (conf.useSessions)
await this.login();
}
The existence of this method is due to the fact that A-Parser supports dynamic config fields, and within one task, the config can have different values. This scenario is possible in two cases:
- Using templates in configuration fields, for example,
[% tools.ua.random() %]for the field User-Agent - Using
overrideswhen calling one scraper from another forthis.parser.request
The method processConf is called once before init(). For the cases described above, processConf is additionally called before processing each request.
Main application rules for processConf:
- Use it only if config transformation has an effect on performance
- Keep in mind that
initis executed once, whileprocessConfcan be executed for each request, in which case the logic may be violated ifinitdepends on changing config fields (see below)
async init?()
The method init is called once when the base scraper object is initialized, and serves to perform one-time actions:
- Starting the browser
- Initializing the session manager using the method
this.sessionManager.init() - Connecting to the database and creating tables in the DB
- Reading static data
- Etc.
Since the method is called once, all configuration fields on which init() depends cannot be used together with configuration field templates or with overrides when calling this.parsers.request
async destroy?()
The method destroy is called once when the job finishes, necessary for the correct disposal of open resources:
- Closing the browser
- Closing the DB connection
- Etc.
async threadInit?()
This method is run when each thread is initialized. Each thread is a copy of the base scraper object with its own unique this.threadId, which starts with 0 and ends with threads_count - 1
Main use cases:
- creating a browser page (tab) for each thread
async threadDestroy?()
Executed upon thread completion during job finalization, serves to free up resources allocated for that thread
async afterResultsProcessor?(results)
This method is executed after processing the results: result constructor, filtering, and unique identification. The main use case is to add requests to the queue using the method this.query.add after applying custom filters, which implements link filtering for navigation (followlinks) for the A-Parser
HTML::LinkExtractor