Hook Methods
These methods work on the principle of hooks. Implementing these methods allows you to control the operation of the scraper at different stages, from initialization to the destruction of the object.
Implementation of all methods except parse
is optional.
async parse(set, results)
The parse
method implements the main logic of processing the request and obtaining the parsing result, the arguments passed are:
set
- an object with information about the request:set.query
- the text string of the requestset.lvl
- the level of the request, by default0
results
- an object with the results that need to be filled and returned from theparse()
method- the scraper must check for the presence of each key in the results object and fill it only if it exists, thus optimizing speed and parsing only the data that is used in forming the result
results
contains keys of the required flat variables with the valuenone
, by default this means that the result has not been obtained, as well as keys of array variables (arrays) with a value in the form of an empty array, ready for fillingresults.success
should be set to1
if the request is processed successfully, by default the value is0
, indicating that the request was processed with an error
Let's look at an example:
class JS_HTML_Tags extends BaseParser {
static defaultConf = {
results: {
flat: [
['title', 'Title'],
],
arrays: {
h2: ['H2 Headers List', [
['header', 'Header'],
]],
}
},
...
};
async parse(set, results) {
// Получаем содержимое HTML страницы, адрес которой был передан в запросе
const {success, data, headers} = await this.request('GET', set.query);
// Проверяем успешность и тип data, при корректной обработке HTML страниц мы должны получать тип 'string', в противном случае A-Parser возвращает объект типа Buffer
if (success && typeof data == 'string') {
let matches;
// Проверяем необходимость сбора title и сохраняем значение
if (results.title && matches = data.match(/<title[^>]*>(.*?)<\/title>/))
results.title = matches[1];
// Проверяем необходимость сбора h2
if (results.h2) {
let count = 0;
const re = /<h2[^>]*>(.*?)<\/h2>/g;
while(matches = re.exec(data)) {
// Сохраняем в цикле все найденные теги h2
results.h2.push(matches[1]);
}
}
// Уведомляем об успешности парсинга
results.success = 1;
}
// Возвращаем обработанные результаты
return results;
}
};
Note that you can create your own functions and methods for better code organization:
function Answer() {
return 42;
}
class JS_HTML_Tags extends BaseParser {
...
async parse(set, results) {
results = await this.doWork(set, results);
return results;
}
async doWork(set, results) {
results.answer = Answer();
return results;
}
};
async processConf?(conf)
This method is used to transform the config according to certain rules, for example, when using captcha we always need to use sessions:
async processConf(conf) {
if (conf.useCaptcha)
conf.useSessions = 1
}
async parse(set, results) {
if (conf.useSessions)
await this.login();
}
The existence of this method is due to the fact that A-Parser supports dynamic config fields and within one task the config can have different values, this scenario is possible in two cases:
- Use of templates in configuration fields, for example
[% tools.ua.random() %]
for the User-Agent field - Use of
overrides
when calling one scraper from another forthis.parser.request
The processConf
method is called once before init()
. For the cases described above, processConf
is additionally called before processing each request
Main rules for applying processConf
:
- Use only if the config transformation has an effect on performance
- Keep in mind that
init
is performed once, andprocessConf
can be performed for each request, in this case, the logic may be disrupted ifinit
depends on changing config fields (see below)
async init?()
The init
method is called once when initializing the base scraper object, serves to perform one-time actions:
- Starting the browser
- Initialization of the session manager using the
this.sessionManager.init()
method - Connecting to the database and creating tables in the DB
- Reading static data
- Etc.
Since the method is called once, all configuration fields on which init()
depends cannot be used in conjunction with field configuration templates or with overrides
when calling this.parsers.request
async destroy?()
The destroy
method is called once upon completion of the task, necessary for the correct destruction of open resources:
- Closing the browser
- Closing the database connection
- Etc.
async threadInit?()
This method is launched at the initialization of each thread, each thread is a copy of the base scraper object with its unique this.threadId
, which starts from 0
and ends with threads_count - 1
Main application scenarios:
- creating a browser page (tab) for each thread
async threadDestroy?()
It is executed upon completion of the thread in the process of completing the task, serves to free the resources allocated for this thread
async afterResultsProcessor?(results)
This method is executed after processing the results: by the result constructor, filtering, and deduplication. The main use case is adding queries to the queue using the this.query.add
method after applying custom filters, thus implementing link filtering for transitions (followlinks
) for the scraper HTML::LinkExtractor