Skip to main content

Hook Methods

These methods work on the principle of hooks. Implementing these methods allows you to control the operation of the scraper at different stages, from initialization to the destruction of the object.

Implementation of all methods except parse is optional.

async parse(set, results)

The parse method implements the main logic of processing the request and obtaining the parsing result, the arguments passed are:

  • set - an object with information about the request:
    • set.query - the text string of the request
    • set.lvl - the level of the request, by default 0
  • results - an object with the results that need to be filled and returned from the parse() method
    • the scraper must check for the presence of each key in the results object and fill it only if it exists, thus optimizing speed and parsing only the data that is used in forming the result
    • results contains keys of the required flat variables with the value none, by default this means that the result has not been obtained, as well as keys of array variables (arrays) with a value in the form of an empty array, ready for filling
    • results.success should be set to 1 if the request is processed successfully, by default the value is 0, indicating that the request was processed with an error

Let's look at an example:

class JS_HTML_Tags extends BaseParser {
static defaultConf = {
results: {
flat: [
['title', 'Title'],
],
arrays: {
h2: ['H2 Headers List', [
['header', 'Header'],
]],
}
},
...
};

async parse(set, results) {
// Получаем содержимое HTML страницы, адрес которой был передан в запросе
const {success, data, headers} = await this.request('GET', set.query);

// Проверяем успешность и тип data, при корректной обработке HTML страниц мы должны получать тип 'string', в противном случае A-Parser возвращает объект типа Buffer
if (success && typeof data == 'string') {
let matches;

// Проверяем необходимость сбора title и сохраняем значение
if (results.title && matches = data.match(/<title[^>]*>(.*?)<\/title>/))
results.title = matches[1];

// Проверяем необходимость сбора h2
if (results.h2) {
let count = 0;
const re = /<h2[^>]*>(.*?)<\/h2>/g;
while(matches = re.exec(data)) {
// Сохраняем в цикле все найденные теги h2
results.h2.push(matches[1]);
}
}

// Уведомляем об успешности парсинга
results.success = 1;
}

// Возвращаем обработанные результаты
return results;
}
};

Note that you can create your own functions and methods for better code organization:

function Answer() {
return 42;
}

class JS_HTML_Tags extends BaseParser {
...

async parse(set, results) {
results = await this.doWork(set, results);
return results;
}

async doWork(set, results) {
results.answer = Answer();
return results;
}
};

async processConf?(conf)

This method is used to transform the config according to certain rules, for example, when using captcha we always need to use sessions:

async processConf(conf) {
if (conf.useCaptcha)
conf.useSessions = 1
}

async parse(set, results) {
if (conf.useSessions)
await this.login();
}

The existence of this method is due to the fact that A-Parser supports dynamic config fields and within one task the config can have different values, this scenario is possible in two cases:

  • Use of templates in configuration fields, for example [% tools.ua.random() %] for the User-Agent field
  • Use of overrides when calling one scraper from another for this.parser.request

The processConf method is called once before init(). For the cases described above, processConf is additionally called before processing each request

Main rules for applying processConf:

  • Use only if the config transformation has an effect on performance
  • Keep in mind that init is performed once, and processConf can be performed for each request, in this case, the logic may be disrupted if init depends on changing config fields (see below)

async init?()

The init method is called once when initializing the base scraper object, serves to perform one-time actions:

  • Starting the browser
  • Initialization of the session manager using the this.sessionManager.init() method
  • Connecting to the database and creating tables in the DB
  • Reading static data
  • Etc.
caution

Since the method is called once, all configuration fields on which init() depends cannot be used in conjunction with field configuration templates or with overrides when calling this.parsers.request

async destroy?()

The destroy method is called once upon completion of the task, necessary for the correct destruction of open resources:

  • Closing the browser
  • Closing the database connection
  • Etc.

async threadInit?()

This method is launched at the initialization of each thread, each thread is a copy of the base scraper object with its unique this.threadId, which starts from 0 and ends with threads_count - 1

Main application scenarios:

  • creating a browser page (tab) for each thread

async threadDestroy?()

It is executed upon completion of the thread in the process of completing the task, serves to free the resources allocated for this thread

async afterResultsProcessor?(results)

This method is executed after processing the results: by the result constructor, filtering, and deduplication. The main use case is adding queries to the queue using the this.query.add method after applying custom filters, thus implementing link filtering for transitions (followlinks) for the scraper HTML::LinkExtractorHTML::LinkExtractor