Skip to main content

Hook methods

These methods work on the principle of hooks. The implementation of these methods allows you to control the parser's operation at different stages, from initialization to object destruction.

Implementation of all methods except parse is optional.

async parse(set, results)

The parse method implements the main logic for processing a query and obtaining the parsing result; the following are passed as arguments:

  • set - an object with information about the query:
    • set.query - the text string of the query
    • set.lvl - the query level, defaults to 0
  • results - an object with results that need to be filled and returned from the parse() method
    • the parser must check for the presence of each key in the results object and fill it only if present; this optimizes speed and parses only the data used in result formation
    • results contains keys of the required flat variables with a value of none, which by default means no result was obtained, as well as keys for array variables (arrays) with a value in the form of an empty array ready to be filled
    • results.success must be set to 1 upon successful query processing; the default value is 0, meaning the query was processed with an error

Let's look at an example:

class JS_HTML_Tags extends BaseParser {
static defaultConf = {
results: {
flat: [
['title', 'Title'],
],
arrays: {
h2: ['H2 Headers List', [
['header', 'Header'],
]],
}
},
...
};

async parse(set, results) {
// Get the content of the HTML page whose address was passed in the query
const {success, data, headers} = await this.request('GET', set.query);

// Check success and data type; when processing HTML pages correctly, we should get type 'string', otherwise A-Parser returns a Buffer object
if (success && typeof data == 'string') {
let matches;

// Check the need to collect the title and save the value
if (results.title && matches = data.match(/<title[^>]*>(.*?)<\/title>/))
results.title = matches[1];

// Check the need to collect h2
if (results.h2) {
let count = 0;
const re = /<h2[^>]*>(.*?)<\/h2>/g;
while(matches = re.exec(data)) {
// Save all found h2 tags in a loop
results.h2.push(matches[1]);
}
}

// Notify about parsing success
results.success = 1;
}

// Return processed results
return results;
}
};

Note that you can create your own functions and methods for better code organization:

function Answer() {
return 42;
}

class JS_HTML_Tags extends BaseParser {
...

async parse(set, results) {
results = await this.doWork(set, results);
return results;
}

async doWork(set, results) {
results.answer = Answer();
return results;
}
};

async processConf?(conf)

This method is used to transform the config according to certain rules, for example, when using a captcha, we always need to use sessions:

async processConf(conf) {
if (conf.useCaptcha)
conf.useSessions = 1
}

async parse(set, results) {
if (conf.useSessions)
await this.login();
}

The existence of this method is due to the fact that A-Parser supports dynamic config fields and within a single task, the config can have different values; such a scenario is possible in two cases:

  • Using templates in configuration fields, for example [% tools.ua.random() %] for the User-Agent field
  • Using overrides when calling one parser from another for this.parser.request

The processConf method is called once before init(). For the cases described above, processConf is additionally called before processing each query

Main rules for applying processConf:

  • Use only if config transformations have an effect on performance
  • Keep in mind that init is executed once, while processConf can be executed for each query; in this case, logic may be broken if init depends on changing config fields (see below)

async init?()

The init method is called once during the initialization of the base parser object and serves to perform one-time actions:

  • Starting the browser
  • Initializing the session manager using the this.sessionManager.init() method
  • Connecting to a database and creating tables in the DB
  • Reading static data
  • Etc.
caution

Since the method is called once, all configuration fields that init() depends on cannot be used in conjunction with configuration field templates or with overrides when calling this.parsers.request

async destroy?()

The destroy method is called once upon completion of the task and is necessary for the correct destruction of open resources:

  • Closing the browser
  • Closing the DB connection
  • Etc.

async threadInit?()

This method is launched during the initialization of each thread; each thread is a copy of the base parser object with its own unique this.threadId, which starts from 0 and ends at threads_count - 1

Main use cases:

  • creating a browser page (tab) for each thread

async threadDestroy?()

Executed upon thread termination during the task completion process; serves to free resources allocated for that thread

async afterResultsProcessor?(results)

This method is executed after processing results by the Results Builder, filtering, and deduplication. The main use case is adding queries to the queue using the this.query.add method after applying user filters; this is how link filtering for transitions (followlinks) is implemented for the HTML::LinkExtractorHTML::LinkExtractor parser.