Skip to main content

Helper methods (utils, tools, sleep)

this.utils.*

.updateResultsData(results, data)

await this.utils.updateResultsData(results, data) - the method for automatic filling of $pages.$i.data and $data, must be called to add content to the result page

.urlFromHTML(url, base)

await this.utils.urlFromHTML(url, base) - processes a link obtained from HTML code - decodes entities (& etc.), optionally you can pass base - a base URL (for example, the URL of the source page), in this way the full link can be obtained

.url.extractDomain(url, removeDefaultSubdomain)

await this.utils.url.extractDomain(url, removeDefaultSubdomain) - the method takes a link as its first parameter and returns the domain from that link. The second optional parameter determines whether to truncate the subdomain from the domain www. By default 0 - i.e. do not truncate.

.url.extractTopDomain(url)

await this.utils.url.extractTopDomain(url) - the method takes a link as its first parameter and returns the domain from that link, without subdomains.

.url.extractTopDomainByZone(url)

await this.utils.url.extractTopDomainByZone(url) - the method takes a link as its first parameter and returns the domain from that link, including without subdomains. Works with all regional zones

.url.extractMaxPath(url)

await this.utils.url.extractMaxPath(url) - the method takes a string and selects the URL from it

.url.extractWOParams(url)

await this.utils.url.extractWOParams(url)- the method takes a link and returns the same link truncated before the parameters string. That is, it will return the URL up to ?

.removeHtml(string)

await this.utils.removeHtml(string) - the method takes a string and returns it cleaned of HTML tags

.removeNoDigit(string)

await this.utils.removeNoDigit(string) - the method takes a string, removes everything but digits from it, and returns the result

.removeComma(string)

await this.utils.removeComma(string) - the method takes a string, removes characters such as .,\r\n from it, and returns the result

.getAllBlocks(html, regexp, opts?)

await this.utils.getAllBlocks(html, regexp, opts?) - getting all blocks on the page with corresponding closing tags, the method takes an HTML string and a regular expression that indicates the start of the block (any blocks that have paired closing tags, for example <div>...</div>), the result is an array of all found blocks

Options opts:

  • searchStartIndex - indicates the index in the string from which to start searching, by default 0
const blocks = this.utils.getAllBlocks(html, /<div [^>]*?class="results"/)

.getAllBlocksByAttr(html, tag, attrName, attrRegExp, opts?)

await this.utils.getAllBlocksByAttr(html, tag, attrName, attrRegExp, opts?) - a method similar to .getAllBlocks, instead of a regular expression to search for the start of the block, the tag name, the attribute name to search by (for example id, class) and a regular expression that will be applied to the value of the specified attribute

const blocks = this.utils.getAllBlocksByAttr(html, 'div', 'class', /results/)

await tools.*

The global object tools, allows access to A-Parser's built-in functions

Analog of template toolkit tools $tools.*

note

tools.query is unavailable, you must use this.query

await tools.createTemplate(string)

Allows using the Template Toolkit inside the JavaScript scraper.

let template = await tools.createTemplate("Hello [% content %]!")
template = typeof template == 'function' ? await template({content: 'World'}) : template
this.logger.put(template) // Output: Hello World!

Example usage of "Markdown to HTML Translation Scraper"

await this.sleep(sec)

await this.sleep(sec)

Sets a delay in the thread for a number of seconds (sec), can be fractional.

await this.mutex.*

Mutex for synchronization between threads, allows locking a code section for one thread

.lock()

Waiting for the lock, execution will continue with the first thread that acquired the lock, other threads will wait for the lock to be released

.unlock()

Releasing the lock, the next thread will continue execution if it was waiting for the lock - .lock()

results.<array>.addElement()

The results.<array>.addElement() method allows for more convenient filling of arrays in results. When using it, you don't need to remember the sequence of variables in the array and list them manually.

results.serp.addElement({
link: 'https://google.com',
anchor: 'Google',
snippet: 'Loreps ipsum...',
});

this.isContextAlive()

This method is necessary for long-lived threads that process requests in a loop, allowing them to terminate correctly when the job is stopped or removed

while (this.isContextAlive()) {
await this.request(...)
}