Skip to main content

Chrome Management (puppeteer)

A-Parser + Puppeteer

A-Parser allows using Chrome (Chromium) browser as an engine for downloading and rendering pages, using the popular library puppeteer. The main advantages of working with puppeteer together with A-Parser are:

  • support for separate proxies for each browser tab
  • multi-threaded browser tab management
  • interception of requests
  • all the capabilities of A-Parser for managing queues, forming requests, and processing results

Using the Chrome browser opens up the following possibilities:

  • rendering DOM and JavaScript
  • the ability to interactively work with website elements:
    • filling out forms
    • following links
    • drag & drop
    • file downloads
    • mouse emulation
    • and much more, any standard actions can be automated
  • easier bypassing of various anti-parsing protections, as the Chromium browser is as close as possible to the one used by users
  • the ability to work in Headless mode, i.e. without a graphical browser, which allows saving resources, as well as running the browser on servers without a graphical environment
note

It should be noted that the browser's work is much more resource-intensive (CPU, Memory) than regular A-Parser threads.

Depending on the complexity of the site, it is recommended to use no more than 1-2 threads (browser tabs) per available processor core, for example, for 8-core processors - from 8 to 16 tabs.

Usage example

Let's consider the example of the Chrome::ScreenshotMaker2 scraper:

  • the scraper takes screenshots of sites of a specified size and can also reduce (scale) the image
  • can optionally use a proxy
  • creates a separate browser tab for each A-Parser thread
import { BaseParser, PuppeteerTypes } from 'a-parser-types';

let browser: PuppeteerTypes.Browser;
let jimp;

class JS_Chrome_ScreenshotsMaker2 extends BaseParser {
static defaultConf: typeof BaseParser.defaultConf = {
version: '0.2.1',
results: {
flat: [
['screenshot', 'PNG screenshot'],
]
},
results_format: '$screenshot',
load_timeout: 30,
width: 1024,
height: 768,
log_screenshots: 0,
headless: 1,
};

static editableConf: typeof BaseParser.editableConf = [
['log_screenshots', ['checkbox', 'Log Screenshots']],
['width', ['textfield', 'Viewport Width']],
['height', ['textfield', 'Viewport Height']],
['resize_width', ['textfield', 'Resize Width']],
['resize_height', ['textfield', 'Resize Height']],
['headless', ['checkbox', 'Chrome Headless']],
];

async init() {
//инициализируем браузер
browser = await this.puppeteer.launch({
headless: this.conf.headless,
logConnections: false,
defaultViewport: {
width: parseInt(this.conf.width),
height: parseInt(this.conf.height),
}
});

if(this.conf.resize_width) {
//подключаем модуль jimp если необходимо делать resize скриншота
jimp = require('jimp');
};
};

async destroy() {
//закрываем браузер по завершению работы задания
if(browser)
await browser.close();
}

page: PuppeteerTypes.Page;

async threadInit() {
//создаем страницу браузера при инициализауии потока
this.page = await browser.newPage();

//стандартные методы puppeteer
await this.page.setCacheEnabled(true);
await this.page.setDefaultNavigationTimeout(this.conf.timeout * 1000);

//указываем A-Parser использовать прокси для данной страницы
await this.puppeteer.setPageUseProxy(this.page);


this.logger.put(`New page created for thread #${this.threadId}`);
}

async parse(set, results) {
const self = this;
const {conf, page} = self;

for(let attempt = 1; attempt <= conf.proxyretries; attempt++) {
try {
self.logger.put(`Attempt #${attempt}`);

//переходим на страницу, указанную в запросе
await page.goto(set.query);
//скрываем скролл бар для скриншота
await page.evaluate(() => { document.querySelector('html').style.overflow = 'hidden'; });

//получаем скриншот
results.screenshot = await page.screenshot();

if(parseInt(conf.resize_width)) {
//при необходимости ресайзим изображение
let image = await jimp.read(results.screenshot);
image.resize(parseInt(conf.resize_width), parseInt(conf.resize_height));
results.screenshot = await image.getBufferAsync('image/png');
}

self.logger.put(`Screenshot(${attempt}): OK, size: ${parseInt("" + (results.screenshot.length / 1024))}KB`);
if(conf.log_screenshots)
self.logger.putHTML("<img src='data:image/png;base64," + results.screenshot.toString('base64') + "'>");

results.success = 1;

//закрываем текущие соединения, т.к. браузер использует keep-alive
await self.puppeteer.closeActiveConnections();
break;
}
catch(error) {
self.logger.put(`Fetch page error: ${error}`);
//закрываем текущие соединения, т.к. браузер использует keep-alive
await self.puppeteer.closeActiveConnections();
//меняем прокси для вкладки браузера
await self.proxy.next();
}
}

return results;
}
}

This example demonstrates the simplicity of using different proxies for each tab, as well as multi-threaded operation (1 thread = 1 browser tab).

Method description

await this.puppeteer.launch(opts?)

This method is similar to the .launch method of the puppeteer library, it launches the Chromium browser with the necessary options opts. The main difference is in the integration with A-Parser and support for proxies for each tab, as well as the presence of additional options:

logConnections?: boolean

Enables logging of all connections (regardless of whether a proxy is used or not), the log output is done separately by threads.

stealth?: boolean

Uses the puppeteer-extra plugin to mask Chromium as a real Chrome.

stealthOpts?: any

Additional options for the puppeteer-extra plugin.

extraPlugins?: array

Using additional plugins, such as these: puppeteer-extra/packages

A-Parser + Puppeteer

Other options

All other launch options can be found in the original documentation puppeteer

await this.puppeteer.setPageUseProxy(page)

This method links the browser page to the A-Parser thread for correct proxy operation, it must be called immediately after creating the page:

const page = await browser.newPage();
await this.puppeteer.setPageUseProxy(page);

await this.puppeteer.closeActiveConnections(page?)

This method should be called after completing the request processing or before changing the proxy to process the next attempt.

By default, the Chrome browser leaves open connections to the sites it connects to, this method allows you to control the number of used resources and reduces the load on the proxy.

The page argument is optional, when called without an argument, A-Parser will close the connections for the tab associated with the current thread.

await this.puppeteer.logScreenshot()

The method logs a screenshot of the current page.