Skip to main content

Managing Chrome (puppeteer)

A-Parser + Puppeteer

A-Parser allows the use of the Chrome (Chromium) browser as an engine for downloading and rendering pages, using the popular puppeteer library.

The main advantages of working with puppeteer in conjunction with A-Parser:

  • support for separate proxies for each browser tab
  • multithreaded control of browser tabs
  • interception of requests
  • all the capabilities of A-Parser for managing the queue, forming requests, and processing results

Using the Chrome browser opens up the following possibilities:

  • rendering DOM and JavaScript
  • the ability to interactively work with website elements:
    • filling out forms
    • following links
    • drag & drop
    • file uploads
    • mouse emulation
    • and much more, any standard actions can be automated
  • easier bypassing of various anti-scraper protections, as the Chromium browser is very similar to the one used by regular users
  • the ability to work in Headless mode, i.e., without a graphical interface, which saves resources and allows running the browser on servers without a graphical environment
note

It should be taken into account that browser operation is much more resource-intensive (CPU, Memory) than the usual A-Parser threads

Depending on the complexity of the site, it is recommended to use the number of threads (browser tabs) no more than 1-2 per available CPU core, for example for an 8-core processor - from 8 to 16 tabs

Use case example

Let's consider the example of the Chrome::ScreenshotMaker2 scraper:

  • the scraper takes screenshots of websites of a specified size, and can also reduce (scale) the image
  • can optionally use a proxy
  • creates a separate browser tab for each A-Parser thread
import { BaseParser, PuppeteerTypes } from 'a-parser-types';

let browser: PuppeteerTypes.Browser;
let jimp;

class JS_Chrome_ScreenshotsMaker2 extends BaseParser {
static defaultConf: typeof BaseParser.defaultConf = {
version: '0.2.1',
results: {
flat: [
['screenshot', 'PNG screenshot'],
]
},
results_format: '$screenshot',
load_timeout: 30,
width: 1024,
height: 768,
log_screenshots: 0,
headless: 1,
};

static editableConf: typeof BaseParser.editableConf = [
['log_screenshots', ['checkbox', 'Log Screenshots']],
['width', ['textfield', 'Viewport Width']],
['height', ['textfield', 'Viewport Height']],
['resize_width', ['textfield', 'Resize Width']],
['resize_height', ['textfield', 'Resize Height']],
['headless', ['checkbox', 'Chrome Headless']],
];

async init() {
// инициализируем браузер
browser = await this.puppeteer.launch({
headless: this.conf.headless,
logConnections: false,
defaultViewport: {
width: parseInt(this.conf.width),
height: parseInt(this.conf.height),
}
});

if (this.conf.resize_width) {
// подключаем модуль jimp если необходимо делать resize скриншота
jimp = require('jimp');
};
};

async destroy() {
// закрываем браузер по завершению работы задания
if (browser)
await browser.close();
}

page: PuppeteerTypes.Page;

async threadInit() {
// создаем страницу браузера при инициализауии потока
this.page = await browser.newPage();

// стандартные методы puppeteer
await this.page.setCacheEnabled(true);
await this.page.setDefaultNavigationTimeout(this.conf.timeout * 1000);

// указываем A-Parser использовать прокси для данной страницы
await this.puppeteer.setPageUseProxy(this.page);


this.logger.put(`New page created for thread #${this.threadId}`);
}

async parse(set, results) {
const self = this;
const { conf, page } = self;

for (let attempt = 1; attempt <= conf.proxyretries; attempt++) {
try {
self.logger.put(`Attempt #${attempt}`);

// переходим на страницу, указанную в запросе
await page.goto(set.query);
// скрываем скролл бар для скриншота
await page.evaluate(() => { document.querySelector('html').style.overflow = 'hidden'; });

// получаем скриншот
results.screenshot = await page.screenshot();

if (parseInt(conf.resize_width)) {
// при необходимости ресайзим изображение
let image = await jimp.read(results.screenshot);
image.resize(parseInt(conf.resize_width), parseInt(conf.resize_height));
results.screenshot = await image.getBufferAsync('image/png');
}

self.logger.put(`Screenshot(${attempt}): OK, size: ${parseInt("" + (results.screenshot.length / 1024))}KB`);
if (conf.log_screenshots)
self.logger.putHTML("<img src='data:image/png;base64," + results.screenshot.toString('base64') + "'>");

results.success = 1;

// закрываем текущие соединения, т.к. браузер использует keep-alive
await self.puppeteer.closeActiveConnections();
break;
}
catch (error) {
self.logger.put(`Fetch page error: ${error}`);
// закрываем текущие соединения, т.к. браузер использует keep-alive
await self.puppeteer.closeActiveConnections();
// меняем прокси для вкладки браузера
await self.proxy.next();
}
}

return results;
}
}

This example demonstrates the ease of using different proxies for each tab, as well as multithreading (1 thread = 1 browser tab)

Description of methods

await this.puppeteer.launch(opts?)

This method is similar to the .launch method of the puppeteer library, it launches the Chromium browser with the necessary options opts. The main difference is the integration with A-Parser and support for proxies for each tab, as well as additional options:

logConnections?: boolean

Enables logging of all connections (whether using a proxy or not), log output is separated by threads

stealth?: boolean

Uses the puppeteer-extra plugin to disguise Chromium as a real Chrome

stealthOpts?: any

Additional options for the puppeteer-extra plugin

extraPlugins?: array

Use of additional plugins, such as puppeteer-extra/packages

A-Parser + Puppeteer

Other options

All other launch options can be viewed in the original puppeteer documentation

await this.puppeteer.setPageUseProxy(page)

This method associates a browser page with an A-Parser thread for correct proxy operation, it must be called immediately after creating the page:

const page = await browser.newPage();
await this.puppeteer.setPageUseProxy(page);

await this.puppeteer.closeActiveConnections(page?)

This method should be called after completing the processing of a request or before changing the proxy for processing the next attempt

By default, the Chrome browser leaves open connections to the sites it connects to, this method allows you to control the number of resources used, and also reduces the load on the proxy

The page argument is optional, when called without an argument, A-Parser will close connections for the tab associated with the current thread

await this.puppeteer.logScreenshot()

The method logs a screenshot of the current page