Managing Chrome (puppeteer)
A-Parser allows the use of the Chrome (Chromium)
browser as an engine for downloading and rendering pages, using the popular puppeteer library.
The main advantages of working with puppeteer
in conjunction with A-Parser:
- support for separate proxies for each browser tab
- multithreaded control of browser tabs
- interception of requests
- all the capabilities of A-Parser for managing the queue, forming requests, and processing results
Using the Chrome
browser opens up the following possibilities:
- rendering
DOM
andJavaScript
- the ability to interactively work with website elements:
- filling out forms
- following links
- drag & drop
- file uploads
- mouse emulation
- and much more, any standard actions can be automated
- easier bypassing of various anti-scraper protections, as the Chromium browser is very similar to the one used by regular users
- the ability to work in
Headless
mode, i.e., without a graphical interface, which saves resources and allows running the browser on servers without a graphical environment
It should be taken into account that browser operation is much more resource-intensive (CPU, Memory) than the usual A-Parser threads
Depending on the complexity of the site, it is recommended to use the number of threads (browser tabs) no more than 1-2 per available CPU core, for example for an 8-core processor - from 8 to 16 tabs
Use case example
Let's consider the example of the Chrome::ScreenshotMaker2
scraper:
- the scraper takes screenshots of websites of a specified size, and can also reduce (scale) the image
- can optionally use a proxy
- creates a separate browser tab for each A-Parser thread
import { BaseParser, PuppeteerTypes } from 'a-parser-types';
let browser: PuppeteerTypes.Browser;
let jimp;
class JS_Chrome_ScreenshotsMaker2 extends BaseParser {
static defaultConf: typeof BaseParser.defaultConf = {
version: '0.2.1',
results: {
flat: [
['screenshot', 'PNG screenshot'],
]
},
results_format: '$screenshot',
load_timeout: 30,
width: 1024,
height: 768,
log_screenshots: 0,
headless: 1,
};
static editableConf: typeof BaseParser.editableConf = [
['log_screenshots', ['checkbox', 'Log Screenshots']],
['width', ['textfield', 'Viewport Width']],
['height', ['textfield', 'Viewport Height']],
['resize_width', ['textfield', 'Resize Width']],
['resize_height', ['textfield', 'Resize Height']],
['headless', ['checkbox', 'Chrome Headless']],
];
async init() {
// инициализируем браузер
browser = await this.puppeteer.launch({
headless: this.conf.headless,
logConnections: false,
defaultViewport: {
width: parseInt(this.conf.width),
height: parseInt(this.conf.height),
}
});
if (this.conf.resize_width) {
// подключаем модуль jimp если необходимо делать resize скриншота
jimp = require('jimp');
};
};
async destroy() {
// закрываем браузер по завершению работы задания
if (browser)
await browser.close();
}
page: PuppeteerTypes.Page;
async threadInit() {
// создаем страницу браузера при инициализауии потока
this.page = await browser.newPage();
// стандартные методы puppeteer
await this.page.setCacheEnabled(true);
await this.page.setDefaultNavigationTimeout(this.conf.timeout * 1000);
// указываем A-Parser использовать прокси для данной страницы
await this.puppeteer.setPageUseProxy(this.page);
this.logger.put(`New page created for thread #${this.threadId}`);
}
async parse(set, results) {
const self = this;
const { conf, page } = self;
for (let attempt = 1; attempt <= conf.proxyretries; attempt++) {
try {
self.logger.put(`Attempt #${attempt}`);
// переходим на страницу, указанную в запросе
await page.goto(set.query);
// скрываем скролл бар для скриншота
await page.evaluate(() => { document.querySelector('html').style.overflow = 'hidden'; });
// получаем скриншот
results.screenshot = await page.screenshot();
if (parseInt(conf.resize_width)) {
// при необходимости ресайзим изображение
let image = await jimp.read(results.screenshot);
image.resize(parseInt(conf.resize_width), parseInt(conf.resize_height));
results.screenshot = await image.getBufferAsync('image/png');
}
self.logger.put(`Screenshot(${attempt}): OK, size: ${parseInt("" + (results.screenshot.length / 1024))}KB`);
if (conf.log_screenshots)
self.logger.putHTML("<img src='data:image/png;base64," + results.screenshot.toString('base64') + "'>");
results.success = 1;
// закрываем текущие соединения, т.к. браузер использует keep-alive
await self.puppeteer.closeActiveConnections();
break;
}
catch (error) {
self.logger.put(`Fetch page error: ${error}`);
// закрываем текущие соединения, т.к. браузер использует keep-alive
await self.puppeteer.closeActiveConnections();
// меняем прокси для вкладки браузера
await self.proxy.next();
}
}
return results;
}
}
This example demonstrates the ease of using different proxies for each tab, as well as multithreading (1 thread = 1 browser tab)
Description of methods
await this.puppeteer.launch(opts?)
This method is similar to the .launch
method of the puppeteer
library, it launches the Chromium
browser with the necessary options opts
. The main difference is the integration with A-Parser and support for proxies for each tab, as well as additional options:
logConnections?: boolean
Enables logging of all connections (whether using a proxy or not), log output is separated by threads
stealth?: boolean
Uses the puppeteer-extra
plugin to disguise Chromium
as a real Chrome
stealthOpts?: any
Additional options for the puppeteer-extra
plugin
extraPlugins?: array
Use of additional plugins, such as puppeteer-extra/packages
Other options
All other launch options can be viewed in the original puppeteer documentation
await this.puppeteer.setPageUseProxy(page)
This method associates a browser page with an A-Parser thread for correct proxy operation, it must be called immediately after creating the page:
const page = await browser.newPage();
await this.puppeteer.setPageUseProxy(page);
await this.puppeteer.closeActiveConnections(page?)
This method should be called after completing the processing of a request or before changing the proxy for processing the next attempt
By default, the Chrome
browser leaves open connections to the sites it connects to, this method allows you to control the number of resources used, and also reduces the load on the proxy
The page
argument is optional, when called without an argument, A-Parser will close connections for the tab associated with the current thread
await this.puppeteer.logScreenshot()
The method logs a screenshot of the current page