Chrome Management (puppeteer)

A-Parser allows using the Chrome (Chromium) browser as an engine for downloading and rendering pages, utilizing the popular library puppeteer.
The main advantages of using puppeteer together with A-Parser:
- support for separate proxies for each browser tab
- multi-threaded management of browser tabs
- request interception
- all A-Parser features for queue management, request generation, and result processing
Using the Chrome browser unlocks the following capabilities:
DOMandJavaScriptrendering- ability for interactive work with website elements:
- form filling
- following links
- drag & drop
- file downloads
- mouse emulation
- and much more; any standard action can be automated
- simpler bypass of various scraping protections, as the Chromium browser closely resembles what users employ
- ability to work in
Headlessmode, i.e., without the graphical browser interface, which saves resources and allows launching the browser on servers without a graphical environment
It must be considered that browser operation is much more resource-intensive (CPU, Memory) than regular A-Parser threads
Depending on the site complexity, it is recommended to use a number of threads (browser tabs) no more than 1-2 for each available CPU core, for example, 8 to 16 tabs for an 8-core processor
Use Case
Let's look at the example of the Chrome::ScreenshotMaker2 scraper:
- the scraper takes screenshots of sites of a specified size and can also reduce (scale) the image
- can optionally use a proxy
- creates a separate browser tab for each A-Parser thread
import { BaseParser, PuppeteerTypes } from 'a-parser-types';
let browser: PuppeteerTypes.Browser;
let jimp;
class JS_Chrome_ScreenshotsMaker2 extends BaseParser {
static defaultConf: typeof BaseParser.defaultConf = {
version: '0.2.1',
results: {
flat: [
['screenshot', 'PNG screenshot'],
]
},
results_format: '$screenshot',
load_timeout: 30,
width: 1024,
height: 768,
log_screenshots: 0,
headless: 1,
};
static editableConf: typeof BaseParser.editableConf = [
['log_screenshots', ['checkbox', 'Log Screenshots']],
['width', ['textfield', 'Viewport Width']],
['height', ['textfield', 'Viewport Height']],
['resize_width', ['textfield', 'Resize Width']],
['resize_height', ['textfield', 'Resize Height']],
['headless', ['checkbox', 'Chrome Headless']],
];
async init() {
// initialize the browser
browser = await this.puppeteer.launch({
headless: this.conf.headless,
logConnections: false,
defaultViewport: {
width: parseInt(this.conf.width),
height: parseInt(this.conf.height),
}
});
if (this.conf.resize_width) {
// connect the jimp module if screenshot resizing is required
jimp = require('jimp');
};
};
async destroy() {
// close the browser upon completion of the task
if (browser)
await browser.close();
}
page: PuppeteerTypes.Page;
async threadInit() {
// create a browser page upon thread initialization
this.page = await browser.newPage();
// standard puppeteer methods
await this.page.setCacheEnabled(true);
await this.page.setDefaultNavigationTimeout(this.conf.timeout * 1000);
// instruct A-Parser to use the proxy for this page
await this.puppeteer.setPageUseProxy(this.page);
this.logger.put(`New page created for thread #${this.threadId}`);
}
async parse(set, results) {
const self = this;
const { conf, page } = self;
for (let attempt = 1; attempt <= conf.proxyretries; attempt++) {
try {
self.logger.put(`Attempt #${attempt}`);
// navigate to the page specified in the request
await page.goto(set.query);
// hide the scroll bar for the screenshot
await page.evaluate(() => { document.querySelector('html').style.overflow = 'hidden'; });
// get the screenshot
results.screenshot = await page.screenshot();
if (parseInt(conf.resize_width)) {
// resize the image if necessary
let image = await jimp.read(results.screenshot);
image.resize(parseInt(conf.resize_width), parseInt(conf.resize_height));
results.screenshot = await image.getBufferAsync('image/png');
}
self.logger.put(`Screenshot(${attempt}): OK, size: ${parseInt("" + (results.screenshot.length / 1024))}KB`);
if (conf.log_screenshots)
self.logger.putHTML("<img src='data:image/png;base64," + results.screenshot.toString('base64') + "'>");
results.success = 1;
// close current connections, as the browser uses keep-alive
await self.puppeteer.closeActiveConnections();
break;
}
catch (error) {
self.logger.put(`Fetch page error: ${error}`);
// close current connections, as the browser uses keep-alive
await self.puppeteer.closeActiveConnections();
// change proxy for the browser tab
await self.proxy.next();
}
}
return results;
}
}
This example demonstrates the simplicity of using different proxies for each tab, as well as multi-threaded operation (1 thread = 1 browser tab)
Description of Methods
await this.puppeteer.launch(opts?)
This method is analogous to the .launch method of the puppeteer, library; it launches the Chromium browser with the necessary options opts. The main difference is the integration with A-Parser and proxy support for each tab, as well as the presence of additional options:
logConnections?: boolean
Enables logging of all connections (regardless of whether a proxy is used), with logs displayed separately by thread
stealth?: boolean
Uses the puppeteer-extra plugin to disguise Chromium as a real Chrome
stealthOpts?: any
Additional options for the puppeteer-extra plugin
extraPlugins?: array
Use of additional plugins, such as puppeteer-extra/packages

Other options
All other launch options can be viewed in the original puppeteer documentation
await this.puppeteer.setPageUseProxy(page)
This method links the browser page to the A-Parser thread for correct proxy operation; it must be called immediately after creating the page:
const page = await browser.newPage();
await this.puppeteer.setPageUseProxy(page);
await this.puppeteer.closeActiveConnections(page?)
This method must be called after the request processing is complete or before changing the proxy for processing the next attempt
The Chrome browser by default leaves open connections to the sites it connects to; this method allows control over the number of resources used and reduces the load on proxies
The argument page is optional; when called without an argument, A-Parser will close connections for the tab associated with the current thread
await this.puppeteer.logScreenshot()
The method logs a screenshot of the current page