Skip to main content

Chrome Management (puppeteer)

A-Parser + Puppeteer

A-Parser allows using the Chrome (Chromium) browser as an engine for downloading and rendering pages, utilizing the popular library puppeteer.

The main advantages of using puppeteer together with A-Parser:

  • support for separate proxies for each browser tab
  • multi-threaded management of browser tabs
  • request interception
  • all A-Parser features for queue management, request generation, and result processing

Using the Chrome browser unlocks the following capabilities:

  • DOM and JavaScript rendering
  • ability for interactive work with website elements:
    • form filling
    • following links
    • drag & drop
    • file downloads
    • mouse emulation
    • and much more; any standard action can be automated
  • simpler bypass of various scraping protections, as the Chromium browser closely resembles what users employ
  • ability to work in Headless mode, i.e., without the graphical browser interface, which saves resources and allows launching the browser on servers without a graphical environment
note

It must be considered that browser operation is much more resource-intensive (CPU, Memory) than regular A-Parser threads

Depending on the site complexity, it is recommended to use a number of threads (browser tabs) no more than 1-2 for each available CPU core, for example, 8 to 16 tabs for an 8-core processor

Use Case

Let's look at the example of the Chrome::ScreenshotMaker2 scraper:

  • the scraper takes screenshots of sites of a specified size and can also reduce (scale) the image
  • can optionally use a proxy
  • creates a separate browser tab for each A-Parser thread
import { BaseParser, PuppeteerTypes } from 'a-parser-types';

let browser: PuppeteerTypes.Browser;
let jimp;

class JS_Chrome_ScreenshotsMaker2 extends BaseParser {
static defaultConf: typeof BaseParser.defaultConf = {
version: '0.2.1',
results: {
flat: [
['screenshot', 'PNG screenshot'],
]
},
results_format: '$screenshot',
load_timeout: 30,
width: 1024,
height: 768,
log_screenshots: 0,
headless: 1,
};

static editableConf: typeof BaseParser.editableConf = [
['log_screenshots', ['checkbox', 'Log Screenshots']],
['width', ['textfield', 'Viewport Width']],
['height', ['textfield', 'Viewport Height']],
['resize_width', ['textfield', 'Resize Width']],
['resize_height', ['textfield', 'Resize Height']],
['headless', ['checkbox', 'Chrome Headless']],
];

async init() {
// initialize the browser
browser = await this.puppeteer.launch({
headless: this.conf.headless,
logConnections: false,
defaultViewport: {
width: parseInt(this.conf.width),
height: parseInt(this.conf.height),
}
});

if (this.conf.resize_width) {
// connect the jimp module if screenshot resizing is required
jimp = require('jimp');
};
};

async destroy() {
// close the browser upon completion of the task
if (browser)
await browser.close();
}

page: PuppeteerTypes.Page;

async threadInit() {
// create a browser page upon thread initialization
this.page = await browser.newPage();

// standard puppeteer methods
await this.page.setCacheEnabled(true);
await this.page.setDefaultNavigationTimeout(this.conf.timeout * 1000);

// instruct A-Parser to use the proxy for this page
await this.puppeteer.setPageUseProxy(this.page);


this.logger.put(`New page created for thread #${this.threadId}`);
}

async parse(set, results) {
const self = this;
const { conf, page } = self;

for (let attempt = 1; attempt <= conf.proxyretries; attempt++) {
try {
self.logger.put(`Attempt #${attempt}`);

// navigate to the page specified in the request
await page.goto(set.query);
// hide the scroll bar for the screenshot
await page.evaluate(() => { document.querySelector('html').style.overflow = 'hidden'; });

// get the screenshot
results.screenshot = await page.screenshot();

if (parseInt(conf.resize_width)) {
// resize the image if necessary
let image = await jimp.read(results.screenshot);
image.resize(parseInt(conf.resize_width), parseInt(conf.resize_height));
results.screenshot = await image.getBufferAsync('image/png');
}

self.logger.put(`Screenshot(${attempt}): OK, size: ${parseInt("" + (results.screenshot.length / 1024))}KB`);
if (conf.log_screenshots)
self.logger.putHTML("<img src='data:image/png;base64," + results.screenshot.toString('base64') + "'>");

results.success = 1;

// close current connections, as the browser uses keep-alive
await self.puppeteer.closeActiveConnections();
break;
}
catch (error) {
self.logger.put(`Fetch page error: ${error}`);
// close current connections, as the browser uses keep-alive
await self.puppeteer.closeActiveConnections();
// change proxy for the browser tab
await self.proxy.next();
}
}

return results;
}
}

This example demonstrates the simplicity of using different proxies for each tab, as well as multi-threaded operation (1 thread = 1 browser tab)

Description of Methods

await this.puppeteer.launch(opts?)

This method is analogous to the .launch method of the puppeteer, library; it launches the Chromium browser with the necessary options opts. The main difference is the integration with A-Parser and proxy support for each tab, as well as the presence of additional options:

logConnections?: boolean

Enables logging of all connections (regardless of whether a proxy is used), with logs displayed separately by thread

stealth?: boolean

Uses the puppeteer-extra plugin to disguise Chromium as a real Chrome

stealthOpts?: any

Additional options for the puppeteer-extra plugin

extraPlugins?: array

Use of additional plugins, such as puppeteer-extra/packages

A-Parser + Puppeteer

Other options

All other launch options can be viewed in the original puppeteer documentation

await this.puppeteer.setPageUseProxy(page)

This method links the browser page to the A-Parser thread for correct proxy operation; it must be called immediately after creating the page:

const page = await browser.newPage();
await this.puppeteer.setPageUseProxy(page);

await this.puppeteer.closeActiveConnections(page?)

This method must be called after the request processing is complete or before changing the proxy for processing the next attempt

The Chrome browser by default leaves open connections to the sites it connects to; this method allows control over the number of resources used and reduces the load on proxies

The argument page is optional; when called without an argument, A-Parser will close connections for the tab associated with the current thread

await this.puppeteer.logScreenshot()

The method logs a screenshot of the current page