Managing Chrome (puppeteer)

A-Parser + Puppeteer

A-Parser allows the use of the Chrome (Chromium) browser as an engine for downloading and rendering pages, using the popular puppeteer library.

The main advantages of working with puppeteer in conjunction with A-Parser:

support for separate proxies for each browser tab
multithreaded control of browser tabs
interception of requests
all the capabilities of A-Parser for managing the queue, forming requests, and processing results

Using the `Chrome` browser opens up the following possibilities:

rendering DOM and JavaScript
the ability to interactively work with website elements:
- filling out forms
- following links
- drag & drop
- file uploads
- mouse emulation
- and much more, any standard actions can be automated
easier bypassing of various anti-scraper protections, as the Chromium browser is very similar to the one used by regular users
the ability to work in Headless mode, i.e., without a graphical interface, which saves resources and allows running the browser on servers without a graphical environment

note

It should be taken into account that browser operation is much more resource-intensive (CPU, Memory) than the usual A-Parser threads

Depending on the complexity of the site, it is recommended to use the number of threads (browser tabs) no more than 1-2 per available CPU core, for example for an 8-core processor - from 8 to 16 tabs

Use case example

Let's consider the example of the Chrome::ScreenshotMaker2 scraper:

the scraper takes screenshots of websites of a specified size, and can also reduce (scale) the image
can optionally use a proxy
creates a separate browser tab for each A-Parser thread

import { BaseParser, PuppeteerTypes } from 'a-parser-types';

let browser: PuppeteerTypes.Browser;
let jimp;

class JS_Chrome_ScreenshotsMaker2 extends BaseParser {
    static defaultConf: typeof BaseParser.defaultConf = {
        version: '0.2.1',
        results: {
            flat: [
                ['screenshot', 'PNG screenshot'],
            ]
        },
        results_format: '$screenshot',
        load_timeout: 30,
        width: 1024,
        height: 768,
        log_screenshots: 0,
        headless: 1,
    };
    
    static editableConf: typeof BaseParser.editableConf = [
        ['log_screenshots', ['checkbox', 'Log Screenshots']],
        ['width', ['textfield', 'Viewport Width']],
        ['height', ['textfield', 'Viewport Height']],
        ['resize_width', ['textfield', 'Resize Width']],
        ['resize_height', ['textfield', 'Resize Height']],
        ['headless', ['checkbox', 'Chrome Headless']],
    ];

    async init() {
        // инициализируем браузер
        browser = await this.puppeteer.launch({
            headless: this.conf.headless,
            logConnections: false,
            defaultViewport: {
                width: parseInt(this.conf.width),
                height: parseInt(this.conf.height),
            }
        });
        
        if (this.conf.resize_width) {
            // подключаем модуль jimp если необходимо делать resize скриншота
            jimp = require('jimp');
        };
    };

    async destroy() {
        // закрываем браузер по завершению работы задания
        if (browser)
            await browser.close();
    }

    page: PuppeteerTypes.Page;

    async threadInit() {
        // создаем страницу браузера при инициализауии потока
        this.page = await browser.newPage();

        // стандартные методы puppeteer
        await this.page.setCacheEnabled(true);
        await this.page.setDefaultNavigationTimeout(this.conf.timeout * 1000);

        // указываем A-Parser использовать прокси для данной страницы
        await this.puppeteer.setPageUseProxy(this.page);


        this.logger.put(`New page created for thread #${this.threadId}`);
    }

    async parse(set, results) {
        const self = this;
        const { conf, page } = self;
        
        for (let attempt = 1; attempt <= conf.proxyretries; attempt++) {
            try {
                self.logger.put(`Attempt #${attempt}`);

                // переходим на страницу, указанную в запросе
                await page.goto(set.query);
                // скрываем скролл бар для скриншота
                await page.evaluate(() => { document.querySelector('html').style.overflow = 'hidden'; });
                
                // получаем скриншот
                results.screenshot = await page.screenshot();
                
                if (parseInt(conf.resize_width)) {
                    // при необходимости ресайзим изображение
                    let image = await jimp.read(results.screenshot);
                    image.resize(parseInt(conf.resize_width), parseInt(conf.resize_height));
                    results.screenshot = await image.getBufferAsync('image/png');
                }

                self.logger.put(`Screenshot(${attempt}): OK, size: ${parseInt("" + (results.screenshot.length / 1024))}KB`);
                if (conf.log_screenshots)
                    self.logger.putHTML("<img src='data:image/png;base64," + results.screenshot.toString('base64') + "'>");

                results.success = 1;

                // закрываем текущие соединения, т.к. браузер использует keep-alive
                await self.puppeteer.closeActiveConnections();
                break;
            }
            catch (error) {
                self.logger.put(`Fetch page error: ${error}`);
                // закрываем текущие соединения, т.к. браузер использует keep-alive
                await self.puppeteer.closeActiveConnections();
                // меняем прокси для вкладки браузера
                await self.proxy.next();                
            }
        }
        
        return results;
    }
}

This example demonstrates the ease of using different proxies for each tab, as well as multithreading (1 thread = 1 browser tab)

Description of methods

`await this.puppeteer.launch(opts?)`

This method is similar to the .launch method of the puppeteer library, it launches the Chromium browser with the necessary options opts. The main difference is the integration with A-Parser and support for proxies for each tab, as well as additional options:

`logConnections?: boolean`

Enables logging of all connections (whether using a proxy or not), log output is separated by threads

`stealth?: boolean`

Uses the puppeteer-extra plugin to disguise Chromium as a real Chrome

`stealthOpts?: any`

Additional options for the puppeteer-extra plugin

`extraPlugins?: array`

Use of additional plugins, such as puppeteer-extra/packages

A-Parser + Puppeteer

Other options

All other launch options can be viewed in the original puppeteer documentation

`await this.puppeteer.setPageUseProxy(page)`

This method associates a browser page with an A-Parser thread for correct proxy operation, it must be called immediately after creating the page:

const page = await browser.newPage();
await this.puppeteer.setPageUseProxy(page);

`await this.puppeteer.closeActiveConnections(page?)`

This method should be called after completing the processing of a request or before changing the proxy for processing the next attempt

By default, the Chrome browser leaves open connections to the sites it connects to, this method allows you to control the number of resources used, and also reduces the load on the proxy

The page argument is optional, when called without an argument, A-Parser will close connections for the tab associated with the current thread

`await this.puppeteer.logScreenshot()`

The method logs a screenshot of the current page

Using the Chrome browser opens up the following possibilities:​

Use case example​

Description of methods​

await this.puppeteer.launch(opts?)​

logConnections?: boolean​

stealth?: boolean​

stealthOpts?: any​

extraPlugins?: array​

Other options​

await this.puppeteer.setPageUseProxy(page)​

await this.puppeteer.closeActiveConnections(page?)​

await this.puppeteer.logScreenshot()​