HTML::ArticleExtractor - Article scraper

Scraper overview

HTML::ArticleExtractor collects articles from web pages.

It works using the @mozilla/readability module which is integrated into A-Parser and collects such basic data as: title, content with and without HTML markup, and article length.

It is based on the Net::HTTP scraper, which allows it to support its functionality. Supports multi-page scraping (pagination). It has built-in means to bypass CloudFlare protection and also the ability to choose Chrome as the engine for scraping emails from pages, the data on which is loaded by scripts.

Capable of scaling up to 200 requests per minute – that's 12,000 links per hour.

Go to DEMO Buy A-Parser Pro ($299)

Collected data

Article title - $title
HTML string of the processed article content - $content
Text content of the article (all HTML removed) - $textContent
Article length in characters - $length
Article description or a short excerpt from the content - $excerpt
Author metadata - $byline
Site name - $siteName

Capabilities

Multi-page scraping (pagination)
Supports gzip/deflate/brotli compression
Detection and conversion of site encodings to UTF-8
Bypassing CloudFlare protection
Choice of engine (HTTP or Chrome)
Ability to set article length
Scraping articles with and without HTML tags

Use cases

Collecting ready-made articles from any websites

Queries

As queries, it is necessary to specify links to the pages from which you need to scrape articles, for example:

https://a-parser.com/docs/
https://lenta.ru/articles/2021/09/11/buran/
https://www.thetimes.co.uk/article/the-russian-banker-the-royal-fixers-and-a-500-000-riddle-vvgc55b2s

Output results examples

A-Parser supports flexible formatting of results thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats such as CSV or JSON

Possible settings

note

Common settings for all scrapers Supports all settings of the Net::HTTP scraper.

Scraper overview​

Collected data​

Capabilities​

Use cases​

Queries​

Output results examples​

Possible settings​