Skip to main content

HTML::ArticleExtractor - Article scraper

Scraper overview

OverviewHTML::ArticleExtractorHTML::ArticleExtractor collects articles from web pages.

It works using the @mozilla/readability module which is integrated into A-Parser and collects such basic data as: title, content with and without HTML markup, and article length.

It is based on the Net::HTTPNet::HTTP scraper, which allows it to support its functionality. Supports multi-page scraping (pagination). It has built-in means to bypass CloudFlare protection and also the ability to choose Chrome as the engine for scraping emails from pages, the data on which is loaded by scripts.

Capable of scaling up to 200 requests per minute – that's 12,000 links per hour.

Collected data

  • Article title - $title
  • HTML string of the processed article content - $content
  • Text content of the article (all HTML removed) - $textContent
  • Article length in characters - $length
  • Article description or a short excerpt from the content - $excerpt
  • Author metadata - $byline
  • Site name - $siteName

Capabilities

  • Multi-page scraping (pagination)
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of site encodings to UTF-8
  • Bypassing CloudFlare protection
  • Choice of engine (HTTP or Chrome)
  • Ability to set article length
  • Scraping articles with and without HTML tags

Use cases

  • Collecting ready-made articles from any websites

Queries

As queries, it is necessary to specify links to the pages from which you need to scrape articles, for example:

https://a-parser.com/docs/
https://lenta.ru/articles/2021/09/11/buran/
https://www.thetimes.co.uk/article/the-russian-banker-the-royal-fixers-and-a-500-000-riddle-vvgc55b2s

Output results examples

A-Parser supports flexible formatting of results thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats such as CSV or JSON

Possible settings