Skip to main content

HTML::ArticleExtractor - Article parser

Overview of the parser

Parser overviewHTML::ArticleExtractorHTML::ArticleExtractor extracts articles from web pages.

It works using the @mozilla/readability module, which is built into A-Parser and collects such basic data as: title, content with and without HTML markup, and article length.

It is based on the Net::HTTPNet::HTTP parser, which allows it to support its functionality. Supports multi-page parsing (pagination). Has built-in tools to bypass CloudFlare protection and also the ability to choose Chrome as an engine for parsing emails from pages where data is loaded by scripts.

Capable of reaching speeds up to 200 requests per minute – which is 12 000 links per hour.

Collected data

  • Article title - $title
  • HTML string of the processed article content - $content
  • Text content of the article (all HTML removed) - $textContent
  • Article length in characters - $length
  • Article description or a short excerpt from the content - $excerpt
  • Author metadata - $byline
  • Site name - $siteName

Capabilities

  • Multi-page parsing (pagination)
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of website encodings to UTF-8
  • Bypassing CloudFlare protection
  • Choice of engine (HTTP or Chrome)
  • Ability to set article length
  • Parsing articles with and without HTML tags

Use cases

  • Collecting ready-made articles from any websites

Queries

As queries, you must specify links to the pages from which you need to parse articles, for example:

https://a-parser.com/docs/
https://lenta.ru/articles/2021/09/11/buran/
https://www.thetimes.co.uk/article/the-russian-banker-the-royal-fixers-and-a-500-000-riddle-vvgc55b2s

Output results examples

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in an arbitrary form, as well as in structured formats such as CSV or JSON

Possible settings