HTML::ArticleExtractor - Article Parser
HTML::ArticleExtractor Parser Overview

It works using the @mozilla/readability module, which is built into A-Parser and collects such basic data as: title, content with HTML markup and without, article length.
It is based on the Net::HTTP parser, which allows it to support its functionality. It supports multi-page parsing (page switching). It has built-in tools for bypassing CloudFlare protection and also the ability to choose Chrome as the engine for parsing mail from pages, data on which is loaded by scripts.
It is capable of developing a speed of up to 200 requests per minute - this is 12,000 links per hour.
Collected data
- Article title -
$title
- HTML string of processed article content -
$content
- Text content of the article (all HTML removed) -
$textContent
- Article length in characters -
$length
- Article description or short excerpt from content -
$excerpt
- Author metadata -
$byline
- Site name -
$siteName
Capabilities
- Multi-page parsing (page switching)
- Supports gzip/deflate/brotli compression
- Detection and conversion of site encodings to UTF-8
- Bypassing CloudFlare protection
- Choice of engine (HTTP or Chrome)
- Ability to set article length
- Parsing articles with HTML tags and without
Use cases
- Collecting ready-made articles from any sites
Query examples
As queries, you need to specify links to pages from which articles need to be parsed, for example:
https://a-parser.com/docs/
https://lenta.ru/articles/2021/09/11/buran/
https://www.thetimes.co.uk/article/the-russian-banker-the-royal-fixers-and-a-500-000-riddle-vvgc55b2s
Possible result output formats
A-Parser supports flexible formatting of results thanks to the built-in Template Toolkit template engine, which allows it to output results in any form, as well as in structured form, such as CSV or JSON.
Possible settings
General settings for all parsers.
Supports all settings of the Net::HTTP parser.