HTML::ArticleExtractor - Article Parser
HTML::ArticleExtractor Parser Overview
HTML::ArticleExtractor collects articles from web pages.
It works using the @mozilla/readability module, which is built into A-Parser and collects such basic data as: title, content with HTML markup and without, article length.
It is based on the Net::HTTP parser, which allows it to support its functionality. It supports multi-page parsing (page switching). It has built-in tools for bypassing CloudFlare protection and also the ability to choose Chrome as the engine for parsing mail from pages, data on which is loaded by scripts.
It is capable of developing a speed of up to 200 requests per minute - this is 12,000 links per hour.
- Article title -
- HTML string of processed article content -
- Text content of the article (all HTML removed) -
- Article length in characters -
- Article description or short excerpt from content -
- Author metadata -
- Site name -
- Multi-page parsing (page switching)
- Supports gzip/deflate/brotli compression
- Detection and conversion of site encodings to UTF-8
- Bypassing CloudFlare protection
- Choice of engine (HTTP or Chrome)
- Ability to set article length
- Parsing articles with HTML tags and without
- Collecting ready-made articles from any sites
As queries, you need to specify links to pages from which articles need to be parsed, for example:
Possible result output formats
A-Parser supports flexible formatting of results thanks to the built-in Template Toolkit template engine, which allows it to output results in any form, as well as in structured form, such as CSV or JSON.
General settings for all parsers. Supports all settings of the Net::HTTP parser.