Skip to main content

HTML::TextExtractor - Parsing content (text) from websites

Parser overview

Parser overviewHTML::TextExtractorHTML::TextExtractor parses text blocks from the specified page. This content parser supports multi-page parsing (page navigation). It has built-in protection bypass tools for CloudFlare and also the ability to choose Chrome as an engine for parsing content from pages where data is loaded by scripts. Capable of reaching speeds up to 2000 requests per minute – which is 120,000 links per hour.

Parser use cases

Text parsing via Chrome using lingualeo.com as an example

Text parsing via Chrome using lingualeo.com as an example
  1. Add the Engine option and select the Chrome (Slow, JavaScript Enabled) engine from the list.
  2. Specify the link to the site from which you want to parse text as a query.
note

This option can be useful in cases where the site loads the main text with scripts as the page loads, and when using HTTP (Fast, JavaScript Disabled), the result is missing or incomplete.

Download example

How to import an example into A-Parser

eJxtU01v2zAM/S9EDhsQJO1hF9/SYME6pHXXpqcgB8GmXa2ypOkjS2Hkv+/Jce2k
680kHx8fxeeWgvCv/sGx5+Ap27Zku2/KqORKRBVoSlY4zy6Vt/Rjc7fOsg0fwvdD
cKIIxgExYFsKb5bRbfbsnCwZRVkiZl1LnaK9UDEBihdnGqbjbjcljES3XxnXiDR6
Yq9nvY6h+CT2vDEoVlLxmF4huhdNYpyUInCqzqqO6MvXWTgkBlGWMkijhTpNSJuM
U5+1/NMp8sFJXQOP0En2KwhEOnBHkpJv7wq3NOliAk3s+n+deigLLvKUPNSuBLSU
Q6ESyqMiAzuBV8ttkoR8S0YvlFrzntUI6+hvolQlXn5Roem2b/wckv/HcRw2PB+F
s/x10DCwdNFNfjd2lWZtaiyuDdZWspEBsV+aqNNtrpB8ZbbDs90nWGMcD2N65n46
zGVZJw+MV1vYMXWxxsVlLpOF0ZWs895X78ioN3BwrpemsYrTXjoqhat4fhwdsvD9
GVIwCvzYvOxGXHg/GKP8z6eTVOskHPgtCWzwkudTe8pCKPX8uD6v0OgoBC8hWJ/N
5wpWi0KxmRWmmbs4p9QcuDZwFVY77ob/bvg720//vqw94mi//cMJnTZMWOTwVB4X
oez6+A9VbWHX

Text parsing with page navigation using news as an example

Text parsing with page navigation using news as an example

Results are saved in the aparser/results/example/textextractor directory in a separate file for each query. The query sequence number is used as the name.

  1. Add the Check next page option and specify (forum\/news\/page-\d+)"[^>]+>Next as the regex.
  2. Add the Page as new query option.
  3. Change File name to example/textextractor/${query.num}.txt.
  4. Specify the link to the first page of A-Parser news as a query: https://a-parser.com/forum/news/.
Download example

How to import an example into A-Parser

eJx1VN1v2jAQ/18sHjaVEtjoSx4qUVS0TRRoS58Ik6zkQj0c27UdPhTlf9/ZCQmw
7sXJne/jd7+7c0EsNVuz0GDAGhKuCqL8PwnJ44FmikMYLuFgHw9W09hKHYYzFBd0
A6RLFNUGtPNbkR/Lp+mVLVokkNKcW9ItiD0qwLBSWSaFwTuWoBi/Q7w9C7mjPHdm
X1Kp8yyKAgF7gx+F17dRlNx8jcjq9/365j7K+8PBN3d+T/15585h3513A68ZYkCa
JMxlpJyExWW6KcuYq7RPyvK/AF3ikZnB/jkHfWwRWp3DdfQtgPJmU9gBavpluV53
CTKKHJiJ1Bl1+Tpq0Ktpbi5f6Q6WEi9TxqFVT1Ca0czhgqofgUX0cKI46BQfLmFP
5FnZswd7UXGV0fWnRfEm2IdnWEi0dc4MzETLDFUubq08ntCuSMfLBEPk3ve58iFh
SrlBDgxCn1AEmlzfMAuaIsp5TSlSJMWIc09Pa+bjP+SMJzhMoxSdftaOn5vM/4lR
NuWdp9qB3mvE0ETx0sP8qfVK5FRuTmRwNw8om7HMRTUYXd/ThrOZM8ukhiZNHbnO
joukQLixaVs4Uq3qooyLtlwqYylStpljAZolcLLMxRK3dS7G0g2Cq0vknGNbDLy0
4zIydRuc0AK8dh77FAirWVFipeTm12sFVWmG43jnAGbI5HnWOmRMOX97mZ7fkHak
UHi3VpkwCOht9VD0YpkFfq/9VgfExbCwkThdWGG5bl6U5kEqPn1XwgIXlvwxi8ra
FepsUYeMGWwMCQflX6y1tO0=

Collected data

  • Parses text blocks from the specified page
  • Array with all collected pages (used when the Use Pages option is active)

Capabilities

  • Multi-page text parsing (page navigation)
  • Automatic cleaning of text from HTML tags
  • Ability to set a minimum length for a text block
  • Optional removal of link anchors from text
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of website encodings to UTF-8
  • CloudFlare protection bypass
  • Choice of engine (HTTP or Chrome)

Use cases

  • Parsing text content from any websites

Queries

Links to pages from which text blocks need to be parsed should be specified as queries, for example:

https://a-parser.com/

Output results examples

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in arbitrary forms, as well as structured ones like CSV or JSON

Default output

Result format:

$texts.format('$text\n')

Result example:

Hello, Super Team of the Highest Professionals in their Field! Thank you for the opportunity to study Spanish, Turkish and Portuguese! I wish you further expansion of your Opportunities! Inspiration and Creativity! And please add the Opportunity to study German and French!”
I've been using Lingualeo for many years, first started when there was no app at all, only the website) Thanks to the developers, keep it up, with creativity and great love for the work)
Technical English for IT: dictionaries, textbooks, magazines
Learn languages online Learn English online Learn Vietnamese online Learn Greek online Learn Indonesian online Learn Spanish online Learn Italian online Learn Chinese online Learn Korean online Learn German online Learn Dutch online Learn Polish online Learn Portuguese online Learn Serbian online Learn Turkish online Learn Ukrainian online Learn French online Learn Hindi online Learn Czech online Learn Japanese online

Possible settings

Parameter nameDefault valueDescription
Min block length50Minimum length of a text block in characters.
Skip anchor textWhether to skip anchors in the text.
Ignore tags listOption to specify tags that should be ignored. Example: div,span,p
Good statusAllChoice of which server response will be considered successful. If there is a different response from the server during parsing, the query will be repeated with a different proxy.
Good code RegExAbility to specify a regular expression to check the response code.
MethodGETRequest method.
POST bodyContent to be sent to the server when using the POST method. Supports variables $query – request URL, $query.orig – original query, and $pagenum - page number when using the Use Pages option.
CookiesAbility to specify cookies for the request.
User agent`_User-agent of the current Chrome version is automatically substituted_User-Agent header when requesting pages.
Additional headersAbility to specify custom request headers with support for template engine features and variables from the query builder.
Read only headersRead only headers. In some cases, this allows saving traffic if there is no need to process content.
Detect charset on contentRecognize encoding based on page content.
Emulate browser headersEmulate browser headers.
Max redirects count7Maximum number of redirects the parser will follow.
Max cookies count16Maximum number of cookies to save.
Bypass CloudFlareAutomatic CloudFlare check bypass.
Follow common redirectsAllows redirects http <-> https and www.domain <-> domain within one domain, bypassing the Max redirects count limit.
EngineHTTP (Fast, JavaScript Disabled)Allows choosing the engine: HTTP (faster, no JavaScript) or Chrome (slower, JavaScript enabled).
Chrome HeadlessIf the option is enabled, the browser will not be displayed.
Chrome DevToolsAllows using Chromium debugging tools.
Chrome Log Proxy connectionsIf the option is enabled, chrome connection information will be output to the log.
Chrome Wait Untilnetworkidle2Determines when the page is considered loaded. More about values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic ban immediately if HTTP/1.1 is used.
Bypass CloudFlare with Chrome(Experimental)CF bypass via Chrome.
Bypass CloudFlare with Chrome Max PagesMax number of pages when bypassing CF via Chrome.