HTML::TextExtractor - Parsing content (text) from websites

Parser overview

HTML::TextExtractor parses text blocks from the specified page. This content parser supports multi-page parsing (page navigation). It has built-in protection bypass tools for CloudFlare and also the ability to choose Chrome as an engine for parsing content from pages where data is loaded by scripts. Capable of reaching speeds up to 2000 requests per minute – which is 120,000 links per hour.

Go to DEMO Buy A-Parser Pro ($299)

Parser use cases

Text parsing via Chrome using lingualeo.com as an example

Add the Engine option and select the Chrome (Slow, JavaScript Enabled) engine from the list.
Specify the link to the site from which you want to parse text as a query.

note

This option can be useful in cases where the site loads the main text with scripts as the page loads, and when using HTTP (Fast, JavaScript Disabled), the result is missing or incomplete.

Download example

How to import an example into A-Parser

eJxtU01v2zAM/S9EDhsQJO1hF9/SYME6pHXXpqcgB8GmXa2ypOkjS2Hkv+/Jce2k
680kHx8fxeeWgvCv/sGx5+Ap27Zku2/KqORKRBVoSlY4zy6Vt/Rjc7fOsg0fwvdD
cKIIxgExYFsKb5bRbfbsnCwZRVkiZl1LnaK9UDEBihdnGqbjbjcljES3XxnXiDR6
Yq9nvY6h+CT2vDEoVlLxmF4huhdNYpyUInCqzqqO6MvXWTgkBlGWMkijhTpNSJuM
U5+1/NMp8sFJXQOP0En2KwhEOnBHkpJv7wq3NOliAk3s+n+deigLLvKUPNSuBLSU
Q6ESyqMiAzuBV8ttkoR8S0YvlFrzntUI6+hvolQlXn5Roem2b/wckv/HcRw2PB+F
s/x10DCwdNFNfjd2lWZtaiyuDdZWspEBsV+aqNNtrpB8ZbbDs90nWGMcD2N65n46
zGVZJw+MV1vYMXWxxsVlLpOF0ZWs895X78ioN3BwrpemsYrTXjoqhat4fhwdsvD9
GVIwCvzYvOxGXHg/GKP8z6eTVOskHPgtCWzwkudTe8pCKPX8uD6v0OgoBC8hWJ/N
5wpWi0KxmRWmmbs4p9QcuDZwFVY77ob/bvg720//vqw94mi//cMJnTZMWOTwVB4X
oez6+A9VbWHX

Text parsing with page navigation using news as an example

Results are saved in the aparser/results/example/textextractor directory in a separate file for each query. The query sequence number is used as the name.

Add the Check next page option and specify (forum\/news\/page-\d+)"[^>]+>Next as the regex.
Add the Page as new query option.
Change File name to example/textextractor/${query.num}.txt.
Specify the link to the first page of A-Parser news as a query: https://a-parser.com/forum/news/.

Download example

How to import an example into A-Parser

eJx1VN1v2jAQ/18sHjaVEtjoSx4qUVS0TRRoS58Ik6zkQj0c27UdPhTlf9/ZCQmw
7sXJne/jd7+7c0EsNVuz0GDAGhKuCqL8PwnJ44FmikMYLuFgHw9W09hKHYYzFBd0
A6RLFNUGtPNbkR/Lp+mVLVokkNKcW9ItiD0qwLBSWSaFwTuWoBi/Q7w9C7mjPHdm
X1Kp8yyKAgF7gx+F17dRlNx8jcjq9/365j7K+8PBN3d+T/15585h3513A68ZYkCa
JMxlpJyExWW6KcuYq7RPyvK/AF3ikZnB/jkHfWwRWp3DdfQtgPJmU9gBavpluV53
CTKKHJiJ1Bl1+Tpq0Ktpbi5f6Q6WEi9TxqFVT1Ca0czhgqofgUX0cKI46BQfLmFP
5FnZswd7UXGV0fWnRfEm2IdnWEi0dc4MzETLDFUubq08ntCuSMfLBEPk3ve58iFh
SrlBDgxCn1AEmlzfMAuaIsp5TSlSJMWIc09Pa+bjP+SMJzhMoxSdftaOn5vM/4lR
NuWdp9qB3mvE0ETx0sP8qfVK5FRuTmRwNw8om7HMRTUYXd/ThrOZM8ukhiZNHbnO
joukQLixaVs4Uq3qooyLtlwqYylStpljAZolcLLMxRK3dS7G0g2Cq0vknGNbDLy0
4zIydRuc0AK8dh77FAirWVFipeTm12sFVWmG43jnAGbI5HnWOmRMOX97mZ7fkHak
UHi3VpkwCOht9VD0YpkFfq/9VgfExbCwkThdWGG5bl6U5kEqPn1XwgIXlvwxi8ra
FepsUYeMGWwMCQflX6y1tO0=

Collected data

Parses text blocks from the specified page
Array with all collected pages (used when the Use Pages option is active)

Capabilities

Multi-page text parsing (page navigation)
Automatic cleaning of text from HTML tags
Ability to set a minimum length for a text block
Optional removal of link anchors from text
Supports gzip/deflate/brotli compression
Detection and conversion of website encodings to UTF-8
CloudFlare protection bypass
Choice of engine (HTTP or Chrome)

Use cases

Parsing text content from any websites

Queries

Links to pages from which text blocks need to be parsed should be specified as queries, for example:

https://a-parser.com/

Output results examples

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in arbitrary forms, as well as structured ones like CSV or JSON

Default output

Result format:

$texts.format('$text\n')

Result example:

Hello, Super Team of the Highest Professionals in their Field! Thank you for the opportunity to study Spanish, Turkish and Portuguese! I wish you further expansion of your Opportunities! Inspiration and Creativity! And please add the Opportunity to study German and French!”
I've been using Lingualeo for many years, first started when there was no app at all, only the website) Thanks to the developers, keep it up, with creativity and great love for the work)
Technical English for IT: dictionaries, textbooks, magazines
Learn languages online Learn English online Learn Vietnamese online Learn Greek online Learn Indonesian online Learn Spanish online Learn Italian online Learn Chinese online Learn Korean online Learn German online Learn Dutch online Learn Polish online Learn Portuguese online Learn Serbian online Learn Turkish online Learn Ukrainian online Learn French online Learn Hindi online Learn Czech online Learn Japanese online

Possible settings

note

Common settings for all parsers

Parameter name	Default value	Description
Min block length	`50`	Minimum length of a text block in characters.
Skip anchor text	`☐`	Whether to skip anchors in the text.
Ignore tags list		Option to specify tags that should be ignored. Example: div,span,p
Good status	`All`	Choice of which server response will be considered successful. If there is a different response from the server during parsing, the query will be repeated with a different proxy.
Good code RegEx		Ability to specify a regular expression to check the response code.
Method	`GET`	Request method.
POST body		Content to be sent to the server when using the POST method. Supports variables `$query` – request URL, `$query.orig` – original query, and `$pagenum` - page number when using the Use Pages option.
Cookies		Ability to specify cookies for the request.
User agent	``_User-agent of the current Chrome version is automatically substituted_`	User-Agent header when requesting pages.
Additional headers		Ability to specify custom request headers with support for template engine features and variables from the query builder.
Read only headers	`☐`	Read only headers. In some cases, this allows saving traffic if there is no need to process content.
Detect charset on content	`☐`	Recognize encoding based on page content.
Emulate browser headers	`☐`	Emulate browser headers.
Max redirects count	`7`	Maximum number of redirects the parser will follow.
Max cookies count	`16`	Maximum number of cookies to save.
Bypass CloudFlare	`☑`	Automatic CloudFlare check bypass.
Follow common redirects	`☑`	Allows redirects http <-> https and www.domain <-> domain within one domain, bypassing the Max redirects count limit.
Engine	`HTTP (Fast, JavaScript Disabled)`	Allows choosing the engine: HTTP (faster, no JavaScript) or Chrome (slower, JavaScript enabled).
Chrome Headless	`☐`	If the option is enabled, the browser will not be displayed.
Chrome DevTools	`☑`	Allows using Chromium debugging tools.
Chrome Log Proxy connections	`☑`	If the option is enabled, chrome connection information will be output to the log.
Chrome Wait Until	`networkidle2`	Determines when the page is considered loaded. More about values.
Use HTTP/2 transport	`☐`	Determines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic ban immediately if HTTP/1.1 is used.
Bypass CloudFlare with Chrome(Experimental)	`☐`	CF bypass via Chrome.
Bypass CloudFlare with Chrome Max Pages		Max number of pages when bypassing CF via Chrome.

Parser overview​

Parser use cases​

Text parsing via Chrome using lingualeo.com as an example​

Text parsing with page navigation using news as an example​

Collected data​

Capabilities​

Use cases​

Queries​

Output results examples​

Default output​

Possible settings​