Skip to main content

HTML::TextExtractor - Parsing content (text) from a website

Overview of the scraper

OverviewHTML::TextExtractorHTML::TextExtractor scrapes text blocks from the specified page. This content scraper supports multi-page parsing (navigation through pages). It has built-in means to bypass CloudFlare protection and also the ability to choose Chrome as the engine for scraping content from pages where data is loaded by scripts. Capable of reaching speeds up to 2000 requests per minute – that's 120,000 links per hour.

Use cases for the scraper

Parsing text via Chrome using lingualeo.com as an example

Text parsing via chrome on example of lingualeo
  1. Add the Engine option, from the list select the Chrome (Slow, JavaScript Enabled) engine.
  2. As a query, specify the link to the site from which you need to scrape the text.
note

This option can be useful in cases where the site loads the main text with scripts during the page load and when using HTTP (Fast, JavaScript Disabled) the result is absent or incomplete.

Download example

How to import the example into A-Parser

eJxtU01v2zAM/S9EDhsQJO1hF9/SYME6pHXXpqcgB8GmXa2ypOkjS2Hkv+/Jce2k
680kHx8fxeeWgvCv/sGx5+Ap27Zku2/KqORKRBVoSlY4zy6Vt/Rjc7fOsg0fwvdD
cKIIxgExYFsKb5bRbfbsnCwZRVkiZl1LnaK9UDEBihdnGqbjbjcljES3XxnXiDR6
Yq9nvY6h+CT2vDEoVlLxmF4huhdNYpyUInCqzqqO6MvXWTgkBlGWMkijhTpNSJuM
U5+1/NMp8sFJXQOP0En2KwhEOnBHkpJv7wq3NOliAk3s+n+deigLLvKUPNSuBLSU
Q6ESyqMiAzuBV8ttkoR8S0YvlFrzntUI6+hvolQlXn5Roem2b/wckv/HcRw2PB+F
s/x10DCwdNFNfjd2lWZtaiyuDdZWspEBsV+aqNNtrpB8ZbbDs90nWGMcD2N65n46
zGVZJw+MV1vYMXWxxsVlLpOF0ZWs895X78ioN3BwrpemsYrTXjoqhat4fhwdsvD9
GVIwCvzYvOxGXHg/GKP8z6eTVOskHPgtCWzwkudTe8pCKPX8uD6v0OgoBC8hWJ/N
5wpWi0KxmRWmmbs4p9QcuDZwFVY77ob/bvg720//vqw94mi//cMJnTZMWOTwVB4X
oez6+A9VbWHX

Parsing text with page navigation using news as an example

Text parsing with page navigation on example of news

Results are saved in the directory aparser/results/example/textextractor in a separate file for each query. The name is given as the serial number of the query.

  1. Add the Check next page option, as a regex specify (forum\/news\/page-\d+)"[^>]+>Forward.
  2. Add the Page as new query option.
  3. Change File name to example/textextractor/${query.num}.txt.
  4. As a query, specify the link to the first page of A-Parser news: https://a-parser.com/forum/news/.
Download example

How to import the example into A-Parser

eJx1VN1v2jAQ/18sHjaVEtjoSx4qUVS0TRRoS58Ik6zkQj0c27UdPhTlf9/ZCQmw
7sXJne/jd7+7c0EsNVuz0GDAGhKuCqL8PwnJ44FmikMYLuFgHw9W09hKHYYzFBd0
A6RLFNUGtPNbkR/Lp+mVLVokkNKcW9ItiD0qwLBSWSaFwTuWoBi/Q7w9C7mjPHdm
X1Kp8yyKAgF7gx+F17dRlNx8jcjq9/365j7K+8PBN3d+T/15585h3513A68ZYkCa
JMxlpJyExWW6KcuYq7RPyvK/AF3ikZnB/jkHfWwRWp3DdfQtgPJmU9gBavpluV53
CTKKHJiJ1Bl1+Tpq0Ktpbi5f6Q6WEi9TxqFVT1Ca0czhgqofgUX0cKI46BQfLmFP
5FnZswd7UXGV0fWnRfEm2IdnWEi0dc4MzETLDFUubq08ntCuSMfLBEPk3ve58iFh
SrlBDgxCn1AEmlzfMAuaIsp5TSlSJMWIc09Pa+bjP+SMJzhMoxSdftaOn5vM/4lR
NuWdp9qB3mvE0ETx0sP8qfVK5FRuTmRwNw8om7HMRTUYXd/ThrOZM8ukhiZNHbnO
joukQLixaVs4Uq3qooyLtlwqYylStpljAZolcLLMxRK3dS7G0g2Cq0vknGNbDLy0
4zIydRuc0AK8dh77FAirWVFipeTm12sFVWmG43jnAGbI5HnWOmRMOX97mZ7fkHak
UHi3VpkwCOht9VD0YpkFfq/9VgfExbCwkThdWGG5bl6U5kEqPn1XwgIXlvwxi8ra
FepsUYeMGWwMCQflX6y1tO0=

Collected data

  • Scrapes text blocks from the specified page
  • An array with all collected pages (used when the Use Pages option is active)

Capabilities

  • Multi-page text parsing (navigation through pages)
  • Automatic cleaning of text from HTML tags
  • Ability to set the minimum length of a text block
  • Optional removal of link anchors from text
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of site encodings to UTF-8
  • Bypassing CloudFlare protection
  • Choice of engine (HTTP or Chrome)

Use cases

  • Parsing text content from any websites

Queries

As queries, you need to specify links to the pages from which you want to scrape text blocks, for example:

https://a-parser.com/

Output results examples

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, such as CSV or JSON

Default Output

Result format:

$texts.format('$text\n')

Example of result:

Здравствуйте, Супер Команда Высочайших Профессионалов своего Дела! Спасибо за возможность изучения Испанского, Турецкого и Португальского языка! Желаю Вам дальнейшего расширения Ваших Возможностей! Вдохновения и Творчества! И просьба добавить Возможность изучения Немецкого и Французского языка!”
Использую лингвалео уже многие годы, первый раз начал заниматься еще когда приложения не было совсем, был только сайт) Спасибо разработчикам, продолжайте в том же духе, с креативом и с большой любовью к делу)
Технический английский для IT: словари, учебники, журналы
Изучай языки онлайн Изучай английский онлайн Изучай вьетнамский онлайн Изучай греческий онлайн Изучай индонезийский онлайн Изучай испанский онлайн Изучай итальянский онлайн Изучай китайский онлайн Изучай корейский онлайн Изучай немецкий онлайн Изучай нидерландский онлайн Изучай польский онлайн Изучай португальский онлайн Изучай сербский онлайн Изучай турецкий онлайн Изучай украинский онлайн Изучай французский онлайн Изучай хинди онлайн Изучай чешский онлайн Изучай японский онлайн

Possible Settings

Parameter NameDefault ValueDescription
Min block length50Minimum length of a text block in characters.
Skip anchor textWhether to skip anchors in text.
Ignore tags listOption to specify tags to ignore. Example: div,span,p
Good statusAllSelect which server response will be considered successful. If another response is received during scraping, the request will be retried with a different proxy.
Good code RegExAbility to specify a regular expression to check the response code.
MethodGETRequest method.
POST bodyContent to send to the server when using the POST method. Supports variables $query – URL of the request, $query.orig – original request, and $pagenum - page number when using the Use Pages option.
CookiesAbility to specify cookies for the request.
User agent`_Automatically substituted user-agent of the current version of Chrome_User-Agent header when requesting pages.
Additional headersAbility to specify custom request headers with support for template features and using variables from the request constructor.
Read only headersRead only headers. In some cases, it allows saving traffic if there is no need to process content.
Detect charset on contentDetect charset based on the content of the page.
Emulate browser headersEmulate browser headers.
Max redirects count7Maximum number of redirects the scraper will follow.
Max cookies count16Maximum number of cookies to save.
Bypass CloudFlareAutomatic bypass of CloudFlare checks.
Follow common redirectsAllows redirects http <-> https and www.domain <-> domain within the same domain, bypassing the Max redirects count limit.
EngineHTTP (Fast, JavaScript Disabled)Allows you to choose between HTTP engine (faster, without JavaScript) or Chrome (slower, JavaScript enabled).
Chrome HeadlessIf this option is enabled, the browser will not be displayed.
Chrome DevToolsAllows the use of Chromium debugging tools.
Chrome Log Proxy connectionsIf this option is enabled, information about Chrome connections will be logged.
Chrome Wait Untilnetworkidle2Determines when the page is considered loaded. More about the values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used.
Bypass CloudFlare with Chrome(Experimental)Bypass CF through Chrome.
Bypass CloudFlare with Chrome Max PagesMax. number of pages when bypassing CF through Chrome.