Skip to main content

HTML::TextExtractor - Parsing content (text) from a website

HTML::TextExtractor content parser overview

Overview

HTML::TextExtractorHTML::TextExtractor parses text blocks from the specified page. This content parser supports multipage parsing (page switching). It has built-in tools for bypassing CloudFlare protection and also the ability to select Chrome as the engine for parsing content from pages where data is loaded by scripts. It can develop speed up to 2000 requests per minute - that's 120,000 links per hour.

Use cases for HTML::TextExtractor content parser

Parsing text through Chrome using lingualeo.com as an example

Case 1

  1. Add the Engine option, select the Chrome (Slow, JavaScript Enabled) engine from the list.
  2. Specify the link to the site from which you want to parse the text as a request.
info

This option can be useful in cases where the site loads the main text with scripts as the page loads, and when using HTTP (Fast, JavaScript Disabled) the result is absent or incomplete.

Download example

How to import an example into A-Parser

eJxtU01v2zAM/S9EDhsQJO1hF9/SYME6pHXXpqcgB8GmXa2ypOkjS2Hkv+/Jce2k
680kHx8fxeeWgvCv/sGx5+Ap27Zku2/KqORKRBVoSlY4zy6Vt/Rjc7fOsg0fwvdD
cKIIxgExYFsKb5bRbfbsnCwZRVkiZl1LnaK9UDEBihdnGqbjbjcljES3XxnXiDR6
Yq9nvY6h+CT2vDEoVlLxmF4huhdNYpyUInCqzqqO6MvXWTgkBlGWMkijhTpNSJuM
U5+1/NMp8sFJXQOP0En2KwhEOnBHkpJv7wq3NOliAk3s+n+deigLLvKUPNSuBLSU
Q6ESyqMiAzuBV8ttkoR8S0YvlFrzntUI6+hvolQlXn5Roem2b/wckv/HcRw2PB+F
s/x10DCwdNFNfjd2lWZtaiyuDdZWspEBsV+aqNNtrpB8ZbbDs90nWGMcD2N65n46
zGVZJw+MV1vYMXWxxsVlLpOF0ZWs895X78ioN3BwrpemsYrTXjoqhat4fhwdsvD9
GVIwCvzYvOxGXHg/GKP8z6eTVOskHPgtCWzwkudTe8pCKPX8uD6v0OgoBC8hWJ/N
5wpWi0KxmRWmmbs4p9QcuDZwFVY77ob/bvg720//vqw94mi//cMJnTZMWOTwVB4X
oez6+A9VbWHX

See also:

Parsing text with page switching using news as an example

Case 2

The results are saved in the directory aparser/results/example/textextractor in a separate file for each request. The name specifies the request number.

  1. Add the Check next page option, specify the regular expression (forum\/news\/page-\d+)"[^>]+>Forward.
  2. Add the Page as new query option.
  3. Change the File name to example/textextractor/${query.num}.txt.
  4. Specify the link to the first page with news on A-Parser as a request: https://a-parser.com/forum/news/.
Download example

How to import an example into A-Parser

eJx1VN1v2jAQ/18sHjaVEtjoSx4qUVS0TRRoS58Ik6zkQj0c27UdPhTlf9/ZCQmw
7sXJne/jd7+7c0EsNVuz0GDAGhKuCqL8PwnJ44FmikMYLuFgHw9W09hKHYYzFBd0
A6RLFNUGtPNbkR/Lp+mVLVokkNKcW9ItiD0qwLBSWSaFwTuWoBi/Q7w9C7mjPHdm
X1Kp8yyKAgF7gx+F17dRlNx8jcjq9/365j7K+8PBN3d+T/15585h3513A68ZYkCa
JMxlpJyExWW6KcuYq7RPyvK/AF3ikZnB/jkHfWwRWp3DdfQtgPJmU9gBavpluV53
CTKKHJiJ1Bl1+Tpq0Ktpbi5f6Q6WEi9TxqFVT1Ca0czhgqofgUX0cKI46BQfLmFP
5FnZswd7UXGV0fWnRfEm2IdnWEi0dc4MzETLDFUubq08ntCuSMfLBEPk3ve58iFh
SrlBDgxCn1AEmlzfMAuaIsp5TSlSJMWIc09Pa+bjP+SMJzhMoxSdftaOn5vM/4lR
NuWdp9qB3mvE0ETx0sP8qfVK5FRuTmRwNw8om7HMRTUYXd/ThrOZM8ukhiZNHbnO
joukQLixaVs4Uq3qooyLtlwqYylStpljAZolcLLMxRK3dS7G0g2Cq0vknGNbDLy0
4zIydRuc0AK8dh77FAirWVFipeTm12sFVWmG43jnAGbI5HnWOmRMOX97mZ7fkHak
UHi3VpkwCOht9VD0YpkFfq/9VgfExbCwkThdWGG5bl6U5kEqPn1XwgIXlvwxi8ra
FepsUYeMGWwMCQflX6y1tO0=

See also:

List of collected data

Здравствуйте, Супер Команда Высочайших Профессионалов своего Дела! Спасибо за возможность изучения Испанского, Турецкого и Португальского языка! Желаю Вам дальнейшего расширения Ваших Возможностей! Вдохновения и Творчества! И просьба добавить Возможность изучения Немецкого и Французского языка!”
Использую лингвалео уже многие годы, первый раз начал заниматься еще когда приложения не было совсем, был только сайт) Спасибо разработчикам, продолжайте в том же духе, с креативом и с большой любовью к делу)
Технический английский для IT: словари, учебники, журналы
Изучай языки онлайн Изучай английский онлайн Изучай вьетнамский онлайн Изучай греческий онлайн Изучай индонезийский онлайн Изучай испанский онлайн Изучай итальянский онлайн Изучай китайский онлайн Изучай корейский онлайн Изучай немецкий онлайн Изучай нидерландский онлайн Изучай польский онлайн Изучай португальский онлайн Изучай сербский онлайн Изучай турецкий онлайн Изучай украинский онлайн Изучай французский онлайн Изучай хинди онлайн Изучай чешский онлайн Изучай японский онлайн
  • Parses text blocks from the specified page
  • An array with all collected pages (used when working with the Use Pages option)

HTML::TextExtractor content parser capabilities

  • Multipage text parsing (page switching)
  • Automatic text cleaning from HTML tags
  • Ability to set a minimum text block length
  • Optional removal of link anchors from text
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of website encodings to UTF-8
  • Bypassing CloudFlare protection
  • Engine selection (HTTP or Chrome)

Usage scenarios

  • Parsing text content from any websites

Examples of requests

Requests should specify links to pages from which text blocks need to be parsed, for example:

https://a-parser.com/

Possible result output formats

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit template engine, which allows it to output results in any form, as well as in a structured form, such as CSV or JSON.

Possible settings

Parameter nameDefault valueDescription
Min block length50The minimum length of a text block in characters.
Skip anchor textWhether to skip anchors in the text.
Good statusAllSelect which server response will be considered successful. If parsing returns a different response from the server, the request will be repeated with a different proxy.
Good code RegEx-Ability to specify a regular expression to check the response code.
MethodGETRequest method.
POST body-Content to be passed to the server when using the POST method. Supports variables $query - URL query, $query.orig - original query, and $pagenum - page number when using the Use Pages option.
Cookies-Ability to specify cookies for the request.
User agentMozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)User-Agent header when requesting pages.
Additional headers-Ability to specify arbitrary request headers with support for template engine capabilities and use of variables from the request builder.
Read only headersRead only headers. In some cases, it allows you to save traffic if there is no need to process content.
Detect charset on contentDetect encoding based on page content.
Emulate browser headersEmulate browser headers.
Max redirects count7The maximum number of redirects that the parser will follow.
Max cookies count16The maximum number of cookies to be saved.
Bypass CloudFlareAutomatic bypass of CloudFlare check.
Follow common redirectsAllows http <-> https and www.domain <-> domain redirects within the same domain to bypass the Max redirects count limit.
EngineHTTP (Fast, JavaScript Disabled)Allows you to choose the HTTP engine (faster, without JavaScript) or Chrome (slower, with JavaScript).
Chrome HeadlessIf enabled, the browser will not be displayed.
Chrome DevToolsAllows you to use Chromium debugging tools.
Chrome Log Proxy connectionsIf enabled, information about chrome connections will be output to the log.
Chrome Wait Untilnetworkidle2Determines when the page is considered loaded. More about values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if you use HTTP/1.1.
Bypass CloudFlare with Chrome(Experimental)Bypass CF via Chrome.
Bypass CloudFlare with Chrome Max Pages-Max. number of pages when bypassing CF via Chrome.