HTML::TextExtractor - Parsing content (text) from a website
Overview of the scraper
HTML::TextExtractor scrapes text blocks from the specified page. This content scraper supports multi-page parsing (navigation through pages). It has built-in means to bypass CloudFlare protection and also the ability to choose Chrome as the engine for scraping content from pages where data is loaded by scripts. Capable of reaching speeds up to 2000 requests per minute – that's 120,000 links per hour.Use cases for the scraper
Parsing text via Chrome using lingualeo.com as an example
- Add the Engine option, from the list select the
Chrome (Slow, JavaScript Enabled)
engine. - As a query, specify the link to the site from which you need to scrape the text.
note
This option can be useful in cases where the site loads the main text with scripts during the page load and when using HTTP (Fast, JavaScript Disabled)
the result is absent or incomplete.
Download example
How to import the example into A-Parser
eJxtU01v2zAM/S9EDhsQJO1hF9/SYME6pHXXpqcgB8GmXa2ypOkjS2Hkv+/Jce2k
680kHx8fxeeWgvCv/sGx5+Ap27Zku2/KqORKRBVoSlY4zy6Vt/Rjc7fOsg0fwvdD
cKIIxgExYFsKb5bRbfbsnCwZRVkiZl1LnaK9UDEBihdnGqbjbjcljES3XxnXiDR6
Yq9nvY6h+CT2vDEoVlLxmF4huhdNYpyUInCqzqqO6MvXWTgkBlGWMkijhTpNSJuM
U5+1/NMp8sFJXQOP0En2KwhEOnBHkpJv7wq3NOliAk3s+n+deigLLvKUPNSuBLSU
Q6ESyqMiAzuBV8ttkoR8S0YvlFrzntUI6+hvolQlXn5Roem2b/wckv/HcRw2PB+F
s/x10DCwdNFNfjd2lWZtaiyuDdZWspEBsV+aqNNtrpB8ZbbDs90nWGMcD2N65n46
zGVZJw+MV1vYMXWxxsVlLpOF0ZWs895X78ioN3BwrpemsYrTXjoqhat4fhwdsvD9
GVIwCvzYvOxGXHg/GKP8z6eTVOskHPgtCWzwkudTe8pCKPX8uD6v0OgoBC8hWJ/N
5wpWi0KxmRWmmbs4p9QcuDZwFVY77ob/bvg720//vqw94mi//cMJnTZMWOTwVB4X
oez6+A9VbWHX
Parsing text with page navigation using news as an example
Results are saved in the directory aparser/results/example/textextractor
in a separate file for each query. The name is given as the serial number of the query.
- Add the Check next page option, as a regex specify
(forum\/news\/page-\d+)"[^>]+>Forward
. - Add the Page as new query option.
- Change File name to
example/textextractor/${query.num}.txt
. - As a query, specify the link to the first page of A-Parser news:
https://a-parser.com/forum/news/
.
Download example
How to import the example into A-Parser
eJx1VN1v2jAQ/18sHjaVEtjoSx4qUVS0TRRoS58Ik6zkQj0c27UdPhTlf9/ZCQmw
7sXJne/jd7+7c0EsNVuz0GDAGhKuCqL8PwnJ44FmikMYLuFgHw9W09hKHYYzFBd0
A6RLFNUGtPNbkR/Lp+mVLVokkNKcW9ItiD0qwLBSWSaFwTuWoBi/Q7w9C7mjPHdm
X1Kp8yyKAgF7gx+F17dRlNx8jcjq9/365j7K+8PBN3d+T/15585h3513A68ZYkCa
JMxlpJyExWW6KcuYq7RPyvK/AF3ikZnB/jkHfWwRWp3DdfQtgPJmU9gBavpluV53
CTKKHJiJ1Bl1+Tpq0Ktpbi5f6Q6WEi9TxqFVT1Ca0czhgqofgUX0cKI46BQfLmFP
5FnZswd7UXGV0fWnRfEm2IdnWEi0dc4MzETLDFUubq08ntCuSMfLBEPk3ve58iFh
SrlBDgxCn1AEmlzfMAuaIsp5TSlSJMWIc09Pa+bjP+SMJzhMoxSdftaOn5vM/4lR
NuWdp9qB3mvE0ETx0sP8qfVK5FRuTmRwNw8om7HMRTUYXd/ThrOZM8ukhiZNHbnO
joukQLixaVs4Uq3qooyLtlwqYylStpljAZolcLLMxRK3dS7G0g2Cq0vknGNbDLy0
4zIydRuc0AK8dh77FAirWVFipeTm12sFVWmG43jnAGbI5HnWOmRMOX97mZ7fkHak
UHi3VpkwCOht9VD0YpkFfq/9VgfExbCwkThdWGG5bl6U5kEqPn1XwgIXlvwxi8ra
FepsUYeMGWwMCQflX6y1tO0=
Collected data
- Scrapes text blocks from the specified page
- An array with all collected pages (used when the Use Pages option is active)
Capabilities
- Multi-page text parsing (navigation through pages)
- Automatic cleaning of text from HTML tags
- Ability to set the minimum length of a text block
- Optional removal of link anchors from text
- Supports gzip/deflate/brotli compression
- Detection and conversion of site encodings to UTF-8
- Bypassing CloudFlare protection
- Choice of engine (HTTP or Chrome)
Use cases
- Parsing text content from any websites
Queries
As queries, you need to specify links to the pages from which you want to scrape text blocks, for example:
https://a-parser.com/
Output results examples
A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, such as CSV or JSON
Default Output
Result format:
$texts.format('$text\n')
Example of result:
Здравствуйте, Супер Команда Высочайших Профессионалов своего Дела! Спасибо за возможность изучения Испанского, Турецкого и Португальского языка! Желаю Вам дальнейшего расширения Ваших Возможностей! Вдохновения и Творчества! И просьба добавить Возможность изучения Немецкого и Французского языка!”
Использую лингвалео уже многие годы, первый раз начал заниматься еще когда приложения не было совсем, был только сайт) Спасибо разработчикам, продолжайте в том же духе, с креативом и с большой любовью к делу)
Технический английский для IT: словари, учебники, журналы
Изучай языки онлайн Изучай английский онлайн Изучай вьетнамский онлайн Изучай греческий онлайн Изучай индонезийский онлайн Изучай испанский онлайн Изучай итальянский онлайн Изучай китайский онлайн Изучай корейский онлайн Изучай немецкий онлайн Изучай нидерландский онлайн Изучай польский онлайн Изучай португальский онлайн Изучай сербский онлайн Изучай турецкий онлайн Изучай украинский онлайн Изучай французский онлайн Изучай хинди онлайн Изучай чешский онлайн Изучай японский онлайн
Possible Settings
Parameter Name | Default Value | Description |
---|---|---|
Min block length | 50 | Minimum length of a text block in characters. |
Skip anchor text | ☐ | Whether to skip anchors in text. |
Ignore tags list | Option to specify tags to ignore. Example: div,span,p | |
Good status | All | Select which server response will be considered successful. If another response is received during scraping, the request will be retried with a different proxy. |
Good code RegEx | Ability to specify a regular expression to check the response code. | |
Method | GET | Request method. |
POST body | Content to send to the server when using the POST method. Supports variables $query – URL of the request, $query.orig – original request, and $pagenum - page number when using the Use Pages option. | |
Cookies | Ability to specify cookies for the request. | |
User agent | `_Automatically substituted user-agent of the current version of Chrome_ | User-Agent header when requesting pages. |
Additional headers | Ability to specify custom request headers with support for template features and using variables from the request constructor. | |
Read only headers | ☐ | Read only headers. In some cases, it allows saving traffic if there is no need to process content. |
Detect charset on content | ☐ | Detect charset based on the content of the page. |
Emulate browser headers | ☐ | Emulate browser headers. |
Max redirects count | 7 | Maximum number of redirects the scraper will follow. |
Max cookies count | 16 | Maximum number of cookies to save. |
Bypass CloudFlare | ☑ | Automatic bypass of CloudFlare checks. |
Follow common redirects | ☑ | Allows redirects http <-> https and www.domain <-> domain within the same domain, bypassing the Max redirects count limit. |
Engine | HTTP (Fast, JavaScript Disabled) | Allows you to choose between HTTP engine (faster, without JavaScript) or Chrome (slower, JavaScript enabled). |
Chrome Headless | ☐ | If this option is enabled, the browser will not be displayed. |
Chrome DevTools | ☑ | Allows the use of Chromium debugging tools. |
Chrome Log Proxy connections | ☑ | If this option is enabled, information about Chrome connections will be logged. |
Chrome Wait Until | networkidle2 | Determines when the page is considered loaded. More about the values. |
Use HTTP/2 transport | ☐ | Determines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used. |
Bypass CloudFlare with Chrome(Experimental) | ☐ | Bypass CF through Chrome. |
Bypass CloudFlare with Chrome Max Pages | Max. number of pages when bypassing CF through Chrome. |