Skip to main content

HTML::TextExtractor - Parsing content (text) from a website

Overview of the scraper

OverviewHTML::TextExtractorHTML::TextExtractor scrapes text blocks from the specified page. This content scraper supports multi-page parsing (navigation through pages). It has built-in means to bypass CloudFlare protection and also the ability to choose Chrome as the engine for scraping content from pages where data is loaded by scripts. Capable of reaching speeds up to 2000 requests per minute – that's 120,000 links per hour.

Use cases for the scraper

Parsing text via Chrome using as an example

Text parsing via chrome on example of lingualeo
  1. Add the Engine option, from the list select the Chrome (Slow, JavaScript Enabled) engine.
  2. As a query, specify the link to the site from which you need to scrape the text.

This option can be useful in cases where the site loads the main text with scripts during the page load and when using HTTP (Fast, JavaScript Disabled) the result is absent or incomplete.

Download example

How to import the example into A-Parser


Parsing text with page navigation using news as an example

Text parsing with page navigation on example of news

Results are saved in the directory aparser/results/example/textextractor in a separate file for each query. The name is given as the serial number of the query.

  1. Add the Check next page option, as a regex specify (forum\/news\/page-\d+)"[^>]+>Forward.
  2. Add the Page as new query option.
  3. Change File name to example/textextractor/${query.num}.txt.
  4. As a query, specify the link to the first page of A-Parser news:
Download example

How to import the example into A-Parser


Collected data

  • Scrapes text blocks from the specified page
  • An array with all collected pages (used when the Use Pages option is active)


  • Multi-page text parsing (navigation through pages)
  • Automatic cleaning of text from HTML tags
  • Ability to set the minimum length of a text block
  • Optional removal of link anchors from text
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of site encodings to UTF-8
  • Bypassing CloudFlare protection
  • Choice of engine (HTTP or Chrome)

Use cases

  • Parsing text content from any websites


As queries, you need to specify links to the pages from which you want to scrape text blocks, for example:

Output results examples

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, such as CSV or JSON

Default Output

Result format:


Example of result:

Здравствуйте, Супер Команда Высочайших Профессионалов своего Дела! Спасибо за возможность изучения Испанского, Турецкого и Португальского языка! Желаю Вам дальнейшего расширения Ваших Возможностей! Вдохновения и Творчества! И просьба добавить Возможность изучения Немецкого и Французского языка!”
Использую лингвалео уже многие годы, первый раз начал заниматься еще когда приложения не было совсем, был только сайт) Спасибо разработчикам, продолжайте в том же духе, с креативом и с большой любовью к делу)
Технический английский для IT: словари, учебники, журналы
Изучай языки онлайн Изучай английский онлайн Изучай вьетнамский онлайн Изучай греческий онлайн Изучай индонезийский онлайн Изучай испанский онлайн Изучай итальянский онлайн Изучай китайский онлайн Изучай корейский онлайн Изучай немецкий онлайн Изучай нидерландский онлайн Изучай польский онлайн Изучай португальский онлайн Изучай сербский онлайн Изучай турецкий онлайн Изучай украинский онлайн Изучай французский онлайн Изучай хинди онлайн Изучай чешский онлайн Изучай японский онлайн

Possible Settings

Parameter NameDefault ValueDescription
Min block length50Minimum length of a text block in characters.
Skip anchor textWhether to skip anchors in text.
Ignore tags listOption to specify tags to ignore. Example: div,span,p
Good statusAllSelect which server response will be considered successful. If another response is received during scraping, the request will be retried with a different proxy.
Good code RegExAbility to specify a regular expression to check the response code.
MethodGETRequest method.
POST bodyContent to send to the server when using the POST method. Supports variables $query – URL of the request, $query.orig – original request, and $pagenum - page number when using the Use Pages option.
CookiesAbility to specify cookies for the request.
User agent`_Automatically substituted user-agent of the current version of Chrome_User-Agent header when requesting pages.
Additional headersAbility to specify custom request headers with support for template features and using variables from the request constructor.
Read only headersRead only headers. In some cases, it allows saving traffic if there is no need to process content.
Detect charset on contentDetect charset based on the content of the page.
Emulate browser headersEmulate browser headers.
Max redirects count7Maximum number of redirects the scraper will follow.
Max cookies count16Maximum number of cookies to save.
Bypass CloudFlareAutomatic bypass of CloudFlare checks.
Follow common redirectsAllows redirects http <-> https and www.domain <-> domain within the same domain, bypassing the Max redirects count limit.
EngineHTTP (Fast, JavaScript Disabled)Allows you to choose between HTTP engine (faster, without JavaScript) or Chrome (slower, JavaScript enabled).
Chrome HeadlessIf this option is enabled, the browser will not be displayed.
Chrome DevToolsAllows the use of Chromium debugging tools.
Chrome Log Proxy connectionsIf this option is enabled, information about Chrome connections will be logged.
Chrome Wait Untilnetworkidle2Determines when the page is considered loaded. More about the values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used.
Bypass CloudFlare with Chrome(Experimental)Bypass CF through Chrome.
Bypass CloudFlare with Chrome Max PagesMax. number of pages when bypassing CF through Chrome.