Skip to main content

HTML::TextExtractor - Parsing content (text) from a website

HTML::TextExtractor content parser overview


HTML::TextExtractorHTML::TextExtractor parses text blocks from the specified page. This content parser supports multipage parsing (page switching). It has built-in tools for bypassing CloudFlare protection and also the ability to select Chrome as the engine for parsing content from pages where data is loaded by scripts. It can develop speed up to 2000 requests per minute - that's 120,000 links per hour.

Use cases for HTML::TextExtractor content parser

Parsing text through Chrome using as an example

Case 1

  1. Add the Engine option, select the Chrome (Slow, JavaScript Enabled) engine from the list.
  2. Specify the link to the site from which you want to parse the text as a request.

This option can be useful in cases where the site loads the main text with scripts as the page loads, and when using HTTP (Fast, JavaScript Disabled) the result is absent or incomplete.

Download example

How to import an example into A-Parser


See also:

Parsing text with page switching using news as an example

Case 2

The results are saved in the directory aparser/results/example/textextractor in a separate file for each request. The name specifies the request number.

  1. Add the Check next page option, specify the regular expression (forum\/news\/page-\d+)"[^>]+>Forward.
  2. Add the Page as new query option.
  3. Change the File name to example/textextractor/${query.num}.txt.
  4. Specify the link to the first page with news on A-Parser as a request:
Download example

How to import an example into A-Parser


See also:

List of collected data

Здравствуйте, Супер Команда Высочайших Профессионалов своего Дела! Спасибо за возможность изучения Испанского, Турецкого и Португальского языка! Желаю Вам дальнейшего расширения Ваших Возможностей! Вдохновения и Творчества! И просьба добавить Возможность изучения Немецкого и Французского языка!”
Использую лингвалео уже многие годы, первый раз начал заниматься еще когда приложения не было совсем, был только сайт) Спасибо разработчикам, продолжайте в том же духе, с креативом и с большой любовью к делу)
Технический английский для IT: словари, учебники, журналы
Изучай языки онлайн Изучай английский онлайн Изучай вьетнамский онлайн Изучай греческий онлайн Изучай индонезийский онлайн Изучай испанский онлайн Изучай итальянский онлайн Изучай китайский онлайн Изучай корейский онлайн Изучай немецкий онлайн Изучай нидерландский онлайн Изучай польский онлайн Изучай португальский онлайн Изучай сербский онлайн Изучай турецкий онлайн Изучай украинский онлайн Изучай французский онлайн Изучай хинди онлайн Изучай чешский онлайн Изучай японский онлайн
  • Parses text blocks from the specified page
  • An array with all collected pages (used when working with the Use Pages option)

HTML::TextExtractor content parser capabilities

  • Multipage text parsing (page switching)
  • Automatic text cleaning from HTML tags
  • Ability to set a minimum text block length
  • Optional removal of link anchors from text
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of website encodings to UTF-8
  • Bypassing CloudFlare protection
  • Engine selection (HTTP or Chrome)

Usage scenarios

  • Parsing text content from any websites

Examples of requests

Requests should specify links to pages from which text blocks need to be parsed, for example:

Possible result output formats

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit template engine, which allows it to output results in any form, as well as in a structured form, such as CSV or JSON.

Possible settings

Parameter nameDefault valueDescription
Min block length50The minimum length of a text block in characters.
Skip anchor textWhether to skip anchors in the text.
Good statusAllSelect which server response will be considered successful. If parsing returns a different response from the server, the request will be repeated with a different proxy.
Good code RegEx-Ability to specify a regular expression to check the response code.
MethodGETRequest method.
POST body-Content to be passed to the server when using the POST method. Supports variables $query - URL query, $query.orig - original query, and $pagenum - page number when using the Use Pages option.
Cookies-Ability to specify cookies for the request.
User agentMozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)User-Agent header when requesting pages.
Additional headers-Ability to specify arbitrary request headers with support for template engine capabilities and use of variables from the request builder.
Read only headersRead only headers. In some cases, it allows you to save traffic if there is no need to process content.
Detect charset on contentDetect encoding based on page content.
Emulate browser headersEmulate browser headers.
Max redirects count7The maximum number of redirects that the parser will follow.
Max cookies count16The maximum number of cookies to be saved.
Bypass CloudFlareAutomatic bypass of CloudFlare check.
Follow common redirectsAllows http <-> https and www.domain <-> domain redirects within the same domain to bypass the Max redirects count limit.
EngineHTTP (Fast, JavaScript Disabled)Allows you to choose the HTTP engine (faster, without JavaScript) or Chrome (slower, with JavaScript).
Chrome HeadlessIf enabled, the browser will not be displayed.
Chrome DevToolsAllows you to use Chromium debugging tools.
Chrome Log Proxy connectionsIf enabled, information about chrome connections will be output to the log.
Chrome Wait Untilnetworkidle2Determines when the page is considered loaded. More about values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if you use HTTP/1.1.
Bypass CloudFlare with Chrome(Experimental)Bypass CF via Chrome.
Bypass CloudFlare with Chrome Max Pages-Max. number of pages when bypassing CF via Chrome.