Skip to main content

HTML::TextExtractor::LangDetect - Page language detection

HTML::TextExtractor::LangDetect parser overview

HTML::TextExtractor::LangDetect

HTML::TextExtractor::LangDetectHTML::TextExtractor::LangDetect detects the language of a website and the accuracy of detection in percentage. It supports multi-page parsing and navigation through internal pages of a website to the specified depth, allowing you to go through all pages of a website, collecting internal and external links. It has built-in tools for bypassing CloudFlare protection and also the ability to choose Chrome as the engine for parsing emails from pages whose data is loaded by scripts. It can develop speed up to 2000 requests per minute - that's 120,000 links per hour.

List of collected data

http://vk.com/: RUSSIAN
http://a-parser.com/: RUSSIAN
http://yandex.ru/: RUSSIAN
http://youtube.com/: ENGLISH
http://google.com/: ENGLISH
http://facebook.com/: ENGLISH
  • Detects website language
  • Accuracy of detection in %

Features

  • Multi-page parsing (navigation through pages)
  • Supports gzip/deflate/brotli compression
  • Detects and converts website encodings to UTF-8
  • Bypasses CloudFlare protection
  • Choice of engine (HTTP or Chrome)
  • Detects website language without using third-party services
  • Accuracy of detection in %

Use cases

  • Selection of domains with a certain content language

Query examples

A list of websites should be specified as queries, for example:

http://a-parser.com/
http://yandex.ru/
http://google.com/
http://vk.com/
http://facebook.com/
http://youtube.com/

Possible result output formats

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit template engine, which allows it to output results in any form, as well as in structured formats such as CSV or JSON.

Possible settings

Parameter nameDefault valueDescription
Good statusAllSelect which server response will be considered successful. If a different response is received during parsing, the request will be repeated with a different proxy.
Good code RegEx-Ability to specify a regular expression to check the response code.
MethodGETRequest method.
POST body-Content to be passed to the server when using the POST method. Supports variables $query - URL query, $query.orig - original query, and $pagenum - page number when using the Use Pages option.
Cookies-Ability to specify cookies for the request.
User agentMozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)User-Agent header when requesting pages.
Additional headers-Ability to specify arbitrary request headers with support for template engine capabilities and use of variables from the request builder.
Read only headersRead only headers. In some cases, it allows you to save traffic if there is no need to process content.
Detect charset on contentDetect encoding based on page content.
Emulate browser headersEmulate browser headers.
Max redirects count7Maximum number of redirects to follow.
Max cookies count16Maximum number of cookies to save.
Bypass CloudFlareAutomatic CloudFlare check bypass.
Follow common redirectsAllows http <-> https and www.domain <-> domain redirects within the same domain to be made bypassing the Max redirects count limit.
EngineHTTP (Fast, JavaScript Disabled)Allows you to choose the HTTP engine (faster, without JavaScript) or Chrome (slower, with JavaScript)
Chrome HeadlessIf enabled, the browser will not be displayed.
Chrome DevToolsAllows you to use Chromium debugging tools.
Chrome Log Proxy connectionsIf enabled, information about chrome connections will be output to the log.
Chrome Wait Untilnetworkidle2Determines when the page is considered loaded. More about values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if you use HTTP/1.1.
Bypass CloudFlare with Chrome(Experimental)Bypass CF via Chrome.
Bypass CloudFlare with Chrome Max Pages-Max. number of pages when bypassing CF via Chrome.