HTML::TextExtractor::LangDetect - Page language detection
HTML::TextExtractor::LangDetect parser overview

List of collected data
http://vk.com/: RUSSIAN
http://a-parser.com/: RUSSIAN
http://yandex.ru/: RUSSIAN
http://youtube.com/: ENGLISH
http://google.com/: ENGLISH
http://facebook.com/: ENGLISH
- Detects website language
- Accuracy of detection in %
Features
- Multi-page parsing (navigation through pages)
- Supports gzip/deflate/brotli compression
- Detects and converts website encodings to UTF-8
- Bypasses CloudFlare protection
- Choice of engine (HTTP or Chrome)
- Detects website language without using third-party services
- Accuracy of detection in %
Use cases
- Selection of domains with a certain content language
Query examples
A list of websites should be specified as queries, for example:
http://a-parser.com/
http://yandex.ru/
http://google.com/
http://vk.com/
http://facebook.com/
http://youtube.com/
Possible result output formats
A-Parser supports flexible result formatting thanks to the built-in Template Toolkit template engine, which allows it to output results in any form, as well as in structured formats such as CSV or JSON.
Possible settings
Parameter name | Default value | Description |
---|---|---|
Good status | All | Select which server response will be considered successful. If a different response is received during parsing, the request will be repeated with a different proxy. |
Good code RegEx | - | Ability to specify a regular expression to check the response code. |
Method | GET | Request method. |
POST body | - | Content to be passed to the server when using the POST method. Supports variables $query - URL query, $query.orig - original query, and $pagenum - page number when using the Use Pages option. |
Cookies | - | Ability to specify cookies for the request. |
User agent | Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) | User-Agent header when requesting pages. |
Additional headers | - | Ability to specify arbitrary request headers with support for template engine capabilities and use of variables from the request builder. |
Read only headers | ☐ | Read only headers. In some cases, it allows you to save traffic if there is no need to process content. |
Detect charset on content | ☐ | Detect encoding based on page content. |
Emulate browser headers | ☐ | Emulate browser headers. |
Max redirects count | 7 | Maximum number of redirects to follow. |
Max cookies count | 16 | Maximum number of cookies to save. |
Bypass CloudFlare | ☑ | Automatic CloudFlare check bypass. |
Follow common redirects | ☑ | Allows http <-> https and www.domain <-> domain redirects within the same domain to be made bypassing the Max redirects count limit. |
Engine | HTTP (Fast, JavaScript Disabled) | Allows you to choose the HTTP engine (faster, without JavaScript) or Chrome (slower, with JavaScript) |
Chrome Headless | ☐ | If enabled, the browser will not be displayed. |
Chrome DevTools | ☑ | Allows you to use Chromium debugging tools. |
Chrome Log Proxy connections | ☑ | If enabled, information about chrome connections will be output to the log. |
Chrome Wait Until | networkidle2 | Determines when the page is considered loaded. More about values. |
Use HTTP/2 transport | ☐ | Determines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if you use HTTP/1.1. |
Bypass CloudFlare with Chrome(Experimental) | ☐ | Bypass CF via Chrome. |
Bypass CloudFlare with Chrome Max Pages | - | Max. number of pages when bypassing CF via Chrome. |