HTML::TextExtractor::LangDetect - Language Detection of a Page
Overview of the scraper


Collected Data
- Determines the language of the website
- Accuracy of detection in %
Capabilities
- Multi-page scraping (navigation through pages)
- Supports gzip/deflate/brotli compression
- Detection and conversion of website encodings to UTF-8
- Bypassing CloudFlare protection
- Choice of engine (HTTP or Chrome)
- Website language detection without using third-party services
- Accuracy of detection in %
Use Cases
- Selecting domains with specific language content
Queries
As queries, you need to specify a list of websites, for example:
http://a-parser.com/
http://yandex.ru/
http://google.com/
http://vk.com/
http://facebook.com/
http://youtube.com/
Output Results Examples
A-Parser supports flexible formatting of results thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, for example CSV or JSON
Default Output
Result format:
$query: $lang\n
Example of the result:
http://vk.com/: RUSSIAN
http://a-parser.com/: RUSSIAN
http://yandex.ru/: RUSSIAN
http://youtube.com/: ENGLISH
http://google.com/: ENGLISH
http://facebook.com/: ENGLISH
Possible Settings
Parameter Name | Default Value | Description |
---|---|---|
Good status | All | Selection of which server response will be considered successful. If a different response is received during scraping, the request will be repeated with a different proxy. |
Good code RegEx | Ability to specify a regular expression to check the response code. | |
Method | GET | Request method. |
POST body | Content to be sent to the server when using the POST method. Supports variables $query – URL of the request, $query.orig – the original request, and $pagenum - page number when using the Use Pages option. | |
Cookies | Ability to specify cookies for the request. | |
User agent | Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) | The User-Agent header when requesting pages. |
Additional headers | Ability to specify custom request headers with support for templating features and using variables from the request constructor. | |
Read only headers | ☐ | Read headers only. In some cases, it allows saving traffic if there is no need to process content. |
Detect charset on content | ☐ | Detect charset based on the content of the page. |
Emulate browser headers | ☐ | Emulate browser headers. |
Max redirects count | 7 | Maximum number of redirects the scraper will follow. |
Max cookies count | 16 | Maximum number of cookies to save. |
Bypass CloudFlare | ☑ | Automatic bypass of CloudFlare checks. |
Follow common redirects | ☑ | Allows redirects http <-> https and www.domain <-> domain within the same domain, bypassing the Max redirects count limit. |
Engine | HTTP (Fast, JavaScript Disabled) | Allows choosing between the HTTP engine (faster, without JavaScript) or Chrome (slower, with JavaScript enabled). |
Chrome Headless | ☐ | If this option is enabled, the browser will not be displayed. |
Chrome DevTools | ☑ | Allows the use of Chromium debugging tools. |
Chrome Log Proxy connections | ☑ | If this option is enabled, information about Chrome connections will be logged. |
Chrome Wait Until | networkidle2 | Determines when the page is considered loaded. More about the values. |
Use HTTP/2 transport | ☐ | Determines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used. |
Bypass CloudFlare with Chrome(Experimental) | ☐ | Bypass CF through Chrome. |
Bypass CloudFlare with Chrome Max Pages | Max. number of pages when bypassing CF through Chrome. |