HTML::TextExtractor::LangDetect - Language Detection of a Page

Overview of the scraper

HTML::TextExtractor::LangDetect determines the language of the website, as well as the accuracy of detection in percentages. Supports multi-page scraping and navigation through internal pages of the site to a specified depth, which allows to go through all the pages of the site, collecting internal and external links. It has built-in means of bypassing CloudFlare protection and also the ability to choose Chrome as the engine for scraping emails from pages, the data on which is loaded by scripts. Capable of reaching speeds up to 2000 requests per minute – that's 120,000 links per hour.

Go to DEMO Buy A-Parser Pro ($299)

Collected Data

Determines the language of the website
Accuracy of detection in %

Capabilities

Multi-page scraping (navigation through pages)
Supports gzip/deflate/brotli compression
Detection and conversion of website encodings to UTF-8
Bypassing CloudFlare protection
Choice of engine (HTTP or Chrome)
Website language detection without using third-party services
Accuracy of detection in %

Use Cases

Selecting domains with specific language content

Queries

As queries, you need to specify a list of websites, for example:

http://a-parser.com/
http://yandex.ru/
http://google.com/
http://vk.com/
http://facebook.com/
http://youtube.com/

Output Results Examples

A-Parser supports flexible formatting of results thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, for example CSV or JSON

Default Output

Result format:

$query: $lang\n

Example of the result:

http://vk.com/: RUSSIAN
http://a-parser.com/: RUSSIAN
http://yandex.ru/: RUSSIAN
http://youtube.com/: ENGLISH
http://google.com/: ENGLISH
http://facebook.com/: ENGLISH

Possible Settings

note

Common settings for all scrapers

Parameter Name	Default Value	Description
Good status	`All`	Selection of which server response will be considered successful. If a different response is received during scraping, the request will be repeated with a different proxy.
Good code RegEx		Ability to specify a regular expression to check the response code.
Method	`GET`	Request method.
POST body		Content to be sent to the server when using the POST method. Supports variables `$query` – URL of the request, `$query.orig` – the original request, and `$pagenum` - page number when using the Use Pages option.
Cookies		Ability to specify cookies for the request.
User agent	`Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)`	The User-Agent header when requesting pages.
Additional headers		Ability to specify custom request headers with support for templating features and using variables from the request constructor.
Read only headers	`☐`	Read headers only. In some cases, it allows saving traffic if there is no need to process content.
Detect charset on content	`☐`	Detect charset based on the content of the page.
Emulate browser headers	`☐`	Emulate browser headers.
Max redirects count	`7`	Maximum number of redirects the scraper will follow.
Max cookies count	`16`	Maximum number of cookies to save.
Bypass CloudFlare	`☑`	Automatic bypass of CloudFlare checks.
Follow common redirects	`☑`	Allows redirects http <-> https and www.domain <-> domain within the same domain, bypassing the Max redirects count limit.
Engine	`HTTP (Fast, JavaScript Disabled)`	Allows choosing between the HTTP engine (faster, without JavaScript) or Chrome (slower, with JavaScript enabled).
Chrome Headless	`☐`	If this option is enabled, the browser will not be displayed.
Chrome DevTools	`☑`	Allows the use of Chromium debugging tools.
Chrome Log Proxy connections	`☑`	If this option is enabled, information about Chrome connections will be logged.
Chrome Wait Until	`networkidle2`	Determines when the page is considered loaded. More about the values.
Use HTTP/2 transport	`☐`	Determines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used.
Bypass CloudFlare with Chrome(Experimental)	`☐`	Bypass CF through Chrome.
Bypass CloudFlare with Chrome Max Pages		Max. number of pages when bypassing CF through Chrome.

Overview of the scraper​

Collected Data​

Capabilities​

Use Cases​

Queries​

Output Results Examples​

Default Output​

Possible Settings​