HTML::EmailExtractor - Scraping email addresses from website pages
Overview of the scraper
HTML::EmailExtractor collects email addresses from specified pages. It supports navigating through internal pages of the website up to a specified depth, which allows it to go through all the pages of the site, collecting internal and external links. The email scraper has built-in means to bypass CloudFlare protection and also the ability to choose Chrome as the engine for scraping emails from pages where data is loaded by scripts. Capable of reaching speeds up to 250 requests per minute – that's 15,000 links per hour.Use cases for the scraper
Scraping emails from a website with page navigation deep into a specified limit
- Add the option Parse to level, in the list select the necessary value (limit).
- In the Requests section, check the
Unique requests
option. - In the Results section, check the
Unique by line
option. - As a request, specify the link to the website from which you need to scrape emails.
Download example
How to import an example into A-Parser
eJxtU01z2jAQ/S8aDu0MY5pDL74RJkzTIXGakBPDQYPXREWWVEmGpB7+e98Kx4Ym
N+3u2/f2S62IMuzCg6dAMYh81QqX3iIXJVWy0VGMhZM+kOfwSvxY3i3y/KaWSt+8
Ri830XpAenAr4psjpFsXlTUBMVXCTBwL2pOGZy91A8zVcb0eC+ghM8ytryXrjtxV
1hXRB5/knpYWwUppGtxzWPeyZrlRKSNxNKsS0ZevWXxlBlmWiiuR+qTAbQyqz0b9
4VJEiF6ZLfAwvaIw97aGO1IiYefbe4UrMUq2AE2T8n+dckQefUNjEVDtHAOisg9U
UgdEVCQvMbGiG07eCmumWqfBDLBEf90oXWLs0wpJt13i55DiA8ex7/Bcak/+4FFD
z5Ks6+JuyCrtwm7RuLFoW6taRdhhZhvDu/kG547I9WO7Z1htPfUyHXOnjstyZPgA
hq1N3eC6aONiM5fOjTWV2hZowKuS3pGNWeJ8CzOztdPEfZlGa2wl0ONwIdPQrYGN
ocD/k2dJ4uLwo7U6/Hw6leq8wgV+5wJrTPJctaPcSK2fHxfnETFcFIyXGF3IJ5PD
4ZDt/taBl5r5ZiI4N9LW4qjQ2XHd/7n+Z7af/7y8PWJpv8PDCc4dMhg+jCpgI/zL
/gFm02Dr
Scraping emails from a database of websites with page navigation deep into a specified limit
- Add the option Parse to level, in the list select the necessary value (limit).
- In the Requests section, check the
Unique requests
option. - In the Results section, check the
Unique by line
option. - As a request, specify the links to the websites from which you need to scrape emails, or in Requests from select
File
and upload a file of requests with a database of websites.
Download example
How to import an example into A-Parser
eJxtU01z2jAQ/S8aDu0MY5pDL74RJkzTIXGakBPDQYPXREWWVEmGpB7+e98Kx4Ym
N+3u2/f2S62IMuzCg6dAMYh81QqX3iIXJVWy0VGMhZM+kOfwSvxY3i3y/KaWSt+8
Ri830XpAenAr4psjpFsXlTUBMVXCTBwL2pOGZy91A8zVcb0eC+ghM8ytryXrjtxV
1hXRB5/knpYWwUppGtxzWPeyZrlRKSNxNKsS0ZevWXxlBlmWiiuR+qTAbQyqz0b9
4VJEiF6ZLfAwvaIw97aGO1IiYefbe4UrMUq2AE2T8n+dckQefUNjEVDtHAOisg9U
UgdEVCQvMbGiG07eCmumWqfBDLBEf90oXWLs0wpJt13i55DiA8ex7/Bcak/+4FFD
z5Ks6+JuyCrtwm7RuLFoW6taRdhhZhvDu/kG547I9WO7Z1htPfUyHXOnjstyZPgA
hq1N3eC6aONiM5fOjTWV2hZowKuS3pGNWeJ8CzOztdPEfZlGa2wl0ONwIdPQrYGN
ocD/k2dJ4uLwo7U6/Hw6leq8wgV+5wJrTPJctaPcSK2fHxfnETFcFIyXGF3IJ5PD
4ZDt/taBl5r5ZiI4N9LW4qjQ2XHd/7n+Z7af/7y8PWJpv8PDCc4dMhg+jCpgI/zL
/gFm02Dr
Scraping emails from a database of links
- In the Queries section, check the
Unique queries
option. - In the Results section, check the
Unique per line
option. - As a query, specify the links from which you need to scrape emails, or in Queries from select
File
and upload a file with a link database.
Download example
How to import example into A-Parser
eJxtU01z0zAQ/S+aHmAmOPTAxbc00wwwaV3a9BRyEPE6COuLXSkpePLfWTmOHZfe
tG/fvv1UI4Kkmh4QCAKJfN0I375FLkqoZNRBTISXSIDJvRafV3fLPL81Uunbl4By
Gxwy5UzebCaCBfhJC4dGJqErf511qr3zSe5h5dhZKQ0DvGDrXhpIUaUMkLxZ1Qq9
e5+Fl6Qgy1IF5azUpwypriHrs1W/Y4qngMrumM8mKqAFOsNwgFYkgX/OFa7FVWsL
lolt/LdTjMgDRpgI4moX3DGUvaOSmtijAqDkERQ+lcR4I5ydab2EPeiB1srfRKVL
nuOs4qAvXeDblOI/jWPf4WWqPeABuYZepbVuirshqnRLt+PGreO2tTIqsE1zF23a
zUcGawDfj+0+0YxD6NN0yl12PhUPtmTmsLWZH6BRG6PNjMGts5XaFdwAqhLOzGhX
fI+FnTvjNaS+bNSat0LwOFzIjLo1JGMo8HXwvE0xuuTgnKavT6dSPSq+wE+pQMOT
vMzaSW6l1s+Py0uPGC6KjZ8heMqn08PhkNV/DaWlZhin3+3Z8wMl4Bjy6Mq4DVuw
4bXLOKpZwoxRqSv5IUBNY5hMpqkVEKnUADvHN8yDPG76P9v/7Obtn5s3R76RX/Rw
oqeBJjJjvBniAxD59fEfH7B6cg==
Collected data
- Email addresses
- Total number of addresses on the page
- Array with all collected pages (used when Use Pages option is enabled)
Capabilities
- Multi-page scraping (pagination)
- Navigation through internal site pages up to a specified depth (option Parse to level) – allows to cover all site pages, collecting internal and external links
- Determining follow links for links
- Limit on page transitions (option Follow links limit)
- Ability to consider subdomains as internal site pages
- Supports gzip/deflate/brotli compression
- Detection and conversion of site encodings to UTF-8
- CloudFlare protection bypass
- Choice of engine (HTTP or Chrome)
- Supports all the functionality of HTML::LinkExtractor
Use cases
- Email address scraping
- Displaying the number of e-mail addresses
Queries
As queries, it is necessary to specify links to pages, for example:
https://a-parser.com/pages/support/
Output results examples
A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, such as CSV or JSON
Displaying the number of email addresses
Result format:
$mailcount
Example of result:
4
Possible settings
Parameter Name | Default Value | Description |
---|---|---|
Good status | All | Selection of which server response will be considered successful. If another response is received during scraping, the request will be repeated with a different proxy |
Good code RegEx | Ability to specify a regular expression to check the response code | |
Ban Proxy Code RegEx | Ability to temporarily ban a proxy for a time (Proxy ban time) based on the server response code | |
Method | GET | Request method |
POST body | Content to be sent to the server when using the POST method. Supports variables $query – URL request, $query.orig – original request, and $pagenum - page number when using the Use Pages option. | |
Cookies | Ability to specify cookies for the request. | |
User agent | _Automatically substituted user-agent of the current Chrome version_ | User-Agent header when requesting pages |
Additional headers | Ability to specify custom request headers with support for templating features and using variables from the request builder | |
Read only headers | ☐ | Read headers only. In some cases, it saves traffic if there is no need to process content |
Detect charset on content | ☐ | Detect charset based on the content of the page |
Emulate browser headers | ☐ | Emulate browser headers |
Max redirects count | 0 | Maximum number of redirects the scraper will follow |
Follow common redirects | ☑ | Allows for http <-> https and www.domain <-> domain redirects within the same domain, bypassing the Max redirects count limit |
Max cookies count | 16 | Maximum number of cookies to save |
Engine | HTTP (Fast, JavaScript Disabled) | Allows choosing between the HTTP engine (faster, without JavaScript) or Chrome (slower, JavaScript enabled) |
Chrome Headless | ☐ | If this option is enabled, the browser will not be displayed |
Chrome DevTools | ☑ | Allows the use of Chromium debugging tools |
Chrome Log Proxy connections | ☑ | If this option is enabled, information about Chrome connections will be logged |
Chrome Wait Until | networkidle2 | Determines when the page is considered loaded. More about the values. |
Use HTTP/2 transport | ☐ | Determines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used. |
Don't verify TLS certs | ☐ | Disable TLS certificate validation |
Randomize TLS Fingerprint | ☐ | This option allows bypassing site bans by TLS fingerprint |
Bypass CloudFlare | ☑ | Automatic bypass of CloudFlare checks |
Bypass CloudFlare with Chrome(Experimental) | ☐ | Bypass CF through Chrome |
Bypass CloudFlare with Chrome Max Pages | 20 | Max. number of pages when bypassing CF through Chrome |
Subdomains are internal | ☐ | Whether to consider subdomains as internal links |
Follow links | Internal only | Which links to follow |
Follow links limit | 0 | Follow links limit, applied to each unique domain |
Skip comment blocks | ☐ | Whether to skip comment blocks |
Search Cloudflare protected e-mails | ☑ | Whether to scrape Cloudflare protected e-mails. |
Skip non-HTML blocks | ☑ | Do not collect email addresses in tags (script, style, comment, etc.). |
Skip meta tags | ☐ | Do not collect email addresses in meta tags. |