HTML::EmailExtractor - Scraping email addresses from website pages
Overview of the scraper


Use cases for the scraper
Scraping emails from a website with page navigation deep into a specified limit

- Add the option Parse to level, in the list select the necessary value (limit).
- In the Requests section, check the
Unique requests
option. - In the Results section, check the
Unique by line
option. - As a request, specify the link to the website from which you need to scrape emails.
Download example
How to import an example into A-Parser
eJxtU01z2jAQ/S8aDu0MY5pDL74RJkzTIXGakBPDQYPXREWWVEmGpB7+e98Kx4Ym
N+3u2/f2S62IMuzCg6dAMYh81QqX3iIXJVWy0VGMhZM+kOfwSvxY3i3y/KaWSt+8
Ri830XpAenAr4psjpFsXlTUBMVXCTBwL2pOGZy91A8zVcb0eC+ghM8ytryXrjtxV
1hXRB5/knpYWwUppGtxzWPeyZrlRKSNxNKsS0ZevWXxlBlmWiiuR+qTAbQyqz0b9
4VJEiF6ZLfAwvaIw97aGO1IiYefbe4UrMUq2AE2T8n+dckQefUNjEVDtHAOisg9U
UgdEVCQvMbGiG07eCmumWqfBDLBEf90oXWLs0wpJt13i55DiA8ex7/Bcak/+4FFD
z5Ks6+JuyCrtwm7RuLFoW6taRdhhZhvDu/kG547I9WO7Z1htPfUyHXOnjstyZPgA
hq1N3eC6aONiM5fOjTWV2hZowKuS3pGNWeJ8CzOztdPEfZlGa2wl0ONwIdPQrYGN
ocD/k2dJ4uLwo7U6/Hw6leq8wgV+5wJrTPJctaPcSK2fHxfnETFcFIyXGF3IJ5PD
4ZDt/taBl5r5ZiI4N9LW4qjQ2XHd/7n+Z7af/7y8PWJpv8PDCc4dMhg+jCpgI/zL
/gFm02Dr
Scraping emails from a database of websites with page navigation deep into a specified limit

- Add the option Parse to level, in the list select the necessary value (limit).
- In the Requests section, check the
Unique requests
option. - In the Results section, check the
Unique by line
option. - As a request, specify the links to the websites from which you need to scrape emails, or in Requests from select
File
and upload a file of requests with a database of websites.
Download example
How to import an example into A-Parser
eJxtU01z2jAQ/S8aDu0MY5pDL74RJkzTIXGakBPDQYPXREWWVEmGpB7+e98Kx4Ym
N+3u2/f2S62IMuzCg6dAMYh81QqX3iIXJVWy0VGMhZM+kOfwSvxY3i3y/KaWSt+8
Ri830XpAenAr4psjpFsXlTUBMVXCTBwL2pOGZy91A8zVcb0eC+ghM8ytryXrjtxV
1hXRB5/knpYWwUppGtxzWPeyZrlRKSNxNKsS0ZevWXxlBlmWiiuR+qTAbQyqz0b9
4VJEiF6ZLfAwvaIw97aGO1IiYefbe4UrMUq2AE2T8n+dckQefUNjEVDtHAOisg9U
UgdEVCQvMbGiG07eCmumWqfBDLBEf90oXWLs0wpJt13i55DiA8ex7/Bcak/+4FFD
z5Ks6+JuyCrtwm7RuLFoW6taRdhhZhvDu/kG547I9WO7Z1htPfUyHXOnjstyZPgA
hq1N3eC6aONiM5fOjTWV2hZowKuS3pGNWeJ8CzOztdPEfZlGa2wl0ONwIdPQrYGN
ocD/k2dJ4uLwo7U6/Hw6leq8wgV+5wJrTPJctaPcSK2fHxfnETFcFIyXGF3IJ5PD
4ZDt/taBl5r5ZiI4N9LW4qjQ2XHd/7n+Z7af/7y8PWJpv8PDCc4dMhg+jCpgI/zL
/gFm02Dr
Scraping emails from a database of links

- In the Queries section, check the
Unique queries
option. - In the Results section, check the
Unique per line
option. - As a query, specify the links from which you need to scrape emails, or in Queries from select
File
and upload a file with a link database.
Download example
How to import example into A-Parser
eJxtU01z0zAQ/S+aHmAmOPTAxbc00wwwaV3a9BRyEPE6COuLXSkpePLfWTmOHZfe
tG/fvv1UI4Kkmh4QCAKJfN0I375FLkqoZNRBTISXSIDJvRafV3fLPL81Uunbl4By
Gxwy5UzebCaCBfhJC4dGJqErf511qr3zSe5h5dhZKQ0DvGDrXhpIUaUMkLxZ1Qq9
e5+Fl6Qgy1IF5azUpwypriHrs1W/Y4qngMrumM8mKqAFOsNwgFYkgX/OFa7FVWsL
lolt/LdTjMgDRpgI4moX3DGUvaOSmtijAqDkERQ+lcR4I5ydab2EPeiB1srfRKVL
nuOs4qAvXeDblOI/jWPf4WWqPeABuYZepbVuirshqnRLt+PGreO2tTIqsE1zF23a
zUcGawDfj+0+0YxD6NN0yl12PhUPtmTmsLWZH6BRG6PNjMGts5XaFdwAqhLOzGhX
fI+FnTvjNaS+bNSat0LwOFzIjLo1JGMo8HXwvE0xuuTgnKavT6dSPSq+wE+pQMOT
vMzaSW6l1s+Py0uPGC6KjZ8heMqn08PhkNV/DaWlZhin3+3Z8wMl4Bjy6Mq4DVuw
4bXLOKpZwoxRqSv5IUBNY5hMpqkVEKnUADvHN8yDPG76P9v/7Obtn5s3R76RX/Rw
oqeBJjJjvBniAxD59fEfH7B6cg==
Collected data
- Email addresses
- Total number of addresses on the page
- Array with all collected pages (used when Use Pages option is enabled)
Capabilities
- Multi-page scraping (pagination)
- Navigation through internal site pages up to a specified depth (option Parse to level) – allows to cover all site pages, collecting internal and external links
- Determining follow links for links
- Limit on page transitions (option Follow links limit)
- Ability to consider subdomains as internal site pages
- Supports gzip/deflate/brotli compression
- Detection and conversion of site encodings to UTF-8
- CloudFlare protection bypass
- Choice of engine (HTTP or Chrome)
- Supports all the functionality of
HTML::LinkExtractor
Use cases
- Email address scraping
- Displaying the number of e-mail addresses
Queries
As queries, it is necessary to specify links to pages, for example:
https://a-parser.com/pages/support/
Output results examples
A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, such as CSV or JSON
Displaying the number of email addresses
Result format:
$mailcount
Example of result:
4
Possible settings
Parameter Name | Default Value | Description |
---|---|---|
Good status | All | Selection of which server response will be considered successful. If another response is received during scraping, the request will be repeated with a different proxy |
Good code RegEx | Ability to specify a regular expression to check the response code | |
Ban Proxy Code RegEx | Ability to temporarily ban a proxy for a time (Proxy ban time) based on the server response code | |
Method | GET | Request method |
POST body | Content to be sent to the server when using the POST method. Supports variables $query – URL request, $query.orig – original request, and $pagenum - page number when using the Use Pages option. | |
Cookies | Ability to specify cookies for the request. | |
User agent | _Automatically substituted user-agent of the current Chrome version_ | User-Agent header when requesting pages |
Additional headers | Ability to specify custom request headers with support for templating features and using variables from the request builder | |
Read only headers | ☐ | Read headers only. In some cases, it saves traffic if there is no need to process content |
Detect charset on content | ☐ | Detect charset based on the content of the page |
Emulate browser headers | ☐ | Emulate browser headers |
Max redirects count | 0 | Maximum number of redirects the scraper will follow |
Follow common redirects | ☑ | Allows for http <-> https and www.domain <-> domain redirects within the same domain, bypassing the Max redirects count limit |
Max cookies count | 16 | Maximum number of cookies to save |
Engine | HTTP (Fast, JavaScript Disabled) | Allows choosing between the HTTP engine (faster, without JavaScript) or Chrome (slower, JavaScript enabled) |
Chrome Headless | ☐ | If this option is enabled, the browser will not be displayed |
Chrome DevTools | ☑ | Allows the use of Chromium debugging tools |
Chrome Log Proxy connections | ☑ | If this option is enabled, information about Chrome connections will be logged |
Chrome Wait Until | networkidle2 | Determines when the page is considered loaded. More about the values. |
Use HTTP/2 transport | ☐ | Determines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used. |
Don't verify TLS certs | ☐ | Disable TLS certificate validation |
Randomize TLS Fingerprint | ☐ | This option allows bypassing site bans by TLS fingerprint |
Bypass CloudFlare | ☑ | Automatic bypass of CloudFlare checks |
Bypass CloudFlare with Chrome(Experimental) | ☐ | Bypass CF through Chrome |
Bypass CloudFlare with Chrome Max Pages | 20 | Max. number of pages when bypassing CF through Chrome |
Subdomains are internal | ☐ | Whether to consider subdomains as internal links |
Follow links | Internal only | Which links to follow |
Follow links limit | 0 | Follow links limit, applied to each unique domain |
Skip comment blocks | ☐ | Whether to skip comment blocks |
Search Cloudflare protected e-mails | ☑ | Whether to scrape Cloudflare protected e-mails. |
Skip non-HTML blocks | ☑ | Do not collect email addresses in tags (script, style, comment, etc.). |
Skip meta tags | ☐ | Do not collect email addresses in meta tags. |