Skip to main content

HTML::EmailExtractor - Scraping Email Addresses from Website Pages

Scraper Overview

Scraper OverviewHTML::EmailExtractorHTML::EmailExtractor collects email addresses from specified pages. It supports traversing internal site pages up to a specified depth, which allows iteration over all site pages, collecting internal and external links. The Email Scraper has built-in means to bypass CloudFlare protection, and also the option to select Chrome as the engine for scraping emails from pages where data is loaded by scripts. It is capable of speeds up to 250 requests per minute, – which is 15 000 links per hour.

Scraper Use Cases

Scraping emails from a site by traversing pages up to a specified limit

Scraping emails from a site by traversing pages up to a specified limit
  1. Add the option Parse to level, and select the required value (limit) from the list.
  2. In the Queries section, check the Unique queries option.
  3. In the Results section, check the Unique by string option.
  4. As the query, specify the link to the site from which emails need to be scraped.
Download example

How to import an example into A-Parser

eJxtU01z2jAQ/S8aDu0MY5pDL74RJkzTIXGakBPDQYPXREWWVEmGpB7+e98Kx4Ym
N+3u2/f2S62IMuzCg6dAMYh81QqX3iIXJVWy0VGMhZM+kOfwSvxY3i3y/KaWSt+8
Ri830XpAenAr4psjpFsXlTUBMVXCTBwL2pOGZy91A8zVcb0eC+ghM8ytryXrjtxV
1hXRB5/knpYWwUppGtxzWPeyZrlRKSNxNKsS0ZevWXxlBlmWiiuR+qTAbQyqz0b9
4VJEiF6ZLfAwvaIw97aGO1IiYefbe4UrMUq2AE2T8n+dckQefUNjEVDtHAOisg9U
UgdEVCQvMbGiG07eCmumWqfBDLBEf90oXWLs0wpJt13i55DiA8ex7/Bcak/+4FFD
z5Ks6+JuyCrtwm7RuLFoW6taRdhhZhvDu/kG547I9WO7Z1htPfUyHXOnjstyZPgA
hq1N3eC6aONiM5fOjTWV2hZowKuS3pGNWeJ8CzOztdPEfZlGa2wl0ONwIdPQrYGN
ocD/k2dJ4uLwo7U6/Hw6leq8wgV+5wJrTPJctaPcSK2fHxfnETFcFIyXGF3IJ5PD
4ZDt/taBl5r5ZiI4N9LW4qjQ2XHd/7n+Z7af/7y8PWJpv8PDCc4dMhg+jCpgI/zL
/gFm02Dr

Scraping emails from a site database by traversing each site up to a specified depth limit

Scraping emails from a site database by traversing each site up to a specified depth limit
  1. Add the option Parse to level, and select the required value (limit) from the list.
  2. In the Queries section, check the Unique queries option.
  3. In the Results section, check the Unique by string option.
  4. As the query, specify the links to the sites from which emails need to be scraped, or in Queries from specify File and upload the query file with the site database.
Download example

How to import an example into A-Parser

eJxtU01z2jAQ/S8aDu0MY5pDL74RJkzTIXGakBPDQYPXREWWVEmGpB7+e98Kx4Ym
N+3u2/f2S62IMuzCg6dAMYh81QqX3iIXJVWy0VGMhZM+kOfwSvxY3i3y/KaWSt+8
Ri830XpAenAr4psjpFsXlTUBMVXCTBwL2pOGZy91A8zVcb0eC+ghM8ytryXrjtxV
1hXRB5/knpYWwUppGtxzWPeyZrlRKSNxNKsS0ZevWXxlBlmWiiuR+qTAbQyqz0b9
4VJEiF6ZLfAwvaIw97aGO1IiYefbe4UrMUq2AE2T8n+dckQefUNjEVDtHAOisg9U
UgdEVCQvMbGiG07eCmumWqfBDLBEf90oXWLs0wpJt13i55DiA8ex7/Bcak/+4FFD
z5Ks6+JuyCrtwm7RuLFoW6taRdhhZhvDu/kG547I9WO7Z1htPfUyHXOnjstyZPgA
hq1N3eC6aONiM5fOjTWV2hZowKuS3pGNWeJ8CzOztdPEfZlGa2wl0ONwIdPQrYGN
ocD/k2dJ4uLwo7U6/Hw6leq8wgV+5wJrTPJctaPcSK2fHxfnETFcFIyXGF3IJ5PD
4ZDt/taBl5r5ZiI4N9LW4qjQ2XHd/7n+Z7af/7y8PWJpv8PDCc4dMhg+jCpgI/zL
/gFm02Dr

Scraping emails by a list of links

Scraping emails by a list of links
  1. In the Queries section, check the Unique queries option.
  2. In the Results section, check the Unique by string option.
  3. As the query, specify the links from which emails need to be scraped, or in Queries from specify File and upload the query file with the database of links.
Download example

How to import an example into A-Parser

eJxtU01z0zAQ/S+aHmAmOPTAxbc00wwwaV3a9BRyEPE6COuLXSkpePLfWTmOHZfe
tG/fvv1UI4Kkmh4QCAKJfN0I375FLkqoZNRBTISXSIDJvRafV3fLPL81Uunbl4By
Gxwy5UzebCaCBfhJC4dGJqErf511qr3zSe5h5dhZKQ0DvGDrXhpIUaUMkLxZ1Qq9
e5+Fl6Qgy1IF5azUpwypriHrs1W/Y4qngMrumM8mKqAFOsNwgFYkgX/OFa7FVWsL
lolt/LdTjMgDRpgI4moX3DGUvaOSmtijAqDkERQ+lcR4I5ydab2EPeiB1srfRKVL
nuOs4qAvXeDblOI/jWPf4WWqPeABuYZepbVuirshqnRLt+PGreO2tTIqsE1zF23a
zUcGawDfj+0+0YxD6NN0yl12PhUPtmTmsLWZH6BRG6PNjMGts5XaFdwAqhLOzGhX
fI+FnTvjNaS+bNSat0LwOFzIjLo1JGMo8HXwvE0xuuTgnKavT6dSPSq+wE+pQMOT
vMzaSW6l1s+Py0uPGC6KjZ8heMqn08PhkNV/DaWlZhin3+3Z8wMl4Bjy6Mq4DVuw
4bXLOKpZwoxRqSv5IUBNY5hMpqkVEKnUADvHN8yDPG76P9v/7Obtn5s3R76RX/Rw
oqeBJjJjvBniAxD59fEfH7B6cg==

Collected Data

Example of collected data

  • Email addresses
  • Total number of addresses on the page
  • Array with all collected pages (used when the Use Pages option is active)

Capabilities

  • Multi-page scraping (page navigation)
  • Traverse internal site pages up to a specified depth (the Parse to level) – option allows iteration over all site pages, collecting internal and external links
  • Determining follow links for links
  • Page traversal limit (Follow links limit option)
  • Ability to specify whether subdomains should be considered internal site pages
  • Supports gzip/deflate/brotli compression
  • Detection and conversion of site encodings to UTF-8
  • CloudFlare bypass
  • Engine selection (HTTP or Chrome)
  • Support for all HTML::LinkExtractorHTML::LinkExtractor

Use Cases

  • Scraping email addresses
  • Outputting the count of e-mail addresses

Queries

As queries, you must specify links to pages, for example:

https://a-parser.com/pages/support/

Output Result Examples

A-Parser supports flexible result formatting thanks to the built-in templating engine Template Toolkit, which allows it to output results in an arbitrary form, as well as in structured formats like CSV or JSON

Email address count output

Result format:

$mailcount

Example result:

4

Possible Settings

Parameter NameDefault ValueDescription
Good statusAllSelect which server response is considered successful. If the scraper receives a different server response, the request will be retried with a different proxy
Good code RegExAbility to specify a regular expression for checking the response code
Ban Proxy Code RegExAbility to temporarily ban proxies (Proxy ban time) based on the server response code
MethodGETRequest method
POST bodyContent to send to the server when using the POST method. Supports variables: $query – request URL, $query.orig – original query, and $pagenum - page number when using the Use Pages option.
CookiesAbility to specify cookies for the request.
User agent_Automatically inserts the user-agent of the current Chrome version_User-Agent header when requesting pages
Additional headersAbility to specify custom request headers with support for templating features and using variables from the query constructor
Read only headersRead headers only. In some cases, this saves traffic if there is no need to process content
Detect charset on contentDetect encoding based on page content
Emulate browser headersEmulate browser headers
Max redirects count0Maximum number of redirects the scraper will follow
Follow common redirectsAllows http <-> https and www.domain <-> domain redirects within the same domain, bypassing the Max redirects count limit
Max cookies count16Maximum number of cookies to save
EngineHTTP (Fast, JavaScript Disabled)Allows selection of HTTP engine (faster, no JavaScript) or Chrome engine (slower, JavaScript enabled)
Chrome HeadlessIf enabled, the browser will not be displayed
Chrome DevToolsAllows using Chromium debugging tools
Chrome Log Proxy connectionsIf enabled, information on chrome connections will be output to the log
Chrome Wait Untilnetworkidle2Determines when a page is considered loaded. More about values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used.
Don't verify TLS certsDisable TLS certificate validation
Randomize TLS FingerprintThis option allows bypassing sites banned by TLS fingerprint
Bypass CloudFlareAutomatic CloudFlare protection bypass
Bypass CloudFlare with Chrome(Experimental)CF bypass via Chrome
Bypass CloudFlare with Chrome Max Pages20Max number of pages when bypassing CF via Chrome
Subdomains are internalShould subdomains be considered internal links
Follow linksInternal onlyWhich links to follow
Follow links limit0Follow links limit, applied to each unique domain
Skip comment blocksShould comment blocks be skipped
Search Cloudflare protected e-mailsWhether to scrape Cloudflare protected e-mails.
Skip non-HTML blocksDo not collect email addresses within tags (script, style, comment, etc.).
Skip meta tagsDo not collect email addresses in meta tags
Search URL encoded e-mailsCollect URL encoded emails