1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.
  2. Join our Telegram chat: https://t.me/a_parser_en
    Dismiss Notice

HTML::EmailExtractor - parsing of e-mail addresses from site pages

Dec 23, 2020

  • Collected data(top)

    • Collect email addresses from specified pages


    • Going to the internal pages of the site to the specified depth (option Parse to level) - allows you to walk through all pages of the site, collecting internal and external links
    • Determining the site encoding
    • Definition of links for links
    • Ability to specify subdomains as internal pages of a site

    Use options(top)

    • Used to collect e-mail addresses that are placed on site pages
    • Ability to output the number of e-mail addresses hosted on the site through the variable $ mailcount


    In the request it is necessary to specify the url from which you want to scrape the e-mail address(es)

    Option Parse to level(top)

    Specifies to a parser to pass according to adjacent pages of the site into depth to the specified level, for example:
    • If the 1st level that is specified the parser will follow all links specified on the initial page
    • If the 2nd level that is specified the parser will follow all links specified on the initial page according to all links collected from pages at the first level
    • etc.
    Since on adjacent pages most likely there will be links to the initial page or repetitions of links in order that the parser didn't go in cycles and didn't following around it is necessary to include uniqueness of requests (Unique queries) surely.


    • The result is the e-mail address (es) of the site

    Possible settings(top)

    Global settings for all parsers
    ParameterValue by defaultDescription
    Good statusAllChoice what the response from the server will be it is considered successful. If when parsing there is other response from the server, the request will be repeated with other proxy
    Good code RegEx-the ability to specify a regular expression to check the response code
    MethodGETRequest method
    POST body-Content for sending to the server when using the POST method. Supports variables $query - url of request, $query.orig - the initial request and $pagenum - number of the page when using the option Use Pages
    Cookies-Opportunity to specify cookies for request
    User agentMozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)User-Agent header in case of request of pages
    Additional headers-Opportunity to specify arbitrary headers of request, with support of opportunities of Template Toolkit and use of variables from Query builder
    Read only headersRead only headers. Allows to save in certain cases traffic if there is no need to parse content.
    Detect charset on contentOpportunity to recognize the coding on the basis of page contents
    Emulate browser headersOpportunity to emulate browser headers
    Max redirects count7Maximum quantity of redirects on which will follow a parser
    Max cookies count16The maximum number of Cookie for saving
    Bypass CloudFlareAutomatic bypass CloudFlare checks on the browser
    Subdomains are internalWhether to consider subdomains as internal links
    Follow linksInternal onlyDetermines what links follow
    Search Cloudflare protected e-mailsSpecifies whether to use Cloudflare protected e-mails
    Skip non-HTML blocksDo not collect mailing addresses in <script>, <style>, <!-- comment -->...
Misery713 likes this.