1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.
  2. Join our Telegram chat: https://t.me/a_parser_en
    Dismiss Notice

Collect 1.65 million emails from the pages of contacts for 2.5 hours

Discussion in 'Share Experience' started by Support, Sep 16, 2015.

  1. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,377
    Likes Received:
    2,107
    1. First of all collect by the parser SE::Google SE::Google links to pages with contacts:

    [​IMG]

    • Choosing a preset 1000 Links use Proxy, which saving links on request with a maximum depth
    • Add unique links by domain
    • Specify the key phrase "контакты"
    • Add 2 query format to propagate requests and receive a large number of results
    Code:
    eyJwcmVzZXQiOiJkZWZhdWx0IiwidmFsdWUiOnsicGFyc2VycyI6W1siU0U6Okdv
    b2dsZSIsIjEwMDAgTGlua3MgdXNlIFByb3h5Iix7InR5cGUiOiJ1bmlxdWUiLCJy
    ZXN1bHQiOlsic2VycCIsImxpbmsiXSwidW5pcXVlVHlwZSI6ImRvbWFpbiIsInVu
    aXF1ZUdsb2JhbCI6dHJ1ZX1dXSwicmVzdWx0c0Zvcm1hdCI6IiRwMS5wcmVzZXQi
    LCJyZXN1bHRzU2F2ZVRvIjoiZmlsZSIsInJlc3VsdHNGaWxlTmFtZSI6Imxpbmtz
    LWNvbnRhY3RzLU9jdC0wNl8wOC0yOS01OS50eHQiLCJhZGRpdGlvbmFsRm9ybWF0
    cyI6W10sInJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9ybWF0IjpbIiRxdWVy
    eSB7YXo6YTp6enp9IiwiJHF1ZXJ5IHtudW06MToxMDAwMH0iXSwidW5pcXVlUXVl
    cmllcyI6ZmFsc2UsInNhdmVGYWlsZWRRdWVyaWVzIjpmYWxzZSwiaXRlcmF0b3JP
    cHRpb25zIjp7Im9uQWxsTGV2ZWxzIjpmYWxzZX0sImRvTG9nIjoibm8iLCJrZWVw
    VW5pcXVlIjoiTm8iLCJtb3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5k
    IjoiIiwicmVzdWx0c0FwcGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOltdLCJyZXN1
    bHRzQnVpbGRlcnMiOltdLCJjb25maWdPdmVycmlkZXMiOltdfSwicGFyc2Vyc0Nv
    bmZQcmVzZXRzIjp7IlNFOjpHb29nbGUiOnsiMTAwMCBMaW5rcyB1c2UgUHJveHki
    OnsicXVlcnlmb3JtYXQiOiIkcXVlcnkiLCJwYXJzZW5vdGZvdW5kIjp0cnVlLCJn
    bCI6IiIsInBhZ2Vjb3VudCI6IjEwIiwiZG9fZ3ppcCI6dHJ1ZSwiZG9tYWluIjoi
    d3d3Lmdvb2dsZS5jb20iLCJ0aW1lb3V0IjoiNjAiLCJ1c2Vwcm94eSI6dHJ1ZSwi
    YW50aWdhdGVwcmVzZXQiOiJkZWZhdWx0IiwiZXh0cmFxdWVyeSI6IiIsImxvY2F0
    aW9uIjoiIiwidXNlc2Vzc2lvbnMiOnRydWUsInNlcnB0aW1lIjoiIiwibGlua3Nw
    ZXJwYWdlIjoiMTAwIiwiZmlsdGVyIjp0cnVlLCJzZXJwIjoiIiwidXNlYW50aWdh
    dGUiOmZhbHNlLCJwcm94eXJldHJpZXMiOiIxNSIsInJlcXVlc3RkZWxheSI6IjAi
    LCJwcm94eWJhbm5lZGNsZWFudXAiOiI2MDAiLCJmb3JtYXRyZXN1bHQiOiIkc2Vy
    cC5mb3JtYXQoJyRsaW5rXFxuJykiLCJyYXdkYXRhIjowLCJsciI6IiIsInVzZWNh
    cHRjaGFraWxsZXIiOmZhbHNlLCJtYXhfc2l6ZSI6IjIwNDgwMCJ9fX19

    The result is a database containing 1663086 links to the pages of contact of various sites:

    [​IMG]

    2. At collected links using parser Net::HTTP Net::HTTP and regular expression collect the email address:

    [​IMG]

    • Disable the use of proxy
    • With the option Parse custom result specify a regular expression ((?>\b[-a-z0-9._%+]+)@[a-z0-9.-]+\.[a-z]{2,6})\b to collect email addresses from page source - $data
    • Add the conversion of result to lower case and unique by string
    • As the request specifying the base of links, collected in the first task
    Code:
    eyJwcmVzZXQiOiJkZWZhdWx0IiwidmFsdWUiOnsicGFyc2VycyI6W1siTmV0OjpI
    VFRQIiwiZGVmYXVsdCIseyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6InVzZXByb3h5
    IiwidmFsdWUiOmZhbHNlfSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQi
    OiJkYXRhIiwicmVnZXgiOiIoKD8+XFxiWy1hLXowLTkuXyUrXSspQFthLXowLTku
    LV0rXFwuW2Etel17Miw2fSlcXGIiLCJyZWdleFR5cGUiOiJpZyIsInJlc3VsdFR5
    cGUiOiJhcnJheSIsImFycmF5TmFtZSI6Im1haWxzIiwicmVzdWx0cyI6WyJtYWls
    Il19LHsidHlwZSI6InVuaXF1ZSIsInJlc3VsdCI6WyJtYWlscyIsIm1haWwiXSwi
    dW5pcXVlVHlwZSI6InN0cmluZyIsInVuaXF1ZUdsb2JhbCI6dHJ1ZX1dXSwicmVz
    dWx0c0Zvcm1hdCI6IiRwMS5tYWlscy5mb3JtYXQoJyRtYWlsXFxuJykiLCJyZXN1
    bHRzU2F2ZVRvIjoiZmlsZSIsInJlc3VsdHNGaWxlTmFtZSI6IiRkYXRlZmlsZS5m
    b3JtYXQoKS50eHQiLCJhZGRpdGlvbmFsRm9ybWF0cyI6W10sInJlc3VsdHNVbmlx
    dWUiOiJubyIsInF1ZXJ5Rm9ybWF0IjpbIiRxdWVyeSJdLCJ1bmlxdWVRdWVyaWVz
    IjpmYWxzZSwic2F2ZUZhaWxlZFF1ZXJpZXMiOmZhbHNlLCJpdGVyYXRvck9wdGlv
    bnMiOnsib25BbGxMZXZlbHMiOmZhbHNlfSwiZG9Mb2ciOiJubyIsImtlZXBVbmlx
    dWUiOiJObyIsIm1vcmVPcHRpb25zIjpmYWxzZSwicmVzdWx0c1ByZXBlbmQiOiIi
    LCJyZXN1bHRzQXBwZW5kIjoiIiwicXVlcnlCdWlsZGVycyI6W10sInJlc3VsdHNC
    dWlsZGVycyI6W3sic291cmNlIjpbMCxbIm1haWxzIiwibWFpbCJdXSwidHlwZSI6
    ImxjIiwiYXJyYXkiOiJtYWlscyIsInRvIjoibWFpbCJ9XSwiY29uZmlnT3ZlcnJp
    ZGVzIjpbXX19

    As a result we get database contains 1647115 unique email address:

    [​IMG]

    • The average speed of processing was 12,000 links per minute
    • TOP10 e-mail domains:
    Code:
    249772 mail.ru
    129894 gmail.com
    91901 yandex.ru
    25625 rambler.ru
    20821 bk.ru
    19773 hotmail.com
    14656 yahoo.com
    14117 list.ru
    13636 inbox.ru
    11670 ukr.net
    
     

Share This Page