Collect 1.65 million emails from the pages of contacts for 2.5 hours

Discussion in 'Share Experience' started by Support, Sep 16, 2015.

  1. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,547
    Likes Received:
    2,164
    1. First of all collect by the parser SE::Google SE::Google links to pages with contacts:

    [​IMG]

    • Choosing a preset 1000 Links use Proxy, which saving links on request with a maximum depth
    • Add unique links by domain
    • Specify the key phrase "контакты"
    • Add 2 query format to propagate requests and receive a large number of results
    Code:
    eyJwcmVzZXQiOiJkZWZhdWx0IiwidmFsdWUiOnsicGFyc2VycyI6W1siU0U6Okdv
    b2dsZSIsIjEwMDAgTGlua3MgdXNlIFByb3h5Iix7InR5cGUiOiJ1bmlxdWUiLCJy
    ZXN1bHQiOlsic2VycCIsImxpbmsiXSwidW5pcXVlVHlwZSI6ImRvbWFpbiIsInVu
    aXF1ZUdsb2JhbCI6dHJ1ZX1dXSwicmVzdWx0c0Zvcm1hdCI6IiRwMS5wcmVzZXQi
    LCJyZXN1bHRzU2F2ZVRvIjoiZmlsZSIsInJlc3VsdHNGaWxlTmFtZSI6Imxpbmtz
    LWNvbnRhY3RzLU9jdC0wNl8wOC0yOS01OS50eHQiLCJhZGRpdGlvbmFsRm9ybWF0
    cyI6W10sInJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9ybWF0IjpbIiRxdWVy
    eSB7YXo6YTp6enp9IiwiJHF1ZXJ5IHtudW06MToxMDAwMH0iXSwidW5pcXVlUXVl
    cmllcyI6ZmFsc2UsInNhdmVGYWlsZWRRdWVyaWVzIjpmYWxzZSwiaXRlcmF0b3JP
    cHRpb25zIjp7Im9uQWxsTGV2ZWxzIjpmYWxzZX0sImRvTG9nIjoibm8iLCJrZWVw
    VW5pcXVlIjoiTm8iLCJtb3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5k
    IjoiIiwicmVzdWx0c0FwcGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOltdLCJyZXN1
    bHRzQnVpbGRlcnMiOltdLCJjb25maWdPdmVycmlkZXMiOltdfSwicGFyc2Vyc0Nv
    bmZQcmVzZXRzIjp7IlNFOjpHb29nbGUiOnsiMTAwMCBMaW5rcyB1c2UgUHJveHki
    OnsicXVlcnlmb3JtYXQiOiIkcXVlcnkiLCJwYXJzZW5vdGZvdW5kIjp0cnVlLCJn
    bCI6IiIsInBhZ2Vjb3VudCI6IjEwIiwiZG9fZ3ppcCI6dHJ1ZSwiZG9tYWluIjoi
    d3d3Lmdvb2dsZS5jb20iLCJ0aW1lb3V0IjoiNjAiLCJ1c2Vwcm94eSI6dHJ1ZSwi
    YW50aWdhdGVwcmVzZXQiOiJkZWZhdWx0IiwiZXh0cmFxdWVyeSI6IiIsImxvY2F0
    aW9uIjoiIiwidXNlc2Vzc2lvbnMiOnRydWUsInNlcnB0aW1lIjoiIiwibGlua3Nw
    ZXJwYWdlIjoiMTAwIiwiZmlsdGVyIjp0cnVlLCJzZXJwIjoiIiwidXNlYW50aWdh
    dGUiOmZhbHNlLCJwcm94eXJldHJpZXMiOiIxNSIsInJlcXVlc3RkZWxheSI6IjAi
    LCJwcm94eWJhbm5lZGNsZWFudXAiOiI2MDAiLCJmb3JtYXRyZXN1bHQiOiIkc2Vy
    cC5mb3JtYXQoJyRsaW5rXFxuJykiLCJyYXdkYXRhIjowLCJsciI6IiIsInVzZWNh
    cHRjaGFraWxsZXIiOmZhbHNlLCJtYXhfc2l6ZSI6IjIwNDgwMCJ9fX19

    The result is a database containing 1663086 links to the pages of contact of various sites:

    [​IMG]

    2. At collected links using parser Net::HTTP Net::HTTP and regular expression collect the email address:

    [​IMG]

    • Disable the use of proxy
    • With the option Parse custom result specify a regular expression ((?>\b[-a-z0-9._%+]+)@[a-z0-9.-]+\.[a-z]{2,6})\b to collect email addresses from page source - $data
    • Add the conversion of result to lower case and unique by string
    • As the request specifying the base of links, collected in the first task
    Code:
    eyJwcmVzZXQiOiJkZWZhdWx0IiwidmFsdWUiOnsicGFyc2VycyI6W1siTmV0OjpI
    VFRQIiwiZGVmYXVsdCIseyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6InVzZXByb3h5
    IiwidmFsdWUiOmZhbHNlfSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQi
    OiJkYXRhIiwicmVnZXgiOiIoKD8+XFxiWy1hLXowLTkuXyUrXSspQFthLXowLTku
    LV0rXFwuW2Etel17Miw2fSlcXGIiLCJyZWdleFR5cGUiOiJpZyIsInJlc3VsdFR5
    cGUiOiJhcnJheSIsImFycmF5TmFtZSI6Im1haWxzIiwicmVzdWx0cyI6WyJtYWls
    Il19LHsidHlwZSI6InVuaXF1ZSIsInJlc3VsdCI6WyJtYWlscyIsIm1haWwiXSwi
    dW5pcXVlVHlwZSI6InN0cmluZyIsInVuaXF1ZUdsb2JhbCI6dHJ1ZX1dXSwicmVz
    dWx0c0Zvcm1hdCI6IiRwMS5tYWlscy5mb3JtYXQoJyRtYWlsXFxuJykiLCJyZXN1
    bHRzU2F2ZVRvIjoiZmlsZSIsInJlc3VsdHNGaWxlTmFtZSI6IiRkYXRlZmlsZS5m
    b3JtYXQoKS50eHQiLCJhZGRpdGlvbmFsRm9ybWF0cyI6W10sInJlc3VsdHNVbmlx
    dWUiOiJubyIsInF1ZXJ5Rm9ybWF0IjpbIiRxdWVyeSJdLCJ1bmlxdWVRdWVyaWVz
    IjpmYWxzZSwic2F2ZUZhaWxlZFF1ZXJpZXMiOmZhbHNlLCJpdGVyYXRvck9wdGlv
    bnMiOnsib25BbGxMZXZlbHMiOmZhbHNlfSwiZG9Mb2ciOiJubyIsImtlZXBVbmlx
    dWUiOiJObyIsIm1vcmVPcHRpb25zIjpmYWxzZSwicmVzdWx0c1ByZXBlbmQiOiIi
    LCJyZXN1bHRzQXBwZW5kIjoiIiwicXVlcnlCdWlsZGVycyI6W10sInJlc3VsdHNC
    dWlsZGVycyI6W3sic291cmNlIjpbMCxbIm1haWxzIiwibWFpbCJdXSwidHlwZSI6
    ImxjIiwiYXJyYXkiOiJtYWlscyIsInRvIjoibWFpbCJ9XSwiY29uZmlnT3ZlcnJp
    ZGVzIjpbXX19

    As a result we get database contains 1647115 unique email address:

    [​IMG]

    • The average speed of processing was 12,000 links per minute
    • TOP10 e-mail domains:
    Code:
    249772 mail.ru
    129894 gmail.com
    91901 yandex.ru
    25625 rambler.ru
    20821 bk.ru
    19773 hotmail.com
    14656 yahoo.com
    14117 list.ru
    13636 inbox.ru
    11670 ukr.net
    
     

Share This Page