Parse base for XRumer: 420000 forums in 9 hours

Discussion in 'Share Experience' started by Support, Sep 16, 2015.

  1. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,547
    Likes Received:
    2,164
    For parsing we will use only one keyword "forum", complementing its substitutions of digits and letters, and we will not use inurl: operator, that will greatly increase the speed of parsing.

    [​IMG]

    • Using parser SE::Google SE::Google with preset 1000 Links use Proxy
    • Adding links filtration by a regular expression, which is only suitable for popular forums
    • Add unique links on top level domain
    • Use 2 formats queries - search characters from a to zzzz, and numbers from 1 to 50000
    Code:
    eyJwcmVzZXQiOiJkZWZhdWx0IiwidmFsdWUiOnsicGFyc2VycyI6W1siU0U6Okdv
    b2dsZSIsIjEwMDAgTGlua3MgdXNlIFByb3h5Iix7InR5cGUiOiJmaWx0ZXIiLCJy
    ZXN1bHQiOlsic2VycCIsImxpbmsiXSwiZmlsdGVyVHlwZSI6InJlbWF0Y2giLCJ2
    YWx1ZSI6InZpZXd0b3BpY1xcLnBocHx2aWV3Zm9ydW1cXC5waHB8dmlld3RocmVh
    ZFxcLnBocHx0aHJlYWQtfGZvcnVtXFwucGhwfHNob3d0aHJlYWRcXC5waHB8Zm9y
    dW1kaXNwbGF5XFwucGhwfFlhQkJcXC5wbHxZYUJCXFwuY2dpfHViYnRocmVhZHNc
    XC5waHB8dWx0aW1hdGViYlxcLnBocHx1bHRpbWF0ZWJiXFwuY2dpfGluZGV4XFwu
    cGhwXFw/c2hvd3RvcGljPXx0aHJlYWRzfHRvcGljfG1lbWJlcnN8bWVtYmVyXFwu
    cGhwfG1lbWJlcmxpc3RcXC5waHB8cHJvZmlsZVxcLnBocHx1c2VyaW5mb1xcLnBo
    cHx2aWV3dG9waWN8dmlld2ZvcnVtfHZpZXd0aHJlYWR8dG9waWN8dGhyZWFkfHNo
    b3d0aHJlYWR8c2hvd3RvcGljfHNob3dmb3J1bSIsIm9wdGlvbiI6ImkifSx7InR5
    cGUiOiJ1bmlxdWUiLCJyZXN1bHQiOlsic2VycCIsImxpbmsiXSwidW5pcXVlVHlw
    ZSI6InRvcGRvbWFpbiIsInVuaXF1ZUdsb2JhbCI6dHJ1ZX1dXSwicmVzdWx0c0Zv
    cm1hdCI6IiRwMS5wcmVzZXQiLCJyZXN1bHRzU2F2ZVRvIjoiZmlsZSIsInJlc3Vs
    dHNGaWxlTmFtZSI6Ik5vdi0wNV8xMS01Mi0xNS50eHQiLCJhZGRpdGlvbmFsRm9y
    bWF0cyI6W10sInJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9ybWF0IjpbIiRx
    dWVyeSB7YXo6YTp6enp6fSIsIiRxdWVyeSB7bnVtOjE6NTAwMDB9Il0sInVuaXF1
    ZVF1ZXJpZXMiOmZhbHNlLCJzYXZlRmFpbGVkUXVlcmllcyI6ZmFsc2UsIml0ZXJh
    dG9yT3B0aW9ucyI6eyJvbkFsbExldmVscyI6ZmFsc2V9LCJkb0xvZyI6Im5vIiwi
    a2VlcFVuaXF1ZSI6Ik5vIiwibW9yZU9wdGlvbnMiOmZhbHNlLCJyZXN1bHRzUHJl
    cGVuZCI6IiIsInJlc3VsdHNBcHBlbmQiOiIiLCJxdWVyeUJ1aWxkZXJzIjpbXSwi
    cmVzdWx0c0J1aWxkZXJzIjpbXSwiY29uZmlnT3ZlcnJpZGVzIjpbXX0sInBhcnNl
    cnNDb25mUHJlc2V0cyI6eyJTRTo6R29vZ2xlIjp7IjEwMDAgTGlua3MgdXNlIFBy
    b3h5Ijp7InF1ZXJ5Zm9ybWF0IjoiJHF1ZXJ5IiwicGFyc2Vub3Rmb3VuZCI6dHJ1
    ZSwiZ2wiOiIiLCJwYWdlY291bnQiOiIxMCIsImRvX2d6aXAiOnRydWUsImRvbWFp
    biI6Ind3dy5nb29nbGUuY29tIiwidGltZW91dCI6IjYwIiwidXNlcHJveHkiOnRy
    dWUsImFudGlnYXRlcHJlc2V0IjoiZGVmYXVsdCIsImV4dHJhcXVlcnkiOiIiLCJs
    b2NhdGlvbiI6IiIsInVzZXNlc3Npb25zIjp0cnVlLCJzZXJwdGltZSI6IiIsImxp
    bmtzcGVycGFnZSI6IjEwMCIsImZpbHRlciI6dHJ1ZSwic2VycCI6IiIsInVzZWFu
    dGlnYXRlIjpmYWxzZSwicHJveHlyZXRyaWVzIjoiMTUiLCJyZXF1ZXN0ZGVsYXki
    OiIwIiwicHJveHliYW5uZWRjbGVhbnVwIjoiNjAwIiwiZm9ybWF0cmVzdWx0Ijoi
    JHNlcnAuZm9ybWF0KCckbGlua1xcbicpIiwicmF3ZGF0YSI6MCwibHIiOiIiLCJ1
    c2VjYXB0Y2hha2lsbGVyIjpmYWxzZSwibWF4X3NpemUiOiIyMDQ4MDAifX19fQ==

    As a result we get a database of links to forums containing 421,618 unique domain:

    [​IMG]

    During 9 hours:
    • 525 254 request has been processed to the maximum level
    • 68 million links was parsing, 420K of which fit into filter and the unique by domain
    • The average speed of parsing was 1000 requests per minute
     

Share This Page