Determine the CMS for 1000000 domains over 15 hours

Discussion in 'Share Experience' started by Support, Sep 15, 2015.

  1. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,547
    Likes Received:
    2,164
    As basis for domains we use top million domains from Alexa, the basis can be downloaded here:
    Initial data:
    • Server with Quad-core processor Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz, 8 Gb RAM and bandwidth 100 mbit\s
    • In settings of parser is set use of 6 cores of CPU since the processor supports 8 executive cores with Hyper-Threading technology, 2 cores are left for stable system work
    Screenshot of settings of the task:

    [​IMG]

    • Source file with requests contains data in the format <alexa-rank>,<domain>, one million records, each domain begin from new line. By means of the Queries Builder we separate the domain and it rank. For parser Rank::CMS need to specify the full link to the site or the page, therefore in query format we will add http://
    • We use a parser Rank::CMS Rank::CMS with default settings, we specify that parsing will be made without proxy and at most 3 attempts on query
    • For convenience we will save result in two formats - in the top-1m-cms.txt file we will write the domain, Alexa Rank and the name of CMS; in the top-1m/ folder we will save domains, automatically sorting names of files by the name CMS (i.e. in the WordPress.txt file there will be only domains with Wordpress and it is so similar for all defined CMS)
    • By default check is executed on all CMS, forum engines and Wiki-engines
    Result of work of task:

    [​IMG]

    Statistics:
    • Speed of parsing is 1100 domains per minute
    • All 301841 of 1000000 domains as one of popular CMS, forums or Wiki using on the homepage were defined
    • 126 different CMS are defined
    • Top 10 most popular CMS, the first value defines number of domains:
    Code:
    209855 WordPress
    23732 Joomla
    22945 Drupal
    6488 TYPO3 CMS
    4917 vBulletin
    3726 1C-Bitrix
    2515 phpBB
    2415 ExpressionEngine
    2022 DataLife Engine
    1928 Microsoft SharePoint
    Code:
    eyJwcmVzZXQiOiJSYW5rIENNUyBBbGV4YSB0b3AtMWtrIiwidmFsdWUiOnsicGFy
    c2VycyI6W1siUmFuazo6Q01TIiwiZGVmYXVsdCIseyJ0eXBlIjoib3ZlcnJpZGUi
    LCJpZCI6InVzZXByb3h5IiwidmFsdWUiOmZhbHNlfSx7InR5cGUiOiJvdmVycmlk
    ZSIsImlkIjoicHJveHlyZXRyaWVzIiwidmFsdWUiOiIzIn1dXSwicmVzdWx0c0Zv
    cm1hdCI6IiRxdWVyeTskcXVlcnkuYWxleGE7JHAxLmNtc1xcbiIsInJlc3VsdHNT
    YXZlVG8iOiJmaWxlIiwicmVzdWx0c0ZpbGVOYW1lIjoidG9wLTFtLWNtcy50eHQi
    LCJhZGRpdGlvbmFsRm9ybWF0cyI6W1sidG9wLTFtLyR7cDEuY21zfS50eHQiLCIk
    cXVlcnlcXG4iXV0sInJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9ybWF0Ijoi
    aHR0cDovLyRxdWVyeSIsInVuaXF1ZVF1ZXJpZXMiOmZhbHNlLCJzYXZlRmFpbGVk
    UXVlcmllcyI6ZmFsc2UsImRvTG9nIjoibm8iLCJrZWVwVW5pcXVlIjoiTm8iLCJt
    b3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5kIjoiIiwicmVzdWx0c0Fw
    cGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOlt7InNvdXJjZSI6InF1ZXJ5IiwidHlw
    ZSI6InN0cmluZ1NwbGl0Iiwic2VwYXJhdG9yIjoiLCIsInRvIjpbImFsZXhhIiwi
    cXVlcnkiXX1dLCJyZXN1bHRzQnVpbGRlcnMiOltdLCJjb25maWdPdmVycmlkZXMi
    OltdfX0=

    Files of results:
    • File with initial domains, Alexa Rank and the defined CMS, top-1m-cms.txt 37mb
    • Archive with sorted on CMS files, top-1m.zip 7.6mb

    Considerably it is possible to increase parsing speed by reduction of quantity of the checked CMS, on a screenshot an example of the task in which only WordPress is checked, apparently speed increased more than by 8 times, thus resources of the server are enough for further increase in threads. Such task will be performed in only 2 hours

    [​IMG]
     
    Wicked likes this.

Share This Page