1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.
  2. Join our Telegram chat: https://t.me/a_parser_en
    Dismiss Notice

Determine the CMS for 1000000 domains over 15 hours

Discussion in 'Share Experience' started by Support, Sep 15, 2015.

  1. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Mar 16, 2012
    Likes Received:
    As basis for domains we use top million domains from Alexa, the basis can be downloaded here:
    Initial data:
    • Server with Quad-core processor Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz, 8 Gb RAM and bandwidth 100 mbit\s
    • In settings of parser is set use of 6 cores of CPU since the processor supports 8 executive cores with Hyper-Threading technology, 2 cores are left for stable system work
    Screenshot of settings of the task:


    • Source file with requests contains data in the format <alexa-rank>,<domain>, one million records, each domain begin from new line. By means of the Queries Builder we separate the domain and it rank. For parser Rank::CMS need to specify the full link to the site or the page, therefore in query format we will add http://
    • We use a parser Rank::CMS Rank::CMS with default settings, we specify that parsing will be made without proxy and at most 3 attempts on query
    • For convenience we will save result in two formats - in the top-1m-cms.txt file we will write the domain, Alexa Rank and the name of CMS; in the top-1m/ folder we will save domains, automatically sorting names of files by the name CMS (i.e. in the WordPress.txt file there will be only domains with Wordpress and it is so similar for all defined CMS)
    • By default check is executed on all CMS, forum engines and Wiki-engines
    Result of work of task:


    • Speed of parsing is 1100 domains per minute
    • All 301841 of 1000000 domains as one of popular CMS, forums or Wiki using on the homepage were defined
    • 126 different CMS are defined
    • Top 10 most popular CMS, the first value defines number of domains:
    209855 WordPress
    23732 Joomla
    22945 Drupal
    6488 TYPO3 CMS
    4917 vBulletin
    3726 1C-Bitrix
    2515 phpBB
    2415 ExpressionEngine
    2022 DataLife Engine
    1928 Microsoft SharePoint

    Files of results:
    • File with initial domains, Alexa Rank and the defined CMS, top-1m-cms.txt 37mb
    • Archive with sorted on CMS files, top-1m.zip 7.6mb

    Considerably it is possible to increase parsing speed by reduction of quantity of the checked CMS, on a screenshot an example of the task in which only WordPress is checked, apparently speed increased more than by 8 times, thus resources of the server are enough for further increase in threads. Such task will be performed in only 2 hours

    Wicked likes this.

Share This Page