HTML::LinkExtractor - Parser of external and internal links from the specified site, can follow the internal links to the selected level

Nov 15, 2016

  • Collected data(top)



    • Quantity of external links
    • Quantity of internal links
    • External links: links, anchors, anchors cleared of HTML tags and nofollow parameter
    • Internal links: links, anchors, anchors cleared of HTML tags and nofollow parameter
    [​IMG]

    Opportunity(top)



    • Following according to internal pages of the site up to the specified depth (option Parse to level) - allows to following according to all pages of the site, collecting internal and external links
    • Supports automatic code conversion of anchors from any coding in UTF-8
    • Automatically clears an anchor of HTML tags (for example <b>, <img>, etc.)
    • Determination of nofollow for each link
    • Opportunity to specify to consider subdomains as internal pages of the site

    Use options(top)



    • Receiving a full map of the site - saving of all internal links
    • Receiving all external links from the site
    • Check of a backlinks to the site

    Requests(top)


    As requests it is necessary to specify links to pages with which it is necessary to collect links, or entry point (for example the homepage of the site) if the option Parse to level is used:

    Option Parse to level(top)


    Specifies to a parser to pass according to adjacent pages of the site into depth to the specified level, for example:
    • If the 1st level that is specified the parser will follow all links specified on the initial page
    • If the 2nd level that is specified the parser will follow all links specified on the initial page according to all links collected from pages at the first level
    • etc.
    Simple words is the minimum number of clicks between the initial page and finite
    Since on adjacent pages most likely there will be links to the initial page or repetitions of links in order that the parser didn't go in cycles and didn't following around it is necessary to include uniqueness of requests (Unique queries) surely.
    Also, similar to Net::HTTP Net::HTTP are available next options: Check content, Use pages and Check next page.

    When using this option it is possible to manage the link for the transition. To do this, an existing array $followlinks, which contains a links to the next step (level) of transition. On this array you can apply filters, thereby controlling where can proceed. Below is an example.

    Examples of tasks(top)


    Collection of all external links from the site(top)


    [​IMG]
    On this screenshot settings are selected:
    • Following according to adjacent pages to the 10th level
    • In result to save only the list of external links
    • Uniqueness of requests, i.e. a parser 2 times won't follow the same link
    • Uniqueness of results on a line, i.e. in the finite file only unique links from the specified site will be saved

    Collection of all internal links from the site(top)


    Settings are similar as in the previous example, only we change Result format $intlinks.format('$link\n') - we save only internal links

    Transition only on the links that do not have the word forum(top)


    [​IMG]

    Possible settings(top)

    Global settings for all parsers
    ParameterValue by defaultDescription
    Subdomains are internalWhether to consider subdomains as internal links
    Follow linksInternal onlyDetermines what links follow
    User agentMozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)User-Agent header in case of request of pages
    Max redirects count0Maximum quantity of redirects on which will follow a parser
    Bypass CloudFlareAutomatic bypass CloudFlare checks on the browser
    Use gzipDetermines whether to use compression of the transferred traffic