Skip to main content

HTML::LinkExtractor - Parser for external and internal links from a specified website

Overview of HTML::LinkExtractor parser

HTML::LinkExtractor

HTML::LinkExtractorHTML::LinkExtractor is a parser for external and internal links from a specified website. It supports multipage parsing and navigation through internal pages of the website to a specified depth, allowing you to go through all the pages of the website, collecting internal and external links. It has built-in tools for bypassing CloudFlare protection and also the ability to choose Chrome as the engine for parsing mail from pages, data on which is loaded by scripts. It can develop speed up to 2000 requests per minute - that's 120,000 links per hour.

Use cases for HTML::LinkExtractor parser

Case 1

  1. Add the option Parse to level, select the value 10 in the list (navigate to neighboring pages up to the 10th level).
  2. Add the option Result format, specify $extlinks.format('$link\n') as the value (output of external links).
  3. In the Queries section, check the option Unique queries.
  4. In the Results section, check the option Unique by line.
  5. Specify the link to the website from which you want to parse external links as the query.
Download example

How to import an example into A-Parser

eJxtU01v2zAM/S9CgK5AlrSHXnxLgwZb4dZdm57SHISYztTIoirRWQrD/32U7NjJ
1ptIvsfHL9WCpN/5JwceyItkVQsb3yIRORSy0iTGwkrnwYXwSvxYPqRJkiqzuzuQ
kxtCx4geWwv6tMBstKTQeI6pnM2YIoU9aPbspa4Yc33VnOD34JzK4Ugo0JWSuJa2
hI4iRnAgzeJ+0gK+XYyC+fZmLi5Fs16PRUvxixgODHs96Xrqgy9yD0sMKkrD4F6w
9SjLqJNLghA96lxO6BAyyDxXoTOpW4UwlUH11aiPWKcnp8yW8Ww6BX7hsGQ3QUwS
nJ/HCldiFG3BaarI/9VyREKugrHwXO1Cci15Hyik9hxRBE7yBrJu2Ekt0My0joMe
YDH9baV0zlucFUz62RG/hmT/5Wj6Dk+leGV/HNfQZ4nWbfYwsHJMccuNG+S2tSoV
se3nWJmwmyt27gBsP7bHACvRQS/TZe7U+VAtmHAfw9ZmdnCdtXG2mXPnBk2htll3
c0dkZZb8GzIzx9JqCH2ZSmveiofn4UJmvltDMIYC/yXPo8TZPyJE7e9f2lKtU3yB
N6HAkid5qtql3EitX5/T04gYLoqN30Q2mU7ld4ueFzpRpsCpCESCLfJFcVvNuv+/
/S+vv/zFSd3wwt79U4sO3QUs+3hMnrfBP7b5C6wbebo=

Similar to the first case, but in step 2, specify $intlinks.format('$link\n') as the value (output of internal links).

Download example

How to import an example into A-Parser

eJxtU8tu2zAQ/BfCQBrAtZNDL7o5Roy2cOI0j5PjA2GtXNYUyZIrN4Ggf++QkiW7
zY27O7OzL9aCZdiHB0+BOIhsXQuX3iITORWy0izGwkkfyMfwWnx9vltm2VKZ/e0b
e7ll64HosbXgd0dgW8fKmoCYymGmFEs6kIbnIHUFzPVVc4I/kPcqpyOhsL6UjFra
EjqKGCnDGuJh0gI+XYyi+fpqLi5Fs9mMRUsJixSODHc96Xrqg0/yQM82qihNg3sB
616WSSeXTDF61Lmc8FvMIPNcxc6kbhXiVAbVF6N+pzoDe2V2wMP0isLC2xJuppQk
Ot+PFa7FKNkCaarE/9FyRMa+orEIqHYhUUveBwqpAyKKyUtsYNUNO6uFNTOt06AH
WEp/UymdY4uzAqRvHfFjyOq/HE3f4akUVvbHo4Y+S7JuVncDK7dLu0PjxqJtrUrF
sMPcVibu5grOPZHrx3YfYaX11Mt0mTt1HKojE+9j2NrMDa6zNs42c+7cWlOo3aq7
uSOyMs/4DSszt6XTFPsyldbYSqDH4UJmoVtDNIYC/yXPk8TZP2Jrdfj+1JbqvMIF
fokFlpjkqWqXciu1fnlcnkbEcFEwfjK7bDqVn50NWOhEmcJORSQy7SwuCm01m/7/
9r+8/vAXZ3WDhf0KDy06dhex8GFMAdvAj23+ApcrebQ=

Case 3

  1. Add the option Parse to level, select the value 3 in the list (navigate to neighboring pages up to the 3rd level).
  2. Add the option Result format, specify $query as the value.
  3. Add a filter. Filter by $followlinks.$i.link - Link, select Does not contain string as the type, and specify forum as the string.
  4. In the Queries section, check the option Unique queries.
  5. In the Results section, check the option Unique by line.
  6. Specify the link to the website from which you want to parse links as the query.
Download example

How to import an example into A-Parser

eJxtVE1v2zAM/S/CDhuQJS2GXXxLgwbd4DZdm57SHISYzrTIkipRaQvD/33UR2xn
6ykh+R75+CG3DLk7uHsLDtCxYtMyE/+zglVQcy+RTZjh1oEN4Q27Wd+WRVEKdbh+
Q8t3qC0hemzL8N0AsbVBoZWjmKjIjClKOIIkz5FLT5hv3Qh+BGtFBSd8rW3DkaQk
BZnBPr14sO/Pz4qNuLWQCEFFhhcbokupXyWpDArCL9tOMnCdWErjTivkQo3yU1nf
kJ3Uk8MB9dBtt6fkbhmFBSnmcppn1Qcf+RHWOkmCwb0k6443sYGKI4ToNHX4+csU
30IGXlUi1OQyVQjTHqo+KfESBTq0Qu0JHwYhwC2tbsiNEJPE6ZwUbvK0Quc+8n8l
DivQepgwR2qXnLRUfaDm0lFE0Jg4bXaVl1i0TKu5lHGBAyymv/JCVnQd85pIPzLx
Y8jqvxxd3+G4FN3CqyUNfZZoXa1uB1alS72PW4z7bQSS7Rbaq7CbC3IeAEw/trsA
a7SFvkzOnKvTAzCgwuENW5ubwXXWxtlmzp10UbXYr/Ixn5BeremVrdRCN0ZC6Et5
KWkrDh6GC5m7vIZgDAL/JS9iibP3iVpL9/MxSTVW0AV+DwIbmuS4ak6541I+PZTj
CBsuiozfiKaYzfjX9PCnO93MWOAh7DUdFHXVbfvPQv/xaD/8OBRtR/v64+4TOjQX
sOSjKbn4yi67v8azl7c=

List of collected data

External links: 12
https://www.youtube.com/c/AParser_channel
https://t.me/a_parser
https://en.a-parser.com/
https://spyserp.com/ru/
https://sitechecker.pro/
https://arsenkin.ru/tools/
https://spyserp.com/
http://www.promkaskad.ru/
https://www.youtube.com/channel/UCvypGICrfCky8tPtebmIvQw
https://www.facebook.com/AParserRu
https://twitter.com/a_parser
https://www.youtube.com/c/AParser_channel

Internal links: 129
https://a-parser.com/
https://a-parser.com/
https://a-parser.com/a-parser-for-seo/
https://a-parser.com/a-parser-for-business-and-freelancers/
https://a-parser.com/a-parser-for-developers/
https://a-parser.com/a-parser-for-marketing-and-analytics/
https://a-parser.com/a-parser-for-e-commerce/
https://a-parser.com/a-parser-for-cpa/
https://a-parser.com/wiki/features-and-benefits/
https://a-parser.com/wiki/parsers/
  • Number of external links
  • Number of internal links
  • External links:
    • links themselves
    • anchors
    • anchors cleared of HTML tags
    • nofollow parameter
    • <a> tag in full
  • Internal links:
    • links themselves
    • anchors
    • anchors cleared of HTML tags
    • nofollow parameter
    • <a> tag in full
  • Array with all collected pages (used when working with the Use Pages option)

Capabilities

  • Multi-page parsing (navigation through pages)
  • Navigation through internal pages of the site up to a specified depth (Parse to level option) - allows to go through all pages of the site, collecting internal and external links
  • Automatically clears the anchor of HTML tags
  • Determination of nofollow for each link
  • Possibility to consider subdomains as internal site pages
  • Supports compression gzip/deflate/brotli
  • Determination and conversion of site encodings to UTF-8
  • Bypassing CloudFlare protection
  • Choice of engine (HTTP or Chrome)

Usage scenarios

  • Obtaining a complete site map (saving all internal links)
  • Obtaining all external links from the site
  • Checking backlinks to your own site

Query examples

Queries should include links to pages from which links need to be collected, or an entry point (for example, the main page of the site), in cases where the Parse to level option is used:

https://lenta.ru/
https://a-parser.com/wiki/index/

Result output options

A-Parser supports flexible result formatting thanks to the built-in Template Toolkit template engine, which allows it to output results in arbitrary form, as well as in structured form, such as CSV or JSON.

Possible settings

Parameter nameDefault valueDescription
Good statusAllSelects which response from the server will be considered successful. If a different response from the server is received during parsing, the request will be retried with a different proxy
Good code RegEx-Allows to specify a regular expression to check the response code
Ban Proxy Code RegEx-Allows to ban a proxy for a certain time (Proxy ban time) based on the server response code
MethodGETRequest method
POST body-Content to be sent to the server when using the POST method. Supports variables $query - URL query, $query.orig - original query, and $pagenum - page number when using the Use Pages option.
Cookies-Allows to specify cookies for the request
User agentUser-agent of the current version of Chrome is automatically insertedUser-Agent header when requesting pages
Additional headers-Allows to specify arbitrary request headers with support for template engine capabilities and use of variables from the request constructor
Read only headersRead only headers. In some cases, it allows to save traffic if there is no need to process the content
Detect charset on contentDetects the encoding based on the page content
Emulate browser headersEmulates browser headers
Max redirects count0Maximum number of redirects to follow
Follow common redirectsAllows to make http <-> https and www.domain <-> domain redirects within one domain, bypassing the Max redirects count limit
Max cookies count16Maximum number of cookies to be saved
EngineHTTP (Fast, JavaScript Disabled)Allows to choose HTTP engine (faster, without JavaScript) or Chrome (slower, with JavaScript)
Chrome HeadlessIf enabled, the browser will not be displayed
Chrome DevToolsAllows to use Chromium debugging tools
Chrome Log Proxy connectionsIf enabled, information about chrome connections will be output to the log
Chrome Wait Untilnetworkidle2Determines when the page is considered loaded. More about values.
Use HTTP/2 transportDetermines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used.
Don't verify TLS certsDisables TLS certificate validation
Randomize TLS FingerprintThis option allows to bypass site bans based on TLS fingerprint
Bypass CloudFlareAutomatically bypasses CloudFlare check
Bypass CloudFlare with Chrome(Experimental)Bypass CF via Chrome
Bypass CloudFlare with Chrome Max Pages20Max. number of pages when bypassing CF via Chrome
Subdomains are internalDetermines whether to consider subdomains as internal links
Follow linksInternal onlyDetermines which links to follow
Skip comment blocksDetermines whether to skip comment blocks