HTML::LinkExtractor - Scraper of external and internal links from a specified website
Overview of the scraper
HTML::LinkExtractor – is a scraper of external and internal links from a specified website. It supports multi-page scraping and navigating through internal pages of the website up to a specified depth, which allows it to go through all the pages of the website, collecting internal and external links. It has built-in means to bypass CloudFlare protection and also the ability to choose Chrome as the engine for scraping emails from pages that load data via scripts. Capable of reaching speeds up to 2000 requests per minute – that's 120,000 links per hour.Use cases for the scraper
Collecting all external links from a website
- Add the option Parse to level, in the list select the value
10
(navigate through neighboring pages up to the 10th level). - Add the option Result format, as a value specify
$extlinks.format('$link\n')
(output of external links). - In the Requests section, check the
Unique requests
option. - In the Results section, check the
Unique by line
option. - As a request, specify the link to the website from which you need to scrape external links.
Download example
How to import example into A-Parser
eJxtU01v2zAM/S9CgK5AlrSHXnxLgwZb4dZdm57SHISYztTIoirRWQrD/32U7NjJ
1ptIvsfHL9WCpN/5JwceyItkVQsb3yIRORSy0iTGwkrnwYXwSvxYPqRJkiqzuzuQ
kxtCx4geWwv6tMBstKTQeI6pnM2YIoU9aPbspa4Yc33VnOD34JzK4Ugo0JWSuJa2
hI4iRnAgzeJ+0gK+XYyC+fZmLi5Fs16PRUvxixgODHs96Xrqgy9yD0sMKkrD4F6w
9SjLqJNLghA96lxO6BAyyDxXoTOpW4UwlUH11aiPWKcnp8yW8Ww6BX7hsGQ3QUwS
nJ/HCldiFG3BaarI/9VyREKugrHwXO1Cci15Hyik9hxRBE7yBrJu2Ekt0My0joMe
YDH9baV0zlucFUz62RG/hmT/5Wj6Dk+leGV/HNfQZ4nWbfYwsHJMccuNG+S2tSoV
se3nWJmwmyt27gBsP7bHACvRQS/TZe7U+VAtmHAfw9ZmdnCdtXG2mXPnBk2htll3
c0dkZZb8GzIzx9JqCH2ZSmveiofn4UJmvltDMIYC/yXPo8TZPyJE7e9f2lKtU3yB
N6HAkid5qtql3EitX5/T04gYLoqN30Q2mU7ld4ueFzpRpsCpCESCLfJFcVvNuv+/
/S+vv/zFSd3wwt79U4sO3QUs+3hMnrfBP7b5C6wbebo=
tip
Collecting all internal links from a website
Similar to the first case, but in step 2 as a value you need to specify $intlinks.format('$link\n')
(output of internal links).
Download example
How to import example into A-Parser
eJxtU8tu2zAQ/BfCQBrAtZNDL7o5Roy2cOI0j5PjA2GtXNYUyZIrN4Ggf++QkiW7
zY27O7OzL9aCZdiHB0+BOIhsXQuX3iITORWy0izGwkkfyMfwWnx9vltm2VKZ/e0b
e7ll64HosbXgd0dgW8fKmoCYymGmFEs6kIbnIHUFzPVVc4I/kPcqpyOhsL6UjFra
EjqKGCnDGuJh0gI+XYyi+fpqLi5Fs9mMRUsJixSODHc96Xrqg0/yQM82qihNg3sB
616WSSeXTDF61Lmc8FvMIPNcxc6kbhXiVAbVF6N+pzoDe2V2wMP0isLC2xJuppQk
Ot+PFa7FKNkCaarE/9FyRMa+orEIqHYhUUveBwqpAyKKyUtsYNUNO6uFNTOt06AH
WEp/UymdY4uzAqRvHfFjyOq/HE3f4akUVvbHo4Y+S7JuVncDK7dLu0PjxqJtrUrF
sMPcVibu5grOPZHrx3YfYaX11Mt0mTt1HKojE+9j2NrMDa6zNs42c+7cWlOo3aq7
uSOyMs/4DSszt6XTFPsyldbYSqDH4UJmoVtDNIYC/yXPk8TZP2Jrdfj+1JbqvMIF
fokFlpjkqWqXciu1fnlcnkbEcFEwfjK7bDqVn50NWOhEmcJORSQy7SwuCm01m/7/
9r+8/vAXZ3WDhf0KDy06dhex8GFMAdvAj23+ApcrebQ=
Click only on links that do not contain the word forum
- Add the option Parse to level, in the list select the value
3
(navigate through neighboring pages up to the 3rd level). - Add the option Result format, as a value specify
$query
. - Add a filter. Filter by
$followlinks.$i.link - Link
, choose the typeDoes not contain string
, as the string itself specifyforum
. - In the Requests section, check the
Unique requests
option. - In the Results section, check the
Unique by line
option. - As a request, specify the link to the website from which you need to scrape links.
Download example
How to import a preset into A-Parser
eJxtVE1v2zAM/S/CDhuQJS2GXXxLgwbd4DZdm57SHISYzrTIkipRaQvD/33UR2xn
6ykh+R75+CG3DLk7uHsLDtCxYtMyE/+zglVQcy+RTZjh1oEN4Q27Wd+WRVEKdbh+
Q8t3qC0hemzL8N0AsbVBoZWjmKjIjClKOIIkz5FLT5hv3Qh+BGtFBSd8rW3DkaQk
BZnBPr14sO/Pz4qNuLWQCEFFhhcbokupXyWpDArCL9tOMnCdWErjTivkQo3yU1nf
kJ3Uk8MB9dBtt6fkbhmFBSnmcppn1Qcf+RHWOkmCwb0k6443sYGKI4ToNHX4+csU
30IGXlUi1OQyVQjTHqo+KfESBTq0Qu0JHwYhwC2tbsiNEJPE6ZwUbvK0Quc+8n8l
DivQepgwR2qXnLRUfaDm0lFE0Jg4bXaVl1i0TKu5lHGBAyymv/JCVnQd85pIPzLx
Y8jqvxxd3+G4FN3CqyUNfZZoXa1uB1alS72PW4z7bQSS7Rbaq7CbC3IeAEw/trsA
a7SFvkzOnKvTAzCgwuENW5ubwXXWxtlmzp10UbXYr/Ixn5BeremVrdRCN0ZC6Et5
KWkrDh6GC5m7vIZgDAL/JS9iibP3iVpL9/MxSTVW0AV+DwIbmuS4ak6541I+PZTj
CBsuiozfiKaYzfjX9PCnO93MWOAh7DUdFHXVbfvPQv/xaD/8OBRtR/v64+4TOjQX
sOSjKbn4yi67v8azl7c=
tip
Collected Data
- Number of external links
- Number of internal links
- External links:
- the links themselves
- anchors
- anchors cleaned from HTML tags
- nofollow attribute
- the full
<a>
tag
- Internal links:
- the links themselves
- anchors
- anchors cleaned from HTML tags
- nofollow attribute
- the full
<a>
tag
- An array with all the collected pages (used when the Use Pages option is enabled)
Capabilities
- Multi-page scraping (navigation through pages)
- Navigation through internal pages of the site up to a specified depth (Parse to level option) – allows you to go through all the pages of the site, collecting internal and external links
- Limit on page navigation (Follow links limit option)
- Automatically cleans anchors from HTML tags
- Determines nofollow for each link
- Ability to consider subdomains as internal pages of the site
- Supports gzip/deflate/brotli compression
- Detects and converts site encodings to UTF-8
- Bypasses CloudFlare protection
- Choice of engine (HTTP or Chrome)
Use Cases
- Obtaining a complete site map (saving all internal links)
- Collecting all external links from a site
- Checking backlinks to your own site
Queries
As queries, you need to specify links to the pages from which you want to collect links, or an entry point (for example, the main page of the site), in cases where the Parse to level option is used:
https://lenta.ru/
https://a-parser.com/wiki/index/
Output Results Examples
A-Parser supports flexible result formatting thanks to the built-in Template Toolkit, which allows it to output results in any form, as well as in structured formats, such as CSV or JSON
Output of external and internal links with their count
Result format:
External links: $extcount\n$extlinks.format('$link\n')
Internal links: $intcount\n$intlinks.format('$link\n')
Example of result:
External links: 12
https://www.youtube.com/c/AParser_channel
https://t.me/a_parser
https://en.a-parser.com/
https://spyserp.com/ru/
https://sitechecker.pro/
https://arsenkin.ru/tools/
https://spyserp.com/
http://www.promkaskad.ru/
https://www.youtube.com/channel/UCvypGICrfCky8tPtebmIvQw
https://www.facebook.com/AParserRu
https://twitter.com/a_parser
https://www.youtube.com/c/AParser_channel
Internal links: 129
https://a-parser.com/
https://a-parser.com/
https://a-parser.com/a-parser-for-seo/
https://a-parser.com/a-parser-for-business-and-freelancers/
https://a-parser.com/a-parser-for-developers/
https://a-parser.com/a-parser-for-marketing-and-analytics/
https://a-parser.com/a-parser-for-e-commerce/
https://a-parser.com/a-parser-for-cpa/
https://a-parser.com/wiki/features-and-benefits/
https://a-parser.com/wiki/parsers/
Possible Settings
Parameter Name | Default Value | Description |
---|---|---|
Good status | All | Selection of which server response will be considered successful. If another response is received during scraping, the request will be repeated with a different proxy |
Good code RegEx | Ability to specify a regular expression to check the response code | |
Ban Proxy Code RegEx | Ability to temporarily ban a proxy (for Proxy ban time) based on the server response code | |
Method | GET | Request method |
POST body | Content to be sent to the server when using the POST method. Supports variables $query – URL of the request, $query.orig – the original request, and $pagenum - page number when using the Use Pages option. | |
Cookies | Ability to specify cookies for the request. | |
User agent | _Automatically substitutes the user-agent of the current version of Chrome_ | User-Agent header for page requests |
Additional headers | Ability to specify custom request headers with support for template capabilities and using variables from the request builder | |
Read only headers | ☐ | Read headers only. In some cases, it allows saving traffic if there is no need to process content |
Detect charset on content | ☐ | Detect charset based on the content of the page |
Emulate browser headers | ☐ | Emulate browser headers |
Max redirects count | 0 | Maximum number of redirects the scraper will follow |
Follow common redirects | ☑ | Allows for http <-> https and www.domain <-> domain redirects within the same domain, bypassing the Max redirects count limit |
Max cookies count | 16 | Maximum number of cookies to keep |
Engine | HTTP (Fast, JavaScript Disabled) | Allows choosing between the HTTP engine (faster, without JavaScript) or Chrome (slower, with JavaScript enabled) |
Chrome Headless | ☐ | If this option is enabled, the browser will not be displayed |
Chrome DevTools | ☑ | Allows the use of Chromium debugging tools |
Chrome Log Proxy connections | ☑ | If this option is enabled, information about Chrome connections will be logged |
Chrome Wait Until | networkidle2 | Determines when the page is considered loaded. More about the values. |
Use HTTP/2 transport | ☐ | Determines whether to use HTTP/2 instead of HTTP/1.1. For example, Google and Majestic immediately ban if HTTP/1.1 is used. |
Don't verify TLS certs | ☐ | Disabling TLS certificate validation |
Randomize TLS Fingerprint | ☐ | This option allows bypassing site bans by TLS fingerprint |
Bypass CloudFlare | ☑ | Automatic bypass of CloudFlare checks |
Bypass CloudFlare with Chrome(Experimental) | ☐ | Bypass CF through Chrome |
Bypass CloudFlare with Chrome Max Pages | 20 | Max. number of pages when bypassing CF through Chrome |
Subdomains are internal | ☐ | Whether to consider subdomains as internal links |
Follow links | Internal only | Which links to follow |
Follow links limit | 0 | Follow links limit, applied to each unique domain |
Skip comment blocks | ☐ | Whether to skip comment blocks |