Using Regular Expressions
General Information
In A-Parser, Perl/JavaScript-compatible regular expressions are used, which can be applied:
- When parsing arbitrary information from any websites
- In the query builder to extract or replace part of a query
- In the results builder to transform any results
- When using filters
- In the regular expression constructor
- When checking the availability of the next page in the Net::HTTP scraper
Detailed documentation on regular expressions can be found in the following sources:
- Regular expressions on WikiPedia
- Universal encyclopedia of PCRE standard regular expressions
- The Sharing regexes thread on the forum
In A-Parser, it is possible to process any result using a regular expression, for this the option Use regex is used:
Usage Specifics and Flags
- Regular expressions are written without delimiters
//
- The following flags are supported:
- i - case-insensitive search
- s - dot matches all characters, including line breaks
- g - global search or replacement
Additionally, it is possible to specify a flag within the regex itself, for example, searching for the word test
in each line of the entire text (or page code, depending on what the regex is applied to) using the m flag (multi line - symbols ^
and $
work as the beginning and end of the line respectively):
(?m)^(.+?test.+?)$
Extraction of Arbitrary Information
Using the Use regex option or the Results Constructor, it is possible to use regular expressions to extract arbitrary information from parsing results, for example from the source HTML code of pages or from already prepared results
- As Apply to, the result from the scraper is selected, which can be a simple result or an array
- The regular expression is specified without delimiters, followed by the possibility to specify a flag
- In Result Type, the type of result is indicated -
Flat
(simple result) orArray
(array). If an array is selected as the source result or the g flag of the regular expression is used, the result will always be saved in an array. The name of the array is specified in the Name field - Each capturing bracket of the regular expression can be saved as a separate element, the name of the element is written in the corresponding field $1 to, $2 to... - where the number indicates the capturing bracket number
- In the Regex field, you can use the template engine, which allows using the query as part of the regular expression
The newly created results can be used in formatting results, in the results builder, in filtering and deduplication of results, or in the next option Use RegEx.
This option is similar to the results builder when using RegEx Match
Example of parsing links to images from the source HTML code
To solve this task, we use the scraper Net::HTTP to obtain the source code of the page.
We apply a regular expression with the isg flags to $data
(the downloaded page), and save the results in the src
elements of the images
array.
In the result format, we specify to output all src
elements separated by a newline.
As a result of parsing for the request http://a-parser.com/
in the result file, we will get the following list:
/img/lang/en.png
/img/lang/ru.png
img/[email protected]
https://files.a-parser.com/img/site/tour_ru/V1qpV.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_1_all_parsers_list.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_1_quick_task.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_2_task_editor_easy.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_3_task_editor_analyze_domains.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_4_task_editor_parse_emails.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_5_queue_fast_google.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_6_queue_spyserp.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_7_javascript_parser.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_8_scheduler.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_9_settings.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_10_proxies.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_11_templates.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_12_task_tester.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_13_parser_test.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_14_api.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_15_resources.png
data/avatars/s/0/12.jpg?1507557563
data/avatars/s/0/12.jpg?1507557563
data/avatars/s/13/13392.jpg?1570706020
data/avatars/s/16/16560.jpg?1586782475
data/avatars/s/1/1240.jpg?1537376153
styles/uix/xenforo/avatars/avatar_s.png
data/avatars/s/0/371.jpg?1412969226
styles/uix/xenforo/avatars/avatar_s.png
//mc.yandex.ru/watch/26891250
Download example
How to import an example into A-Parser
eJxtVN9v2jAQ/l8sJArqYH3YS7Stokhomxgwmj5BJlnkyLz612yHFUX533d2Egfa
vYDv7rvvvvNdXBFH7bPdGLDgLEl2FdHhTBKSw5GW3JFboqmxYHx4R1bgkuRLmm7Q
HxEVcWcNmHMorVNiC7ZJNM0BuaijwS7gBc2PTBS7n5+zsTWH/d6OP/mf3XBPspvJ
+H4UTh08bZiZLSJh66LG0DM6w/+KigATtAAbkV4zwSIkq3uR6gTGsBwQxXK0j8oI
6kwn+kR56WGDhmvShG+GgyBWDkekzrJYYBGiHq7vJu3dxeAjPUGqfAnGoXcv0Gr1
DvBmwEe7MqOJe/EMNM+ZY0pS3lTwnfRVnyT7E0RKhVg8GgZ2YZRAl4NA4J3nTt2O
DIJNkKIMuT+aHJIcKbdwSyxKXVAUkr+OMAeGOmXW2utBf0WUnHG+hBPwHhb4H0rG
c1yV2RGTvraJ/4es33DUsb3LUjisvwY1RJZgPay/91m5WqqiuwzOBHNo27kqpR/M
e3Q+A+h4ZysPE8pALNMyt9Xxa9Ag/Wb0I5vp3nXVxtVYrp0HJY+sWLfb1iFLmeIn
t5ZzJTQH35csOcexWNj26zGz7Ri80Qt8nTwPJa4+VqcUt98eG6naMFy/D16gwJu8
rNpSHijnT9vlZYT0K4XGL+e0TaZT+q55BiYHJabEJzooFK4UtlVn8ZGIT0l18VQk
VY1j+m03Dcb35BHow8uxOAOS3NX/AFJvlP8=
Regular expression constructor
Starting with version 1.2.78, a Regular Expression Constructor was added.
It can be found in the Tools -> Regular Expression Constructor tab. You can also send the obtained page code directly to Test Parsing. For this, you need to enable debug mode and click on the link Go to RegEx Builder.
In the constructor, there is an option to select the programming language in which the obtained regular expressions will be used.
To work with the constructor, you need to insert the source text into the field on the left (or it will be automatically inserted from Test Parsing when transitioning Go to Regex Builder). On the right, we configure the parameters of the future regular expression.
To compose a simple Regular Expression (for example, to get a title), it is enough to specify the necessary elements of the regular expression.
- In the Before group field, we write the characters that are before the information we need
- In the After group field, we write the characters that are after the required data
- In the Group starts with field, we specify the characters with which the desired string should start
- In the Group ends with field, we indicate the characters that should be at the end of the desired string
As seen in the screenshot above, we are composing a regular expression that will select the site's title. We put <title>
before the group and </title>
after the group, and also, for example, indicate that the desired string starts with the letter W
.
For full testing of the obtained regular expression, there is an option to enable the necessary flags: g, s, and i.
You can also compose more complex regular expressions, in which there are 2 or more groups.
For example, let's try to compose a regular expression for collecting all links and anchors in the list <li>
. For this, we need to enable the g
flag and add another search group, as the first group will contain links, and the second anchors.
After setting the necessary parameters for both groups, we get the regular expression:
<li><a href="(.+?)">(.+?)<\/a
To check the regular expression, press the Test button: