Using Regular Expressions
General Information
A-Parser uses Perl/JavaScript-compatible regular expressions that can be used:
- When parsing arbitrary information from any websites
- In the query builder to extract or replace part of the query
- In the results builder to transform any results
- When using filters
- When checking the availability of the next page in the
Net::HTTP scraper
Detailed documentation on regular expressions can be found in the following sources:
- Regular expressions on WikiPedia
- Universal encyclopedia of regular expressions of the PCRE standard
- Share your regular expressions on the forum
In A-Parser, it is possible to process any result using a regular expression, for this, the option Parse custom result is used:
Usage Specifics and Flags
- Regular expressions are written without delimiters
//
- The following flags are supported:
- i - case-insensitive search
- s - the dot includes all characters, including line breaks
- g - global search or replacement
Additionally, it is possible to specify a flag in the regular expression itself, for example, searching for the word test in each line of the entire text (or page code, depending on what the regular expression is applied to) using the m (multi-line - the ^
and $
characters work as the beginning/end of the line):
(?m)^(.+?test.+?)$
Extraction of Any Information
With the help of the Parse custom results
option or the Results builder
, regular expressions can be used to extract arbitrary information from parsing results, for example, from the source code of HTML pages or from already prepared results.
- The result from the parser is selected as Parse result, it can be simple result or array
- The regular expression is specified without delimiters, followed by the possibility of specifying a flag
- The Result type specifies the result type - Flat (simple result) or Array (array). If an array is selected as the source result or the g flag of the regular expression is used, the result will always be saved in an array. The Name field specifies the name of the array
- Each capturing bracket of the regular expression can be saved as a separate element, the name of the element is recorded in the corresponding field $1 to, $2 to... - where the number denotes the number of the capturing bracket
- In the RegEx field, you can use the template engine, which allows you to use the request as part of the regular expression
The newly created results can be used in result formatting, in the results builder, in filtering, and deduplication of results or in the next Parse custom result option. This option is similar to the results builder when using RegEx Match.
Example of Parsing Links to Images from Source HTML Code
To solve this task, we use the Net::HTTP scraper to get the source code of the page. We apply the regular expression with the isg flags to $data (downloaded page), save the result in the images array in the src element. In the result format, we specify to output all src elements separated by a line break.
As a result of parsing, for the request http://a-parser.com/
, we will get the following list in the result file:
/img/lang/en.png
/img/lang/ru.png
img/[email protected]
https://files.a-parser.com/img/site/tour_ru/V1qpV.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_1_all_parsers_list.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_1_quick_task.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_2_task_editor_easy.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_3_task_editor_analyze_domains.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_4_task_editor_parse_emails.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_5_queue_fast_google.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_6_queue_spyserp.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_7_javascript_parser.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_8_scheduler.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_9_settings.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_10_proxies.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_11_templates.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_12_task_tester.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_13_parser_test.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_14_api.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_15_resources.png
data/avatars/s/0/12.jpg?1507557563
data/avatars/s/0/12.jpg?1507557563
data/avatars/s/13/13392.jpg?1570706020
data/avatars/s/16/16560.jpg?1586782475
data/avatars/s/1/1240.jpg?1537376153
styles/uix/xenforo/avatars/avatar_s.png
data/avatars/s/0/371.jpg?1412969226
styles/uix/xenforo/avatars/avatar_s.png
//mc.yandex.ru/watch/26891250
Download example
How to import the example into A-Parser
eJxtVN9v2jAQ/l8sJArqYH3YS7Stokhomxgwmj5BJlnkyLz612yHFUX533d2Egfa
vYDv7rvvvvNdXBFH7bPdGLDgLEl2FdHhTBKSw5GW3JFboqmxYHx4R1bgkuRLmm7Q
HxEVcWcNmHMorVNiC7ZJNM0BuaijwS7gBc2PTBS7n5+zsTWH/d6OP/mf3XBPspvJ
+H4UTh08bZiZLSJh66LG0DM6w/+KigATtAAbkV4zwSIkq3uR6gTGsBwQxXK0j8oI
6kwn+kR56WGDhmvShG+GgyBWDkekzrJYYBGiHq7vJu3dxeAjPUGqfAnGoXcv0Gr1
DvBmwEe7MqOJe/EMNM+ZY0pS3lTwnfRVnyT7E0RKhVg8GgZ2YZRAl4NA4J3nTt2O
DIJNkKIMuT+aHJIcKbdwSyxKXVAUkr+OMAeGOmXW2utBf0WUnHG+hBPwHhb4H0rG
c1yV2RGTvraJ/4es33DUsb3LUjisvwY1RJZgPay/91m5WqqiuwzOBHNo27kqpR/M
e3Q+A+h4ZysPE8pALNMyt9Xxa9Ag/Wb0I5vp3nXVxtVYrp0HJY+sWLfb1iFLmeIn
t5ZzJTQH35csOcexWNj26zGz7Ri80Qt8nTwPJa4+VqcUt98eG6naMFy/D16gwJu8
rNpSHijnT9vlZYT0K4XGL+e0TaZT+q55BiYHJabEJzooFK4UtlVn8ZGIT0l18VQk
VY1j+m03Dcb35BHow8uxOAOS3NX/AFJvlP8=
Regular Expression Builder
Starting from version 1.2.78, a Regular Expression Builder has been added.
You can find it on the Tools tab -> Regular Expression Builder. You can also send the obtained page code directly to the Test scraper. To do this, you need to enable the debug mode and click on the Go to RegEx Builder link.
In the builder, you can choose the programming language in which the obtained regular expressions will be used.
To work with the builder, you need to insert the source text into the left field (or it will be inserted automatically from the Test scraper when you go to the RegEx Builder). On the right, configure the parameters of the future regular expression.
To create a simple regular expression (for example, to get the title), it is enough to specify the necessary elements of the regular expression.
- In the Before group field, enter the characters that precede the information we need
- In the After group field, enter the characters that follow the desired data
- In the Group starts with field, specify the characters with which the desired string should begin
- In the Group ends with field, specify the characters that should be at the end of the desired string
As can be seen in the screenshot above, we are creating a regular expression that will select the title of the site. We will put <title>
before the group and </title>
after the group, and also, for example, indicate that the desired string starts with the letter A
.
To fully test the obtained regular expression, it is possible to enable the necessary flags: g, s, and i.
It is also possible to create more complex regular expressions with 2 or more groups.
For example, let's try to create a regular expression to collect all links and anchors in a list (<li>
). To do this, we need to enable the g flag and add another search group, since the first group will contain links, and the second will contain anchors.
After setting the necessary parameters for both groups, we get the regular expression:
<li><a href="(.+?)">(.+?)<\/a
To check the regular expression created, click the Test button
After executing the regular expression, the result of its work is displayed at the bottom: the full string and the captured groups. Double-clicking on any element in the result table scrolls the initial text to the location of this match.
Useful links
Regular expressions for the little ones
My name is Vitaly Kotov and I know a little about regular expressions. Under the cut, I will tell you the basics of working with them...
Regular expressions (regexp) - basics
Regular expressions are a mechanism for finding and replacing text. In a string, file, multiple files...
⏩Parsing industrial equipment catalog
An example of using regular expressions in parsing an industrial equipment catalog
⏩Parsing the Booking.com resource
An example of using regular expressions in parsing the Booking.com resource
⏩Finding contact pages
An example of using regular expressions in parsing contact pages