Skip to main content

Using Regular Expressions

General Information

In A-Parser, Perl/JavaScript-compatible regular expressions are used, which can be applied:

Detailed documentation on regular expressions can be found in the following sources:

In A-Parser, it is possible to process any result using a regular expression, for this the option Use regex is used:

Option to Parse custom result

Usage Specifics and Flags

  • Regular expressions are written without delimiters //
  • The following flags are supported:
    • i - case-insensitive search
    • s - dot matches all characters, including line breaks
    • g - global search or replacement

Additionally, it is possible to specify a flag within the regex itself, for example, searching for the word test in each line of the entire text (or page code, depending on what the regex is applied to) using the m flag (multi line - symbols ^ and $ work as the beginning and end of the line respectively):

(?m)^(.+?test.+?)$

Extraction of Arbitrary Information

Description of working with regular expressions in the task editor

Using the Use regex option or the Results Constructor, it is possible to use regular expressions to extract arbitrary information from parsing results, for example from the source HTML code of pages or from already prepared results

  • As Apply to, the result from the scraper is selected, which can be a simple result or an array
  • The regular expression is specified without delimiters, followed by the possibility to specify a flag
  • In Result Type, the type of result is indicated - Flat (simple result) or Array (array). If an array is selected as the source result or the g flag of the regular expression is used, the result will always be saved in an array. The name of the array is specified in the Name field
  • Each capturing bracket of the regular expression can be saved as a separate element, the name of the element is written in the corresponding field $1 to, $2 to... - where the number indicates the capturing bracket number
  • In the Regex field, you can use the template engine, which allows using the query as part of the regular expression

The newly created results can be used in formatting results, in the results builder, in filtering and deduplication of results, or in the next option Use RegEx. This option is similar to the results builder when using RegEx Match

Example of parsing links to images from the source HTML code

To solve this task, we use the scraper Net::HTTPNet::HTTP to obtain the source code of the page. We apply a regular expression with the isg flags to $data (the downloaded page), and save the results in the src elements of the images array. In the result format, we specify to output all src elements separated by a newline.

As a result of parsing for the request http://a-parser.com/ in the result file, we will get the following list:

/img/lang/en.png  
/img/lang/ru.png
img/[email protected]
https://files.a-parser.com/img/site/tour_ru/V1qpV.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_1_all_parsers_list.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_1_quick_task.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_2_task_editor_easy.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_3_task_editor_analyze_domains.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_4_task_editor_parse_emails.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_5_queue_fast_google.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_6_queue_spyserp.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_7_javascript_parser.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_8_scheduler.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_9_settings.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_10_proxies.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_11_templates.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_12_task_tester.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_13_parser_test.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_14_api.png
https://files.a-parser.com/img/site/tour_ru/tour_ru_15_resources.png
data/avatars/s/0/12.jpg?1507557563
data/avatars/s/0/12.jpg?1507557563
data/avatars/s/13/13392.jpg?1570706020
data/avatars/s/16/16560.jpg?1586782475
data/avatars/s/1/1240.jpg?1537376153
styles/uix/xenforo/avatars/avatar_s.png
data/avatars/s/0/371.jpg?1412969226
styles/uix/xenforo/avatars/avatar_s.png
//mc.yandex.ru/watch/26891250
Download example

How to import an example into A-Parser


eJxtVN9v2jAQ/l8sJArqYH3YS7Stokhomxgwmj5BJlnkyLz612yHFUX533d2Egfa
vYDv7rvvvvNdXBFH7bPdGLDgLEl2FdHhTBKSw5GW3JFboqmxYHx4R1bgkuRLmm7Q
HxEVcWcNmHMorVNiC7ZJNM0BuaijwS7gBc2PTBS7n5+zsTWH/d6OP/mf3XBPspvJ
+H4UTh08bZiZLSJh66LG0DM6w/+KigATtAAbkV4zwSIkq3uR6gTGsBwQxXK0j8oI
6kwn+kR56WGDhmvShG+GgyBWDkekzrJYYBGiHq7vJu3dxeAjPUGqfAnGoXcv0Gr1
DvBmwEe7MqOJe/EMNM+ZY0pS3lTwnfRVnyT7E0RKhVg8GgZ2YZRAl4NA4J3nTt2O
DIJNkKIMuT+aHJIcKbdwSyxKXVAUkr+OMAeGOmXW2utBf0WUnHG+hBPwHhb4H0rG
c1yV2RGTvraJ/4es33DUsb3LUjisvwY1RJZgPay/91m5WqqiuwzOBHNo27kqpR/M
e3Q+A+h4ZysPE8pALNMyt9Xxa9Ag/Wb0I5vp3nXVxtVYrp0HJY+sWLfb1iFLmeIn
t5ZzJTQH35csOcexWNj26zGz7Ri80Qt8nTwPJa4+VqcUt98eG6naMFy/D16gwJu8
rNpSHijnT9vlZYT0K4XGL+e0TaZT+q55BiYHJabEJzooFK4UtlVn8ZGIT0l18VQk
VY1j+m03Dcb35BHow8uxOAOS3NX/AFJvlP8=

Regular expression constructor

Starting with version 1.2.78, a Regular Expression Constructor was added.

It can be found in the Tools -> Regular Expression Constructor tab. You can also send the obtained page code directly to Test Parsing. For this, you need to enable debug mode and click on the link Go to RegEx Builder.

Open the page code in the regular expression constructor

In the constructor, there is an option to select the programming language in which the obtained regular expressions will be used.

To work with the constructor, you need to insert the source text into the field on the left (or it will be automatically inserted from Test Parsing when transitioning Go to Regex Builder). On the right, we configure the parameters of the future regular expression.

To compose a simple Regular Expression (for example, to get a title), it is enough to specify the necessary elements of the regular expression.

  • In the Before group field, we write the characters that are before the information we need
  • In the After group field, we write the characters that are after the required data
  • In the Group starts with field, we specify the characters with which the desired string should start
  • In the Group ends with field, we indicate the characters that should be at the end of the desired string
Example of getting a title using a regular expression constructor

As seen in the screenshot above, we are composing a regular expression that will select the site's title. We put <title> before the group and </title> after the group, and also, for example, indicate that the desired string starts with the letter W.

For full testing of the obtained regular expression, there is an option to enable the necessary flags: g, s, and i.

You can also compose more complex regular expressions, in which there are 2 or more groups. For example, let's try to compose a regular expression for collecting all links and anchors in the list <li>. For this, we need to enable the g flag and add another search group, as the first group will contain links, and the second anchors.

Example of using groups in the regular expression constructor

After setting the necessary parameters for both groups, we get the regular expression:

<li><a href="(.+?)">(.+?)<\/a

To check the regular expression, press the Test button: