In A-Parser, it is possible to filter results according to a set of specific rules and save only the necessary data.
The main ways to use filters are:
- Saving only those links that contain a specific string (for example, a CMS sign)
- Filtering a domain database by specific parameters (for example, Yandex IKS from 300, Alexa up to 100000, language RU)
- Checking the server response (for example, 200 OK or the content of specific headers)
- Checking for the presence in the snippet of the original query
- Any other use cases where it is necessary to limit results according to specific conditions
Filters can be added in the Task Editor by clicking on the tool icon next to the necessary scraper:
Types of Filtering
It is possible to filter both single results and arrays of results (Representation of Results). There are several types of filters:
- By equality or inequality of strings
- By the presence or absence of a substring
- By matching or not matching a regular expression
- Numeric values can be filtered by greater than, less than, and equality
Features of Filtering
- When filtering arrays of results, only the results that fall under the filter remain in the array
- When filtering simple results, if the result does not fall under the filter, the result for this query is completely skipped, including when using multiple scrapers
- When using two or more filters in a task, the logical AND is applied between them, in other words, the result will be saved if it meets the conditions of all filters simultaneously
- When comparing in the field of specifying a value (string, regular expression, or numeric value), you can use the Template Toolkit template engine, all variables similar to those for General Result Format are available
Filtering by Text on a Page
Checking the site database for the presence of specific text on the page.
We use a file with links as queries, and as a result, we get a file with links where the desired text is found.
We use the Net::HTTP scraper to download the desired page, and save the request (the link we are checking) as a result. We filter the $data result - the content of the downloaded page, the filter type is Contain string, and we specify the string itself:
Filtering Images by Size
Filtering images by resolution.
We filter the height and width of the image when parsing through SE::Google::Images, saving only those images that are larger than 500x500 pixels:
Filtering by Several Attributes
Filtering links by the presence of any of several different strings.
To filter links by several different strings, we use the ability to specify regular expressions, for this we write several attributes separated by a delimiter:
Note that regular expressions require escaping a number of characters
Filtering by a Specific Parameter
Saving sites with a specific Alexa Rank.
We save only sites with an Alexa Rank greater than 100:
Using a Query in a Filter
Saving Google snippets that contain the original query.
As a string for comparison, we explicitly specify
[% query %] - a variable that contains the query, select Sensitive to search for a substring with case sensitivity: