Result Uniqueness
Uniqueness, deduplication, removal of duplicates, removal of repetitions - all of these imply that we do not need repeating results. In A-Parser there are 2 methods of uniqueness, let's consider each one in detail.
Unique results by string
This method works after the result is formed (Basic Formatting Principles), immediately before writing the result to a file, each line is checked for uniqueness and only new unique lines are written to the file.
You can enable uniqueness by string in Quick Task:
or in Task Editor:
Unique any result
Uniqueness by any result allows you to make uniqueness directly on the selected result from a specific scraper (Result Presentation in the scraper). To add this type of uniqueness, click on the tool icon to the right of the scraper in Task Editor and click Add unique result:
Now you can choose on which result to make uniqueness and its type:
The Global switch is used when 2 or more scrapers are selected, it determines whether to make a common uniqueness or separately for each scraper.
Types of uniqueness
Parameter | Description |
---|---|
String | Uniqueness by string (the entire result string is compared) |
Domain | Uniqueness by domain (the entire domain is compared, for example, www.domain.com and domain.com are different domains) |
Top-level domain | Uniqueness by the main domain, taking into account regional, commercial, educational and other domains (for example, domain.co.uk and domain2.co.uk are different domains, and sub1.domain.com and sub2.domain.com are the same) |
Second-level domain | Uniqueness by the second-level domain (second-level domains are compared, for example, www.domain.com, domain.com and user.subdomain.domain.com are all the same domain) |
Path | Uniqueness by path (parts of the link before the file are compared, for example, http://domain.com/path1/file.php and http://domain.com/path1/file2.php are the same parts of the link before the file) |
Without parameters | Uniqueness by link without parameters (links without parameters are compared, for example, http://domain.com/file.php?page=1 and http://domain.com/file.php?page=2 are the same links) |
Unique queries
Unique queries send only unique queries directly to parsing, which have not been parsed in the current task before. The main use cases are:
- If there are duplicates in the source queries and they should not be parsed (double work)
- When using the Parse to level option, only unique queries should be used to prevent requests from growing and looping (for example, when using the
HTML::LinkExtractor scraper)
In all other cases, unnecessary use of query uniqueness will only slow down the overall work of the scraper.
Keeping uniqueness across tasks
In A-Parser, it is possible to save the uniqueness database for use in future tasks, which allows you to save only new unique results in new tasks (for example, links when parsing SERP in SE::Google)
To save the uniqueness database, you need to create a new database name when adding the first task:
For all subsequent tasks, you need to select the previously created database name, thereby only new unique results will be saved, regardless of whether the results are written to the same file as in the first task or to a new file.