Result Uniqueness
Uniqueness, deduplication, duplicate removal, repetition removal - all this implies that we do not need repeating results. In A-Parser, there are 2 methods of uniqueness, let's discuss each in detail.
Result Uniqueness by String
This method works after result formation, directly before writing the result to a file each line is checked for uniqueness and only new unique lines are written to the file.
See also: Order of processing requests
You can enable uniqueness by string in Quick task:
Or in Task Editor:
Uniqueness by Any Result
Uniqueness by any result allows for uniqueness directly on the selected result from a specific scraper. You can add this type of uniqueness in the Task Editor by clicking on the tool icon to the right of the scraper and pressing Add uniqueness:
Now you can choose which result to make unique and the type of uniqueness:
The Global switch is used when 2 or more scrapers are selected, it determines whether to make a common uniqueness or separate for each scraper.
Types of Uniqueness
Parameter | Description |
---|---|
String | Uniqueness by string (the entire result string is compared) |
Domain | Uniqueness by domain (the entire domain is compared, for example www.domain.com and domain.com are different domains) |
Top-level domain | Uniqueness by main domain considering regional, commercial, educational, and other domains (for example domain.co.uk and domain2.co.uk are different domains, while sub1.domain.com and sub2.domain.com are the same) |
Second-level domain | Uniqueness by second-level domain (second-level domains are compared, for example www.domain.com, domain.com, and user.subdomain.domain.com are all the same domain) |
Path | Uniqueness by path (parts of the link up to the file are compared, for example http://domain.com/path1/file.php and http://domain.com/path1/file2.php are the same up to the file) |
Without parameters | Uniqueness by link without parameters (links without parameters are compared, for example http://domain.com/file.php?page=1 and http://domain.com/file.php?page=2 are the same links) |
Query Uniqueness
Query uniqueness sends only unique queries that have not been previously scraped in the current task directly to scraping. Main use cases:
- If there are duplicates in the source queries and it is undesirable to scrape them (double work)
- When using the option Scrape to Level, it is necessary to use only unique queries to prevent the proliferation and looping of queries (for example, when using the scraper HTML::LinkExtractor)
In all other cases, unnecessary use of query uniqueness will only slow down the overall performance of the scraper
Preserving Uniqueness Between Tasks
It is possible to save the uniqueness database for use in future tasks, which allows new tasks to save only new unique results (for example, links when scraping SERPs in SE::Google)
To save the uniqueness database, when adding the first task, create a new database name:
For all subsequent tasks, it is necessary to select the previously created database name, thereby saving only new unique results, regardless of whether the results are being written to the same file as in the first task or to a new file.