Skip to main content

Result Uniqueness

Uniqueness, deduplication, removal of duplicates, removal of repetitions - all of these imply that we do not need repeating results. In A-Parser there are 2 methods of uniqueness, let's consider each one in detail.

Unique results by string

This method works after the result is formed (Basic Formatting Principles), immediately before writing the result to a file, each line is checked for uniqueness and only new unique lines are written to the file.

You can enable uniqueness by string in Quick Task: option to make results unique by string in Quick Task

or in Task Editor: option to make results unique by string in Task Editor

Unique any result

Uniqueness by any result allows you to make uniqueness directly on the selected result from a specific scraper (Result Presentation in the scraper). To add this type of uniqueness, click on the tool icon to the right of the scraper in Task Editor and click Add unique result: option to Add unique result in Task Editor

Now you can choose on which result to make uniqueness and its type: type of uniqueness in Task Editor

info

The Global switch is used when 2 or more scrapers are selected, it determines whether to make a common uniqueness or separately for each scraper.

Types of uniqueness

ParameterDescription
StringUniqueness by string (the entire result string is compared)
DomainUniqueness by domain (the entire domain is compared, for example, www.domain.com and domain.com are different domains)
Top-level domainUniqueness by the main domain, taking into account regional, commercial, educational and other domains (for example, domain.co.uk and domain2.co.uk are different domains, and sub1.domain.com and sub2.domain.com are the same)
Second-level domainUniqueness by the second-level domain (second-level domains are compared, for example, www.domain.com, domain.com and user.subdomain.domain.com are all the same domain)
PathUniqueness by path (parts of the link before the file are compared, for example, http://domain.com/path1/file.php and http://domain.com/path1/file2.php are the same parts of the link before the file)
Without parametersUniqueness by link without parameters (links without parameters are compared, for example, http://domain.com/file.php?page=1 and http://domain.com/file.php?page=2 are the same links)

Unique queries

Unique queries send only unique queries directly to parsing, which have not been parsed in the current task before. The main use cases are:

  • If there are duplicates in the source queries and they should not be parsed (double work)
  • When using the Parse to level option, only unique queries should be used to prevent requests from growing and looping (for example, when using the HTML::LinkExtractorHTML::LinkExtractor scraper)
info

In all other cases, unnecessary use of query uniqueness will only slow down the overall work of the scraper.

Keeping uniqueness across tasks

In A-Parser, it is possible to save the uniqueness database for use in future tasks, which allows you to save only new unique results in new tasks (for example, links when parsing SERP in SE::GoogleSE::Google)

To save the uniqueness database, you need to create a new database name when adding the first task: saving the uniqueness database in Task Editor

For all subsequent tasks, you need to select the previously created database name, thereby only new unique results will be saved, regardless of whether the results are written to the same file as in the first task or to a new file.