Skip to main content

Result Uniqueness

Uniqueness, deduplication, duplicate removal, repetition removal - all this implies that we do not need repeating results. In A-Parser, there are 2 methods of uniqueness, let's discuss each in detail.

Result Uniqueness by String

This method works after result formation, directly before writing the result to a file each line is checked for uniqueness and only new unique lines are written to the file.

You can enable uniqueness by string in Quick task:

The option of unique results in a Quick task

Or in Task Editor:

The unique string option in the Task Editor

Uniqueness by Any Result

Uniqueness by any result allows for uniqueness directly on the selected result from a specific scraper. You can add this type of uniqueness in the Task Editor by clicking on the tool icon to the right of the scraper and pressing Add uniqueness:

The Add unique result option in the Task Editor

Now you can choose which result to make unique and the type of uniqueness:

The type of uniqueness in the Task Editor
note

The Global switch is used when 2 or more scrapers are selected, it determines whether to make a common uniqueness or separate for each scraper.

Types of Uniqueness

ParameterDescription
StringUniqueness by string (the entire result string is compared)
DomainUniqueness by domain (the entire domain is compared, for example www.domain.com and domain.com are different domains)
Top-level domainUniqueness by main domain considering regional, commercial, educational, and other domains (for example domain.co.uk and domain2.co.uk are different domains, while sub1.domain.com and sub2.domain.com are the same)
Second-level domainUniqueness by second-level domain (second-level domains are compared, for example www.domain.com, domain.com, and user.subdomain.domain.com are all the same domain)
PathUniqueness by path (parts of the link up to the file are compared, for example http://domain.com/path1/file.php and http://domain.com/path1/file2.php are the same up to the file)
Without parametersUniqueness by link without parameters (links without parameters are compared, for example http://domain.com/file.php?page=1 and http://domain.com/file.php?page=2 are the same links)

Query Uniqueness

Query uniqueness sends only unique queries that have not been previously scraped in the current task directly to scraping. Main use cases:

  • If there are duplicates in the source queries and it is undesirable to scrape them (double work)
  • When using the option Scrape to Level, it is necessary to use only unique queries to prevent the proliferation and looping of queries (for example, when using the scraper HTML::LinkExtractorHTML::LinkExtractor)
note

In all other cases, unnecessary use of query uniqueness will only slow down the overall performance of the scraper

Preserving Uniqueness Between Tasks

It is possible to save the uniqueness database for use in future tasks, which allows new tasks to save only new unique results (for example, links when scraping SERPs in SE::GoogleSE::Google)

To save the uniqueness database, when adding the first task, create a new database name:

Saving the uniqueness database in the Task Editor

For all subsequent tasks, it is necessary to select the previously created database name, thereby saving only new unique results, regardless of whether the results are being written to the same file as in the first task or to a new file.