Skip to main content

Results deduplication

Deduplication, unique results, removing duplicates, removing repeats - all this implies that we do not need repeating results. In A-Parser there are 2 methods of deduplication, let's examine each in detail.

Deduplication by string

This method works after result formatting; immediately before writing the result to a file, each line is checked for uniqueness, and only new unique lines are written to the file.

You can enable deduplication by string in the Quick Task:

The option of unique results in a Quick task

Or in the Task Editor:

The deduplication by string option in the Task Editor

Deduplication by any result

Deduplication by any result allows you to perform deduplication directly on a selected result from a specific parser. You can add this deduplication type in the Task Editor by clicking on the tool icon to the right of the parser and clicking Add unique result:

The Add unique result option in the Task Editor

Now you can choose which result to perform deduplication on and the deduplication type:

The deduplication type in the Task Editor
note

The Global toggle is used when 2 or more parsers are selected; it determines whether to perform common deduplication or separately for each parser.

Deduplication types

ParameterDescription
StringDeduplication by string (the entire result string is compared)
DomainDeduplication by domain (the entire domain is compared, e.g., www.domain.com and domain.com are different domains)
Top-level domainDeduplication by main domain considering regional, commercial, educational, and other domains (e.g., domain.co.uk and domain2.co.uk are different domains, while sub1.domain.com and sub2.domain.com are the same)
2nd-level domainDeduplication by 2nd-level domain (second-level domains are compared, e.g., www.domain.com, domain.com, and user.subdomain.domain.com are all the same domain)
PathDeduplication by path (link parts up to the file are compared, e.g., http://domain.com/path1/file.php and http://domain.com/path1/file2.php have the same link parts up to the file)
Without parametersDeduplication by link without parameters (links without parameters are compared, e.g., http://domain.com/file.php?page=1 and http://domain.com/file.php?page=2 are the same links)

Query deduplication

Query deduplication sends only unique queries to parsing that have not been parsed previously in the current task. Main use cases:

  • If there are duplicates in the source queries and it is undesirable to parse them (double work)
  • When using the Parse to level option, it is necessary to use only unique queries to prevent query expansion and looping (for example, when using the HTML::LinkExtractorHTML::LinkExtractor parser)
note

In all other cases, unnecessary use of query deduplication will only slow down the overall parser performance

Saving deduplication state across tasks

It is possible to save the deduplication database for use in future tasks, which allows saving only new unique results in new tasks (for example, links when parsing SERPs in SE::GoogleSE::Google)

To save the deduplication database, you must create a new database name when adding the first task:

Saving the uniqueness database in the Task Editor

For all subsequent tasks, you must select the previously created database name; this way, only new unique results will be saved, regardless of whether the results are written to the same file as in the first task or to a new file.