Skip to main content

Query Formatting

Query formatting - allows you to add substitutions and format the query to the desired view using templates, is used for each query.

Query formats

example of query formats

where:

  1. Query format for scraper 1
  2. Query format for scraper 2
  3. Common query format

There are 2 ways to specify a template:

  • Common query format, it is processed first and supports substitutions
  • Query format for each scraper - allows you to specify a specific format for individual scrapers

Let's consider the example on the screenshot, let's assume that we use a file with a list of domains as queries like this:

google.com  
a-parser.com
yandex.ru

The common query format is set as

http://$query

The string http:// will be substituted before each source query (domain), the query will be transformed from google.com to http://google.com

The query format for scraper 1 remains unchanged, scraper 1 will parse the query http://google.com

The query format for scraper 2 looks like this:

site:$query

The query for this scraper will be transformed: http://google.com -> site:http://google.com

Query templates

The query format fully supports the Template Toolkit Template Toolkit, the following variables are available:

  • $query - query after formatting through the common query format
  • $query.num - query number
  • $query.lvl - query nesting level when using Parse to level or Parse all results options
  • $query.orig - original query before formatting
  • $query.first - first query when using Parse to level or Parse all results options
  • $query.prev - shows the query that was on the previous level, works for HTML::LinkExtractorHTML::LinkExtractor, tools.query.add and JS parsers (this.query.add)
  • All variables created through Query Builder

Substitution macros

Common query format supports the following macros:

MacroDescriptionExamples
{az:START:END}Substitution of alphanumeric sequence. Instead of START, specify the beginning of the sequence, instead of END - the end. The length of END must be greater than or equal to the length of START. Characters at the end of the END sequence must be after (in alphabetical order) the characters at the beginning of the START sequence. Any UTF-8 character sequences can be used.{az:a:z} - substitution of all characters from a to z (a, b, c, ..., x, z). {az:aaa:zzz} - substitution of all characters from aaa to zzz (aaa, aab, aac, ..., zzx, zzz). {az:a:zz} - substitution of all characters from a to zz (a, b, c, ... aa, ab, ..., zx, zz). {az:00:99} - substitution of all numbers from 00 to 99 (00, 01, 02, ..., 98, 99). {az:а:яяя} - substitution of all Cyrillic characters from а to яяя (а, б, ... аа, аб, ... яяю, яяя)
{each:WORD1,WORD2,...}Substitution of specified words WORD1, WORD2, etc., length is not limited.{each:green,blue,red,black} - substitution of words green, blue, red, black. {each:,buy,sell} - substitution of an empty word, then buy and sell
{subs:NAME}Substitution of additional words from files in the queries/subs/ folder. Instead of NAME, specify the name of the file without the .txt extension.{subs:zones} - substitution of all lines from the file queries/subs/zones.txt
{num:START:END}The macro iterates through numbers in the specified range. Instead of START, specify the beginning of the interval, instead of END - the end. Fractional numbers are supported.{num:1:1000} - substitution of all numbers from 1 to 1000 (1, 2, 3 ..., 999, 1000)
{num:START:END:STEP}The macro iterates through numbers in the specified range with the specified step. Instead of START, specify the beginning of the interval, instead of END - the end, instead of STEP - the step. Fractional numbers are supported.{num:0:1000:10} - substitution of all numbers from 0 to 1000 with a step of 10 (0, 10, 20 ..., 990, 1000)
{num:END:START}The macro iterates through numbers in the specified range in reverse order. Instead of END, specify the end of the interval, START specifies the beginning of the interval. Fractional numbers are supported.{num:1000:1} - substitution of all numbers from 1000 to 1 (1000,999, 998, ..., 2, 1)
{num:END:START:STEP}The macro iterates through numbers in the specified range in reverse order with the specified step. Instead of END, specify the end of the interval, START specifies the beginning of the interval, instead of STEP - the step. Fractional numbers are supported.{num:1000:1:10} - substitution of all numbers from 1000 to 1 with a step of 10 (1000,990, 980, ..., 10, 1)

⏩ Video: Substitution macroses

This video covers:

  • the {num} macro on examples of page navigation and coordinate iteration in a Google Maps scraper
  • the {az} macro on the example of parsing with inurl: to increase the number of queries and, consequently, results
  • the {each} macro on the example of parsing suggestions to generate word combinations

Combining substitution macroses

Substitution macroses can be combined, complex example:

$query site:{subs:zones} {az:aa:zz}

Suppose one of the parsing queries was viagra, and the following list of zones is in the queries/subs/zones.txt file: com, net, org, then the following set of combinations will be parsed:

viagra site:com ab  
...
viagra site:net jj
...
viagra site:eek.rg zz

The total number of queries will correspond to the multiplication of possible combinations: 1 query (viagra) x 3 zones ({subs:zones}) x 676 character variations ({az:aa:zz}) = 2028 queries