Skip to main content

Settings

A-Parser contains the following settings groups:

  • General settings - basic program settings: language, password, update parameters, number of active tasks
  • Stream settings - stream and deduplication method settings for tasks
  • Parser settings - ability to configure each individual parser
  • Proxy check settings - thread count and all settings for the proxy checker
  • Additional settings - optional settings for advanced users
  • Task presets - saving tasks for future use

All settings(except general and additional) are saved in so-called presets - sets of pre-saved settings, for example:

  • Different setting presets for the SE::GoogleSE::Google parser - one for parsing links with a maximum depth of 10 pages with 100 results, another - for evaluating competition for a query, parsing depth of 1 page with 10 results
  • Different setting presets for the proxy checker - separate ones for HTTP and SOCKS proxies

For all settings, there is a default preset(default), which cannot be changed, all changes must be saved in presets with new names.

General settings

Program general settings

Parameter nameDefault valueDescription
PasswordNo passwordSet a password to log in to A-Parser
LanguageEnglishInterface language
News and tipsEnglishNews and tips language
Enable tipsDetermines whether to display tips
Check for updatesDetermines whether to display information about the availability of a new update in the Status bar
Save window sizeDetermines whether to save the window size
Update channelStableSelect update channel(Stable, Beta, Alpha)
Tasks per page5Number of tasks per page in the Task queue
Maximum active tasks1Maximum number of active tasks
Total thread limit10000Total thread limit in A-Parser. The task will not start if the total thread limit is less than the number of threads in the task
Dynamic thread limitDetermines whether to use Dynamic thread limit
CPU cores(task processing)2Support for processing tasks on different processor cores(Enterprise license only). Described in more detail below
CPU cores(result processing)4Multiple cores are used only for filtering, Result constructor, Parse custom result(all license types)
Memory SaverBest speedAllows you to determine how much memory the parser can use(Best speed / Medium memory usage / Save max memory). More...

CPU cores(task processing)

Support for processing tasks on different processor cores, this feature is available only for Enterprise license

This option speeds up(multiple times) the processing of several tasks in the queue(Settings -> Maximum active tasks), while not speeding up the execution of a single task

Also implemented intelligent distribution of tasks to working cores based on the CPU load of each process The number of processor cores used is set in the settings, by default - 2, maximum - 32

As with threads, it is better to approach the choice of the number of cores experimentally, reasonable values will be 2-3 cores for 4-core processors, 4-6 for 8-core and so on. It should be borne in mind that with a large number of cores and their high load, there may be 100% load on the main control process(aparser/aparser.exe), at which further increasing the number of processes for processing tasks will only cause general slowdown or unstable operation. It is also worth considering that each task processing process can create additional load up to 300%(i.e. load up to 100% simultaneously on 3 cores), this feature is related to multithreaded garbage collection in the JavaScript v8 engine

Stream settings

The A-Parser work is based on the principle of multithreaded data processing. The scraper performs tasks in separate threads, the number of which can be flexibly varied depending on the server configuration.

Description of threads operation

Let's understand what threads are in practice. Suppose you need to compile a report for three months.

Option 1
You can compile a report for the first month, then for the second, and then for the third. This is an example of single-threaded work. Tasks are solved sequentially.

Option 2
Hire three accountants who will compile reports each for one month. And then, after receiving the results from all three, make a general report. This is an example of multithreaded work. Tasks are solved simultaneously.

As can be seen from these examples, multithreaded work allows you to complete a task faster, but at the same time requires more resources (we need 3 accountants instead of 1). Multithreading works similarly in A-Parser. Suppose you need to scrape information from several links:

  • with one thread, the application will scrape each site in turn
  • when working in several threads, each thread will process its own link, and after completing it, it will move on to the next one in the list

Thus, in the second option, the entire task will be completed much faster, but it requires more server resources, so it is recommended to comply with the System Requirements.

Threads configuration

Threads configuration in A-Parser is carried out separately for each task, depending on the parameters required for its execution. By default, 2 thread configs are available: for 20 (default) and 100 (100 Threads) threads. To go to the settings of the selected config, you need to click on the pencil icon edit threads config button, after which its settings will open. edit threads config button in the task editor

You can also go to the threads settings through the menu item: Settings - Threads settings.

Here we can:

  • create a new config with our own settings and save it under our own name (Add new button)
  • make changes to an existing config by selecting it from the drop-down list (Save button)

threads config settings

Threads count

This parameter sets the number of threads in which the task launched with this config will work. The number of threads can be any, but you need to take into account the capabilities of your server, as well as the limitation of the proxy tariff, if such a limitation is provided. For example, for our proxies, you can specify no more than the selected tariff.

note

It is also important to remember that the total number of threads in the scraper is equal to the sum of the running tasks and enabled proxy checkers with proxy checking. For example, if one task is launched for 20 threads and two tasks for 100 threads each are running, and one proxy checker is working in which proxy checking is enabled in 15 threads, then the scraper will use a total of 20+100+100+15=235 threads. At the same time, if the proxy tariff is calculated for 200 threads, there will be many unsuccessful requests. To avoid them, you need to reduce the number of threads used. For example, disable proxy checking (if it is not needed, it will save 15 threads) and reduce the number of threads in one of the tasks by another 20 threads. Thus, for one of the running tasks, you need to create a config for 80 threads, leave the rest as they are

Proxy Checkers

This parameter allows you to select a proxy checker with certain settings. Here you can select the All parameter, which means using all running proxy checkers, or only those that need to be used in the task (multiple positions are available for selection)

note

This setting allows you to run a task only with the necessary proxy checkers. The proxy checker setup process is described here

Maximum threads per proxy

Here you set the maximum number of threads that will simultaneously use the same proxy. Allows you to set different parameters, for example, 1 thread = 1 proxy.

note

By default, this parameter is set to 0, which disables this function. In most cases, this is enough. But if you need to limit the load on each proxy, then it makes sense to change the value.

Global proxy ban

All tasks launched with this option have a common proxy ban list. The feature of this parameter is that the list of banned proxies for each parser is common to all running tasks.

For example, a banned proxy in SE::GoogleSE::Google in task 1 will also be banned for SE::GoogleSE::Google in task 2, but it can work freely in SE::YandexSE::Yandex in both tasks.

Max connections per host

This parameter specifies the maximum number of connections per host and is designed to reduce the load on the site when parsing information from it. Essentially, specifying this parameter makes it possible to control the number of requests at any given moment for each specific domain. Enabling this parameter applies to the task, if you run multiple tasks simultaneously with the same thread config, the limit will be counted for all tasks.

By default, this parameter is set to 0, i.e. disabled.

Reuse proxy between retries

This setting disables checking for proxy uniqueness for each attempt, as well as proxy ban. This in turn means the ability to use 1 proxy for all attempts.

This parameter is recommended to be enabled, for example, in cases where it is planned to use 1 proxy, and the output IP changes with each connection to it.

Recommendations

This article covers all the settings that allow you to control threads. It should be noted that when configuring thread config, it is not necessary to set all the parameters specified in the article, it is enough to set only those that will ensure obtaining the correct result. Usually, it is necessary to change only the Threads count, the rest of the settings can be left by default.

Parser settings

Each parser has a multitude of settings and allows you to save different sets of settings in presets. The preset system allows you to use the same parser with different settings depending on the situation, let's look at the example of the SE::GoogleSE::Google parser:

  • Pages count: 10
  • Links per page: 100

Thus, the parser will collect the maximum number of links, going through all the pages of the search results.

Preset 2: "Parsing competition by query"

  • Pages count: 1
  • Links per page: 10
  • Results format: $query: $totalcount\n

In this case, we get the number of search results for the query (query competition), and for greater speed, it is enough for us to parse only the first page with a minimum number of links.

Creating presets

preset creation screenshot

Creating a preset starts with selecting the parser/parsers and defining the result that needs to be obtained.

Next, you need to understand what the input data will be for the selected parser, in the screenshot above, the SE::GoogleSE::Google parser is selected, and its input data is any strings as if you were searching for something in a browser. You can select a query file or enter queries in a text field.

Now you need to override the settings (select options) for the parser, add deduplication. You can use the query builder if you need to process the query. Or use the results builder if you need to process the results in some way.

Next, you need to pay attention to editing the result file name, and if necessary, change it at your discretion.

The last item is to select additional options, especially the Log option. Very useful if you want to know the reason for the parsing error.

After all this, you need to save the preset and add it to the task queue.

Overriding settings

Override preset - quick override of settings for the parser, this option can be added directly in the Task Editor. In one click, you can add several parameters. The list of settings shows the default values, and if the option is bold, it means that it has already been overridden in the preset.

where to find options for the selected parser and how to add options

In this example, two options were overridden: Pages count was changed from the default 10 to 5, and Links per page was set to 100.

In the task, you can use an unlimited number of options Override preset, but if there are many changes, it is more convenient to create a new preset and save all changes to it.

You can also easily save overrides using the Save Overrides function. They will be saved as a separate preset for the selected scraper.

how to save overridden options

After that, in the future, it is enough to simply select this saved preset from the list and use it.

selecting a saved preset for a specific scraper

Common settings for all scrapers

Each scraper has its own set of settings, information on the settings of each scraper can be found in the corresponding section

In this table, we present common settings for all scrapers

Parameter NameDefault ValueDescription
Request retries10The number of attempts for each request, if the request fails to execute within the specified number of attempts, it is considered unsuccessful and skipped
Use proxyDetermines whether to use a proxy
Query format$queryQuery format
Result formatEach scraper has its own valueOutput result format
Proxy ban timeEach scraper has its own valueProxy ban time in seconds
Request timeout60Maximum request waiting time in seconds
Request delay0Delay between requests in seconds, you can set a random value in the range, for example, 10,30 - a delay from 10 to 30 seconds
Proxy CheckerAllWhich proxy checkers to use (choose between all or list specific ones)

Common for all scrapers working over HTTP protocol

Parameter NameDefault ValueDescription
Max body sizeEach scraper has its own valueMaximum size of the search results page in bytes
Use gzipDetermines whether to use compression of transmitted traffic
Extra query string-Allows you to specify additional parameters in the query string
note

Default settings for each scraper may differ. They are stored in the default preset in the settings of each scraper.

Proxy Checker Settings

Proxy Settings

Additional Settings

where to find additional settings

  • Line break allows you to choose between Unix and Windows line ending options when saving results to a file
  • Number format - specifies how to display numbers in the A-Parser interface
  • Template Toolkit macros