Settings
A-Parser contains the following settings groups:
- General settings - basic program settings: language, password, update parameters, number of active tasks
- Stream settings - stream and deduplication method settings for tasks
- Parser settings - ability to configure each individual parser
- Proxy check settings - thread count and all settings for the proxy checker
- Additional settings - optional settings for advanced users
- Task presets - saving tasks for future use
All settings(except general and additional) are saved in so-called presets
- sets of pre-saved settings, for example:
- Different setting presets for the
SE::Google parser - one for parsing links with a maximum depth of 10 pages with 100 results, another - for evaluating competition for a query, parsing depth of 1 page with 10 results
- Different setting presets for the proxy checker - separate ones for HTTP and SOCKS proxies
For all settings, there is a default preset(default), which cannot be changed, all changes must be saved in presets with new names.
General settings
Parameter name | Default value | Description |
---|---|---|
Password | No password | Set a password to log in to A-Parser |
Language | English | Interface language |
News and tips | English | News and tips language |
Enable tips | ☑ | Determines whether to display tips |
Check for updates | ☑ | Determines whether to display information about the availability of a new update in the Status bar |
Save window size | ☐ | Determines whether to save the window size |
Update channel | Stable | Select update channel(Stable, Beta, Alpha) |
Tasks per page | 5 | Number of tasks per page in the Task queue |
Maximum active tasks | 1 | Maximum number of active tasks |
Total thread limit | 10000 | Total thread limit in A-Parser. The task will not start if the total thread limit is less than the number of threads in the task |
Dynamic thread limit | ☐ | Determines whether to use Dynamic thread limit |
CPU cores(task processing) | 2 | Support for processing tasks on different processor cores(Enterprise license only). Described in more detail below |
CPU cores(result processing) | 4 | Multiple cores are used only for filtering, Result constructor, Parse custom result(all license types) |
Memory Saver | Best speed | Allows you to determine how much memory the parser can use(Best speed / Medium memory usage / Save max memory). More... |
CPU cores(task processing)
Support for processing tasks on different processor cores, this feature is available only for Enterprise license
This option speeds up(multiple times) the processing of several tasks in the queue(Settings -> Maximum active tasks
), while not speeding up the execution of a single task
Also implemented intelligent distribution of tasks to working cores based on the CPU load of each process The number of processor cores used is set in the settings, by default - 2, maximum - 32
As with threads, it is better to approach the choice of the number of cores experimentally, reasonable values will be 2-3 cores for 4-core processors, 4-6 for 8-core and so on. It should be borne in mind that with a large number of cores and their high load, there may be 100% load on the main control process(aparser/aparser.exe
), at which further increasing the number of processes for processing tasks will only cause general slowdown or unstable operation. It is also worth considering that each task processing process can create additional load up to 300%(i.e. load up to 100% simultaneously on 3 cores), this feature is related to multithreaded garbage collection in the JavaScript v8 engine
Stream settings
The A-Parser work is based on the principle of multithreaded data processing. The scraper performs tasks in separate threads, the number of which can be flexibly varied depending on the server configuration.
Description of threads operation
Let's understand what threads are in practice. Suppose you need to compile a report for three months.
Option 1
You can compile a report for the first month, then for the second, and then for the third. This is an example of single-threaded work. Tasks are solved sequentially.
Option 2
Hire three accountants who will compile reports each for one month. And then, after receiving the results from all three, make a general report. This is an example of multithreaded work. Tasks are solved simultaneously.
As can be seen from these examples, multithreaded work allows you to complete a task faster, but at the same time requires more resources (we need 3 accountants instead of 1). Multithreading works similarly in A-Parser. Suppose you need to scrape information from several links:
- with one thread, the application will scrape each site in turn
- when working in several threads, each thread will process its own link, and after completing it, it will move on to the next one in the list
Thus, in the second option, the entire task will be completed much faster, but it requires more server resources, so it is recommended to comply with the System Requirements.
Threads configuration
Threads configuration in A-Parser is carried out separately for each task, depending on the parameters required for its execution. By default, 2 thread configs are available: for 20 (default) and 100 (100 Threads) threads. To go to the settings of the selected config, you need to click on the pencil icon , after which its settings will open.
You can also go to the threads settings through the menu item: Settings - Threads settings.
Here we can:
- create a new config with our own settings and save it under our own name (Add new button)
- make changes to an existing config by selecting it from the drop-down list (Save button)
Threads count
This parameter sets the number of threads in which the task launched with this config will work. The number of threads can be any, but you need to take into account the capabilities of your server, as well as the limitation of the proxy tariff, if such a limitation is provided. For example, for our proxies, you can specify no more than the selected tariff.
It is also important to remember that the total number of threads in the scraper is equal to the sum of the running tasks and enabled proxy checkers with proxy checking. For example, if one task is launched for 20 threads and two tasks for 100 threads each are running, and one proxy checker is working in which proxy checking is enabled in 15 threads, then the scraper will use a total of 20+100+100+15=235 threads. At the same time, if the proxy tariff is calculated for 200 threads, there will be many unsuccessful requests. To avoid them, you need to reduce the number of threads used. For example, disable proxy checking (if it is not needed, it will save 15 threads) and reduce the number of threads in one of the tasks by another 20 threads. Thus, for one of the running tasks, you need to create a config for 80 threads, leave the rest as they are
Proxy Checkers
This parameter allows you to select a proxy checker with certain settings. Here you can select the All parameter, which means using all running proxy checkers, or only those that need to be used in the task (multiple positions are available for selection)
This setting allows you to run a task only with the necessary proxy checkers. The proxy checker setup process is described here
Maximum threads per proxy
Here you set the maximum number of threads that will simultaneously use the same proxy. Allows you to set different parameters, for example, 1 thread = 1 proxy.
By default, this parameter is set to 0, which disables this function. In most cases, this is enough. But if you need to limit the load on each proxy, then it makes sense to change the value.
Global proxy ban
All tasks launched with this option have a common proxy ban list. The feature of this parameter is that the list of banned proxies for each parser is common to all running tasks.
For example, a banned proxy in SE::Google in task 1 will also be banned for
SE::Google in task 2, but it can work freely in
SE::Yandex in both tasks.
Max connections per host
This parameter specifies the maximum number of connections per host and is designed to reduce the load on the site when parsing information from it. Essentially, specifying this parameter makes it possible to control the number of requests at any given moment for each specific domain. Enabling this parameter applies to the task, if you run multiple tasks simultaneously with the same thread config, the limit will be counted for all tasks.
By default, this parameter is set to 0, i.e. disabled.
Reuse proxy between retries
This setting disables checking for proxy uniqueness for each attempt, as well as proxy ban. This in turn means the ability to use 1 proxy for all attempts.
This parameter is recommended to be enabled, for example, in cases where it is planned to use 1 proxy, and the output IP changes with each connection to it.
Recommendations
This article covers all the settings that allow you to control threads. It should be noted that when configuring thread config, it is not necessary to set all the parameters specified in the article, it is enough to set only those that will ensure obtaining the correct result. Usually, it is necessary to change only the Threads count, the rest of the settings can be left by default.
Parser settings
Each parser has a multitude of settings and allows you to save different sets of settings in presets. The preset system allows you to use the same parser with different settings depending on the situation, let's look at the example of the SE::Google parser:
Preset 1: "Parsing the maximum number of links"
- Pages count: 10
- Links per page: 100
Thus, the parser will collect the maximum number of links, going through all the pages of the search results.
Preset 2: "Parsing competition by query"
- Pages count: 1
- Links per page: 10
- Results format:
$query: $totalcount\n
In this case, we get the number of search results for the query (query competition), and for greater speed, it is enough for us to parse only the first page with a minimum number of links.
Creating presets
Creating a preset starts with selecting the parser/parsers and defining the result that needs to be obtained.
Next, you need to understand what the input data will be for the selected parser, in the screenshot above, the SE::Google parser is selected, and its input data is any strings as if you were searching for something in a browser. You can select a query file or enter queries in a text field.
Now you need to override the settings (select options) for the parser, add deduplication. You can use the query builder if you need to process the query. Or use the results builder if you need to process the results in some way.
Next, you need to pay attention to editing the result file name, and if necessary, change it at your discretion.
The last item is to select additional options, especially the Log option. Very useful if you want to know the reason for the parsing error.
After all this, you need to save the preset and add it to the task queue.
Overriding settings
Override preset - quick override of settings for the parser, this option can be added directly in the Task Editor. In one click, you can add several parameters. The list of settings shows the default values, and if the option is bold, it means that it has already been overridden in the preset.
In this example, two options were overridden: Pages count
was changed from the default 10
to 5
, and Links per page
was set to 100
.
In the task, you can use an unlimited number of options Override preset, but if there are many changes, it is more convenient to create a new preset and save all changes to it.
You can also easily save overrides using the Save Overrides function. They will be saved as a separate preset for the selected scraper.
After that, in the future, it is enough to simply select this saved preset from the list and use it.
Common settings for all scrapers
Each scraper has its own set of settings, information on the settings of each scraper can be found in the corresponding section
In this table, we present common settings for all scrapers
Parameter Name | Default Value | Description |
---|---|---|
Request retries | 10 | The number of attempts for each request, if the request fails to execute within the specified number of attempts, it is considered unsuccessful and skipped |
Use proxy | ☑ | Determines whether to use a proxy |
Query format | $query | Query format |
Result format | Each scraper has its own value | Output result format |
Proxy ban time | Each scraper has its own value | Proxy ban time in seconds |
Request timeout | 60 | Maximum request waiting time in seconds |
Request delay | 0 | Delay between requests in seconds, you can set a random value in the range, for example, 10,30 - a delay from 10 to 30 seconds |
Proxy Checker | All | Which proxy checkers to use (choose between all or list specific ones) |
Common for all scrapers working over HTTP protocol
Parameter Name | Default Value | Description |
---|---|---|
Max body size | Each scraper has its own value | Maximum size of the search results page in bytes |
Use gzip | ☑ | Determines whether to use compression of transmitted traffic |
Extra query string | - | Allows you to specify additional parameters in the query string |
Default settings for each scraper may differ. They are stored in the default preset in the settings of each scraper.
Proxy Checker Settings
Additional Settings
- Line break allows you to choose between Unix and Windows line ending options when saving results to a file
- Number format - specifies how to display numbers in the A-Parser interface
- Template Toolkit macros