Settings
A-Parser contains the following groups of settings:
- Global Settings - main program settings: language, password, update parameters, number of active tasks
- Thread Settings - thread settings and uniqueness methods for tasks
- Parser Settings - the ability to configure each individual scraper
- Proxy Check Settings - number of threads and all settings for the proxychecker
- Additional Settings - optional settings for advanced users
- Task Presets - saving tasks for subsequent use
All settings (except for global and additional) are saved in so-called presets - sets of pre-saved settings, for example:
- Different presets for the SE::Google scraper - one for parsing links with a maximum depth of 10 pages by 100 results, another - for assessing competition by query, parsing depth 1 page by 10 results
- Different presets for the proxychecker settings - separate for HTTP and SOCKS proxies
For all settings, there is a default preset (default), which cannot be changed, all changes must be saved in presets with new names.
Global Settings
Parameter Name | Default Value | Description |
---|---|---|
Password | No password | Set a password to log into A-Parser |
Language | English | Interface language |
News and tips | English | Language of news and tips |
Enable tips | ☑ | Determines whether to display tips |
Check for updates | ☑ | Determines whether to display information about the availability of a new update in the Status Bar |
Save window size | ☐ | Determines whether to save the window size |
Update channel | Stable | Choice of update channel (Stable, Beta, Alpha) |
Tasks per page | 5 | Number of tasks per page in Task Queue |
Maximum active tasks | 1 | Maximum number of active tasks |
Overall thread limit | 10000 | Overall thread limit in A-Parser. A task will not start if the overall thread limit is less than the number of threads in the task |
Dynamic thread limit | ☐ | Determines whether to use Dynamic Thread Limit |
CPU cores (task processing) | 2 | Support for task processing on different processor cores (only for Enterprise license). Described in more detail below |
CPU cores (result processing) | 4 | Multiple cores are used only for filtering, Result Constructor, Parse custom result (all types of licenses) |
Memory Saver | Best speed | Allows you to determine how much memory the scraper can use (Best speed / Medium memory usage / Save max memory). More details... |
CPU cores (task processing)
Support for task processing on different processor cores, this feature is available only for the Enterprise license
This option speeds up (multiple times) the processing of several tasks in the queue (Settings -> Maximum active tasks), while in no way speeding up the execution of a single task
Intelligent distribution of tasks across CPU cores based on the load of each process has also been implemented The number of CPU cores used is set in the settings, by default - 2, maximum - 32
As with threads, it is better to approach the choice of the number of cores experimentally, with 2-3 cores being reasonable for quad-core processors, 4-6 for octa-core, etc. It should be taken into account that with a large number of cores and high load, there may be a 100% load on the main control process (aparser/aparser.exe
), at which point further increasing the number of processes for task processing will only cause overall slowdown or unstable operation. It should also be considered that each task processing process can create additional load up to 300% (i.e., load 3 cores by 100% simultaneously), this feature is associated with multithreaded garbage collection in the JavaScript v8 engine
Thread Settings
The operation of A-Parser is based on the principle of multithreaded data processing. The scraper processes tasks in separate threads in parallel, the number of which can be flexibly varied depending on the server configuration.
Description of Thread Operation
Let's figure out what threads are in practice. Suppose you need to compile a report for three months.
Option 1
You can compile the report at the beginning for the 1st month, then for the 2nd, and then for the 3rd. This is an example of single-threaded operation. Tasks are solved sequentially.
Option 2
Hire three accountants who will compile reports for one month each. And then, upon receiving the results from all three, make a general report. This is an example of multithreaded operation. Tasks are solved simultaneously.
As can be seen from these examples, multithreaded operation allows the task to be completed faster, but at the same time requires more resources (we need 3 accountants instead of 1). Similarly, multithreading works in A-Parser. Suppose you need to scrape information from several links:
- with one thread, the application will scrape each site in sequence
- with multiple threads, each will process its own link, upon completion of which it will proceed to the next unprocessed one in the list
Thus, in the second option, the entire task will be completed much faster, but at the same time, more server resources are required, so it is recommended to follow the System Requirements
Thread Configuration
Thread configuration in A-Parser is done separately for each task, depending on the parameters required for its execution. By default, there are 2 thread configs available: for 20
and 100
threads, for default and 100 Threads respectively.
To access the settings of the selected config, you need to click on the pencil icon , after which its settings will open.
You can also go to the thread settings through the menu item: Settings -> Thread Settings
Here we can:
- create a new config with our own settings and save it under our name (Add new button)
- make changes to an existing config by selecting it from the drop-down list (Save button)
Number of Threads (Threads count)
This parameter sets the number of threads that a job will operate with when launched with this configuration. The number of threads can be any, but you need to take into account the capabilities of your server, as well as the proxy tariff limit, if such a limit is provided. For example, for our proxies, you can specify no more than the selected tariff.
It is also important to remember that the total number of threads in the scraper equals the sum of running jobs and enabled proxy checkers with proxy checking. For example, if one job is running on 20 threads and two jobs are running on 100 threads each, and one proxy checker is running, which includes proxy checking on 15 threads, then in total the scraper will use 20+100+100+15=235 threads. At the same time, if the proxy tariff is calculated for 200 threads, there will be many unsuccessful requests. To avoid them, you need to reduce the number of threads used. For example, disable proxy checking (if it is not needed, it will save 15 threads) and reduce the number of threads in one of the jobs by another 20 threads. Thus, for one of the running jobs, you need to create a config for 80 threads, leave the rest as is.
Proxy Checkers (Proxy Checkers)
This parameter allows you to choose a proxy checker with specific settings. Here you can select the All
parameter, which means using all working proxy checkers, or only those that need to be used in the job (multiple selections available).
This setting allows you to launch a job only with the necessary proxy checkers. The process of setting up a proxy checker is discussed here.
Maximum Threads per Proxy (Max threads per proxy)
Here you set the maximum number of threads that will use the same proxy simultaneously. It allows you to set different parameters, for example, 1 thread = 1 proxy.
By default, this parameter is set to 0, which disables this feature. In most cases, this is sufficient. But if you need to limit the load on each proxy, then it makes sense to change the value.
Global Proxy Ban (Global proxy ban)
All jobs launched with this option have a common proxy ban database. The feature of this parameter is that the list of banned proxies for each scraper is common to all running jobs.
For example, a proxy banned in SE::Google in job 1 will also be banned for SE::Google in job 2, but at the same time, it can work freely in SE::Yandex in both jobs.
Maximum Connections per Host (Max connections per host)
This parameter indicates the maximum number of connections to a host, intended to reduce the load on the site when scraping information from it. Essentially, specifying this parameter allows you to control the number of requests at any one time, for each specific domain. Enabling this parameter applies to the job, if you run multiple jobs simultaneously with the same thread configuration, then the limit will be considered for all jobs.
By default, this parameter is set to 0, i.e., disabled.
Reuse Proxy Between Retries (Reuse proxy between retries)
This setting disables the check for proxy uniqueness for each attempt, and the proxy ban will also not work. This in turn means the possibility of using 1 proxy for all attempts.
This parameter is recommended to be enabled, for example, in cases where it is planned to use 1 proxy, with each connection to which the outgoing IP changes.
Proxy Strategy (Proxy strategy)
Allows you to manage the proxy selection strategy when using sessions: keep the proxy from a successful request for the next request or always use a random proxy.
Recommendations
This article covers all the settings that allow you to manage streams. It should be noted that when configuring stream settings, it is not necessary to set all the parameters mentioned in the article, it is enough to set only those that will ensure the correct result. Usually, it is only necessary to change Threads count, the other settings can be left at their default values.
Scraper Settings
Each scraper has many settings and allows you to save different sets of settings in presets. The preset system allows you to use the same scraper with different settings depending on the situation, let's look at the example of the SE::Google scraper:
Preset 1: "Parsing the maximum number of links"
- Pages count:
10
- Links per page:
100
Thus, the scraper will collect the maximum number of links by going through all the pages of the search results
Preset 2: "Parsing competition by query"
- Pages count:
1
- Links per page:
10
- Results format:
$query: $totalcount\n
In this case, we get the number of search results by query (query competition) and for greater speed, it is enough to scrape only the first page with the minimum number of links
Creating Presets
Creating a preset starts with selecting a scraper/scrapers and determining the result you want to achieve.
Next, you need to understand what the input data will be for the selected scraper, in the screenshot above the SE::Google scraper is selected, its input data is any strings as if you were searching for something in a browser. You can choose a file with queries or enter queries into the text field.
Now you need to redefine the settings (select options) for the scraper, add deduplication. You can use the query builder if you need to process queries. Or use the results builder if you need to process the results in some way.
Next, you need to pay attention to editing the name of the results file, and if necessary, change it at your discretion.
The last step is to choose additional options, especially the Keep log option. Very useful if you want to find out the reason for the scraping error.
After all this, you need to save the preset and add it to the task queue.
Overriding Settings
Override preset - quick redefinition of settings for the scraper, this option can be added directly in the Task Editor. In one click, you can add several parameters. The list of settings shows the default values, and if an option is highlighted in bold, it means it has already been redefined in the preset
In this example, two options have been redefined: Pages count was set to 5
and Links per page was set to 100
.
In the task, you can use an unlimited number of Override preset options, but if there are many changes, it is more convenient to create a new preset and save all the changes in it.
You can also easily save the overrides using the Save Overrides function. They will be saved as a separate preset for the selected scraper.
After which, in the future, it is enough to simply select this saved preset from the list and use it.