Integration of A-Parser with Redis: advanced API
TODO: (next) remove duplication of information https://a-parser.com/threads/5488/#post-17742 https://a-parser.com/wiki/api-server-redis/
Comparison with HTTP API
A-Parser Redis API was developed to replace the oneRequest and bulkRequest methods for more efficient implementation and support of additional usage scenarios:
- Redis serves as a server for requests and results
- possibility to request results asynchronously or in blocking mode
- possibility to connect multiple scrapers (both on the same and on different servers) to process requests with a single entry point
- possibility to set the number of threads for processing requests and view work logs
- possibility to organize timeouts for operations
- automatic Expire of unused results
Discussion thread on the forum
Preliminary setup
Unlike A-Parser HTTP API, to use Redis API, you need to pre-configure and run a task with the API::Server::Redis scraper:
- install and run the Redis server (locally or remotely)
- create a preset of settings for the
API::Server::Redis scraper, specify:
Redis Host
andRedis Port
- the address and port of the Redis server, by default127.0.0.1
, port6379
Redis Queue Key
- the name of the key for exchanging data with A-Parser, by defaultaparser_redis_api
, you can create separate queues and process them with different tasks or different copies of A-ParserResult Expire(TTL)
- the lifetime of the result in seconds, serves to automatically control and delete unused results, by default3600
seconds (1 hour)
- add a task with the
API::Server::Redis scraper
- as requests, you need to specify {num:1:N}, where N should correspond to the number of threads specified in the task
- you can also enable the log option, thus the possibility of viewing the log for each request will be available
Executing requests
The Redis API is based on Redis Lists, list operations allow you to add an unlimited number of requests to the queue (limited by RAM), as well as receive results in blocking mode with a timeout (blpop
) or in asynchronous mode (lpop
).
- all settings except
useproxy
,proxyChecker
andproxybannedcleanup
are taken from the preset of the called scraper +overrideOpts
- settings
useproxy
,proxyChecker
andproxybannedcleanup
are taken from theAPI::Server::Redis preset +
overrideOpts
The request is added to Redis using the lpush
command, each request consists of an array [queryId, parser, preset, query, overrideOpts, apiOpts]
serialized using JSON
:
parser
,preset
,query
corresponds to the API request oneRequestqueryId
- formed together with the request, we recommend using a sequential number from your database or a good random, by this ID you can get the resultoverrideOpts
- overriding settings for the parser presetapiOpts
- additional API processing parameters
redis-cli
Example of executing requests, for testing you can use redis-cli
:
127.0.0.1:6379> lpush aparser_redis_api '["some_unique_id", "Net::HTTP", "default", "https://ya.ru"]'
(integer) 1
127.0.0.1:6379> blpop aparser_redis_api:some_unique_id 0
1) "aparser_redis_api:some_unique_id"
2) "{\"data\":\"<!DOCTYPE html><html.....
Various cases
Asynchronous check for the presence of a result
lpop aparser_redis_api:some_unique_id
Returns the result if it has already been processed or nil
if the request is still being processed
Blocking result retrieval
blpop aparser_redis_api:some_unique_id 0
This request will be blocked until the result is received, you can also specify the maximum timeout for receiving the result, after which the command will return nil
Saving results in a single queue
By default, A-Parser saves the result for each request under its unique key aparser_redis_api:query_id
, which allows you to organize multi-threaded processing by sending requests and receiving results separately for each thread
In some cases, it is necessary to process results in a single thread as they arrive, in this case it is more convenient to save results in a single result queue (the key must be different from the key for requests)
To do this, you need to specify the output_queue
key for apiOpts
:
lpush aparser_redis_api '["some_unique_id", "Net::HTTP", "default", "https://ya.ru", {}, {"output_queue": "aparser_results"}]'
Getting the result from the common queue:
127.0.0.1:6379> blpop aparser_results 0
1) "aparser_results"
2) "{\"queryId\":\"some_unique_id\",\"results\":{\"data\":\"<!DOCTYPE html><html class=...
Example of implementation (SpySERP case)
Suppose we are creating a SaaS service that evaluates domain parameters, for simplicity we will check the domain registration date
Our service consists of 2 pages:
/index.php
- landing page with a form for entering a domain/results.php?domain=google.com
- page with the results of the service
To improve the user experience, we want our service pages to load instantly, and the process of waiting for data to look natural and display a loader
When requesting results.php
, we first execute a request to the A-Parser Redis API, generating a unique request_id:
lpush aparser_redis_api '["request-1", "Net::Whois", "default", "google.com", {}, {}]'
After that, we can display the page to the user and show a loader on the data display area, due to the absence of delays, the server response will be limited only by the speed of connecting to Redis (usually within 10ms).
A-Parser will start processing the request even before the user's browser receives the first content. After the browser loads all the necessary resources and scripts, we can display the result by sending an AJAX
request to retrieve the data:
/get-results.php?request_id=request-1
The get-results.php
script performs a blocking request to Redis with a timeout of 15 seconds:
blpop aparser_redis_api:request-1 15
And returns the response as soon as it is received from A-Parser. If we receive a zero result by timeout, we can display a data retrieval error for the user.
Thus, by sending a request to A-Parser when opening the page for the first time (/results.php
), we reduce the necessary waiting time for the user's data (/get-results.php
) to the time that the user's browser spends waiting for content, loading scripts, and executing an AJAX
request.