Skip to main content

Integration of A-Parser with Redis: Advanced API

TODO: (next) remove information duplication https://a-parser.com/threads/5488/#post-17742 https://a-parser.com/wiki/api-server-redis/

Comparison with HTTP API

A-Parser Redis API was developed to replace the methods oneRequest and bulkRequest for a more efficient implementation and support of additional use cases:

  • Redis acts as the server for requests and results
  • the ability to request results asynchronously or in blocking mode
  • the ability to connect multiple scrapers (both on the same and on different servers) to process requests with a single entry point
  • the ability to set the number of threads for processing requests and view work logs
  • the ability to organize timeouts for operations
  • automatic Expire of unclaimed results

Discussion thread on the forum

Pre-setup

Unlike A-Parser HTTP API, to use Redis API it is necessary to preliminarily configure and launch a task with the scraper API::Server::RedisAPI::Server::Redis:

  • install and start Redis server (locally or remotely)
  • create a preset of settings for the scraper API::Server::RedisAPI::Server::Redis, specify:
    • Redis Host and Redis Port - the address and port of the Redis server, by default 127.0.0.1, port 6379
    • Redis Queue Key - the name of the key for data exchange with A-Parser, by default aparser_redis_api, you can create separate queues and process them with different tasks or different copies of A-Parser
    • Result Expire(TTL) - the lifetime of the result in seconds, serves for automatic control and deletion of unclaimed results, by default 3600 seconds (1 hour)
  • add a task with the scraper API::Server::RedisAPI::Server::Redis
    • as requests, it is necessary to specify {num:1:N}, where N should correspond to the number of threads specified in the task
    • you can also enable the log option, thus the possibility of viewing the log for each request will be available

Executing requests

Redis API operation is based on Redis Lists, list operations allow adding an unlimited number of requests to the queue (limited by RAM), as well as receiving results in blocking mode with a timeout (blpop) or in asynchronous mode (lpop).

  • all settings, except for useproxy, proxyChecker, and proxybannedcleanup are taken from the preset of the called scraper + overrideOpts
  • settings useproxy, proxyChecker, and proxybannedcleanup are taken from the preset API::Server::RedisAPI::Server::Redis + overrideOpts

A request is added to Redis with the lpush command, each request consists of an array [queryId, parser, preset, query, overrideOpts, apiOpts] serialized using JSON:

  • parser, preset, query corresponds to similar for API request oneRequest
  • queryId - is formed together with the request, it is recommended to use a sequential number from your database or a good random, by this ID you will be able to get the result
  • overrideOpts - overriding settings for the scraper preset
  • apiOpts - additional API processing parameters

redis-cli

Example of executing requests, for testing you can use redis-cli:

127.0.0.1:6379> lpush aparser_redis_api '["some_unique_id", "Net::HTTP", "default", "https://ya.ru"]'
(integer) 1
127.0.0.1:6379> blpop aparser_redis_api:some_unique_id 0
1) "aparser_redis_api:some_unique_id"
2) "{\"data\":\"<!DOCTYPE html><html.....

Various use cases

Asynchronous check for the result

lpop aparser_redis_api:some_unique_id

Will return the result if it has already been processed or nil if the request is still being processed

Blocking result retrieval

blpop aparser_redis_api:some_unique_id 0

This request will be blocked until the result is received, you can also specify a maximum timeout for receiving the result, after which the command will return nil

Saving results to a single queue

By default, A-Parser saves the result for each query under its unique key aparser_redis_api:query_id, which allows organizing multithreaded processing by sending requests and receiving results separately for each thread

In some cases, it is necessary to process results in a single thread as they arrive, in this case, it is more convenient to save the results in a single result queue (the key must be different from the key for requests)

For this, it is necessary to specify the output_queue key for apiOpts:

lpush aparser_redis_api '["some_unique_id", "Net::HTTP", "default", "https://ya.ru", {}, {"output_queue": "aparser_results"}]'

Receiving the result from the common queue:

127.0.0.1:6379> blpop aparser_results 0
1) "aparser_results"
2) "{\"queryId\":\"some_unique_id\",\"results\":{\"data\":\"<!DOCTYPE html><html class=...

Implementation example (SpySERP case)

Let's assume we are creating a SaaS service that assesses domain parameters, for simplicity, we will check the domain registration date

Our service consists of 2 pages:

  • /index.php - a landing page with a domain input form
  • /results.php?domain=google.com - a page with the service's results

To improve user experience, we want our service pages to load instantly, and the data waiting process to look natural and display a loader

When requesting results.php, we first make a request to the A-Parser Redis API, forming a unique request_id:

​lpush aparser_redis_api '["request-1", "Net::Whois", "default", "google.com", {}, {}]'

After that, we can display the page to the user and show the loader in the data display area, due to the absence of delays the server response will be limited only by the speed of the Redis connection (usually within 10ms)

A-Parser will start processing the request even before the user's browser receives the first content, after the browser loads all the necessary resources and scripts, we can display the result, for this we send an AJAX request to get the data:

/get-results.php?request_id=request-1

The get-results.php script performs a blocking request to Redis with a 15-second timeout:

blpop aparser_redis_api:request-1 15

And returns the response as soon as it is received from A-Parser, if we receive a null result due to the timeout, then we can display a data retrieval error to the user

Thus, by sending a request to A-Parser at the first opening of the page (/results.php), we reduce the necessary data waiting time for the user (/get-results.php) by the time the user's browser spends waiting for content, loading scripts, and executing the AJAX request