JavaScript parsers

Aug 20, 2020
  • JavaScript parsers - this is an opportunity to create their own full-fledged parsers with any number of difficult logic using JavaScript. At the same time in JS parsers you can also use all the functionality of the standard parsers.
    [​IMG]


    Features(top)


    • Using all the power of A-Parser now possible write your own parser/reger/poster with any number of difficult logic
    • To write code is used JavaScript with capabilities of ES6 (v8 engine)
    • The code of parsers maximally concise, allowing you to focus on writing the logic; work with multi-threading, network, proxy, results, logs, etc. A-Parser takes over
    • The code can be written directly in the parser interface, by adding a new parser in Parsers editor, a simple example will be loaded by default, on the basis of which you can quickly start creating your own parser
    • Using automatic versioning when saving the parser code via the built-in editor
    • Available for Pro and Enterprise licenses for all operating systems except linux 32bit

    How to work(top)


    • In Parsers editor create a new parser
    • Specify the name of the parser
    • Writing code of parser
    • Save and use as an ordinary parser: in the Task editor select the created parser, if necessary you can set desired settings, config preset, filename and etc.
    • Created parser can be edited at any time. All changes concerning the interface will appear after selecting parser in the list of parsers or restart A-Parser; changes in the logic of parser applies when you restart the task with parser
    • By default, the standard icon is displayed, you can add your own in the png or ico format:
      [​IMG]

    Documentation(top)

    JS parsers are presently in the process of frequent changes and adding functionality, so this section will be updated and changed as the appearance of changes in the functional of JavaScript parsers

    1. General principles of work(top)

    • The constructor is called once for each task
      • You should always set this.defaultConf.results and this.defaultConf.results_format, other fields are optional and will be set to default
      • In the this.defaultConf object, you can set parameter bulkQueries: N. In this case, the parser will take queries in batches of N pieces and all queries for the current iteration will be contained in the set.bulkQueries array.
      • Array this.editableConf determines which settings can be changed by the user from interface of A-Parser
        • You can use the following types of fields:
          • combobox - drop-down menu. You can also make selection menu for preset of standard parser, for example:
            Code:
            ['Util_AntiGate_preset', ['combobox', 'AntiGate preset']]
            If you specify {'multiSelect': 1}, then the menu will be multi select:
            Code:
            ['proxyCheckers', ['combobox', 'Proxy Checkers', {'multiSelect': 1}, ['*', 'All']]]
          • checkbox - checkbox, for the parameters that can have only two values (true/false)
          • textfield - text field
    • Method *parse it's a generator, and on any blocking operation should return a yield (this is the main and the only difference from usual function)
      • The method is called for each query that is processing
      • Passed set (hash with query and its parameters) and results (an empty blank for results)
      • Should always return the completed results, pre-sticking flag success

    2. Automatic versioning(top)

    Code:
            this.defaultConf = {
                version: '0.1.1',
    
    • The version has a format: Major.Minor.Revision
    • Value Revision (last digit) automatically increases with each save
    • Other values (Major, Minor) can be changed manually, and also reset Revision to 0
    • If for some reason you only need to change Revision manually, then the version must be enclosed in double quotes ""

    3. yield this.request(method, url, queryParams, opts)(top)

    • Getting HTTP response on query, as arguments specified method, url, hash with the query parameters and hash with query options
    • If you use the POST method, the data can be transmitted in two ways:
      • Just list the name of the variable and its value. For example, so { url:set.query }.
      • Send the data to the body. For example, so { body: 'url =' + set.query }. If the parameters are several then they are indicated by commas and in the first and second cases.
    • Available options(opts):
      • check_content: ['<\/html>'] - an array of regex for checking; if the check fails, query will be retried with another proxy
    Possibilities:
    • Use regular expressions;
    • Use of its functions to which the data and hedera of the response are transmitted;
    Work algorithm:
    • All specified checks must be completed for a successful request;
    • You can use strings to search by occurrence, regular expressions, and their functions;
    • For logical negation, place the test in an array, i.e. [/xxxx/] means that the regular expression should not be executed;
    Code:
    let response = yield this.request('GET', set.query, {}, {
        check_content: [
        /<\/html>|<\/body>/,
        [/XXXX/],
        '</html>',
        (data, hdr) => {
            return hdr.Status == 200 && data.length > 100;
        }],
        decode: 'auto-html',
    });
    
    • decode: 'auto-html' - detection charset and converting into utf8, possible values: auto-html (based on titles, the meta tags and page content, the best option), utf8 (indicating that the document is encoded in utf8)
    • headers: {'user-agent': 'Google bot'} - hash with headers, header name must be lowercase, also can be specified cookie
    • headers_order: ['cookie', 'user-agent', ...] - allows you to override the order of headers
    • recurse: 0 - the maximum number transitions on redirects, default 7
    • proxyretries: 10 - number of attempts, by default is taken from the settings of the parser
    • parsecodes: {200: 1} - list of successful HTTP responses, if specified '*':1 then all the answers are successful, by default is taken from the settings of the parser
    • timeout: 30 - response timeout in seconds
    • do_gzip: 1 - determines whether to use compression, by default is enabled
    • max_size: 4096 - the maximum size in bytes of the response, by default is taken from the settings of the parser
    • cookie_jar: { } - hash with cookies (format will be described below)
    • attempt: 3 - indicates the number of the current attempt, when using this parameter, the built-in retry handler for this request is ignored
    • browser: 1 - automatical emulation of browser headers
    • use_proxy: 1 - overrides the use of a proxy for a single query in the JS parser over the global setting Use proxy
    • noextraquery: 1 - disable adding Extra query string to query url
    • save_to_file: file - allows you to download a file. Instead of file, you specify the name under which file to save
    • needData: 1 - determines whether or not to transmit data/ pages[] in the response, can be used for optimize
    • data_as_buffer: 0 - defines to return data as a string (String) or as a Buffer object, by default returned as a String
    • bypass_cloudflare: 1 - works the same way as in Net::HTTP Net::HTTP, automatically bypassing the CloudFlare JS protection
    • follow_meta_refresh: 1 - allows to navigate through redirects declared with <meta http-equiv="refresh" content="time; url=..." />
    • tlsOpts: { } – allows you to transfer settings for https connections (https://nodejs.org/dist/latest-v14.x/docs/api/tls.html#tls_tls_connect_options_callback).

    4. yield this.parser.request(parser, preset, overrideParams, query)(top)

    • Getting results from the other (standard) parser, as argument specifies the name of the parser, its preset, hash with override options (optional), query
    • In the override hash, in addition to the parameters of the called parser, you can additionally specify the following parameters:
      • resultArraysWithObjects: 1 - allows you to return an array of objects in the results instead of the standard array of values

    5. tools.*(top)


    6. this.logger.*(top)

    • this.logger.put method displays the line in the log
    • this.doLog can be used as a flag to optimize for cases, when the log is not recorded and there is an argument to .put complex expression

    7. yield this.sleep(sec)(top)

    • Sets the delay in the thread of the number of seconds, may be fractional

    8. yield this.mutex.lock(), yield.this.mutex.unlock()(top)

    • Mutex for synchronization between threads, allows you to lock a section of code to a single thread, the example in the parser JS::Rank::MOZ

    9. this.cookies.*(top)

    • Work with cookies
      • this.cookies.getAll() - getting hash with cookies
      • this.cookies.setAll(cookies) - setting of cookies, as an argument to be passed a hash of cookies
      • this.cookies.set(host, path, name, value) - setting single cookies

    10. this.query.add(query, maxLvl)(top)

    • Adding a new request with the possibility to optionally specify a maximum level (similar tools.query.add)
    • You can pass a hash with parameters as a query, works similarly to Query builder.
      Code:
      this.query.add({query: "http://...", param1: "..", ...})

    11. this.proxy.*(top)

    • Work with proxy
      • this.proxy.next() - change proxy, the old proxy will no longer be used for the current request
      • this.proxy.ban() - change and ban the proxy (must be used when the service blocks work on IP)
      • this.proxy.get() - Get the current proxy (the last proxy with which the request was made)
      • this.proxy.set('http://127.0.0.1:8080', noChange = false) - set the proxy for the next request, the noChange parameter is optional, if true, then the proxy will not change between attempts

    12. yield this.captcha.*(top)

    • Work with CAPTCHA
      • yield this.captcha.recognize(preset, image, type, overrides) - upload captcha for recognition, image - binary image data for recognition, preset indicates the preset for Util::AntiGate Util::AntiGate, type one of: 'jpeg', 'gif', 'png'. The result will be a hash with fields answer (if specified - this is the text of the picture), id (captcha id, for the possibility of further reporting about error via reportBad), error (text error if answer is not specified)
      • yield this.captcha.recognizeFromUrl(preset, url, overrides) - similar to the previous method, but downloading captcha image will be executed automatically by the link without using a proxy
      • yield this.captcha.reportBad(preset, id, overrides) - report the service that captcha is recognized incorrectly

    13. this.utils.*(top)

    • this.util.updateResultsData(results, data) - method for automatic filling $pages.$i.data и $data, it is necessary to call for add content of the resulting page
    • this.util.urlFromHTML(url, [base]) - processes the link received from the HTML code - decodes entities (& amp; etc.), optionally it is possible to pass base - the base URL (for example, the URL of the source page), so a full link can be obtained
    • this.utils.url.extractDomain (url, [removeDefaultSubdomain]) - the method takes a reference as the first parameter and returns the domain from this link. The second optional parameter determines whether the subdomain www should be truncated from the domain. The default is 0 - that is, do not crop.
    • this.utils.url.extractTopDomain (url) - the method takes a reference as the first parameter and returns a domain from this link, without subdomains.
    • this.utils.url.extractTopDomainByZone (url) - the method takes a reference as the first parameter and returns a domain from this link, without subdomains including. Works with all regional areas
    • this.utils.url.extractMaxPath (url) - the method takes a string and selects the URL from it.
    • this.utils.url.extractWOParams (url) - the method takes a reference and returns the same link, truncated to a string of parameters. That is, returns the URL to ?
    • this.utils.removeHtml (string) - the method takes a string and returns it as a cleared
      from html tags
    • this.utils.removeNoDigit (string) - the method takes a string, removes from it everything except the digits and returns the result.
    • this.utils.removeComma (string) - the method takes a string, removes from it characters such as . , \ r \ n and returns the result

    14. this.logger.putHTML (since v1.2.61)(top)


    Since version 1.2.61, it was possible to output HTML code to the parser's log via the "this.logger.putHTML" method, which will be processed and displayed.
    Syntax:
    Code:
    this.logger.putHTML(code)

    15. this.sessionManager.* (since v1.2.84)(top)

    • Work with sessions
      • To use sessions in the JS parser, you must first initialize the Session Manager. This must be done with the init() function
        Code:
        init() {
            this.sessionManager.init({
                //here you can set additional parameters
            });
        }
    • In this.sessionManager.init() you can use the following parameters:
      • name - optional parameter, allows you to override the name of the parser to which the session belongs, by default equal to the name of the parser in which the initialization occurs
      • canChangeProxy - optional parameter, the ability to change the proxy, default is 1
      • domain - optional parameter, specifies to look for a session among all the stored parsers (if the value is not specified), or only for a specific domain (you need to specify a domain with a dot in front, for example, .site.com)
    • To work with sessions, there are several functions
      • this.sessionManager.get() - gets a new session. It is called before the request is executed. Now, you need to use the following syntax:
        Code:
        if(this.sessionManager.get())
            this.proxy.next();
      • this.sessionManager.reset() - clear the cookie and getting a new session. It is necessary to call if the request was not successful with the current session
      • this.sessionManager.save() - saving a successful session or saving arbitrary data in a session
    Example of working with sessions
    Example of saving data in a session

    16. method .addElement (since v1.2.368)(top)


    Starting with version 1.2.368, which allows you to more conveniently fill arrays in results. When using it, it is not necessary to remember the sequence of variables in the array and list them manually.
    Example: https://a-parser.com/threads/5058/



    17. methods init() and destroy() (с версии 1.2.890)(top)


    The init() method is called at task start, destroy() - at finish.
    Example of use:
    Code:
    const puppeteer = require("puppeteer");
    let globalBrowser;
    
    class Parser {
        constructor() {
           ...
        }
    
        async init() {
            globalBrowser = await puppeteer.launch();
        };
    
        async destroy() {
            if(globalBrowser)
                await globalBrowser.close();
        }
    }
    

    Useful links(top)