Changing User Agent Used By Google?

Discussion in 'A-Parser Support Forum' started by scrapefun, Jun 11, 2015.

  1. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    How do you change the user agents used by Google parser?

    I tried to update the user-agents.txt file but it seems to have no effect or I am doing something wrong. I am trying to get it to use iphone user agent to scrape mobile results.

    I can change the user agent fine with the net::http parser since there is an option for it in parser settings but see no way to make this happen with Google.
     
  2. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    I've been testing some more and it seems like the Google parser uses a pretty old user-agent

    When I use the test parser with the Google parser the raw data results never show things like knowledge graph boxes.

    But if I use the test parser for the net:http parser using a Google search result URL and a more recent/modern user agent, the raw data results do include some of the more "modern" search elements like knowledge graph boxes.

    I have wanted a way to check for the knowledge graph boxes and scrape the search results and thought the Google parser was not capable but it seems I just need to be able to use a different user agent.
     
  3. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
  4. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,795
    It isn't possible to change user-agent for SE::Google because this will cause change source of serp page and parser not be able to parse it
     
  5. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Oh, yes. Didn't think of this but makes perfect sense.

    I assume I can make own custom Google parser with net::http and just use the Google query URLS as the query instead of just keywords? What would be the regex to parse the URL (in bold) from this:

    <li class="g"><h3 class="r"><a href="/url?q=http://en.m.wikipedia.org/wiki/Registry_cleaner&amp;sa=U


    The issue I see is making sure to not count a query as a "success" if a captcha/automated query page is shown. Is this possible with net::http parser and how to save those as failed queries? Some kind of result filter?

    Thanks for all the help. I actually try to figure out the regex but for some reason all the logic hasn't clicked with me yet.
     
  6. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Here are my attempts to create the custom Google parser.

    There are two versions. The regex seems to properly extract the urls but not sure if it could be better/more accurate. The difference between the two versions is really how I try to detect whether or not a captcha/automate queries page is detected.

    I think both versions work in that regard but not 100% sure they will properly save queries as failed with an automated/captcha page

    Here is attempt 1: (Just Noticed This One Has A Earlier regex I tried but it was inaccurate both versions now use the regex shown in the 2nd one below)

    mobile1.png
    eyJwcmVzZXQiOiJDdXN0b20gR29vZ2xlIE1vYmlsZSBTY3JhcGVyIiwidmFsdWUi
    OnsicHJlc2V0IjoiQ3VzdG9tIEdvb2dsZSBNb2JpbGUgU2NyYXBlciIsInBhcnNl
    cnMiOltbIk5ldDo6SFRUUCIsImRlZmF1bHQiLHsidHlwZSI6Im92ZXJyaWRlIiwi
    aWQiOiJ1c2VyLWFnZW50IiwidmFsdWUiOiJNb3ppbGxhLzUuMCA7aVBob25lOyBD
    UFUgaVBob25lIE9TIDhfMV8yIGxpa2UgTWFjIE9TIFg7IEFwcGxlV2ViS2l0LzYw
    MC4xLjQgO0tIVE1MLCBsaWtlIEdlY2tvOyBWZXJzaW9uLzguMCBNb2JpbGUvMTJC
    NDQwIFNhZmFyaS82MDAuMS40In0seyJ0eXBlIjoiY3VzdG9tUmVzdWx0IiwicmVz
    dWx0IjpbInBhZ2VzIiwiZGF0YSJdLCJyZWdleCI6IlwiXFwvdXJsXFw/cT0oLio/
    KSZhbXA7c2E9VSIsInJlZ2V4VHlwZSI6ImlnIiwicmVzdWx0VHlwZSI6ImFycmF5
    IiwiYXJyYXlOYW1lIjoic2VycCIsInJlc3VsdHMiOlsibGlua3MiXX0seyJ0eXBl
    Ijoib3ZlcnJpZGUiLCJpZCI6ImZvcm1hdHJlc3VsdCIsInZhbHVlIjoiWyUgRk9S
    RUFDSCBzZXJwIC0lXSAkbG9vcC5jb3VudDskbGlua3MgXFxuWyUgRU5EICVdIn0s
    eyJ0eXBlIjoiZmlsdGVyIiwicmVzdWx0IjpbInBhZ2VzIiwiZGF0YSJdLCJmaWx0
    ZXJUeXBlIjoiY29udGFpbiIsInZhbHVlIjoib3VyIHN5c3RlbXMgaGF2ZSBkZXRl
    Y3RlZCB1bnVzdWFsIHRyYWZmaWMgZnJvbSB5b3VyIGNvbXB1dGVyIiwib3B0aW9u
    IjoiaW5zZW5zIn0seyJ0eXBlIjoiZmlsdGVyIiwicmVzdWx0IjpbInBhZ2VzIiwi
    ZGF0YSJdLCJmaWx0ZXJUeXBlIjoiY29udGFpbiIsInZhbHVlIjoidG8gY29udGlu
    dWUsIHBsZWFzZSB0eXBlIHRoZSBjaGFyYWN0ZXJzIGJlbG93Iiwib3B0aW9uIjoi
    aW5zZW5zIn0seyJ0eXBlIjoiZmlsdGVyIiwicmVzdWx0IjpbInBhZ2VzIiwiZGF0
    YSJdLCJmaWx0ZXJUeXBlIjoiY29udGFpbiIsInZhbHVlIjoiYnV0IHlvdXIgcXVl
    cnkgbG9va3Mgc2ltaWxhciB0byBhdXRvbWF0ZWQgcmVxdWVzdHMiLCJvcHRpb24i
    OiJpbnNlbnMifSx7InR5cGUiOiJmaWx0ZXIiLCJyZXN1bHQiOlsicGFnZXMiLCJk
    YXRhIl0sImZpbHRlclR5cGUiOiJjb250YWluIiwidmFsdWUiOiJidXQgeW91ciBj
    b21wdXRlciBvciBuZXR3b3JrIG1heSBiZSBzZW5kaW5nIGF1dG9tYXRlZCBxdWVy
    aWVzIiwib3B0aW9uIjoiaW5zZW5zIn0seyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6
    InByb3h5YmFubmVkY2xlYW51cCIsInZhbHVlIjoiMCJ9LHsidHlwZSI6Im92ZXJy
    aWRlIiwiaWQiOiJwcm94eXJldHJpZXMiLCJ2YWx1ZSI6IjAifV1dLCJyZXN1bHRz
    Rm9ybWF0IjoiJHAxLnByZXNldCIsInJlc3VsdHNTYXZlVG8iOiJmaWxlIiwicmVz
    dWx0c0ZpbGVOYW1lIjoiJHF1ZXJ5LnR4dCIsImFkZGl0aW9uYWxGb3JtYXRzIjpb
    XSwicmVzdWx0c1VuaXF1ZSI6Im5vIiwicXVlcnlGb3JtYXQiOlsiJHF1ZXJ5Il0s
    InVuaXF1ZVF1ZXJpZXMiOmZhbHNlLCJzYXZlRmFpbGVkUXVlcmllcyI6dHJ1ZSwi
    aXRlcmF0b3JPcHRpb25zIjp7Im9uQWxsTGV2ZWxzIjpmYWxzZSwicXVlcnlCdWls
    ZGVyc0FmdGVySXRlcmF0b3IiOmZhbHNlfSwicmVzdWx0c09wdGlvbnMiOnsib3Zl
    cndyaXRlIjpmYWxzZX0sImRvTG9nIjoibm8iLCJrZWVwVW5pcXVlIjoiTm8iLCJt
    b3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5kIjoiIiwicmVzdWx0c0Fw
    cGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOltdLCJyZXN1bHRzQnVpbGRlcnMiOltd
    LCJjb25maWdPdmVycmlkZXMiOltdfX0=



    Here is attempt 2:
    mobile2.png
    eyJwcmVzZXQiOiJHb29nbGUgTW9iaWxlIFNjcmFwZXIgMiIsInZhbHVlIjp7InBy
    ZXNldCI6Ikdvb2dsZSBNb2JpbGUgU2NyYXBlciAyIiwicGFyc2VycyI6W1siTmV0
    OjpIVFRQIiwiZGVmYXVsdCIseyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6InVzZXIt
    YWdlbnQiLCJ2YWx1ZSI6Ik1vemlsbGEvNS4wIDtpUGhvbmU7IENQVSBpUGhvbmUg
    T1MgOF8xXzIgbGlrZSBNYWMgT1MgWDsgQXBwbGVXZWJLaXQvNjAwLjEuNCA7S0hU
    TUwsIGxpa2UgR2Vja287IFZlcnNpb24vOC4wIE1vYmlsZS8xMkI0NDAgU2FmYXJp
    LzYwMC4xLjQifSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOlsicGFn
    ZXMiLCJkYXRhIl0sInJlZ2V4IjoiPGxpIGNsYXNzPVwiZ1wiPjxoMyBjbGFzcz1c
    InJcIj48YSBocmVmPVwiXFwvdXJsXFw/cT0oLio/KSZhbXA7c2E9VSIsInJlZ2V4
    VHlwZSI6ImlnIiwicmVzdWx0VHlwZSI6ImFycmF5IiwiYXJyYXlOYW1lIjoic2Vy
    cCIsInJlc3VsdHMiOlsibGlua3MiXX0seyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6
    ImZvcm1hdHJlc3VsdCIsInZhbHVlIjoiWyUgRk9SRUFDSCBzZXJwIC0lXSAkbG9v
    cC5jb3VudDskbGlua3MgXFxuWyUgRU5EICVdIn0seyJ0eXBlIjoib3ZlcnJpZGUi
    LCJpZCI6Imdvb2RDb2RlIiwidmFsdWUiOjIwMH0seyJ0eXBlIjoib3ZlcnJpZGUi
    LCJpZCI6InByb3h5cmV0cmllcyIsInZhbHVlIjoiMCJ9XV0sInJlc3VsdHNGb3Jt
    YXQiOiIkcDEucHJlc2V0IiwicmVzdWx0c1NhdmVUbyI6ImZpbGUiLCJyZXN1bHRz
    RmlsZU5hbWUiOiIke3F1ZXJ5fS50eHQiLCJhZGRpdGlvbmFsRm9ybWF0cyI6W10s
    InJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9ybWF0IjpbIiRxdWVyeSJdLCJ1
    bmlxdWVRdWVyaWVzIjpmYWxzZSwic2F2ZUZhaWxlZFF1ZXJpZXMiOnRydWUsIml0
    ZXJhdG9yT3B0aW9ucyI6eyJvbkFsbExldmVscyI6ZmFsc2UsInF1ZXJ5QnVpbGRl
    cnNBZnRlckl0ZXJhdG9yIjpmYWxzZX0sInJlc3VsdHNPcHRpb25zIjp7Im92ZXJ3
    cml0ZSI6ZmFsc2V9LCJkb0xvZyI6Im5vIiwia2VlcFVuaXF1ZSI6Ik5vIiwibW9y
    ZU9wdGlvbnMiOmZhbHNlLCJyZXN1bHRzUHJlcGVuZCI6IiIsInJlc3VsdHNBcHBl
    bmQiOiIiLCJxdWVyeUJ1aWxkZXJzIjpbXSwicmVzdWx0c0J1aWxkZXJzIjpbXSwi
    Y29uZmlnT3ZlcnJpZGVzIjpbXX19


    Any of this correct? :) Better way to do it?

    Thanks!
     
  7. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,547
    Likes Received:
    2,164
    Try this solution.
    [​IMG]
    Code:
    eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTY5Ny8iLCJ2
    YWx1ZSI6eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTY5
    Ny8iLCJwYXJzZXJzIjpbWyJOZXQ6OkhUVFAiLCJkZWZhdWx0Iix7InR5cGUiOiJv
    dmVycmlkZSIsImlkIjoidXNlcHJveHkiLCJ2YWx1ZSI6dHJ1ZX0seyJ0eXBlIjoi
    b3ZlcnJpZGUiLCJpZCI6InByb3h5cmV0cmllcyIsInZhbHVlIjoiMjAifSx7InR5
    cGUiOiJvdmVycmlkZSIsImlkIjoiZm9ybWF0cmVzdWx0IiwidmFsdWUiOiJbJSBJ
    RiBpbmZvLnN1Y2Nlc3MgPT0gMSAlXVslIEZPUkVBQ0ggc2VycCAlXSRsaW5rXFxu
    WyUgRU5EICVdWyUgRU5EICVdIn0seyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6InF1
    ZXJ5Zm9ybWF0IiwidmFsdWUiOiJodHRwOi8vd3d3Lmdvb2dsZS5jb20vc2VhcmNo
    P251bT0xMDAmcT0kcXVlcnkifSx7InR5cGUiOiJvdmVycmlkZSIsImlkIjoidXNl
    ci1hZ2VudCIsInZhbHVlIjoiTW96aWxsYS81LjA7IGlQaG9uZTsgQ1BVIGlQaG9u
    ZSBPUyA4XzFfMiBsaWtlIE1hYyBPUyBYOyBBcHBsZVdlYktpdC82MDAuMS40OyBL
    SFRNTCwgbGlrZSBHZWNrbzsgVmVyc2lvbi84LjAgTW9iaWxlLzEyQjQ0MCBTYWZh
    cmkvNjAwLjEuNCJ9LHsidHlwZSI6Im92ZXJyaWRlIiwiaWQiOiJnb29kQ29kZSIs
    InZhbHVlIjoyMDB9LHsidHlwZSI6ImN1c3RvbVJlc3VsdCIsInJlc3VsdCI6ImRh
    dGEiLCJyZWdleCI6IjxoMyBjbGFzcz1cInJcIj48YSBocmVmPVwiKC4rPylcIiIs
    InJlZ2V4VHlwZSI6ImciLCJyZXN1bHRUeXBlIjoiYXJyYXkiLCJhcnJheU5hbWUi
    OiJzZXJwIiwicmVzdWx0cyI6WyJsaW5rIl19XV0sInJlc3VsdHNGb3JtYXQiOiIk
    cDEucHJlc2V0IiwicmVzdWx0c1NhdmVUbyI6ImZpbGUiLCJyZXN1bHRzRmlsZU5h
    bWUiOiJjdXN0b21HcGFyc2VyLyR7cXVlcnl9LnR4dCIsImFkZGl0aW9uYWxGb3Jt
    YXRzIjpbWyJjdXN0b21HcGFyc2VyL2ZhaWxlZC50eHQiLCJbJSBJRiBwMS5pbmZv
    LnN1Y2Nlc3MgPT0gMCAlXSRxdWVyeVxcblslIEVORCAlXSJdXSwicmVzdWx0c1Vu
    aXF1ZSI6Im5vIiwicXVlcnlGb3JtYXQiOlsiJHF1ZXJ5Il0sInVuaXF1ZVF1ZXJp
    ZXMiOmZhbHNlLCJzYXZlRmFpbGVkUXVlcmllcyI6ZmFsc2UsIml0ZXJhdG9yT3B0
    aW9ucyI6eyJvbkFsbExldmVscyI6ZmFsc2UsInF1ZXJ5QnVpbGRlcnNBZnRlckl0
    ZXJhdG9yIjpmYWxzZX0sInJlc3VsdHNPcHRpb25zIjp7Im92ZXJ3cml0ZSI6ZmFs
    c2V9LCJkb0xvZyI6Im5vIiwia2VlcFVuaXF1ZSI6Ik5vIiwibW9yZU9wdGlvbnMi
    OmZhbHNlLCJyZXN1bHRzUHJlcGVuZCI6IiIsInJlc3VsdHNBcHBlbmQiOiIiLCJx
    dWVyeUJ1aWxkZXJzIjpbXSwicmVzdWx0c0J1aWxkZXJzIjpbXSwiY29uZmlnT3Zl
    cnJpZGVzIjpbXX19
    In this task, the issuance of Google search is parsed using the Net::HTTP with the specified user agent. In this case as a successful response from the server is set code 200, otherwise request will be parsed by specified number of times (as the CAPTCHA is returned code 503, so it will be considered a bad request and will be parse again). All requests for which unsuccessful response is received within the specified request retries will be placed in the file failed.txt.
     
  8. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    This looks great! I'll give it a shot. Thanks so much for the great support
     
    Support likes this.

Share This Page