How do you change the user agents used by Google parser? I tried to update the user-agents.txt file but it seems to have no effect or I am doing something wrong. I am trying to get it to use iphone user agent to scrape mobile results. I can change the user agent fine with the net::http parser since there is an option for it in parser settings but see no way to make this happen with Google.
I've been testing some more and it seems like the Google parser uses a pretty old user-agent When I use the test parser with the Google parser the raw data results never show things like knowledge graph boxes. But if I use the test parser for the net:http parser using a Google search result URL and a more recent/modern user agent, the raw data results do include some of the more "modern" search elements like knowledge graph boxes. I have wanted a way to check for the knowledge graph boxes and scrape the search results and thought the Google parser was not capable but it seems I just need to be able to use a different user agent.
Sorry for all the posts but just came across this: http://a-parser.com/wiki/template-tools/ Anyway to use [% tools.ua.random() %] with Google parser?
It isn't possible to change user-agent for SE::Google because this will cause change source of serp page and parser not be able to parse it
Oh, yes. Didn't think of this but makes perfect sense. I assume I can make own custom Google parser with net::http and just use the Google query URLS as the query instead of just keywords? What would be the regex to parse the URL (in bold) from this: <li class="g"><h3 class="r"><a href="/url?q=http://en.m.wikipedia.org/wiki/Registry_cleaner&sa=U The issue I see is making sure to not count a query as a "success" if a captcha/automated query page is shown. Is this possible with net::http parser and how to save those as failed queries? Some kind of result filter? Thanks for all the help. I actually try to figure out the regex but for some reason all the logic hasn't clicked with me yet.
Here are my attempts to create the custom Google parser. There are two versions. The regex seems to properly extract the urls but not sure if it could be better/more accurate. The difference between the two versions is really how I try to detect whether or not a captcha/automate queries page is detected. I think both versions work in that regard but not 100% sure they will properly save queries as failed with an automated/captcha page Here is attempt 1: (Just Noticed This One Has A Earlier regex I tried but it was inaccurate both versions now use the regex shown in the 2nd one below) Spoiler: Google Mobile Parser 1 eyJwcmVzZXQiOiJDdXN0b20gR29vZ2xlIE1vYmlsZSBTY3JhcGVyIiwidmFsdWUi OnsicHJlc2V0IjoiQ3VzdG9tIEdvb2dsZSBNb2JpbGUgU2NyYXBlciIsInBhcnNl cnMiOltbIk5ldDo6SFRUUCIsImRlZmF1bHQiLHsidHlwZSI6Im92ZXJyaWRlIiwi aWQiOiJ1c2VyLWFnZW50IiwidmFsdWUiOiJNb3ppbGxhLzUuMCA7aVBob25lOyBD UFUgaVBob25lIE9TIDhfMV8yIGxpa2UgTWFjIE9TIFg7IEFwcGxlV2ViS2l0LzYw MC4xLjQgO0tIVE1MLCBsaWtlIEdlY2tvOyBWZXJzaW9uLzguMCBNb2JpbGUvMTJC NDQwIFNhZmFyaS82MDAuMS40In0seyJ0eXBlIjoiY3VzdG9tUmVzdWx0IiwicmVz dWx0IjpbInBhZ2VzIiwiZGF0YSJdLCJyZWdleCI6IlwiXFwvdXJsXFw/cT0oLio/ KSZhbXA7c2E9VSIsInJlZ2V4VHlwZSI6ImlnIiwicmVzdWx0VHlwZSI6ImFycmF5 IiwiYXJyYXlOYW1lIjoic2VycCIsInJlc3VsdHMiOlsibGlua3MiXX0seyJ0eXBl Ijoib3ZlcnJpZGUiLCJpZCI6ImZvcm1hdHJlc3VsdCIsInZhbHVlIjoiWyUgRk9S RUFDSCBzZXJwIC0lXSAkbG9vcC5jb3VudDskbGlua3MgXFxuWyUgRU5EICVdIn0s eyJ0eXBlIjoiZmlsdGVyIiwicmVzdWx0IjpbInBhZ2VzIiwiZGF0YSJdLCJmaWx0 ZXJUeXBlIjoiY29udGFpbiIsInZhbHVlIjoib3VyIHN5c3RlbXMgaGF2ZSBkZXRl Y3RlZCB1bnVzdWFsIHRyYWZmaWMgZnJvbSB5b3VyIGNvbXB1dGVyIiwib3B0aW9u IjoiaW5zZW5zIn0seyJ0eXBlIjoiZmlsdGVyIiwicmVzdWx0IjpbInBhZ2VzIiwi ZGF0YSJdLCJmaWx0ZXJUeXBlIjoiY29udGFpbiIsInZhbHVlIjoidG8gY29udGlu dWUsIHBsZWFzZSB0eXBlIHRoZSBjaGFyYWN0ZXJzIGJlbG93Iiwib3B0aW9uIjoi aW5zZW5zIn0seyJ0eXBlIjoiZmlsdGVyIiwicmVzdWx0IjpbInBhZ2VzIiwiZGF0 YSJdLCJmaWx0ZXJUeXBlIjoiY29udGFpbiIsInZhbHVlIjoiYnV0IHlvdXIgcXVl cnkgbG9va3Mgc2ltaWxhciB0byBhdXRvbWF0ZWQgcmVxdWVzdHMiLCJvcHRpb24i OiJpbnNlbnMifSx7InR5cGUiOiJmaWx0ZXIiLCJyZXN1bHQiOlsicGFnZXMiLCJk YXRhIl0sImZpbHRlclR5cGUiOiJjb250YWluIiwidmFsdWUiOiJidXQgeW91ciBj b21wdXRlciBvciBuZXR3b3JrIG1heSBiZSBzZW5kaW5nIGF1dG9tYXRlZCBxdWVy aWVzIiwib3B0aW9uIjoiaW5zZW5zIn0seyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6 InByb3h5YmFubmVkY2xlYW51cCIsInZhbHVlIjoiMCJ9LHsidHlwZSI6Im92ZXJy aWRlIiwiaWQiOiJwcm94eXJldHJpZXMiLCJ2YWx1ZSI6IjAifV1dLCJyZXN1bHRz Rm9ybWF0IjoiJHAxLnByZXNldCIsInJlc3VsdHNTYXZlVG8iOiJmaWxlIiwicmVz dWx0c0ZpbGVOYW1lIjoiJHF1ZXJ5LnR4dCIsImFkZGl0aW9uYWxGb3JtYXRzIjpb XSwicmVzdWx0c1VuaXF1ZSI6Im5vIiwicXVlcnlGb3JtYXQiOlsiJHF1ZXJ5Il0s InVuaXF1ZVF1ZXJpZXMiOmZhbHNlLCJzYXZlRmFpbGVkUXVlcmllcyI6dHJ1ZSwi aXRlcmF0b3JPcHRpb25zIjp7Im9uQWxsTGV2ZWxzIjpmYWxzZSwicXVlcnlCdWls ZGVyc0FmdGVySXRlcmF0b3IiOmZhbHNlfSwicmVzdWx0c09wdGlvbnMiOnsib3Zl cndyaXRlIjpmYWxzZX0sImRvTG9nIjoibm8iLCJrZWVwVW5pcXVlIjoiTm8iLCJt b3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5kIjoiIiwicmVzdWx0c0Fw cGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOltdLCJyZXN1bHRzQnVpbGRlcnMiOltd LCJjb25maWdPdmVycmlkZXMiOltdfX0= Here is attempt 2: Spoiler: Google Mobile Scraper 2 eyJwcmVzZXQiOiJHb29nbGUgTW9iaWxlIFNjcmFwZXIgMiIsInZhbHVlIjp7InBy ZXNldCI6Ikdvb2dsZSBNb2JpbGUgU2NyYXBlciAyIiwicGFyc2VycyI6W1siTmV0 OjpIVFRQIiwiZGVmYXVsdCIseyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6InVzZXIt YWdlbnQiLCJ2YWx1ZSI6Ik1vemlsbGEvNS4wIDtpUGhvbmU7IENQVSBpUGhvbmUg T1MgOF8xXzIgbGlrZSBNYWMgT1MgWDsgQXBwbGVXZWJLaXQvNjAwLjEuNCA7S0hU TUwsIGxpa2UgR2Vja287IFZlcnNpb24vOC4wIE1vYmlsZS8xMkI0NDAgU2FmYXJp LzYwMC4xLjQifSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOlsicGFn ZXMiLCJkYXRhIl0sInJlZ2V4IjoiPGxpIGNsYXNzPVwiZ1wiPjxoMyBjbGFzcz1c InJcIj48YSBocmVmPVwiXFwvdXJsXFw/cT0oLio/KSZhbXA7c2E9VSIsInJlZ2V4 VHlwZSI6ImlnIiwicmVzdWx0VHlwZSI6ImFycmF5IiwiYXJyYXlOYW1lIjoic2Vy cCIsInJlc3VsdHMiOlsibGlua3MiXX0seyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6 ImZvcm1hdHJlc3VsdCIsInZhbHVlIjoiWyUgRk9SRUFDSCBzZXJwIC0lXSAkbG9v cC5jb3VudDskbGlua3MgXFxuWyUgRU5EICVdIn0seyJ0eXBlIjoib3ZlcnJpZGUi LCJpZCI6Imdvb2RDb2RlIiwidmFsdWUiOjIwMH0seyJ0eXBlIjoib3ZlcnJpZGUi LCJpZCI6InByb3h5cmV0cmllcyIsInZhbHVlIjoiMCJ9XV0sInJlc3VsdHNGb3Jt YXQiOiIkcDEucHJlc2V0IiwicmVzdWx0c1NhdmVUbyI6ImZpbGUiLCJyZXN1bHRz RmlsZU5hbWUiOiIke3F1ZXJ5fS50eHQiLCJhZGRpdGlvbmFsRm9ybWF0cyI6W10s InJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9ybWF0IjpbIiRxdWVyeSJdLCJ1 bmlxdWVRdWVyaWVzIjpmYWxzZSwic2F2ZUZhaWxlZFF1ZXJpZXMiOnRydWUsIml0 ZXJhdG9yT3B0aW9ucyI6eyJvbkFsbExldmVscyI6ZmFsc2UsInF1ZXJ5QnVpbGRl cnNBZnRlckl0ZXJhdG9yIjpmYWxzZX0sInJlc3VsdHNPcHRpb25zIjp7Im92ZXJ3 cml0ZSI6ZmFsc2V9LCJkb0xvZyI6Im5vIiwia2VlcFVuaXF1ZSI6Ik5vIiwibW9y ZU9wdGlvbnMiOmZhbHNlLCJyZXN1bHRzUHJlcGVuZCI6IiIsInJlc3VsdHNBcHBl bmQiOiIiLCJxdWVyeUJ1aWxkZXJzIjpbXSwicmVzdWx0c0J1aWxkZXJzIjpbXSwi Y29uZmlnT3ZlcnJpZGVzIjpbXX19 Any of this correct? Better way to do it? Thanks!
Try this solution. Spoiler: Code for import Code: eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTY5Ny8iLCJ2 YWx1ZSI6eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTY5 Ny8iLCJwYXJzZXJzIjpbWyJOZXQ6OkhUVFAiLCJkZWZhdWx0Iix7InR5cGUiOiJv dmVycmlkZSIsImlkIjoidXNlcHJveHkiLCJ2YWx1ZSI6dHJ1ZX0seyJ0eXBlIjoi b3ZlcnJpZGUiLCJpZCI6InByb3h5cmV0cmllcyIsInZhbHVlIjoiMjAifSx7InR5 cGUiOiJvdmVycmlkZSIsImlkIjoiZm9ybWF0cmVzdWx0IiwidmFsdWUiOiJbJSBJ RiBpbmZvLnN1Y2Nlc3MgPT0gMSAlXVslIEZPUkVBQ0ggc2VycCAlXSRsaW5rXFxu WyUgRU5EICVdWyUgRU5EICVdIn0seyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6InF1 ZXJ5Zm9ybWF0IiwidmFsdWUiOiJodHRwOi8vd3d3Lmdvb2dsZS5jb20vc2VhcmNo P251bT0xMDAmcT0kcXVlcnkifSx7InR5cGUiOiJvdmVycmlkZSIsImlkIjoidXNl ci1hZ2VudCIsInZhbHVlIjoiTW96aWxsYS81LjA7IGlQaG9uZTsgQ1BVIGlQaG9u ZSBPUyA4XzFfMiBsaWtlIE1hYyBPUyBYOyBBcHBsZVdlYktpdC82MDAuMS40OyBL SFRNTCwgbGlrZSBHZWNrbzsgVmVyc2lvbi84LjAgTW9iaWxlLzEyQjQ0MCBTYWZh cmkvNjAwLjEuNCJ9LHsidHlwZSI6Im92ZXJyaWRlIiwiaWQiOiJnb29kQ29kZSIs InZhbHVlIjoyMDB9LHsidHlwZSI6ImN1c3RvbVJlc3VsdCIsInJlc3VsdCI6ImRh dGEiLCJyZWdleCI6IjxoMyBjbGFzcz1cInJcIj48YSBocmVmPVwiKC4rPylcIiIs InJlZ2V4VHlwZSI6ImciLCJyZXN1bHRUeXBlIjoiYXJyYXkiLCJhcnJheU5hbWUi OiJzZXJwIiwicmVzdWx0cyI6WyJsaW5rIl19XV0sInJlc3VsdHNGb3JtYXQiOiIk cDEucHJlc2V0IiwicmVzdWx0c1NhdmVUbyI6ImZpbGUiLCJyZXN1bHRzRmlsZU5h bWUiOiJjdXN0b21HcGFyc2VyLyR7cXVlcnl9LnR4dCIsImFkZGl0aW9uYWxGb3Jt YXRzIjpbWyJjdXN0b21HcGFyc2VyL2ZhaWxlZC50eHQiLCJbJSBJRiBwMS5pbmZv LnN1Y2Nlc3MgPT0gMCAlXSRxdWVyeVxcblslIEVORCAlXSJdXSwicmVzdWx0c1Vu aXF1ZSI6Im5vIiwicXVlcnlGb3JtYXQiOlsiJHF1ZXJ5Il0sInVuaXF1ZVF1ZXJp ZXMiOmZhbHNlLCJzYXZlRmFpbGVkUXVlcmllcyI6ZmFsc2UsIml0ZXJhdG9yT3B0 aW9ucyI6eyJvbkFsbExldmVscyI6ZmFsc2UsInF1ZXJ5QnVpbGRlcnNBZnRlckl0 ZXJhdG9yIjpmYWxzZX0sInJlc3VsdHNPcHRpb25zIjp7Im92ZXJ3cml0ZSI6ZmFs c2V9LCJkb0xvZyI6Im5vIiwia2VlcFVuaXF1ZSI6Ik5vIiwibW9yZU9wdGlvbnMi OmZhbHNlLCJyZXN1bHRzUHJlcGVuZCI6IiIsInJlc3VsdHNBcHBlbmQiOiIiLCJx dWVyeUJ1aWxkZXJzIjpbXSwicmVzdWx0c0J1aWxkZXJzIjpbXSwiY29uZmlnT3Zl cnJpZGVzIjpbXX19 In this task, the issuance of Google search is parsed using the Net::HTTP with the specified user agent. In this case as a successful response from the server is set code 200, otherwise request will be parsed by specified number of times (as the CAPTCHA is returned code 503, so it will be considered a bad request and will be parse again). All requests for which unsuccessful response is received within the specified request retries will be placed in the file failed.txt.