Before I even attempt something, is it possible to download the full source of a page and save as a file AND also parse that page for certain data and save to a file in json format? (I already download the source to a file just need to add the parsing to json part) So it would be: 1: Visit page, grab source and save to file 2: Extract data from that same page and save to json If it is possible I will have a lot more questions? lol
Yes it is possible. To output in JSON is necessary to use a method .json. Here's parsing a site with saving the source code and the results to JSON format (for example Wikipedia): Spoiler: Preset for import Code: eyJwcmVzZXQiOiJkZWZhdWx0IiwidmFsdWUiOnsicHJlc2V0IjoiZGVmYXVsdCIs InBhcnNlcnMiOltbIk5ldDo6SFRUUCIsImRlZmF1bHQiLHsidHlwZSI6Im92ZXJy aWRlIiwiaWQiOiJmb3JtYXRyZXN1bHQiLCJ2YWx1ZSI6IiR0aXRsZS5qc29uXFxu JHRvcDEwLmpzb24ifSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOiJk YXRhIiwicmVnZXgiOiI8dGl0bGU+KC4rPyk8L3RpdGxlPiIsInJlZ2V4VHlwZSI6 IiIsInJlc3VsdFR5cGUiOiJmbGF0IiwiYXJyYXlOYW1lIjoiIiwicmVzdWx0cyI6 WyJ0aXRsZSJdfSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRh IiwicmVnZXgiOiI8IS0tICguKz9ocikgLS0+IiwicmVnZXhUeXBlIjoiZyIsInJl c3VsdFR5cGUiOiJhcnJheSIsImFycmF5TmFtZSI6InRvcDEwIiwicmVzdWx0cyI6 WyJsYW5nIl19XV0sInJlc3VsdHNGb3JtYXQiOiIkcDEucHJlc2V0IiwicmVzdWx0 c1NhdmVUbyI6ImZpbGUiLCJyZXN1bHRzRmlsZU5hbWUiOiJ3aWtpL2pzb24udHh0 IiwiYWRkaXRpb25hbEZvcm1hdHMiOltbIndpa2kvc291cmNlLnR4dCIsIiRwMS5k YXRhIl1dLCJyZXN1bHRzVW5pcXVlIjoibm8iLCJxdWVyeUZvcm1hdCI6WyIkcXVl cnkiXSwidW5pcXVlUXVlcmllcyI6ZmFsc2UsInNhdmVGYWlsZWRRdWVyaWVzIjpm YWxzZSwiaXRlcmF0b3JPcHRpb25zIjp7Im9uQWxsTGV2ZWxzIjpmYWxzZSwicXVl cnlCdWlsZGVyc0FmdGVySXRlcmF0b3IiOmZhbHNlfSwicmVzdWx0c09wdGlvbnMi Onsib3ZlcndyaXRlIjpmYWxzZX0sImRvTG9nIjoiZGIiLCJrZWVwVW5pcXVlIjoi Tm8iLCJtb3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5kIjoiIiwicmVz dWx0c0FwcGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOltdLCJyZXN1bHRzQnVpbGRl cnMiOltdLCJjb25maWdPdmVycmlkZXMiOltdfX0=
A per usual I am stuck with all the Regex I need to do. Here is a sample query: https://www.google.co.uk/search?q=keywrd+planner&pws=0&uule=w+CAIQICINVW5pdGVkIFN0YXRlcw&num=20 I am using this as the user agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0 I need to get the Title and Link from each SERP result but I also need to extract the data highlighted in the images below Any help would be greatly appreciated. All the regex I try doesn't work. Thanks
ReGex for spell: Code: <a class="spell".+?>(.+?)<\/a> ReGex for spell_orig: Code: <a class="spell_orig".+?>(.+?)<\/a> HTML tags can be cleared using Results builder.
Thanks! I have figured out most of the regex thanks to your examples except for when I try to extract the related keywords from the screenshot in my previous post. I have this: This works for grabbing the first suggestion from each of the two columns but does not grab all of them. Next I'm not sure how to properly format the json file. I want to create a json file for each keyword/query that has a layout something like this: Finally, I want to save the json file and a raw file containing the source code in different directories and then create a new directory every 5000 queries and save any failed queries to a separate file. I have code for this in another custom parser that I used below but not sure it translates to this new one. Here is everything I have so far: Spoiler Code: eyJwcmVzZXQiOiJHb29nbGUgUmF3ICYgUGFyc2UgVG8gSlNPTiIsInZhbHVlIjp7 InByZXNldCI6Ikdvb2dsZSBSYXcgJiBQYXJzZSBUbyBKU09OIiwicGFyc2VycyI6 W1siTmV0OjpIVFRQIiwiZGVmYXVsdCIseyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6 ImZvcm1hdHJlc3VsdCIsInZhbHVlIjoiJHF1ZXJ5Lmpzb25cXG4kbG9vcC5jb3Vu dC5qc29uXFxuJHNlcnAuanNvblxcbiR0b3AxMC5qc29uXFxuJHNwZWxsLmpzb25c XG4kc3BlbGxfb3JpZ2luYWwuanNvblxcbiRyZWxhdGVkLmpzb25cXG4ifSx7InR5 cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8aDMg Y2xhc3M9XCJyXCI+PGEgaHJlZj0uKz8+KC4rPyk8XFwvYT4iLCJyZWdleFR5cGUi OiJnIiwicmVzdWx0VHlwZSI6ImFycmF5IiwiYXJyYXlOYW1lIjoic2VycCIsInJl c3VsdHMiOlsidGl0bGUiXX0seyJ0eXBlIjoiY3VzdG9tUmVzdWx0IiwicmVzdWx0 IjoiZGF0YSIsInJlZ2V4IjoiPGgzIGNsYXNzPVwiclwiPjxhIGhyZWY9XCIoLis/ KVwiIiwicmVnZXhUeXBlIjoiZyIsInJlc3VsdFR5cGUiOiJhcnJheSIsImFycmF5 TmFtZSI6InRvcDEwIiwicmVzdWx0cyI6WyJsaW5rIl19LHsidHlwZSI6Im92ZXJy aWRlIiwiaWQiOiJ1c2VyLWFnZW50IiwidmFsdWUiOiJNb3ppbGxhLzUuMCAoV2lu ZG93cyBOVCA2LjE7IFdPVzY0OyBydjozOS4wKSBHZWNrby8yMDEwMDEwMSBGaXJl Zm94LzM5LjAifSx7InR5cGUiOiJvdmVycmlkZSIsImlkIjoiZ29vZENvZGUiLCJ2 YWx1ZSI6MjAwfSx7InR5cGUiOiJvdmVycmlkZSIsImlkIjoicXVlcnlmb3JtYXQi LCJ2YWx1ZSI6Imh0dHBzOi8vd3d3Lmdvb2dsZS5jby51ay9zZWFyY2g/cT0kcXVl cnkmcHdzPTAmdXVsZT13K0NBSVFJQ0lOVlc1cGRHVmtJRk4wWVhSbGN3Jm51bT0y MCJ9LHsidHlwZSI6ImN1c3RvbVJlc3VsdCIsInJlc3VsdCI6ImRhdGEiLCJyZWdl eCI6IjxhIGNsYXNzPVwic3BlbGxcIi4rPz4oLis/KTxcXC9hPiIsInJlZ2V4VHlw ZSI6InMiLCJyZXN1bHRUeXBlIjoiZmxhdCIsImFycmF5TmFtZSI6IiIsInJlc3Vs dHMiOlsic3BlbGwiXX0seyJ0eXBlIjoiY3VzdG9tUmVzdWx0IiwicmVzdWx0Ijoi ZGF0YSIsInJlZ2V4IjoiPGEgY2xhc3M9XCJzcGVsbFwiLis/PiguKz8pPFxcL2E+ IiwicmVnZXhUeXBlIjoicyIsInJlc3VsdFR5cGUiOiJmbGF0IiwiYXJyYXlOYW1l IjoiIiwicmVzdWx0cyI6WyJzcGVsbF9vcmlnaW5hbCJdfSx7InR5cGUiOiJjdXN0 b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8ZGl2IGNsYXNzPVwi YnJzX2NvbFwiPjxwIGNsYXNzPVwiX2U0YlwiPjxhIGhyZWY9Lis/PiguKz8pPFxc L2E+PFxcL3A+IiwicmVnZXhUeXBlIjoiZyIsInJlc3VsdFR5cGUiOiJhcnJheSIs ImFycmF5TmFtZSI6InJlbGF0ZWQiLCJyZXN1bHRzIjpbInN1Z2dlc3Rpb25zIl19 XV0sInJlc3VsdHNGb3JtYXQiOiIkcDEucHJlc2V0IiwicmVzdWx0c1NhdmVUbyI6 ImZpbGUiLCJyZXN1bHRzRmlsZU5hbWUiOiJzZXJwX2pzb24vWyUgSUYgcDEuaW5m by5zdWNjZXNzID09IDEgJV1bJSBVU0UgTWF0aDsgXCJ1c19cIl8gTWF0aC5pbnQo cXVlcnkubnVtIC8gNTAwMCkgX1wiL1wiXyBxdWVyeSBfXCIuanNvblwiICVdWyUg RU5EICVdIiwiYWRkaXRpb25hbEZvcm1hdHMiOltbInNlcnBfcmF3L1slIElGIHAx LmluZm8uc3VjY2VzcyA9PSAxICVdWyUgVVNFIE1hdGg7IFwidXNfXCJfIE1hdGgu aW50KHF1ZXJ5Lm51bSAvIDUwMDApIF9cIi9cIl8gcXVlcnkgX1wiLmh0bWxcIiAl XVslIEVORCAlXSIsIiRwMS5kYXRhIl0sWyJzZXJwX2ZhaWwvZmFpbGVkLnR4dCIs IlslIElGIHAxLmluZm8uc3VjY2VzcyA9PSAwICVdJHF1ZXJ5XFxuWyUgRU5EICVd Il1dLCJyZXN1bHRzVW5pcXVlIjoibm8iLCJxdWVyeUZvcm1hdCI6WyIkcXVlcnki XSwidW5pcXVlUXVlcmllcyI6ZmFsc2UsInNhdmVGYWlsZWRRdWVyaWVzIjpmYWxz ZSwiaXRlcmF0b3JPcHRpb25zIjp7Im9uQWxsTGV2ZWxzIjpmYWxzZSwicXVlcnlC dWlsZGVyc0FmdGVySXRlcmF0b3IiOmZhbHNlfSwicmVzdWx0c09wdGlvbnMiOnsi b3ZlcndyaXRlIjpmYWxzZX0sImRvTG9nIjoibm8iLCJrZWVwVW5pcXVlIjoiTm8i LCJtb3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5kIjoiIiwicmVzdWx0 c0FwcGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOltdLCJyZXN1bHRzQnVpbGRlcnMi Olt7InNvdXJjZSI6WzAsInNwZWxsIl0sInR5cGUiOiJyZW1vdmVIdG1sIiwidG8i OiJzcGVsbCJ9LHsic291cmNlIjpbMCwic3BlbGxfb3JpZ2luYWwiXSwidHlwZSI6 InJlbW92ZUh0bWwiLCJ0byI6Im9yaWdpbmFsIn1dLCJjb25maWdPdmVycmlkZXMi OltdfX0= Thanks! I'm always amazed what this software can do but more amazed by the support!
String <div class="brs_col"> in this regular expression superfluous: Code: <p class="_e4b"><a href=.+?>(.+?)<\/a><\/p> It is necessary create a variable that will contain all data, and is already its output into JSON. Code: [% result.spell = p1.spell; result.spell_original = p1.spellorig; result.suggestions = p1.related; result.serp = p1.serp; result.json() %] Here everything is done correctly. As a result we get here is a preset: Spoiler: Code for import Code: eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTc5Mi8iLCJ2 YWx1ZSI6eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTc5 Mi8iLCJwYXJzZXJzIjpbWyJOZXQ6OkhUVFAiLCJkZWZhdWx0Iix7InR5cGUiOiJv dmVycmlkZSIsImlkIjoiZm9ybWF0cmVzdWx0IiwidmFsdWUiOiJbJSByZXN1bHQu c3BlbGwgPSBwMS5zcGVsbDtcbnJlc3VsdC5zcGVsbF9vcmlnaW5hbCA9IHAxLnNw ZWxsb3JpZztcbnJlc3VsdC5zdWdnZXN0aW9ucyA9IHAxLnJlbGF0ZWQ7XG5yZXN1 bHQuc2VycCA9IHAxLnNlcnA7XG5yZXN1bHQuanNvbigpICVdIn0seyJ0eXBlIjoi Y3VzdG9tUmVzdWx0IiwicmVzdWx0IjoiZGF0YSIsInJlZ2V4IjoiPGgzIGNsYXNz PVwiclwiPjxhIGhyZWY9XCIoLis/KVwiIG9ubW91c2Vkb3duLis/LCcoXFxkKykn LC4rP1wiPiguKz8pPFxcL2E+IiwicmVnZXhUeXBlIjoiZyIsInJlc3VsdFR5cGUi OiJhcnJheSIsImFycmF5TmFtZSI6InNlcnAiLCJyZXN1bHRzIjpbImxpbmsiLCJy YW5rIiwidGl0bGUiXX0seyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6InVzZXItYWdl bnQiLCJ2YWx1ZSI6Ik1vemlsbGEvNS4wIChXaW5kb3dzIE5UIDYuMTsgV09XNjQ7 IHJ2OjM5LjApIEdlY2tvLzIwMTAwMTAxIEZpcmVmb3gvMzkuMCJ9LHsidHlwZSI6 Im92ZXJyaWRlIiwiaWQiOiJnb29kQ29kZSIsInZhbHVlIjoyMDB9LHsidHlwZSI6 Im92ZXJyaWRlIiwiaWQiOiJxdWVyeWZvcm1hdCIsInZhbHVlIjoiaHR0cHM6Ly93 d3cuZ29vZ2xlLmNvLnVrL3NlYXJjaD9xPSRxdWVyeSZwd3M9MCZ1dWxlPXcrQ0FJ UUlDSU5WVzVwZEdWa0lGTjBZWFJsY3cmbnVtPTIwIn0seyJ0eXBlIjoiY3VzdG9t UmVzdWx0IiwicmVzdWx0IjoiZGF0YSIsInJlZ2V4IjoiPGEgY2xhc3M9XCJzcGVs bFwiLis/PiguKz8pPFxcL2E+IiwicmVnZXhUeXBlIjoicyIsInJlc3VsdFR5cGUi OiJmbGF0IiwiYXJyYXlOYW1lIjoiIiwicmVzdWx0cyI6WyJzcGVsbCJdfSx7InR5 cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8cCBj bGFzcz1cIl9lNGJcIj48YSBocmVmPS4rPz4oLis/KTxcXC9hPjxcXC9wPiIsInJl Z2V4VHlwZSI6ImciLCJyZXN1bHRUeXBlIjoiYXJyYXkiLCJhcnJheU5hbWUiOiJy ZWxhdGVkIiwicmVzdWx0cyI6WyJzdWdnZXN0aW9ucyJdfSx7InR5cGUiOiJjdXN0 b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8YSBjbGFzcz1cInNw ZWxsX29yaWdcIi4rPz4oLis/KTxcXC9hPi4rIiwicmVnZXhUeXBlIjoicyIsInJl c3VsdFR5cGUiOiJmbGF0IiwiYXJyYXlOYW1lIjoiIiwicmVzdWx0cyI6WyJzcGVs bG9yaWciXX1dXSwicmVzdWx0c0Zvcm1hdCI6IiRwMS5wcmVzZXQiLCJyZXN1bHRz U2F2ZVRvIjoiZmlsZSIsInJlc3VsdHNGaWxlTmFtZSI6InNlcnBfanNvbi9bJSBJ RiBwMS5pbmZvLnN1Y2Nlc3MgPT0gMSAlXVslIFVTRSBNYXRoOyBcInVzX1wiXyBN YXRoLmludChxdWVyeS5udW0gLyA1MDAwKSBfXCIvXCJfIHF1ZXJ5IF9cIi5qc29u XCIgJV1bJSBFTkQgJV0iLCJhZGRpdGlvbmFsRm9ybWF0cyI6W1sic2VycF9yYXcv WyUgSUYgcDEuaW5mby5zdWNjZXNzID09IDEgJV1bJSBVU0UgTWF0aDsgXCJ1c19c Il8gTWF0aC5pbnQocXVlcnkubnVtIC8gNTAwMCkgX1wiL1wiXyBxdWVyeSBfXCIu aHRtbFwiICVdWyUgRU5EICVdIiwiJHAxLmRhdGEiXSxbInNlcnBfZmFpbC9mYWls ZWQudHh0IiwiWyUgSUYgcDEuaW5mby5zdWNjZXNzID09IDAgJV0kcXVlcnlcXG5b JSBFTkQgJV0iXV0sInJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9ybWF0Ijpb IiRxdWVyeSJdLCJ1bmlxdWVRdWVyaWVzIjpmYWxzZSwic2F2ZUZhaWxlZFF1ZXJp ZXMiOmZhbHNlLCJpdGVyYXRvck9wdGlvbnMiOnsib25BbGxMZXZlbHMiOmZhbHNl LCJxdWVyeUJ1aWxkZXJzQWZ0ZXJJdGVyYXRvciI6ZmFsc2V9LCJyZXN1bHRzT3B0 aW9ucyI6eyJvdmVyd3JpdGUiOmZhbHNlfSwiZG9Mb2ciOiJubyIsImtlZXBVbmlx dWUiOiJObyIsIm1vcmVPcHRpb25zIjpmYWxzZSwicmVzdWx0c1ByZXBlbmQiOiIi LCJyZXN1bHRzQXBwZW5kIjoiIiwicXVlcnlCdWlsZGVycyI6W10sInJlc3VsdHNC dWlsZGVycyI6W3sic291cmNlIjpbMCwic3BlbGwiXSwidHlwZSI6InJlbW92ZUh0 bWwiLCJ0byI6InNwZWxsIn0seyJzb3VyY2UiOlswLCJzcGVsbG9yaWciXSwidHlw ZSI6InJlbW92ZUh0bWwiLCJ0byI6InNwZWxsb3JpZyJ9LHsic291cmNlIjpbMCxb InJlbGF0ZWQiLCJzdWdnZXN0aW9ucyJdXSwidHlwZSI6InJlbW92ZUh0bWwiLCJh cnJheSI6InJlbGF0ZWQiLCJ0byI6InN1Z2dlc3Rpb25zIn1dLCJjb25maWdPdmVy cmlkZXMiOltdfX0= Result:
Thanks! Works great. I never would have gotten the json and variable part right. Great support as always!