Is this possible? Download File and Parse Data Same Time

Discussion in 'A-Parser Support Forum' started by scrapefun, Aug 7, 2015.

  1. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Before I even attempt something, is it possible to download the full source of a page and save as a file AND also parse that page for certain data and save to a file in json format?

    (I already download the source to a file just need to add the parsing to json part)

    So it would be:

    1: Visit page, grab source and save to file
    2: Extract data from that same page and save to json

    If it is possible I will have a lot more questions? lol :)
     
  2. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,545
    Likes Received:
    2,163
    Yes it is possible. To output in JSON is necessary to use a method .json.
    Here's parsing a site with saving the source code and the results to JSON format (for example Wikipedia):
    [​IMG]
    Code:
    eyJwcmVzZXQiOiJkZWZhdWx0IiwidmFsdWUiOnsicHJlc2V0IjoiZGVmYXVsdCIs
    InBhcnNlcnMiOltbIk5ldDo6SFRUUCIsImRlZmF1bHQiLHsidHlwZSI6Im92ZXJy
    aWRlIiwiaWQiOiJmb3JtYXRyZXN1bHQiLCJ2YWx1ZSI6IiR0aXRsZS5qc29uXFxu
    JHRvcDEwLmpzb24ifSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOiJk
    YXRhIiwicmVnZXgiOiI8dGl0bGU+KC4rPyk8L3RpdGxlPiIsInJlZ2V4VHlwZSI6
    IiIsInJlc3VsdFR5cGUiOiJmbGF0IiwiYXJyYXlOYW1lIjoiIiwicmVzdWx0cyI6
    WyJ0aXRsZSJdfSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRh
    IiwicmVnZXgiOiI8IS0tICguKz9ocikgLS0+IiwicmVnZXhUeXBlIjoiZyIsInJl
    c3VsdFR5cGUiOiJhcnJheSIsImFycmF5TmFtZSI6InRvcDEwIiwicmVzdWx0cyI6
    WyJsYW5nIl19XV0sInJlc3VsdHNGb3JtYXQiOiIkcDEucHJlc2V0IiwicmVzdWx0
    c1NhdmVUbyI6ImZpbGUiLCJyZXN1bHRzRmlsZU5hbWUiOiJ3aWtpL2pzb24udHh0
    IiwiYWRkaXRpb25hbEZvcm1hdHMiOltbIndpa2kvc291cmNlLnR4dCIsIiRwMS5k
    YXRhIl1dLCJyZXN1bHRzVW5pcXVlIjoibm8iLCJxdWVyeUZvcm1hdCI6WyIkcXVl
    cnkiXSwidW5pcXVlUXVlcmllcyI6ZmFsc2UsInNhdmVGYWlsZWRRdWVyaWVzIjpm
    YWxzZSwiaXRlcmF0b3JPcHRpb25zIjp7Im9uQWxsTGV2ZWxzIjpmYWxzZSwicXVl
    cnlCdWlsZGVyc0FmdGVySXRlcmF0b3IiOmZhbHNlfSwicmVzdWx0c09wdGlvbnMi
    Onsib3ZlcndyaXRlIjpmYWxzZX0sImRvTG9nIjoiZGIiLCJrZWVwVW5pcXVlIjoi
    Tm8iLCJtb3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5kIjoiIiwicmVz
    dWx0c0FwcGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOltdLCJyZXN1bHRzQnVpbGRl
    cnMiOltdLCJjb25maWdPdmVycmlkZXMiOltdfX0=
     
  3. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Thanks for pointing me in the right direction really helpful!
     
  4. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    A per usual I am stuck with all the Regex I need to do.

    Here is a sample query:
    https://www.google.co.uk/search?q=keywrd+planner&pws=0&uule=w+CAIQICINVW5pdGVkIFN0YXRlcw&num=20

    I am using this as the user agent:
    Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0

    I need to get the Title and Link from each SERP result but I also need to extract the data highlighted in the images below


    misspell.png related.png


    Any help would be greatly appreciated. All the regex I try doesn't work. Thanks
     
  5. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,545
    Likes Received:
    2,163
    [​IMG]
    ReGex for spell:
    Code:
    <a class="spell".+?>(.+?)<\/a>
    ReGex for spell_orig:
    Code:
    <a class="spell_orig".+?>(.+?)<\/a>
    HTML tags can be cleared using Results builder.
     
  6. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Thanks!

    I have figured out most of the regex thanks to your examples except for when I try to extract the related keywords from the screenshot in my previous post. I have this:

    [​IMG]


    This works for grabbing the first suggestion from each of the two columns but does not grab all of them.


    Next I'm not sure how to properly format the json file. I want to create a json file for each keyword/query that has a layout something like this:

    [​IMG]

    Finally, I want to save the json file and a raw file containing the source code in different directories and then create a new directory every 5000 queries and save any failed queries to a separate file. I have code for this in another custom parser that I used below but not sure it translates to this new one.

    Here is everything I have so far:

    [​IMG]




    Code:
    eyJwcmVzZXQiOiJHb29nbGUgUmF3ICYgUGFyc2UgVG8gSlNPTiIsInZhbHVlIjp7
    InByZXNldCI6Ikdvb2dsZSBSYXcgJiBQYXJzZSBUbyBKU09OIiwicGFyc2VycyI6
    W1siTmV0OjpIVFRQIiwiZGVmYXVsdCIseyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6
    ImZvcm1hdHJlc3VsdCIsInZhbHVlIjoiJHF1ZXJ5Lmpzb25cXG4kbG9vcC5jb3Vu
    dC5qc29uXFxuJHNlcnAuanNvblxcbiR0b3AxMC5qc29uXFxuJHNwZWxsLmpzb25c
    XG4kc3BlbGxfb3JpZ2luYWwuanNvblxcbiRyZWxhdGVkLmpzb25cXG4ifSx7InR5
    cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8aDMg
    Y2xhc3M9XCJyXCI+PGEgaHJlZj0uKz8+KC4rPyk8XFwvYT4iLCJyZWdleFR5cGUi
    OiJnIiwicmVzdWx0VHlwZSI6ImFycmF5IiwiYXJyYXlOYW1lIjoic2VycCIsInJl
    c3VsdHMiOlsidGl0bGUiXX0seyJ0eXBlIjoiY3VzdG9tUmVzdWx0IiwicmVzdWx0
    IjoiZGF0YSIsInJlZ2V4IjoiPGgzIGNsYXNzPVwiclwiPjxhIGhyZWY9XCIoLis/
    KVwiIiwicmVnZXhUeXBlIjoiZyIsInJlc3VsdFR5cGUiOiJhcnJheSIsImFycmF5
    TmFtZSI6InRvcDEwIiwicmVzdWx0cyI6WyJsaW5rIl19LHsidHlwZSI6Im92ZXJy
    aWRlIiwiaWQiOiJ1c2VyLWFnZW50IiwidmFsdWUiOiJNb3ppbGxhLzUuMCAoV2lu
    ZG93cyBOVCA2LjE7IFdPVzY0OyBydjozOS4wKSBHZWNrby8yMDEwMDEwMSBGaXJl
    Zm94LzM5LjAifSx7InR5cGUiOiJvdmVycmlkZSIsImlkIjoiZ29vZENvZGUiLCJ2
    YWx1ZSI6MjAwfSx7InR5cGUiOiJvdmVycmlkZSIsImlkIjoicXVlcnlmb3JtYXQi
    LCJ2YWx1ZSI6Imh0dHBzOi8vd3d3Lmdvb2dsZS5jby51ay9zZWFyY2g/cT0kcXVl
    cnkmcHdzPTAmdXVsZT13K0NBSVFJQ0lOVlc1cGRHVmtJRk4wWVhSbGN3Jm51bT0y
    MCJ9LHsidHlwZSI6ImN1c3RvbVJlc3VsdCIsInJlc3VsdCI6ImRhdGEiLCJyZWdl
    eCI6IjxhIGNsYXNzPVwic3BlbGxcIi4rPz4oLis/KTxcXC9hPiIsInJlZ2V4VHlw
    ZSI6InMiLCJyZXN1bHRUeXBlIjoiZmxhdCIsImFycmF5TmFtZSI6IiIsInJlc3Vs
    dHMiOlsic3BlbGwiXX0seyJ0eXBlIjoiY3VzdG9tUmVzdWx0IiwicmVzdWx0Ijoi
    ZGF0YSIsInJlZ2V4IjoiPGEgY2xhc3M9XCJzcGVsbFwiLis/PiguKz8pPFxcL2E+
    IiwicmVnZXhUeXBlIjoicyIsInJlc3VsdFR5cGUiOiJmbGF0IiwiYXJyYXlOYW1l
    IjoiIiwicmVzdWx0cyI6WyJzcGVsbF9vcmlnaW5hbCJdfSx7InR5cGUiOiJjdXN0
    b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8ZGl2IGNsYXNzPVwi
    YnJzX2NvbFwiPjxwIGNsYXNzPVwiX2U0YlwiPjxhIGhyZWY9Lis/PiguKz8pPFxc
    L2E+PFxcL3A+IiwicmVnZXhUeXBlIjoiZyIsInJlc3VsdFR5cGUiOiJhcnJheSIs
    ImFycmF5TmFtZSI6InJlbGF0ZWQiLCJyZXN1bHRzIjpbInN1Z2dlc3Rpb25zIl19
    XV0sInJlc3VsdHNGb3JtYXQiOiIkcDEucHJlc2V0IiwicmVzdWx0c1NhdmVUbyI6
    ImZpbGUiLCJyZXN1bHRzRmlsZU5hbWUiOiJzZXJwX2pzb24vWyUgSUYgcDEuaW5m
    by5zdWNjZXNzID09IDEgJV1bJSBVU0UgTWF0aDsgXCJ1c19cIl8gTWF0aC5pbnQo
    cXVlcnkubnVtIC8gNTAwMCkgX1wiL1wiXyBxdWVyeSBfXCIuanNvblwiICVdWyUg
    RU5EICVdIiwiYWRkaXRpb25hbEZvcm1hdHMiOltbInNlcnBfcmF3L1slIElGIHAx
    LmluZm8uc3VjY2VzcyA9PSAxICVdWyUgVVNFIE1hdGg7IFwidXNfXCJfIE1hdGgu
    aW50KHF1ZXJ5Lm51bSAvIDUwMDApIF9cIi9cIl8gcXVlcnkgX1wiLmh0bWxcIiAl
    XVslIEVORCAlXSIsIiRwMS5kYXRhIl0sWyJzZXJwX2ZhaWwvZmFpbGVkLnR4dCIs
    IlslIElGIHAxLmluZm8uc3VjY2VzcyA9PSAwICVdJHF1ZXJ5XFxuWyUgRU5EICVd
    Il1dLCJyZXN1bHRzVW5pcXVlIjoibm8iLCJxdWVyeUZvcm1hdCI6WyIkcXVlcnki
    XSwidW5pcXVlUXVlcmllcyI6ZmFsc2UsInNhdmVGYWlsZWRRdWVyaWVzIjpmYWxz
    ZSwiaXRlcmF0b3JPcHRpb25zIjp7Im9uQWxsTGV2ZWxzIjpmYWxzZSwicXVlcnlC
    dWlsZGVyc0FmdGVySXRlcmF0b3IiOmZhbHNlfSwicmVzdWx0c09wdGlvbnMiOnsi
    b3ZlcndyaXRlIjpmYWxzZX0sImRvTG9nIjoibm8iLCJrZWVwVW5pcXVlIjoiTm8i
    LCJtb3JlT3B0aW9ucyI6ZmFsc2UsInJlc3VsdHNQcmVwZW5kIjoiIiwicmVzdWx0
    c0FwcGVuZCI6IiIsInF1ZXJ5QnVpbGRlcnMiOltdLCJyZXN1bHRzQnVpbGRlcnMi
    Olt7InNvdXJjZSI6WzAsInNwZWxsIl0sInR5cGUiOiJyZW1vdmVIdG1sIiwidG8i
    OiJzcGVsbCJ9LHsic291cmNlIjpbMCwic3BlbGxfb3JpZ2luYWwiXSwidHlwZSI6
    InJlbW92ZUh0bWwiLCJ0byI6Im9yaWdpbmFsIn1dLCJjb25maWdPdmVycmlkZXMi
    OltdfX0=


    Thanks! I'm always amazed what this software can do but more amazed by the support!
     
    #6 scrapefun, Aug 24, 2015
    Last edited by a moderator: Aug 26, 2015
  7. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,545
    Likes Received:
    2,163
    String <div class="brs_col"> in this regular expression superfluous:
    Code:
    <p class="_e4b"><a href=.+?>(.+?)<\/a><\/p>
    It is necessary create a variable that will contain all data, and is already its output into JSON.
    Code:
    [% result.spell = p1.spell;
    result.spell_original = p1.spellorig;
    result.suggestions = p1.related;
    result.serp = p1.serp;
    result.json() %]
    Here everything is done correctly.

    As a result we get here is a preset:
    [​IMG]
    Code:
    eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTc5Mi8iLCJ2
    YWx1ZSI6eyJwcmVzZXQiOiJodHRwOi8vYS1wYXJzZXIuY29tL3RocmVhZHMvMTc5
    Mi8iLCJwYXJzZXJzIjpbWyJOZXQ6OkhUVFAiLCJkZWZhdWx0Iix7InR5cGUiOiJv
    dmVycmlkZSIsImlkIjoiZm9ybWF0cmVzdWx0IiwidmFsdWUiOiJbJSByZXN1bHQu
    c3BlbGwgPSBwMS5zcGVsbDtcbnJlc3VsdC5zcGVsbF9vcmlnaW5hbCA9IHAxLnNw
    ZWxsb3JpZztcbnJlc3VsdC5zdWdnZXN0aW9ucyA9IHAxLnJlbGF0ZWQ7XG5yZXN1
    bHQuc2VycCA9IHAxLnNlcnA7XG5yZXN1bHQuanNvbigpICVdIn0seyJ0eXBlIjoi
    Y3VzdG9tUmVzdWx0IiwicmVzdWx0IjoiZGF0YSIsInJlZ2V4IjoiPGgzIGNsYXNz
    PVwiclwiPjxhIGhyZWY9XCIoLis/KVwiIG9ubW91c2Vkb3duLis/LCcoXFxkKykn
    LC4rP1wiPiguKz8pPFxcL2E+IiwicmVnZXhUeXBlIjoiZyIsInJlc3VsdFR5cGUi
    OiJhcnJheSIsImFycmF5TmFtZSI6InNlcnAiLCJyZXN1bHRzIjpbImxpbmsiLCJy
    YW5rIiwidGl0bGUiXX0seyJ0eXBlIjoib3ZlcnJpZGUiLCJpZCI6InVzZXItYWdl
    bnQiLCJ2YWx1ZSI6Ik1vemlsbGEvNS4wIChXaW5kb3dzIE5UIDYuMTsgV09XNjQ7
    IHJ2OjM5LjApIEdlY2tvLzIwMTAwMTAxIEZpcmVmb3gvMzkuMCJ9LHsidHlwZSI6
    Im92ZXJyaWRlIiwiaWQiOiJnb29kQ29kZSIsInZhbHVlIjoyMDB9LHsidHlwZSI6
    Im92ZXJyaWRlIiwiaWQiOiJxdWVyeWZvcm1hdCIsInZhbHVlIjoiaHR0cHM6Ly93
    d3cuZ29vZ2xlLmNvLnVrL3NlYXJjaD9xPSRxdWVyeSZwd3M9MCZ1dWxlPXcrQ0FJ
    UUlDSU5WVzVwZEdWa0lGTjBZWFJsY3cmbnVtPTIwIn0seyJ0eXBlIjoiY3VzdG9t
    UmVzdWx0IiwicmVzdWx0IjoiZGF0YSIsInJlZ2V4IjoiPGEgY2xhc3M9XCJzcGVs
    bFwiLis/PiguKz8pPFxcL2E+IiwicmVnZXhUeXBlIjoicyIsInJlc3VsdFR5cGUi
    OiJmbGF0IiwiYXJyYXlOYW1lIjoiIiwicmVzdWx0cyI6WyJzcGVsbCJdfSx7InR5
    cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8cCBj
    bGFzcz1cIl9lNGJcIj48YSBocmVmPS4rPz4oLis/KTxcXC9hPjxcXC9wPiIsInJl
    Z2V4VHlwZSI6ImciLCJyZXN1bHRUeXBlIjoiYXJyYXkiLCJhcnJheU5hbWUiOiJy
    ZWxhdGVkIiwicmVzdWx0cyI6WyJzdWdnZXN0aW9ucyJdfSx7InR5cGUiOiJjdXN0
    b21SZXN1bHQiLCJyZXN1bHQiOiJkYXRhIiwicmVnZXgiOiI8YSBjbGFzcz1cInNw
    ZWxsX29yaWdcIi4rPz4oLis/KTxcXC9hPi4rIiwicmVnZXhUeXBlIjoicyIsInJl
    c3VsdFR5cGUiOiJmbGF0IiwiYXJyYXlOYW1lIjoiIiwicmVzdWx0cyI6WyJzcGVs
    bG9yaWciXX1dXSwicmVzdWx0c0Zvcm1hdCI6IiRwMS5wcmVzZXQiLCJyZXN1bHRz
    U2F2ZVRvIjoiZmlsZSIsInJlc3VsdHNGaWxlTmFtZSI6InNlcnBfanNvbi9bJSBJ
    RiBwMS5pbmZvLnN1Y2Nlc3MgPT0gMSAlXVslIFVTRSBNYXRoOyBcInVzX1wiXyBN
    YXRoLmludChxdWVyeS5udW0gLyA1MDAwKSBfXCIvXCJfIHF1ZXJ5IF9cIi5qc29u
    XCIgJV1bJSBFTkQgJV0iLCJhZGRpdGlvbmFsRm9ybWF0cyI6W1sic2VycF9yYXcv
    WyUgSUYgcDEuaW5mby5zdWNjZXNzID09IDEgJV1bJSBVU0UgTWF0aDsgXCJ1c19c
    Il8gTWF0aC5pbnQocXVlcnkubnVtIC8gNTAwMCkgX1wiL1wiXyBxdWVyeSBfXCIu
    aHRtbFwiICVdWyUgRU5EICVdIiwiJHAxLmRhdGEiXSxbInNlcnBfZmFpbC9mYWls
    ZWQudHh0IiwiWyUgSUYgcDEuaW5mby5zdWNjZXNzID09IDAgJV0kcXVlcnlcXG5b
    JSBFTkQgJV0iXV0sInJlc3VsdHNVbmlxdWUiOiJubyIsInF1ZXJ5Rm9ybWF0Ijpb
    IiRxdWVyeSJdLCJ1bmlxdWVRdWVyaWVzIjpmYWxzZSwic2F2ZUZhaWxlZFF1ZXJp
    ZXMiOmZhbHNlLCJpdGVyYXRvck9wdGlvbnMiOnsib25BbGxMZXZlbHMiOmZhbHNl
    LCJxdWVyeUJ1aWxkZXJzQWZ0ZXJJdGVyYXRvciI6ZmFsc2V9LCJyZXN1bHRzT3B0
    aW9ucyI6eyJvdmVyd3JpdGUiOmZhbHNlfSwiZG9Mb2ciOiJubyIsImtlZXBVbmlx
    dWUiOiJObyIsIm1vcmVPcHRpb25zIjpmYWxzZSwicmVzdWx0c1ByZXBlbmQiOiIi
    LCJyZXN1bHRzQXBwZW5kIjoiIiwicXVlcnlCdWlsZGVycyI6W10sInJlc3VsdHNC
    dWlsZGVycyI6W3sic291cmNlIjpbMCwic3BlbGwiXSwidHlwZSI6InJlbW92ZUh0
    bWwiLCJ0byI6InNwZWxsIn0seyJzb3VyY2UiOlswLCJzcGVsbG9yaWciXSwidHlw
    ZSI6InJlbW92ZUh0bWwiLCJ0byI6InNwZWxsb3JpZyJ9LHsic291cmNlIjpbMCxb
    InJlbGF0ZWQiLCJzdWdnZXN0aW9ucyJdXSwidHlwZSI6InJlbW92ZUh0bWwiLCJh
    cnJheSI6InJlbGF0ZWQiLCJ0byI6InN1Z2dlc3Rpb25zIn1dLCJjb25maWdPdmVy
    cmlkZXMiOltdfX0=

    Result:
     
  8. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Thanks! Works great. I never would have gotten the json and variable part right.

    Great support as always!
     
    Support likes this.

Share This Page