Character Encoding Problem?

Discussion in 'A-Parser Support Forum' started by scrapefun, Feb 23, 2016.

  1. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    I think I'm having a character encoding problem but I can't tell if it is with how A-parser is setup or possibly something on my server.

    I have a custom Net::HTTP parser that queries Google and then saves the results page as a html file. It works great but for some reason a few queries will not save as a file.

    They appear to work fine in the parser test and the queries aren't failing they simple aren't being saved. All the queries that are failing are non-english words/phrases.

    I've include the parser code below but the one way I was able to get it to work is change this line for the results file name:
    serp_raw/[% IF p1.info.success == 1 %][% USE Math; "test_4"_ Math.int(query.num / 2500) _"/"_ query _".html" %][% END %]

    To this:
    serp_raw/[% IF p1.info.success == 1 %][% USE Math; "test_4"_ Math.int(query.num / 2500) _"/test.html" %][% END %]

    With the updated line I can perform one query at a time and the file will be generated but of course the file naming is no longer dynamic.

    And I can then manually re-name the files with the correct query name with no problems.


    Here is the parser code that includes the queries failing:
    Code:
    eyJwcmVzZXQiOiJUZXN0IC0gUkFXIEhUTUwiLCJ2YWx1ZSI6eyJwcmVzZXQiOiJU
    ZXN0IC0gUkFXIEhUTUwiLCJwYXJzZXJzIjpbWyJOZXQ6OkhUVFAiLCJkZWZhdWx0
    Iix7InR5cGUiOiJvdmVycmlkZSIsImlkIjoidXNlci1hZ2VudCIsInZhbHVlIjoi
    TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQv
    NTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzQ3LjAuMjUyNi4xMTEg
    U2FmYXJpLzUzNy4zNiJ9LHsidHlwZSI6Im92ZXJyaWRlIiwiaWQiOiJmb3JtYXRy
    ZXN1bHQiLCJ2YWx1ZSI6IlslIElGIGluZm8uc3VjY2VzcyA9PSAxICVdJHBhZ2Vz
    LmZvcm1hdCgnJGRhdGFcXG4nKVslIEVORCAlXSJ9LHsidHlwZSI6Im92ZXJyaWRl
    IiwiaWQiOiJwcm94eXJldHJpZXMiLCJ2YWx1ZSI6IjIwMCJ9LHsidHlwZSI6Im92
    ZXJyaWRlIiwiaWQiOiJ1c2Vwcm94eSIsInZhbHVlIjp0cnVlfSx7InR5cGUiOiJv
    dmVycmlkZSIsImlkIjoiZ29vZENvZGUiLCJ2YWx1ZSI6WzIwMF19LHsidHlwZSI6
    Im92ZXJyaWRlIiwiaWQiOiJwcm94eWJhbm5lZGNsZWFudXAiLCJ2YWx1ZSI6IjAi
    fSx7InR5cGUiOiJvdmVycmlkZSIsImlkIjoicXVlcnlmb3JtYXQiLCJ2YWx1ZSI6
    Imh0dHBzOi8vd3d3Lmdvb2dsZS5jb20vc2VhcmNoP3E9JHF1ZXJ5JnB3cz0wJnV1
    bGU9dytDQUlRSUNJTlZXNXBkR1ZrSUZOMFlYUmxjdyJ9LHsidHlwZSI6Im92ZXJy
    aWRlIiwiaWQiOiJyZXF1ZXN0ZGVsYXkiLCJ2YWx1ZSI6IjAifSx7InR5cGUiOiJv
    dmVycmlkZSIsImlkIjoidGltZW91dCIsInZhbHVlIjoiMzAifV1dLCJyZXN1bHRz
    Rm9ybWF0IjoiJHAxLnByZXNldCIsInJlc3VsdHNTYXZlVG8iOiJmaWxlIiwicmVz
    dWx0c0ZpbGVOYW1lIjoic2VycF9yYXcvWyUgSUYgcDEuaW5mby5zdWNjZXNzID09
    IDEgJV1bJSBVU0UgTWF0aDsgXCJ0ZXN0XzRcIl8gTWF0aC5pbnQocXVlcnkubnVt
    IC8gMjUwMCkgX1wiL1wiXyBxdWVyeSBfXCIuaHRtbFwiICVdWyUgRU5EICVdIiwi
    YWRkaXRpb25hbEZvcm1hdHMiOltbImZhaWxlZC9mYWlsZWQudHh0IiwiWyUgSUYg
    cDEuaW5mby5zdWNjZXNzID09IDAgJV0kcXVlcnlcXG5bJSBFTkQgJV0iXV0sInJl
    c3VsdHNVbmlxdWUiOiJubyIsInF1ZXJpZXNGcm9tIjoidGV4dCIsInF1ZXJ5Rm9y
    bWF0IjpbIiRxdWVyeSJdLCJ1bmlxdWVRdWVyaWVzIjp0cnVlLCJzYXZlRmFpbGVk
    UXVlcmllcyI6ZmFsc2UsIml0ZXJhdG9yT3B0aW9ucyI6eyJvbkFsbExldmVscyI6
    ZmFsc2UsInF1ZXJ5QnVpbGRlcnNBZnRlckl0ZXJhdG9yIjpmYWxzZSwicXVlcnlC
    dWlsZGVyc09uQWxsTGV2ZWxzIjpmYWxzZX0sInJlc3VsdHNPcHRpb25zIjp7Im92
    ZXJ3cml0ZSI6ZmFsc2V9LCJkb0xvZyI6ImRiIiwia2VlcFVuaXF1ZSI6Ik5vIiwi
    bW9yZU9wdGlvbnMiOmZhbHNlLCJyZXN1bHRzUHJlcGVuZCI6IiIsInJlc3VsdHNB
    cHBlbmQiOiIiLCJxdWVyeUJ1aWxkZXJzIjpbXSwicmVzdWx0c0J1aWxkZXJzIjpb
    XSwiY29uZmlnT3ZlcnJpZGVzIjpbXSwicXVlcmllcyI6Ilx1YmU0NVx1YmM0NW1v
    bnN0ZXJcdWI0ZTNcdWFlMzBcblx1MDQzNFx1MDQ0ZFx1MDQ0MyBcdTA0NDJcdTA0
    MzhcdTA0M2FcdTA0M2UgXHUwNDNlXHUwNDQyXHUwNDM3XHUwNDRiXHUwNDMyXHUw
    NDRiXG5nXHUyNjZkIG1ham9yXG5kciBwZXJvIHZyXHUwMTdlb2dpXHUwMTA3In19
    
     
  2. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    I'm also having trouble with queries like:

    The "&" "+" and "#" characters don't seem to be passed/encoded properly.

    Also, when saving the file if a query has a "%" it won't be used in the filename even thought that is an acceptable character for Windows filenames.

    Even if I encode the query myself it's still not working properly but I need to be able to keep the query in orginal form in my query list and not encoded but just wanted to see what would happen if tested already encoded.
     
    #2 scrapefun, Feb 23, 2016
    Last edited: Mar 2, 2016
  3. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    I'm working on this is issue, new version will be released soon

    you have to apply escape filter:

    [​IMG]

    this will be fixed also
     
  4. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Thanks for the help!

    For the escape filter, I would need to create a filter for each character I am having issues with or is there a way to specify multiple characters in a single filter?

    Also, I'm not clear on where I put this in my task settings. The screenshot providing the example is for Google parser but I am using NET::Http parser?

    Looking forward to the update for the other issues. Fantastic support as always!
     
  5. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    Just replace in our Query format $query to [% query | uri %]
     
    scrapefun likes this.
  6. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    haha...couldn't be much easier than that :)

    Will you post here when the update is ready or should I just check the RU forum for the latest updates?

    Thanks for the help!
     
  7. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    I'll post of course
     
  8. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    Try new beta 1.1.427
     
  9. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    I tested with the latest update and the file naming issues I was having seem to be fixed and most of the characters are passed fine after using the escape filter but I'm still having problems with some words.

    Mainly those containing "+", "&", "<", and ">" characters.

    Here are some examples with the original phrase on the left of the "=" and what is returned in the Google result page I'm downloading on the right:



    Granted some of these are pretty much nonsense for testing purposes but I need to be able to properly submit these characters. Could very well be I'm doing something wrong on my end of course :)
     
    #9 scrapefun, Feb 26, 2016
    Last edited: Mar 2, 2016
  10. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    It isn't proper symbols for HTML, you can't use < >(and several other symbols) directly in html. This is because you will get &lt; &gt; &amp; etc... This called "HTML-entities"

    Exactly same you will get from google in browser:

    [​IMG]
     
  11. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Thanks for the explanation on that. I was just checking in my text editor instead of browser so didn't see those rendered correctly.

    This explains what I was seeing with all the characters except the "+".

    When I view files for those queries in either my text editor or browser the "+" are not there. It's like they have been left off.

    I see this for all the queries containing a "+" . They don't seem to be there.

    Thanks for all the help and patience.
     
    #11 scrapefun, Feb 26, 2016
    Last edited: Mar 2, 2016
  12. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    The "+" issue is fixed in 1.1.433, thanks for your report
     
  13. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Everything is working! Thanks for the help in sorting all of this out.
     

Share This Page