Customize Google Images Scraper

Discussion in 'A-Parser Support Forum' started by scrapefun, May 3, 2015.

  1. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    I need to extract some additional information from Google Images results and am not sure how to go about it.

    On the Google image results page each image generates a url like this:

    href="http://www.google.com/imgres?imgurl...BQ&tbm=isch&ved=0CDQQMygCMAI&biw=1366&bih=631"

    I need to extract the values for these parameters:

    imgurl=
    imgrefurl=
    tbnid=

    And finally, is there a way to extract the filetype of the image into a variable as well (jpg, png, etc)? Something like $filetype?

    So for the final result I would like stored on each line:
    $query;$loop.count;$imgurl;$imgrefurl;$tbnid.$filetype\n
     
  2. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    it is better for you write your own parser based on Net::HTTP
     
  3. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    I can post solution later
     
  4. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    That would be great! Thank you very much.
     
  5. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    I know Forbidden is super busy so I would be open to hiring someone to get this solution. If anyone is interested just send me a PM.
     
  6. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    Very interesting solution:
    • Use SE::Google::Images SE::Google::Images + Raw data results for generate queries and get raw html
    • Use complex regex to get all data
    • Use power of Result format for generate proper result


    [​IMG]

    Code:
    eyJwcmVzZXQiOiJ0b3BpYy0xNjA5OiBjdXN0b20gZ29vZ2xlIGltYWdlcyBwYXJz
    ZXIiLCJ2YWx1ZSI6eyJwcmVzZXQiOiJ0b3BpYy0xNjA5OiBjdXN0b20gZ29vZ2xl
    IGltYWdlcyBwYXJzZXIiLCJwYXJzZXJzIjpbWyJTRTo6R29vZ2xlOjpJbWFnZXMi
    LCJkZWZhdWx0Iix7InR5cGUiOiJvdmVycmlkZSIsImlkIjoicmF3ZGF0YSIsInZh
    bHVlIjp0cnVlfSx7InR5cGUiOiJjdXN0b21SZXN1bHQiLCJyZXN1bHQiOlsicGFn
    ZXMiLCJkYXRhIl0sInJlZ2V4IjoiaW1ndXJsPShbXiZdKj8oPzpcXC4oanBlP2d8
    cG5nfGdpZikpPykmYW1wO2ltZ3JlZnVybD0oW14mXSspJi4qP3RibmlkPShbXjpd
    Kyk6IiwicmVnZXhUeXBlIjoiaWciLCJyZXN1bHRUeXBlIjoiYXJyYXkiLCJhcnJh
    eU5hbWUiOiJpbWdzIiwicmVzdWx0cyI6WyJsaW5rIiwidHlwZSIsInJlZiIsInRi
    bmlkIl19LHsidHlwZSI6Im92ZXJyaWRlIiwiaWQiOiJmb3JtYXRyZXN1bHQiLCJ2
    YWx1ZSI6IlslIEZPUkVBQ0ggaW1ncyAtJV1cbiRxdWVyeTskbG9vcC5jb3VudDsk
    bGluazskcmVmOyR7dGJuaWR9LlslIHR5cGUgPT0gJ25vbmUnID8gJ2RlZmF1bHQu
    anBnJyA6IHR5cGUgJV0gXG5bJSBFTkQgJV0ifV1dLCJyZXN1bHRzRm9ybWF0Ijoi
    JHAxLnByZXNldCIsInJlc3VsdHNTYXZlVG8iOiJmaWxlIiwicmVzdWx0c0ZpbGVO
    YW1lIjoiJGRhdGVmaWxlLmZvcm1hdCgpLnR4dCIsImFkZGl0aW9uYWxGb3JtYXRz
    IjpbXSwicmVzdWx0c1VuaXF1ZSI6Im5vIiwicXVlcnlGb3JtYXQiOlsiJHF1ZXJ5
    Il0sInVuaXF1ZVF1ZXJpZXMiOmZhbHNlLCJzYXZlRmFpbGVkUXVlcmllcyI6ZmFs
    c2UsIml0ZXJhdG9yT3B0aW9ucyI6eyJvbkFsbExldmVscyI6ZmFsc2UsInF1ZXJ5
    QnVpbGRlcnNBZnRlckl0ZXJhdG9yIjpmYWxzZX0sInJlc3VsdHNPcHRpb25zIjp7
    Im92ZXJ3cml0ZSI6ZmFsc2V9LCJkb0xvZyI6Im5vIiwia2VlcFVuaXF1ZSI6Ik5v
    IiwibW9yZU9wdGlvbnMiOmZhbHNlLCJyZXN1bHRzUHJlcGVuZCI6IiIsInJlc3Vs
    dHNBcHBlbmQiOiIiLCJxdWVyeUJ1aWxkZXJzIjpbXSwicmVzdWx0c0J1aWxkZXJz
    IjpbXSwiY29uZmlnT3ZlcnJpZGVzIjpbXX19
     
  7. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Thanks! This works great.

    Is it possible to use the result of one parser to form the queries for another parser? I saw in the help files that it was not possible when the page was posted but wondered if it was possible yet?

    Basically, I want to use the net::http parser to download the actual image from Google images. I got it working as a stand alone task but I would like to be able to use the "$link" result value from the Google image parser as the query for the net:http parser.

    Thanks again for your help!
     
  8. Forbidden

    Forbidden Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 9, 2013
    Messages:
    3,337
    Likes Received:
    1,793
    still not possible
     
  9. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Ok.

    What I am doing is creating an additional result file when scraping Google Images that just contains the image URLs and then I use those as the $query for the Net::HTTP parser in a separate task but with this method I can't match up the image to the original keyword query.

    I want to name the images with the query from the Google Images task. How do I match up the image to the correct query the Google Image parser task?
     
  10. Support

    Support Administrator
    Staff Member A-Parser Enterprise

    Joined:
    Mar 16, 2012
    Messages:
    4,532
    Likes Received:
    2,159
    As the request file, select obtained in the previous task file.
    [​IMG]
    The result is img folder with a pictures, named by keyword and number.
     
  11. scrapefun

    scrapefun A-Parser Enterprise License
    A-Parser Enterprise

    Joined:
    Feb 24, 2015
    Messages:
    184
    Likes Received:
    34
    Works perfectly! This software is so amazing :)

    Thanks so much for your help.
     

Share This Page