SE::Dogpile::Images - image scraper from Dogpile
Scraper Overview
The Dogpile search results image scraper. With the SE::Dogpile::Images scraper, you can get databases of image links or images ready for further use. You can use queries in the same way as you enter them in the Dogpile search bar.
The A-Parser functionality allows you to save the parsing settings of the Dogpile scraper for further use (presets), set up a parsing schedule, and much more. You can use automatic query multiplication, substitution of subqueries from files, enumeration of alphanumeric combinations and lists to obtain the maximum possible number of results.
Results can be saved in the form and structure you need, thanks to the built-in powerful templating engine Template Toolkit which allows you to apply additional logic to the results and output data in various formats, including JSON, SQL, and CSV.
Scraper Use Cases
Downloading images by link
A-Parser allows you to use a chain of tasks, upon completion of the first one, the second one will start execution, using the links from the first one as queries for the second task.
Download example
How to import the example into A-Parser
eJyNVktT2zAQ/iuMhkNoQ2IOvfjCBGimdCihEE4hnVHjtSuQJSPJAcbkv3dXNn6k
JvRmrfa9335ywRy3D/bKgAVnWbgoWOa/WcjOdJIJCXvnKU9gL9JPSmoegWFDlnFj
wZD+gt18DcNKNQy9rkWNCGKeS8eWyyFDh/hpp9qknBzvZ0ejKkp9ecPXMNd4GaOf
RjzF0yVPgawi7oBuR7F3NDgYuWfywKNIOKEVl2UESquJeqvEY0721hmhEtTHoxFg
p0anKHbgnZDw5S3DBdv3Z4Zucm//s7RhYcylhSGzmO6UYzLR9o1wYLjTZpZRTigv
mFYTKS9gDbJR8/5PciGxoXYSo9F5ZdivMvvHx6YusR1qDebJYA61F386mf1orCJ9
oROsPPqNdUuRCodne6pzRcMJUPgAkNV9u9QoSbWBOowzOdTBEToZqAgVm6lNskbU
qaIzma5wpVUskhnmb0QEb5q5miM+Z+pUp5kEKouVENs7a+Mxt3DdAGZiq6HQoU53
29WpD0h9qKA6ZE5rab/flIlnRiAev1C6Kba1nUPV2hWX8vb6opNdgy/yrBNYaaEY
6TpINEIK69oMi84GXYILw2/z+VVrb1DFQALP6AU740C5Q/eSQTj6dCyoAeNBppLX
+wyOk9dExAe++ag/RyW0EbQXxvCXanmouPJmlVun07Jb9cRQ/ge4H8abqHIUS97a
UloN2hdfwppLDxClFTT+cUOdn4n2cKHFA2XbkcqFK/WqKEo7eMy5ZJs2XzS770u2
43IrRypPD/eL6hsVNiNilNJtD2JULuU7QO/DcC89/A96doIwaPZO6Q9Ja5sN30Fa
G1T9NIucyXuWsGBW52ZFbkqiI+zTcKmdbDmssTceLH6Nl58P7u5Gg+OwC7n9HsxV
ICjNN8th85r0rW0PXWzRbdBHRvVm9rBw0N6/7ZXusFjwDhFtvxt+XB+xerCb0bev
O2webPo4JtjxJvVz5a7HIGg/BBTQzwB7fuSHVJJR/RdQ9D7qYYF2DVfhEc/39qo0
JphXOhjW+tU/2vwFUiDnrA==
Collected Data
- Links to images
- Page links
- Height and width
- Preview links
Use Cases
- Collecting images to fill your blogs
- Collecting images to fill your websites
- Collection of avatar databases
Queries
Search phrases should be specified as queries, for example:
Remington
Tiger
Romeo and Juliet
Quantum mechanics
Query Substitutions
You can use built-in macros to multiply queries, for example, if we want to get a very large database of forums, we will specify several main queries in different languages:
forum
форум
foro
论坛
In the query format, we will specify the enumeration of characters from a to zzzz, this method allows to rotate the search results to the maximum and obtain a multitude of new unique results:
$query {az:a:zzzz}
This macro will create 475254
additional queries for each original search query, which in total will give 4 x 475254 = 1901016
search queries, an impressive number, but it's not a problem for A-Parser at all. At a speed of 2000
requests per minute, this task will be processed in just 16
hours.
Output Results Options
A-Parser supports flexible formatting of results thanks to the built-in templating engine Template Toolkit, which allows it to output results in any form, as well as in a structured form, for example CSV or JSON.
Default Output
Result format:
$serp.format('$link\n')
Example result:
http://crossexamined.org/wp-content/uploads/2014/06/Quantum_Computer.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Hydrogen_Density_Plots.png/1200px-Hydrogen_Density_Plots.png
http://3.bp.blogspot.com/-7mo9xgi0zZ0/VDcYLKYsZmI/AAAAAAAABc8/toMaFUqtcEc/s1600/24854-quantum-mechanics.jpg
http://4.bp.blogspot.com/-FnufNdvAIAI/T6GAIsE9QrI/AAAAAAAADgs/ini4LJG_Nes/s1600/A+Mass+&+Energy.jpg
http://40.media.tumblr.com/tumblr_ma6rb5smWd1rx06nvo1_1280.jpg
https://media.buzzle.com/media/images-en/gallery/education/physics/1200-261760-basics-of-quantum-mechanics.jpg
https://wonderopolis.org/wp-content/uploads/2017/03/Quantum_Physicsdreamstime_xxl_60222747.jpg
https://cdn.wallpapersafari.com/20/6/FkmvcC.gif
https://media.buzzle.com/media/images-en/gallery/education/chemistry/1200-96168909-atoms-emit-light.jpg
http://www.therealityfiles.com/wp-content/uploads/2014/12/quantum_mechanics.jpg
http://i.dailymail.co.uk/i/pix/2010/03/18/article-1258932-014CFA7D000004B0-375_468x462.jpg
https://cdn.wallpapersafari.com/7/34/jXU8Ay.gif
http://mednorthwest.com/wp-content/uploads/2015/09/Quantum-entanglement-wave-particle.jpg
http://steve-patterson.com/wp-content/uploads/2015/02/QuantumPhysics2.jpg
http://cdn1.collective-evolution.com/assets/uploads/2016/07/QuantumPhysics-759x500.jpg
Output in CSV Table
Result format:
[% FOREACH item IN serp;
tools.CSVline(query, item.link, item.width, item.height, item.page, item.thumb);
END %]
Example result:
cats,https://cdn2.theweek.co.uk/sites/theweek/files/2017/11/131117-wd-cats.jpg,1400,788,https://www.theweek.co.uk/94877/why-are-so-many-australian-towns-introducing-cat-curfews,https://tse3.mm.bing.net/th?id=OIP.iYyPimFLj1_wgKEsTsggQgHaEK&pid=Api
cats,http://mymodernmet.com/wp/wp-content/uploads/2017/03/gabrielius-khiterer-stray-cats-8.jpg,750,1028,https://mymodernmet.com/gabrielius-khiterer-stray-cat-photos/,https://tse2.mm.bing.net/th?id=OIP.ZjfS8JQc9sahsK0-w8dRFAHaKJ&pid=Api
cats,https://www.israelhayom.com/wp-content/uploads/2020/04/why-cats-are-best-pets-worshipped-animals-1559234295.jpg,2119,1415,https://www.israelhayom.com/2020/04/23/2-nyc-cats-test-positive-for-coronavirus-officials-recommend-pet-precautions/,https://tse1.mm.bing.net/th?id=OIP.U7274nc_llbuQTChXpKVNgHaE8&pid=Api
cats,http://fishsubsidy.org/wp-content/uploads/2020/01/abyssinian-cats.jpg,1204,1445,http://fishsubsidy.org/category/cat/cat-breeds/,https://tse3.mm.bing.net/th?id=OIP.uHEu4-5TLJ6SSgDree6ahQHaI4&pid=Api
cats,https://external-preview.redd.it/gxbKXOj-OF1_RSHa7Ncp8Gs_OFFP5i6V7SU5DPT2t1E.jpg?auto=webp&s=b6e85ba0f1517dc629d21208a7d9db992d550ba9,1920,2560,https://www.reddit.com/r/cats/comments/2k2pio/my_very_ugly_cat/,https://tse1.mm.bing.net/th?id=OIP.t2BxlpEwcGrXJJQSToWVBAHaJ4&pid=Api
cats,http://www.zastavki.com/pictures/originals/2013/Animals_Cats_Sleeping_gray_kitten_036760_.jpg,2560,1600,http://www.zastavki.com/eng/Animals/Cats/wallpaper-36760.htm,https://tse4.mm.bing.net/th?id=OIP.3c_ISLWidlMWXHfjqkpB2wHaEo&pid=Api
cats,https://d.ibtimes.co.uk/en/full/1457779/cats-dont-need-owners.jpg,720,1280,https://www.ibtimes.co.uk/cats-prefer-their-owners-other-people-dont-need-them-feel-safe-1518912,https://tse1.mm.bing.net/th?id=OIP.COdza3KGEWHT3uo9gJ5-0QCoEs&pid=Api
cats,https://img.webmd.com/dtmcms/live/webmd/consumer_assets/site_images/article_thumbnails/reference_guide/why_cats_sneeze_ref_guide/1800x1200_why_cats_sneeze_ref_guide.jpg,1800,1200,https://pets.webmd.com/cats/why-cats-sneeze,https://tse4.mm.bing.net/th?id=OIP.6C8jTceMZG78kseu8RUyfAHaE8&pid=Api
cats,http://mcdaniel.hu/wp-content/uploads/2015/01/6784063-cute-cats-hd.jpg,2560,1600,http://mcdaniel.hu/cat-adoption-101/,https://tse4.mm.bing.net/th?id=OIP.QdEkrZjd1c_VN_aUtleoFgHaEo&pid=Api
Saving in SQL Format
Result format:
[% FOREACH serp;
"INSERT INTO serp VALUES('" _ query _ "', '"; link _ "', '"; page _ "', '"; thumb _ "')\n";
END %]
Example result:
INSERT INTO serp VALUES('cats', 'https://viralcats.net/blog/wp-content/uploads/2017/12/Mean-looking-cat-Viral-Cats-03.jpg', 'https://viralcats.net/blog/2017/12/30/10-kitties-that-you-dont-want-to-mess-with/', 'https://tse2.mm.bing.net/th?id=OIP.AdkhgipoWbJwiQBp9VIWpgAAAA&pid=Api')
INSERT INTO serp VALUES('cats', 'http://mymodernmet.com/wp/wp-content/uploads/2017/03/gabrielius-khiterer-stray-cats-8.jpg', 'https://mymodernmet.com/gabrielius-khiterer-stray-cat-photos/', 'https://tse2.mm.bing.net/th?id=OIP.ZjfS8JQc9sahsK0-w8dRFAHaKJ&pid=Api')
INSERT INTO serp VALUES('cats', 'http://fishsubsidy.org/wp-content/uploads/2020/01/abyssinian-cats.jpg', 'http://fishsubsidy.org/category/cat/cat-breeds/', 'https://tse3.mm.bing.net/th?id=OIP.uHEu4-5TLJ6SSgDree6ahQHaI4&pid=Api')
INSERT INTO serp VALUES('cats', 'https://cdn2.theweek.co.uk/sites/theweek/files/2017/11/131117-wd-cats.jpg', 'https://www.theweek.co.uk/94877/why-are-so-many-australian-towns-introducing-cat-curfews', 'https://tse3.mm.bing.net/th?id=OIP.iYyPimFLj1_wgKEsTsggQgHaEK&pid=Api')
INSERT INTO serp VALUES('cats', 'https://www.israelhayom.com/wp-content/uploads/2020/04/why-cats-are-best-pets-worshipped-animals-1559234295.jpg', 'https://www.israelhayom.com/2020/04/23/2-nyc-cats-test-positive-for-coronavirus-officials-recommend-pet-precautions/', 'https://tse1.mm.bing.net/th?id=OIP.U7274nc_llbuQTChXpKVNgHaE8&pid=Api')
INSERT INTO serp VALUES('cats', 'https://s-i.huffpost.com/gen/964776/images/o-CATS-KILL-BILLIONS-facebook.jpg', 'https://www.huffingtonpost.com/2013/01/30/domestic-cats-kill-billions-mice-birds-annually-study_n_2575833.html', 'https://tse1.mm.bing.net/th?id=OIP.ETFxELWtgKQwMlcoccq-SAHaHa&pid=Api')
Dump Results to JSON
Общий формат результата:
[% IF notFirst;
",\n";
ELSE;
notFirst = 1;
END;
obj = {};
obj.query = query;
obj.images = [];
FOREACH item IN p1.serp;
obj.images.push({
width = item.width
height = item.height
link = item.link
page = item.page
thumb = item.thumb
});
END;
obj.json %]
Начальный текст:
[
Конечный текст:
]
Example result:
[{
"images": [
{
"link": "https://viralcats.net/blog/wp-content/uploads/2017/12/Mean-looking-cat-Viral-Cats-03.jpg",
"width": "462",
"page": "https://viralcats.net/blog/2017/12/30/10-kitties-that-you-dont-want-to-mess-with/",
"thumb": "https://tse2.mm.bing.net/th?id=OIP.AdkhgipoWbJwiQBp9VIWpgAAAA&pid=Api",
"height": "722"
},
{
"link": "http://mymodernmet.com/wp/wp-content/uploads/2017/03/gabrielius-khiterer-stray-cats-8.jpg",
"width": "750",
"page": "https://mymodernmet.com/gabrielius-khiterer-stray-cat-photos/",
"thumb": "https://tse2.mm.bing.net/th?id=OIP.ZjfS8JQc9sahsK0-w8dRFAHaKJ&pid=Api",
"height": "1028"
},
{
"link": "http://fishsubsidy.org/wp-content/uploads/2020/01/abyssinian-cats.jpg",
"width": "1204",
"page": "http://fishsubsidy.org/category/cat/cat-breeds/",
"thumb": "https://tse3.mm.bing.net/th?id=OIP.uHEu4-5TLJ6SSgDree6ahQHaI4&pid=Api",
"height": "1445"
},
],
"query": "cats"
}]
To make the "Initial text" and "Final text" options available in the Task Editor, you need to activate "More options".
Possible settings
Parameter | Default value | Description |
---|---|---|
Pages count | 5 | Number of pages to scrape |
Util::ReCaptcha2 preset | default | Determines whether to use Util::ReCaptcha2 to bypass captchas |
ReCaptcha2 retries | 3 | Number of attempts to submit the captcha response the specified number of times without changing the proxy |