SE::YouTube::Video - YouTube Video Data Scraper
Overview of the scraper
The YouTube Video Data Scraper. With this scraper, you can scrape all basic video data, as well as subtitles and comments. Queries should be links to YouTube video pages. Links to videos can be collected using
SE::YouTube. Using the YouTube video scraper, you can collect all data about a video in multithreaded mode.
A-Parser functionality allows you to save scraping settings for the SE::YouTube::Video scraper for future use (presets), ), set a scraping schedule, and much more.
Results can be saved in the format and structure you need, thanks to the powerful built-in templating engine Template Toolkit which allows you to apply additional logic to the results and output data in various formats, including JSON, SQL and CSV.
Collected data
- Video title and description
- Video duration
- Number of views, likes, and comments
- Link to the preview image
- Author's name, links to their avatar and channel, as well as the number of subscribers
- Video subtitles (including display time information)
- List of tags
- List of comments (including comment replies)
- Comment ID and Parent Comment ID (for replies)
- Author's name, profile link, and avatar
- Comment text and publication time
- List of related videos
- Link and video title
- Author and date
- Number of views and video duration
- Video chapter information ($chapters)
- Title, start time in seconds, and a link to the preview image
Capabilities
- Interface language selection
- Subtitle language selection
- Specifying the number of comment pages (approx. 20 comments per page)
- Specifying the maximum number of reply pages for each comment (approx. 10 replies on the first page, approx. 50 on subsequent ones)
- Specifying the number of related video pages (approx. 20 videos per page)
- Shorts support
Use cases
- Collecting statistical data about YouTube videos
- Scraping subtitles and comments as a source of text data
- Searching for related videos
Features
Subtitle language selection logic
The scraper uses the following priority (in descending order): original, original translated, generated, generated translated.
For example, if the scraper is set to scrape English subtitles, then:
- if the video has original English subtitles, original subtitles will be scraped
- if the video has original subtitles but in a different language, original translated to English will be scraped
- if the video does not have original subtitles but has generated ones in English, generated subtitles will be scraped
- if the video does not have original subtitles, and generated ones are in another language (because the video is in another language), generated translated subtitles will be scraped
Scraping comments
Comments are collected in a single thread, so their scraping can be quite time-consuming, especially when scraping a large number of pages and replies. It is recommended not to set a large number of reply pages; usually 1-3 is enough, or you can disable reply scraping entirely, which will significantly speed up the process.
Queries
Queries must be video links, for example:
https://www.youtube.com/watch?v=lWA2pjMjpBs
https://www.youtube.com/watch?v=EDwb9jOVRtU
https://www.youtube.com/watch?v=5NPBIwQyPWE
Output results examples
A-Parser supports flexible result formatting thanks to the built-in templating engine Template Toolkit, which allows it to output results in an arbitrary form, as well as in a structured format, such as CSV or JSON
Default output
Result format:
$query - $title\nViews: $viewsCount, likes: $likesCount, comments: $commentsCount\n
The result will display the video link, its title, number of likes, views, and comments:
https://www.youtube.com/watch?v=5NPBIwQyPWE - Avril Lavigne - Complicated (Official Video)
Views: 571331713, likes: 3959948, comments: 143597
https://www.youtube.com/watch?v=EDwb9jOVRtU - Madonna - Hung Up (Official Video) [HD]
Views: 414662791, likes: 2153344, comments: 91895
https://www.youtube.com/watch?v=lWA2pjMjpBs - Rihanna - Diamonds
Views: 2104207258, likes: 10235971, comments: 394622
Subtitle output
Result format:
$query\n$subtitles.format('$text ')\n\n
The result will display the video link and subtitles in the specified language.
Output to a CSV table
The built-in tools.CSVLine tool allows you to create correct tabular documents ready for import into Excel or Google Sheets.
General result format:
[% tools.CSVline(query, p1.author, p1.date, p1.duration, p1.title, p1.viewsCount, p1.likesCount, p1.commentsCount, p1.tags.format('$tag,')) %]
File name:
$datefile.format().csv
Initial text:
Link,Author,"Publish date",Duration,Title,"Views count","Likes count","Comments count",Tags
The General Result Format uses the Template Toolkit templating engine.
In the result file name, you just need to change the file extension to csv.
For the "Initial text" option to be available in the Job Editor, , you need to activate "More options". In "Initial text", enter the column names separated by commas and leave the second line empty.
Possible settings
| Parameter name | Default value | Description |
|---|---|---|
| Interface language | English | Interface language selection |
| Subtitles language | English | Subtitle language selection |
| Comments pages count | 5 | Number of comment pages |
| Pages count for replies | 3 | Number of reply pages for each comment |
| Pages count for related videos | 5 | Number of pages with related videos |
| Login required is error | ☑ | Instructs the scraper to treat the authorization required message as an error and retry |