Tool for extracting comments or subtitles from youtube video's
A tool for extracting info from youtube:
You can install this from the official python repository using pip
:
pip3 install youtube-tool
This will add a command yttool
to your python binaries directory,
and probably also to your search path. So you can run this like:
yttool ....arguments....
Note: depending on your local python installation(s), you may have to type
one of pip
, pip3
, or maybe even: pip3.8
.
You can also 'install' this by executing the yttool.py
file directly from
the source directory:
python3 yttool.py ....arguments...
This script needs python 3.8 or later to run.
The python3.8 specific feature I am using is the new :=
walrus operator.
This will output the subtitles in all available languages.
yttool --subtitles https://www.youtube.com/watch?v=bJOuzqu3MUQ
Or list the subtitles prefixed with timestamps
yttool -v --subtitles https://www.youtube.com/watch?v=bJOuzqu3MUQ
You can also extract the subtitles in a format suitable for
creating .srt
subtitle files:
yttool --srt --subtitles https://www.youtube.com/watch?v=bJOuzqu3MUQ
Or you can filter by language, for example only output the english subtitles:
yttool --language en --subtitles https://www.youtube.com/watch?v=0xY06PT5JDE
Or only output the automatically generated subtitles:
yttool --language asr --subtitles https://www.youtube.com/watch?v=0xY06PT5JDE
List all the comments for this Numberphile video:
yttool --comments https://www.youtube.com/watch?v=bJOuzqu3MUQ
Print out an entire livechat replay:
yttool --replay https://www.youtube.com/watch?v=lE0u_jIDh0E
Note: this does not yet work!
Print messages from a livechat as they come:
yttool --livechat https://www.youtube.com/watch?v=EEIk7gwjgIM
List all the video's contained in this System of a Down playlist:
yttool --playlist https://www.youtube.com/playlist?list=PLSKnqXUHTaSdXuK8Z2d-hXLFtJbRZwPtJ
The output will look like this:
CSvFpBOe8eY - System Of A Down - Chop Suey! (Official Video)
zUzd9KyIDrM - System Of A Down - B.Y.O.B. (Official Video)
L-iepu3EtyE - System Of A Down - Aerials (Official Video)
iywaBOMvYLI - System Of A Down - Toxicity (Official Video)
DnGdoEa1tPg - System Of A Down - Lonely Day (Official Video)
LoheCz4t2xc - System Of A Down - Hypnotize (Official Video)
5vBGOrI6yBk - System Of A Down - Sugar (Official Video)
SqZNMvIEHhs - System Of A Down - Spiders (Official Video)
ENBv2i88g6Y - System Of A Down - Question! (Official Video)
bE2r7r7VVic - System Of A Down - Boom! (Official Video)
F46r-_jPPHY - System Of A Down - War? (Official Video)
The first 11 characters are the video id, you can load the corresponding video
by typing: https://www.youtube.com/watch?v=5vBGOrI6yBk
in your browser's URL bar.
Or list all video's from a channel:
yttool -l https://www.youtube.com/channel/UCoxcjq-8xIDTYp3uz647V5A
Or when you don't know the channelid, you can get the same with the username:
yttool -l https://www.youtube.com/user/numberphile
This:
yttool -q somequery
Will list first couple of the video's matching that query.
You can also call yttool with only the video id as an argument:
yttool --info CSvFpBOe8eY
For example if you would like to use TOR, you would do this:
yttool --proxy socks5://localhost:9050 --info https://www.youtube.com/watch?v=Ll-_LV9U1tA
Note that setting a socks proxy via the https_proxy
environment variable does NOT work very well with python's urllib library.
This script does not use the official youtube API, instead, it uses youtube's internal api, which is what is used on the youtube website itself. This does mean there is no guarantee that this script will keep working without maintenance. Youtube will keep changing the way it works internally. So I will need to keep updating this script.
The advantage of using the internal API, is that there are apparently no limits to how many requests you can do. And you don't have to bother with any kind of registration.
These are the main internal api urls I am using:
https://www.youtube.com/comment_service_ajax
https://www.youtube.com/live_chat_replay/get_live_chat_replay
https://www.youtube.com/youtubei/v1/search
https://www.youtube.com/browse_ajax
Also, you can get youtube to respond with json instead of html by adding a &pbj=1
argument to most urls,
and add http headers: x-youtube-client-name: 1
and x-youtube-client-version: 2.20200603.01.00
to your request.
Also the user-agent header needs to be of the right format, see my script for a working example.
Then, for search you need to add a innertubeapikey
. Which I have currently hardcoded in my script, as i did with the client-version.
A future improvement would be to automatically extract these from the current youtube front page.
Youtube's id's are structured in several ways:
A videoid is 11 characters long, when decoded using base64, this results in exactly 8 bytes.
The last character of a videoid can only be: 048AEIMQUYcgkosw
--> 10x6+4 = 64 bits
A playlist id is either 24 or 34 characters long, and has the following format:
course
page is broken,
with lots of overlapping text.A channel-id is 22 base64 characters, with the last character one of: AQgw
, so this decodes to 21x6+2 = 128 bits
"UC
"PU
"UU
"LL
"FL
"RDCMUC
prefixes CL, EL, MQ, TT, WL also seem to have a special meaning
These take
"TLGG<22chars>" -- temporary list - redir from watch_videos
"RDEM<22chars>" -- radio channel
"RD
"OLAK5uy_<33chars>" -- album playlist.
klmn
: 0b1001xxAEIMQUYcgkosw048
--> 2 + 31x6 + 4 = 192 bits"WL" -- 'watch later'
"UL" -- channel video mix
"LM" -- music.youtube likes
"RDMM" -- music.youtube your mix
"RDAMVM
"RDAO<22chars>"
"RDAMPL" + prefix+playlistid
"RDCLAK5uy_" + 33chars
"RDTMAK5uy_" + 33chars
prefixes EL, CL also seem to have a special meaning.
Domains:
youtu.be
youtube.com
UrlPath:
/watch?v=<videoid>&t=123s&list=<listid>
/v/<videoid>
/embed/<videoid>
/embed/videoseries?list=<playlistid>
/watch/<videoid>
/playlist?list=<playlistid>
/channel/<channelid>
/user/<username>
/watch_videos?video_ids=<videoid>,<videoid>,...
Some id's are base64 encoded protobuf packets, like: clickTrackingParams, continuation.
I added a tool: ytdump.py
, which i use to investigate youtube json dictionaries.
--replay
option.Willem Hengeveld [email protected]