More than 3 years have passed since last update.

Youtube subtitle (captions) を編集する。001

Last updated at 2021-02-07Posted at 2021-01-31

youtube 動画から字幕をファイルとして取り出してきます。

python の場合。
使うモジュールのインストール。
youtube_transcript_api の場合。

install

pip install youtube_transcript_api

help を見てみます。大事なことなので。

help

$ youtube_transcript_api --help
usage: youtube_transcript_api [-h] [--list-transcripts] [--languages [LANGUAGES [LANGUAGES ...]]]
                              [--exclude-generated] [--exclude-manually-created] [--json]
                              [--translate TRANSLATE] [--http-proxy URL] [--https-proxy URL]
                              [--cookies COOKIES]
                              video_ids [video_ids ...]

This is an python API which allows you to get the transcripts/subtitles for a given YouTube video.
It also works for automatically generated subtitles and it does not require a headless browser,
like other selenium based solutions do!

positional arguments:
  video_ids             List of YouTube video IDs.

optional arguments:
  -h, --help            show this help message and exit
  --list-transcripts    This will list the languages in which the given videos are available in.
  --languages [LANGUAGES [LANGUAGES ...]]
                        A list of language codes in a descending priority. For example, if this is
                        set to "de en" it will first try to fetch the german transcript (de) and
                        then fetch the english transcript (en) if it fails to do so. As I can't
                        provide a complete list of all working language codes with full certainty,
                        you may have to play around with the language codes a bit, to find the one
                        which is working for you!
  --exclude-generated   If this flag is set transcripts which have been generated by YouTube will
                        not be retrieved.
  --exclude-manually-created
                        If this flag is set transcripts which have been manually created will not
                        be retrieved.
  --json                If this flag is set the output will be JSON formatted.
  --translate TRANSLATE
                        The language code for the language you want this transcript to be
                        translated to. Use the --list-transcripts feature to find out which
                        languages are translatable and which translation languages are available.
  --http-proxy URL      Use the specified HTTP proxy.
  --https-proxy URL     Use the specified HTTPS proxy.
  --cookies COOKIES     The cookie file that will be used for authorization with youtube.

help でわかったことは一度としてありません。混乱します。いつも。でも、損はしませんから、一瞬でも見といたほうがいいとは思っています。

モジュールの使い方。

Git and GitHub for Beginners - Crash Course
https://www.youtube.com/watch?v=RGOj5yH7evk&t
でやってみましょうか。

タイミングがほしいので字幕を取り出しますが、まずどうなってるのかさっぱりわからないので、見ていきます。

from youtube_transcript_api import YouTubeTranscriptApi

line =[]
line[:] = YouTubeTranscriptApi.get_transcript('RGOj5yH7evk',languages=['en'])

for l in line:
    print(l)
    
del line

line = {}
line = YouTubeTranscriptApi.list_transcripts('RGOj5yH7evk')    

for l in line:
    print(l)
        
transcript = line.find_transcript(['en'])
translated = transcript.translate('ja')
print(translated.fetch())

まだ、わかりかねます。（一応上記の結果を見ると、どうなっているか気持ちはわかります。）

見ていくと、まず `get_transcript(video_id)` で取り出したオブジェクトには、

text: start: duration: がありました。

line =[]
line[:] = YouTubeTranscriptApi.get_transcript('belS2Ek4-ow',languages=['en'])

for l in line:
    print("text:", l['text'])
    print("start:", l['start'])
    print("duration:", l['duration'])

'belS2Ek4-ow' がさっきと違いますが、これが video_id ですね。

video_id　については、これを見たら、最初の 3 秒でわかります。（まったく知りませんでしたが、わかりました。全部見ると深くわかるかもしれませんが、とりあえず最初の約 3 秒間だけでも、いや、再生しなくてもわかるように周到なつくりの映像なので、どれのことかわかるようになればいいかも。でも、見なくても、この記事のコードを全部実行していけばたぶんわかるので、見ないということでも問題なし。）
https://youtu.be/j70AA9arThc

つまり youtube 映像自体の識別ですね。最初の 'RGOj5yH7evk' の　video_id
だと尺（ duration :　映像の時間的長さ）が長いので、尺が短いほうが、結果が速く、かつ、見やすいということに気がついたので、とりあえず短い 'belS2Ek4-ow' に変更してみました。つまり別のクリップ（映像）ですね。
クリップのURLがこうなってたとすると、
https://www.youtube.com/watch?v=belS2Ek4-ow

video_id というのは、最後にくっついている 11 文字部
youtube.com/watch?v=belS2Ek4-ow
ということですね。

eg.
python プログラムで video_id を YouTube URL から抽出するには、こちらを参考に。

じゃあ、こうやってみよう。

from youtube_transcript_api import YouTubeTranscriptApi

line =[]
line[:] = YouTubeTranscriptApi.get_transcript('belS2Ek4-ow',languages=['en'])

for l in line:
    print("text: ", l['text'])
    #print("start:", l['start'])
    #print("duration:", l['duration'])
        
del line

print("- - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -")

line = YouTubeTranscriptApi.list_transcripts('belS2Ek4-ow')    
        
transcript = line.find_transcript(['en'])
print(transcript.fetch())

print("- - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -")

translated = transcript.translate('ja')
for dict_obj in translated.fetch():
    print( "text: ", dict_obj['text'] )

わかりかかってきたような気も芽生え始めたけれども、なんかまだ長くて、内容もわかりにくい。
では、video_id を qviM_GnJbOM にして、

from youtube_transcript_api import YouTubeTranscriptApi

line =[]
line[:] = YouTubeTranscriptApi.get_transcript('qviM_GnJbOM',languages=['en'])

for l in line:
    print("text: ", l['text'])
    #print("start:", l['start'])
    #print("duration:", l['duration'])
        
del line

print("- - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -")

line = YouTubeTranscriptApi.list_transcripts('qviM_GnJbOM')    
        
transcript = line.find_transcript(['en'])
# print(transcript.fetch())

print("- - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -")

translated = transcript.translate('ja')
for dict_obj in translated.fetch():
    print( "text: ", dict_obj['text'] )

API のメソッドが似た名前であるので、わからなくなりそうですが、

YouTubeTranscriptApi.get_transcript('qviM_GnJbOM',languages=['en'])

の場合、つまり get_transcript() ・・・ transcript と単数形のはリスト型のオブジェクトを返すことで間違いない。ちゃんとリスト型のオブジェクト（line[:]）に入って、エラーにならなかったのでということですけど。

from youtube_transcript_api import YouTubeTranscriptApi

line = YouTubeTranscriptApi.get_transcript('qviM_GnJbOM')

print(type(line))

と type(object) 関数で調べることもできるようです。
でも、リスト型の宣言して、そこに入るかどうか見てみるというやり方だったわけです。

とにかく、YoutubeTranscriptApi.get_transcript(video_id) メソッドの場合は、リスト型のオブジェクトを返すということは、わかりました。

from youtube_transcript_api import YouTubeTranscriptApi

line = YouTubeTranscriptApi.get_transcript('qviM_GnJbOM')

## type(line)

text_list = []
for l in line:
    print(l['text'])
    text_list.append(l['text'])

And Still I Rise

ポエトリーですね。

いま、ちょっと気になったので、だいぶんとばしてしまいますが、やってみたいことを思いついたので・・・

（トートツですが）youtube 自体の字幕翻訳と google translate API で、いったいどういう違いがでるか？を見てみましょうか。

googletrans をインストール。以下 google colab でも実行可能なので実行環境なくてもブラウザで google colab のページにペーストで実行できます。

googletrans==4.0.0-rc1

pip install googletrans==4.0.0-rc1

YouTubeTranscriptApi

pip install youtube_transcript_api

YouTube の自動生成した英語字幕から YouTube が日本語字幕を生成したものと、YouTube の自動生成した英語字幕を googletrans で google 翻訳したものが表示されます。
比較できるように行番号付きでプリントします。

youtube_vs_googletrans

from youtube_transcript_api import YouTubeTranscriptApi
from googletrans import Translator

video_id ='qviM_GnJbOM'
line =[]
line[:] = YouTubeTranscriptApi.get_transcript(video_id,languages=['en'])

text_list = []
for l in line:
    ##print("text: ", l['text'])
    ##print("start:", l['start'])
    ##print("duration:", l['duration'])
    text_list.append(l['text'].rstrip('\n'))
## print(line)    
del line
## print(text_list)

print("- - - - - - - - - - - - - - - - - - -  youtube  - - - - - - - - - - - - - - - - - - -")

line = YouTubeTranscriptApi.list_transcripts(video_id)    

transcript = line.find_transcript(['en'])
## print(transcript.fetch())
for count, dict_obj in enumerate(transcript.fetch()):
    print( count+1, dict_obj['text'] )

print("- - - - - - - - - - - - - - - - - - translated - - - - - - - - - - - - - - - - - - -")

translated = transcript.translate('ja')
for count, dict_obj in enumerate(translated.fetch()):
    print( count+1, dict_obj['text'] )

print("- - - - - - - - - - - - - - - - -  googletrans  - - - - - - - - - - - - - - - - -")

text_list[:] = [l.replace('\n',' ') for l in text_list]
text_list[:] = [a for a in text_list if a != ' ']

text_compo = []
for count, l in enumerate(text_list):
    print(count+1,l)

print("- - - - - - - - - - - - - - - - - - translated - - - - - - - - - - - - - - - - - - -")    
translator = Translator()
for count, l in enumerate(text_list):
    translated = translator.translate(l, dest="ja")
    ##print(count, l)
    print(count+1, translated.text)

gist code こちらのコードは video_id は 'LjHORRHXtyI' になっています。

やってみてわかったのですが、translate インスタンス（ていうのかな？） Translator() で作るオブジェクトですが、これを一つでずっと回すのではなくて、作って回して壊してと使い回すと、エラーにならずに、ながーい話しっぱなしのクリップ（1時間半話しっぱなしの講演など）でも、プログラムは、つまづかずにおしまいまで走り続けるようでした。これは、他に試している例は見てないんですけど、当初やってみるとエラーが出て止まるのは、文字の多さかな？と考えていましたが、この方法にするとエラーは出なくなりました。それでもエラーになるときはブロッキングかなと思いますが、クラウドのリソースなので。
上のコードにはその改良部分を載せていませんが、この記事の欄外のリンクのコードや、その他リンクしているページのコードでは、その方法を採用しています。

文章にすると、確かめていない、ただそんな気持ちということなので、よくわからいですね。
こういうことです。

translator = Translator()
num = 5
# obj_num = 1

for count, l in enumerate(text_compo):
    if count + 1 < num:
        translated = translator.translate(l, dest='ja')
        #print(count+1, l)
        print(count+1, translated.text)
    else:
        translated = translator.translate(l, dest='ja')
        print(count+1, translated.text)        
        del translator
        num = num + 5
        #obj_num = obj_num + 1
        #print("")
        #print("--- translator :", obj_num)
        #print("")
        translator = Translator()

del translator

googletrans についてはこっちを参考にします。

差異ですが、パッと見た感じで違うのは、 YouTube での日本語への翻訳は全体的に行数が半分になっています。
だからでしょう、約 2 行分の文章が修飾として補完される表現になっているのに対して、googletrans の方では、2 行をつなげてから翻訳するという処理はしてませんから¹、1 行づつの翻訳文となっていて、たまたまですが、詩のリズムとして演出されているようにも感じられなくもないところがある、といったところです。

もともと音声からの文字おこしですから、時間軸と文字数の区切りが影響しているでしょうけれども、英語字幕から日本語への翻訳の時どの程度考慮されているのかな？

つづく・・・

この記事のあとやったいろいろの一部

2 行を 1 行にしてgoogletransで翻訳するものを付け足したプログラムコードです。実際にやってみると比較することができます。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up