More than 3 years have passed since last update.

googletrans , YouTube subtitle , with Google Colab

Last updated at 2021-02-13Posted at 2021-02-11

Before, when you want to hear and write what you are talking about, upload it to YouTube, wait for about an hour, the caption will be created automatically, then download the caption file with .str / .sbv etc., and then I corrected the mistakes using subtitle editing editor with excellent functions such as **Subtitle Edit** or **Aegisub** , and when I finished editing, I exported it as a text file.

I also used [the demo of IBM Watson TTS](https://speech-to-text-demo.ng.bluemix.net/) (I have never seen the Japanese domain. It seems that it has not maintained the HP link at all) and the experimental public server for subtitle generation in the Kawahara laboratory of Kyoto University. I was comparing the results. In terms of ease of editing, I felt that subtitle files were the smoothest. At that time, I didn't know about emacs, so I didn't know the nice thing about using a combination of mpv and emacs to write and transcribe in a text editor. I was a little involved in [making an archive] that was close to one project of recording and transcribing discussions about [the photographer's photographs](https://atom.log.osaka/index.php/sn01) that continued to take pictures related to ....

rf.
京都大学河原研究室音声認識を用いた自動字幕
Captioning System based on Automatic Speech Recognition Technology
http://caption.ist.i.kyoto-u.ac.jp/

rf.
中島敏フォトアーカイブ・解題 (Photo archive "KAIDAI")
https://project.log.osaka/nakajima/

rf.
Nikolaj Lynge Olsson's 'Subtitle Edit'
https://nikse.dk/

rf.
Aegisub github page
https://github.com/Aegisub/Aegisub

... It looks like it will be evening by the time we get there.

Take out a video on YouTube that already has subtitles (it doesn't matter what language it is, but I'm assuming it's in English in this article), and it feels a bit better than the result of YouTube's automatic translation.It is for getting the translation result in text.
There is a lot of way for improvement.

If you look at the code, the contents of this program are the same. If you don't understand, you can uncomment it and see what it is by print('what_you_want_to_know'), so I think you should change what you want to do (by yourself). This project is mostly programmed and verified only with Amazon kindle Fire8 2020 (Android OS of 6500 yen), so it has the disadvantage of a small screen, but I think it can be done on a smartphone as long as the internet and browser support googlecolab.

The reason why it is googlecolab has merit, but it is not clear yet, so I will not write it clearly. One of the advantages is that the IP address can be changed easily. By the way, the IP address of the googlecolab notebook is

!curl ipecho.net/plain

https://ipecho.net/developers.html
Please note that this is a free service, provided as-is with out any warranty.

I know the links have to be consistent so they will probably never change, I'll also try to keep the uptime as close to 100% as possible, but again, I can't offer any warranty for anything and I reserve the right to do whatever I please with it, so you're using the service at your own risk.

If you're planning to use this service in an application you might consider using this link instead: http://ipecho.net/plain

PS: Please don't abuse the system, try to cache the IP for a reasonable amount of time before doing another request, so everyone can enjoy it.

It seems that you can check it. Thus there seems to be a way to use googlecolab locally.

If you concatenate 2 program codes and specify the URL of the target YouTube video as urltext, it will automatically pick up the subtitle text and googletrans will process the fetched text. I intended to display the results of YouTube automatic translation and google translate API side by side in order to compare them, but when I tried and confirmed some, After trying a few things, it seemed that YouTube automatic translation was no different from google translate by concatenating two lines of captions (for example, the original English text) into one line.Ignore it and put three lines into one line, two lines into one line, two lines into one line, and so on to get the meaning processing of long sentences that machine learning is good at. I'm passing it to the google translate API.

Sometimes I want to compare it with another machine translation, so I split the process into two parts: getting the subtitle text from YouTube and concatenating the lines, and google translate the text. If you add a program again, you can batch process from the list of URLs.

Of course, it can be done on Android OS (pydroid3 checked, but QPython gave an error.) The following code assumes that you are using googlecolab, so you will need to rewrite it a bit. However, it shouldn't be too difficult, and I think it's almost the same for jupyter notebook. (I don't know, I haven't done it yet. I'd like to migrate and check it out soon, but I've lost hours installing jupyter in QPython and failed in the end, so I don't have a path to a jupyter test environment)

1 Subtitle

youtube_transcript_api (0.3.1)

googlecolab

pip install youtube_transcript_api

googlecolab

from youtube_transcript_api import YouTubeTranscriptApi
# from googletrans import translator
from google.colab import files
# import time
# import sys
from urllib.parse import urlparse, parse_qs

urltext ='https://www.youtube.com/watch?v=P9GLDezYVX4' ## YouTube URL
args = [urltext]
video_id = ''

def extract_video_id(url):
    query = urlparse(url)
    if query.hostname == 'youtu.be': return query.path[1:]
    if query.hostname in {'www.youtube.com', 'youtube.com'}:
        if query.path == '/watch': return parse_qs(query.query)['v'][0]
        if query.path[:7] == '/embed/': return query.path.split('/')[2]
        if query.path[:3] == '/v/': return query.path.split('/')[2]
    # fail?
    return None

for url in args:
    video_id = (extract_video_id(url))
    print('youtube video_id:',video_id)

## youtube video_id

line =[]
line[:] = YouTubeTranscriptApi.get_transcript(video_id,languages=['en']) ## if YouTube subtitle is English

text_list = [] # subtitle text lines
for l in line:
    ##print("text: ", l['text'])
    ##print("start:", l['start'])
    ##print("duration:", l['duration'])
    
    l['text']=l['text'].strip() # chop empty charactor
    l['text']=l['text'].rstrip('\n') # chop charactor
    l['text']=l['text'].rstrip('\r') # chop charactor
    l['text']=l['text'].replace('\r','') # cut charactor 
    l['text']=l['text'].replace('\n',' ') # cut charactor
    text_list.append(l['text'])

## text_list[:] = [a for a in text_list if a != ' '] ## same as above
## text_list[:] = [l.replace('\n',' ') for l in text_list]
## print(line)    

del line

## print(text_list)

original_stdout = sys.stdout ## stdout backup
filename = 'subtitle.txt' ## print subtitle text to this file
with open(filename, 'w') as f:
    sys.stdout = f # stdout to file

    print('youtube video_id:',video_id)
    print()
    print("haywhnk-A.K.A-@dauuricus")
    print("- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
    print("YouTube captions")
    print("- - - - - - - - - - - - - - - - - - -  YouTube  - - - - - - - - - - - - - - - - - - -")
    print()
    print()

    line = YouTubeTranscriptApi.list_transcripts(video_id)    

    transcript = line.find_transcript(['en']) ## select English subtitle
    #print(transcript.fetch())
    
##    caption_line =[]
##    for count, dict_obj in enumerate(transcript.fetch()):
##        print( dict_obj['text'] )
##        caption_line.append(dict_obj['text'])
##    print()
##    print()
##    print("************************************************************************************")
##    print()
##    print("- - - - - - - - - - - - - - - - - - translated - - - - - - - - - - - - - - - - - - -")
##    print()
##    print()
  
##    translated = transcript.translate('ja')
##    for count, dict_obj in enumerate(translated.fetch()):# japanese
##        print( count+1, dict_obj['text'] )
##    print()
##    print("-----------------------------------------------------------------------------------")
##    print()
##    print("text compositimg ...")
##    print("- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
##    print()
##    print()

####    text_e = [] ##2 lines to 1 line
####    i = 0
####    txt_e = ''
####    for count,l in enumerate(caption_line):
####        if i == 0:
####          txt_e += l
###          i = i + 1
###          text_e.append(txt_e)
###        elif i == 1:
###          txt_e += ' ' +l
###          text_e.pop()
###          text_e.append(txt_e)
###          i = 0
###          txt_e = ''
    
    def line_edit2(textlines): ##2 lines to 1 line same as line_edit1(textlines)
        text_compo = []
        txt = ''
        for count,l in enumerate(textlines):
          if (count+1)%2 == 0:
            txt = text_compo.pop()
            txt += ' ' +l
            text_compo.append(txt)
          else :
            txt = l
            text_compo.append(txt)
        return text_compo

    def line_edit3(textlines): ##3 lines to 1 line
        text_compo = []
        txt = ''
        i = 0
        for count,l in enumerate(textlines):
          if i == 0:
            txt += l
            i = i + 1
            text_compo.append(txt)
          elif i == 1:
            txt = text_compo.pop()
            txt += ' ' + l
            i = i + 1
            text_compo.append(txt)
          elif i == 2:
            txt = text_compo.pop()
            txt += ' ' + l
            text_compo.append(txt)
            txt = ''
            i = 0
        return text_compo

    def line_edit1(textlines): ##2 lines to 1 line
        text_compo = []
        i = 0
        txt = ''
        for count,l in enumerate(textlines):
          if i == 0:
            txt += l
            i = i + 1
            text_compo.append(txt)
          elif i == 1:
            txt += ' ' +l
            text_compo.pop()
            text_compo.append(txt)
            i = 0
            txt = ''
        return text_compo

###    text_compo = [] ##2 lines to 1 line
###    i = 0
###    txt = ''
###    for count,l in enumerate(text_list):
###        if i == 0:
###          txt += l
###          i = i + 1
###          text_compo.append(txt)
###        elif i == 1:
###          txt += ' ' +l
###          text_compo.pop()
###          text_compo.append(txt)
###          i = 0
###          txt = ''
###    print()
###    print()
###    print("************************************************************************************")
###    print()
###    print()
##    for count, l in enumerate(text_list):
##        print(count+1,' ',l)
    print()
    print()
    print("************************************************************************************")
    print("shrink text")
    print()
####    for count, l in enumerate(text_e):
####        print(count+1,l)
    print()
    print()
    text_compo = (line_edit3(text_list))
    text_compo[:] = (line_edit2(text_compo))
    #for count, l in enumerate(text_compo):
    #    print(l)
    text_compo2 = (line_edit2(text_compo))
    text_compo2[:] = (line_edit2(text_compo2))
    for count, l in enumerate(text_compo2):
        print(l)
    print()
    print()
    print()
    print("************************************************************************************")
    print()
    print("Thank you.")

sys.stdout = original_stdout # stdout back
## files.download(filename) ## if you want to download file

Run program ...
Now you get YouTube subtitle in your google drive which google colab can read.

cf.extracting youtube video id from youtube URL
https://qiita.com/dauuricus/private/9e70c4c25566fedb9c19

2 Translate

googletrans==4.0.0-rc

googlecolab

pip install googletrans==4.0.0-rc

googlecolab

from google.colab import files
from googletrans import Translator
import sys
# import re

##### uploaded = files.upload()

#### filename = ''
#### for fn in uploaded.keys():
####  print('User uploaded file "{name}" with length {length} bytes'.format(
####      name=fn, length=len(uploaded[fn])))
####  filename = fn

filename = 'subtitle.txt'
# args= sys.argv
args = [('translate.py'),filename]

print('open '+args[1])
with open(args[1]) as f: # uploaded file
  line = f.readlines() 
f.close()

line[:] = [l.strip() for l in line]
line[:] = [l.rstrip('\n') for l in line]
line[:] = [a for a in line if a != '']
line[:] = [l.replace('\n',' ') for l in line]
line[:] = [l.replace('\r',' ') for l in line]
# print(line)

# print()

#### for line_num,l in enumerate(line):
####  if re.search(r'.*?i'm$',l):
####    print(line_num,'   ',l)
####  elif re.search(r'.*?to/Z',l):
####    print(line_num,'   ',l)
####  if re.search(r'.*?the$',l):
####    print(line_num,'   ',l)
####  elif re.search(r'.*?the/Z',l):
####    print(line_num,'   ',l)


# for line_num,l in enumerate(line):
#    print(line_num,'   ',l)
    
translator = Translator()
num = 20
# obj_num = 1
filename = 'translated.txt'
backup_stdout = sys.stdout
print("translating...")
with open(filename,'w') as f:
    sys.stdout = f
            
    for count, l in enumerate(line):
        if count < 7:
            continue
        else:
            if count +1< num:
               translated = translator.translate(l, dest='ja')
               ##print(count+1,'  ', l) # original text
               print(translated.text)
            else:
               translated = translator.translate(l, dest='ja')
               ##print(count+1,'  ', l) # original text
               print(translated.text)        
               del translator
               num = num + 20
               #obj_num = obj_num + 1
               #print("")
               #print("--- translator :", obj_num)
               #print("")
               translator = Translator()        
    sys.stdout = backup_stdout # back
del translator
print("saving...",filename)

# files.download(filename) # translated.txt

Language list


'af': 'afrikaans'	'sq': 'albanian'	'am': 'amharic'	'ar': 'arabic'
'hy': 'armenian'	'az': 'azerbaijani'	'eu': 'basque'	'be': 'belarusian'
'bn': 'bengali'	'bs': 'bosnian'	'bg': 'bulgarian'	'ca': 'catalan'
'ceb': 'cebuano'	'ny': 'chichewa'	'zh-cn': 'chinese (simplified)'	'zh-tw': 'chinese (traditional)'
'co': 'corsican'	'hr': 'croatian'	'cs': 'czech'	'da': 'danish'
'nl': 'dutch'	'en': 'english'	'eo': 'esperanto'	'et': 'estonian'
'fi': 'finnish'	'fr': 'french'	'fy': 'frisian'	'gl': 'galician'
'ka': 'georgian'	'de': 'german'	'el': 'greek'	'gu': 'gujarati'
'ht': 'haitian creole'	'ha': 'hausa'	'haw': 'hawaiian'	'iw': 'hebrew'
'he': 'hebrew'	'hi': 'hindi'	'hmn': 'hmong'	'hu': 'hungarian'
'is': 'icelandic'	'ig': 'igbo'	'id': 'indonesian'	'ga': 'irish'
'it': 'italian'	'ja': 'japanese'	'jw': 'javanese'	'kn': 'kannada'
'kk': 'kazakh'	'km': 'khmer'	'ko': 'korean'	'ku': 'kurdish (kurmanji)'
'ky': 'kyrgyz'	'lo': 'lao'	'la': 'latin'	'lv': 'latvian'
'lt': 'lithuanian'	'lb': 'luxembourgish'	'mk': 'macedonian'	mg': 'malagasy'
'ms': 'malay'	'ml': 'malayalam'	'mt': 'maltese'	'mi': 'maori'
'mr': 'marathi'	'mn': 'mongolian'	'my': 'myanmar (burmese)'	'ne': 'nepali'
'no': 'norwegian'	'or': 'odia'	'ps': 'pashto'	'fa': 'persian'
'pl': 'polish'	'pt': 'portuguese'	'pa': 'punjabi'	'ro': 'romanian'
'ru': 'russian'	'sm': 'samoan'	'gd': 'scots gaelic'	'sr': 'serbian'
'st': 'sesotho'	'sn': 'shona'	'sd': 'sindhi'	'si': 'sinhala'
'sk': 'slovak'	'sl': 'slovenian'	'so': 'somali'	'es': 'spanish'
'su': 'sundanese'	'sw': 'swahili'	'sv': 'swedish'	'tg': 'tajik'
'ta': 'tamil'	'te': 'telugu'	'th': 'thai'	'tr': 'turkish'
'uk': 'ukrainian'	'ur': 'urdu'	'ug': 'uyghur'	'uz': 'uzbek'
'vi': 'vietnamese'	'cy': 'welsh'	'xh': 'xhosa'	'yi': 'yiddish'
'yo': 'yoruba'a	'zu': 'zulu'

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up