More than 3 years have passed since last update.

YouTube, Deepspeech, with Google Colaboratory [testing_0001]

Last updated at 2021-02-16Posted at 2021-02-15

Deepspeech-0.6.1-models

This article, taking the audio from YouTube videos, Deepspeech ( what is called ASR ) program to text to speech recognition from the speech of English that the is a test of notes of things to use in Google Colabratory.

About Google Colabratory

Frequently Asked Questions
What is Basic Colaboratory?
Colaboratory (abbreviation: Colab) is a service provided by Google Research. Colab is especially suitable for machine learning, data analysis, and education because anyone can write and run Python in a browser. Specifically, it is a hosted Jupyter Notebook service that allows you to access computing resources such as the GPU for free and without any special settings.

Is it really free to use?
I don't know. Who cares? Who knows?
https://research.google.com/colaboratory/faq.html

youtube-dl is in charge of acquiring the voice from the YouTube video , and deepspeech automatically recognizes the voice of the English speaker and guesses the corresponding text and displays it. ( Deepspeech-0.6.1-models / TensorFlow¹) is used for the following programs.)

This is "ready", but be aware that deepspeech-0.6.1-models has a data size of 1.14G.
However, if you do not leave any settings, the data will be lost when the Google Colabratory runtime ends .

I feel that the characters that seem to be difficult are lined up. However, I will write it briefly, just paste the following python code into the cell of googlecolab and execute it.
googlecolab allows you to execute code in a cell with Control + Enter.
The biggest psychological barrier is creating a google account, but there's nothing more difficult here. But also, you don't have to do it. It's only here for a while as a sample for those who want to know.

googlecolab allows vim key bindings in the editor settings, so people living with vim can paste with Shift + insert.

Setting up Google Colaboratory

GoogleColaboratory

from google.colab import drive 
drive.mount('/content/drive')

Rf.
External data: local files, drives, spreadsheets, Cloud Storage
https://colab.research.google.com/notebooks/io.ipynb

Speech Recognition with DeepSpeech

Try searching with this word. The following citations are all from here. There are differences, so please be aware of the differences and make improvements. If you don't see a sample that actually works, you won't be able to get it, so I'm grateful that you have published the recipe notes.

MozillaDeepSpeech.ipynb ... mozilla/DeepSpeech with LM on Youtube videos

Rf.
Erdene-Ochir Tuguldur
tugstugi
Берлиний Техникийн Их Сургууль
https://github.com/tugstugi/dl-colab-notebooks

This notebook uses an open source project mozilla/DeepSpeech to transcribe a given youtube video.

For other deep-learning Colab notebooks, visit tugstugi/dl-colab-notebooks.

Install DeepSpeech

GoogleColaboratory

# @title
import os
from os.path import exists, join, basename, splitext

if not exists('deepspeech-0.6.1-models'):
  !apt-get install -qq sox
  !pip install -q deepspeech-gpu==0.6.1 youtube-dl
  !wget https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/deepspeech-0.6.1-models.tar.gz
  !tar xvfz deepspeech-0.6.1-models.tar.gz
  
from IPython.display import YouTubeVideo

log

Selecting previously unselected package libopencore-amrnb0:amd64.
(Reading database ... 146425 files and directories currently installed.)
Preparing to unpack .../0-libopencore-amrnb0_0.1.3-2.1_amd64.deb ...
Unpacking libopencore-amrnb0:amd64 (0.1.3-2.1) ...
Selecting previously unselected package libopencore-amrwb0:amd64.
Preparing to unpack .../1-libopencore-amrwb0_0.1.3-2.1_amd64.deb ...
Unpacking libopencore-amrwb0:amd64 (0.1.3-2.1) ...
Selecting previously unselected package libmagic-mgc.
Preparing to unpack .../2-libmagic-mgc_1%3a5.32-2ubuntu0.4_amd64.deb ...
Unpacking libmagic-mgc (1:5.32-2ubuntu0.4) ...
Selecting previously unselected package libmagic1:amd64.
Preparing to unpack .../3-libmagic1_1%3a5.32-2ubuntu0.4_amd64.deb ...
Unpacking libmagic1:amd64 (1:5.32-2ubuntu0.4) ...
Selecting previously unselected package libsox3:amd64.
Preparing to unpack .../4-libsox3_14.4.2-3ubuntu0.18.04.1_amd64.deb ...
Unpacking libsox3:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Selecting previously unselected package libsox-fmt-alsa:amd64.
Preparing to unpack .../5-libsox-fmt-alsa_14.4.2-3ubuntu0.18.04.1_amd64.deb ...
Unpacking libsox-fmt-alsa:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Selecting previously unselected package libsox-fmt-base:amd64.
Preparing to unpack .../6-libsox-fmt-base_14.4.2-3ubuntu0.18.04.1_amd64.deb ...
Unpacking libsox-fmt-base:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Selecting previously unselected package sox.
Preparing to unpack .../7-sox_14.4.2-3ubuntu0.18.04.1_amd64.deb ...
Unpacking sox (14.4.2-3ubuntu0.18.04.1) ...
Setting up libmagic-mgc (1:5.32-2ubuntu0.4) ...
Setting up libmagic1:amd64 (1:5.32-2ubuntu0.4) ...
Setting up libopencore-amrnb0:amd64 (0.1.3-2.1) ...
Setting up libopencore-amrwb0:amd64 (0.1.3-2.1) ...
Setting up libsox3:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Setting up libsox-fmt-base:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Setting up libsox-fmt-alsa:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Setting up sox (14.4.2-3ubuntu0.18.04.1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.3) ...
/sbin/ldconfig.real: /usr/local/lib/python3.6/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link

Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for mime-support (3.60ubuntu1) ...
     |████████████████████████████████| 18.7MB 160kB/s 
     |████████████████████████████████| 1.9MB 49.9MB/s 
--2021-02-13 17:57:27--  https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/deepspeech-0.6.1-models.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/60273704/f29e6300-33cd-11ea-8523-3fc40b31be9a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210213T175727Z&X-Amz-Expires=300&X-Amz-Signature=385f1997b95eb6dfac74a33bd120afe1ef4e11c74ffdc081c45d6de333ba5a0b&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=60273704&response-content-disposition=attachment%3B%20filename%3Ddeepspeech-0.6.1-models.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-02-13 17:57:27--  https://github-releases.githubusercontent.com/60273704/f29e6300-33cd-11ea-8523-3fc40b31be9a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210213T175727Z&X-Amz-Expires=300&X-Amz-Signature=385f1997b95eb6dfac74a33bd120afe1ef4e11c74ffdc081c45d6de333ba5a0b&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=60273704&response-content-disposition=attachment%3B%20filename%3Ddeepspeech-0.6.1-models.tar.gz&response-content-type=application%2Foctet-stream
Resolving github-releases.githubusercontent.com (github-releases.githubusercontent.com)... 185.199.108.154, 185.199.109.154, 185.199.110.154, ...
Connecting to github-releases.githubusercontent.com (github-releases.githubusercontent.com)|185.199.108.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1229020343 (1.1G) [application/octet-stream]
Saving to: ‘deepspeech-0.6.1-models.tar.gz’

deepspeech-0.6.1-mo 100%[===================>]   1.14G  96.4MB/s    in 12s     

2021-02-13 17:57:39 (95.4 MB/s) - ‘deepspeech-0.6.1-models.tar.gz’ saved [1229020343/1229020343]

._deepspeech-0.6.1-models
deepspeech-0.6.1-models/
deepspeech-0.6.1-models/._lm.binary
deepspeech-0.6.1-models/lm.binary
deepspeech-0.6.1-models/._output_graph.pbmm
deepspeech-0.6.1-models/output_graph.pbmm
deepspeech-0.6.1-models/._output_graph.pb
deepspeech-0.6.1-models/output_graph.pb
deepspeech-0.6.1-models/._trie
deepspeech-0.6.1-models/trie
deepspeech-0.6.1-models/output_graph.tflite

deepspeech-0.6.1-mo 100%[===================>] 1.14G

size: 1.14G

Extractiong YouTube video_id from YouTube URL

GoogleColaboratory

from urllib.parse import urlparse, parse_qs

urltext ='https://www.youtube.com/watch?v=qviM_GnJbOM' 
args = [urltext]
video_id = ''


def extract_video_id(url):
    query = urlparse(url)
    if query.hostname == 'youtu.be': return query.path[1:]
    if query.hostname in {'www.youtube.com', 'youtube.com'}:
        if query.path == '/watch': return parse_qs(query.query)['v'][0]
        if query.path[:7] == '/embed/': return query.path.split('/')[2]
        if query.path[:3] == '/v/': return query.path.split('/')[2]
    # fail?
    return None

for url in args:
    video_id = (extract_video_id(url))
    print('youtube video_id:',video_id)

Rf.
extracting youtube video id from youtube URL
https://qiita.com/dauuricus/private/9e70c4c25566fedb9c19

Transcribe Youtube Video

We are going to make speech recognition on the following youtube video

GoogleColaboratory



YouTubeVideo(video_id)

Download the above video, convert to a WAV file and do speech recognition

GoogleColaboratory

# !rm -rf *.wav
!youtube-dl --extract-audio --audio-format wav --output "extract.%(ext)s" {urltext}

youtube-dl --extract-audio --audio-format wav --output "test.%(ext)s"
Extract from the video in wav format with the file name extract.wav. Deepspeech seems to support audio with a sampling rate of 16000hz.

[youtube] qviM_GnJbOM: Downloading webpage
[download] Destination: extract.m4a
[download] 100% of 2.05MiB in 00:00
[ffmpeg] Destination: extract.wav
Deleting original file extract.m4a (pass -k to keep)

Rf.
Download Audio from YouTube
https://gist.github.com/umidjons/8a15ba3813039626553929458e3ad1fc

GoogleColaboratory

## !apt install ffmpeg ##if you do not have
!ffmpeg -i extract.wav -vn -acodec pcm_s16le -ac 1 -ar 16000 -f wav test.wav
!deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio test.wav

It seems that deepspeech requires wav of 16000hz, so convert from 44100 Hz: extract.wav to PCM signed 16-bit little-endian 16000 Hz: test.wav.

ffmpeg_cheatsheet_audio

-codecs          # list codecs
-c:a             # audio codec (-acodec)
-fs SIZE         # limit file size (bytes)
-b:v 1M          # video bitrate (1M = 1Mbit/s)
-b:a 1M          # audio bitrate
-vn              # no video
-aq QUALITY      # audio quality (codec-specific)
-ar 16000        # audio sample rate (hz)
-ac 1            # audio channels (1=mono, 2=stereo)
-an              # no audio
-vol N           # volume (256=normal)

log

ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
  libavutil      55. 78.100 / 55. 78.100
  libavcodec     57.107.100 / 57.107.100
  libavformat    57. 83.100 / 57. 83.100
  libavdevice    57. 10.100 / 57. 10.100
  libavfilter     6.107.100 /  6.107.100
  libavresample   3.  7.  0 /  3.  7.  0
  libswscale      4.  8.100 /  4.  8.100
  libswresample   2.  9.100 /  2.  9.100
  libpostproc    54.  7.100 / 54.  7.100
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from 'test.wav':
  Metadata:
    encoder         : Lavf57.83.100
  Duration: 00:02:48.86, bitrate: 1411 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to 'test1.wav':
  Metadata:
    ISFT            : Lavf57.83.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc57.107.100 pcm_s16le
size=    5277kB time=00:02:48.85 bitrate= 256.0kbits/s speed=1.24e+03x    
video:0kB audio:5277kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.001444%

The result of deepspeech's automatic speech recognition (ASR) is

GoogleColaboratory

you may write me down and history with your bitter rested line you may try me in the very dirt but still like dust or i does my satinette you why are you visitations a waltari have oil wells pumping in my living room that's like moons and like sons with the seance just like hopes springing high still and he did you want to see me broken bowed head and lowered eyes soldiering down like hiram weakened by my soul socrates my sansonnetto do take it to her i just got a laugh as if i have gold man sinking in my own back yard you can shoot me with your words you can cut me with your lies you can kill me with your hatefulness but just like life ran does my saxon as the firm you all does it come as a surprise that i danced as if i have diamonds that the meeting of my size out of a hut of history shame i ride up from a past rooted in pain i rise a black ocean leaving and by welling and swelling and bearing him i leaving behind might of terror and fear i ran into a daybreak miraculously clear i right bringing the gifts that my emphasis gay i am the whole and the dream of the sleeve and so that

Comparison: here for YouTube subtitles.

youtube-cation

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
YouTube captions
- - - - - - - - - - - - - - - - - - -  YouTube  - - - - - - - - - - - - - - - - - - -


1    you may write me down in history with
2    your bitter twisted lies
3    you may tribe me in the very dirt but
4    still like dust a lie does my sassiness
5    upset you
6    why are you beset with gloom just
7    because I walked as if I have oil wells
8    pumping in my living room just like
9    moons and like Suns with the certainty
10    of tides just like hope springing high
11    still I rise did you want to see me
12    broken bowed head and lowered eyes
13    shoulders falling down like teardrops we
14    can buy my soul who cries does my
15    sassiness upset you don't take it too
16    hard just cuz I laugh as if I have gold
17    mines digging in my own backyard you can
18    shoot me with your words you can cut me
19    with your lies you can kill me with your
20    hatefulness but just like life arise
21    just my sexiness offend you oh does it
22    come as a surprise that I dance as if I
23    have diamonds at the meeting of my
24    thighs
25    out of the huts of history's shame I
26    rise up from a past rooted in pain I
27    rise a black ocean leaping and wide
28    Welling and swelling and bearing in the
29    time leaving behind nights of terror and
30    fear I rise into a daybreak miraculously
31    clear I rise bringing the gifts that my
32    ancestors gave I am the hope and the
33    dream of the slave and so there go


************************************************************************************

Cf. Still I Rise by MAYA ANGELOU

The command is different from the latest version of deepspeech.

deepspeech-0.6.1-models

usage: deepspeech [-h] --model MODEL [--lm [LM]] [--trie [TRIE]] --audio AUDIO
                  [--beam_width BEAM_WIDTH] [--lm_alpha LM_ALPHA]
                  [--lm_beta LM_BETA] [--version] [--extended] [--json]

Running DeepSpeech inference.

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         Path to the model (protocol buffer binary file)
  --lm [LM]             Path to the language model binary file
  --trie [TRIE]         Path to the language model trie file created with
                        native_client/generate_trie
  --audio AUDIO         Path to the audio file to run (WAV format)
  --beam_width BEAM_WIDTH
                        Beam width for the CTC decoder
  --lm_alpha LM_ALPHA   Language model weight (lm_alpha)
  --lm_beta LM_BETA     Word insertion bonus (lm_beta)
  --version             Print version and exits
  --extended            Output string from extended metadata
  --json                Output json from metadata with timestamp of each word

Remarks

Initially, I ran deepspeech on Google Colaboratory in parallel with the IBM watoson TTS demo to voice-recognize and transcribe a video clip that had been tested for more than 3 hours, but there was no sign that it would end at all, and the IBM watoson TTS When the playback audio of the demo is nearing the end, isn't this processed in real time like watoson? I noticed that, I threw away the finished result and started over with a short clip. I tried to rewrite the article from the installation part according to deepspeech 0.9.3 which is close to the latest version, but since the file structure has changed due to the version upgrade, I can not understand unless I examine it carefully.

Cf.
Speech to Text
The IBM Watson Speech to Text service uses speech recognition capabilities to convert Arabic, English, Spanish, French, Brazilian Portuguese, Japanese, Korean, German, and Mandarin speech into text.
https://speech-to-text-demo.ng.bluemix.net/

Rf.
Real-time Speech to Text with DeepSpeech - Getting Started on Windows and Transcribe Microphone Free
https://www.youtube.com/watch?v=c_0Q3T0XYTA

DeepSpeech | Speech to Text | Common Voice | Donate Your Voice by tuxfoo
https://www.youtube.com/watch?v=GixIsv_1__A

A.I. ?== deep-learning | Comparison of deep-learning software Wiki ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up