日本語
###deepspeech-0.9.3-models
[This version is not backwards compatible with earlier versions.] (https://github.com/mozilla/DeepSpeech/releases)
This article, taking the audio from YouTube videos, Deepspeech ( what is called ASR ) program to text to speech recognition from the speech of English that the is a test of notes of things to use in Google Colabratory.
The article corresponding to deepspeech 0.6.1 has been rewritten to match deepspeech 0.9.3. The file structure and commands are different from earlier versions, so please check your version as similar differences may appear in the future.
####About Google Colabratory
Frequently Asked Questions
What is Basic Colaboratory?
Colaboratory (abbreviation: Colab) is a service provided by Google Research. Colab is especially suitable for machine learning, data analysis, and education because anyone can write and run Python in a browser. Specifically, it is a hosted Jupyter Notebook service that allows you to access computing resources such as the GPU for free and without any special settings.
Is it really free to use?
I don't know. Who knows?
https://research.google.com/colaboratory/faq.html
youtube-dl is in charge of acquiring the voice from the YouTube video , and deepspeech automatically recognizes the voice of the English speaker and guesses the corresponding text and displays it. ( Deepspeech-0.9.3-models / TensorFlow1) is used for the following programs.)
This is "ready", but be aware that deepspeech-0.9.3-models has a litle big data size.
However, if you do not leave any settings, the data will be lost when the Google Colabratory runtime ends .
I feel that the characters that seem to be difficult are lined up. However, I will write it briefly, just paste the following python code into the cell of googlecolab and execute it2.
googlecolab allows you to execute code in a cell with Control
+ Enter
.
The biggest psychological barrier is creating a google account, but there's nothing more difficult here. But also, you don't have to do it. It's only here for a while as a sample for those who want to know.
googlecolab allows vim key bindings in the editor settings, so people living with vim can paste with Shift
+ insert
.
This is all
Setting up Google Colaboratory
from google.colab import drive
drive.mount('/content/drive')
Rf.
External data: local files, drives, spreadsheets, Cloud Storage
https://colab.research.google.com/notebooks/io.ipynb
Speech Recognition with DeepSpeech
Try searching with this word. The following citations are all from here. There are differences, so please be aware of the differences and make improvements. If you don't see a sample that actually works, you won't be able to get it, so I'm grateful that you have published the recipe notes.
- MozillaDeepSpeech.ipynb ... mozilla/DeepSpeech with LM on Youtube videos
Rf.
Erdene-Ochir Tuguldur
tugstugi
Берлиний Техникийн Их Сургууль
https://github.com/tugstugi/dl-colab-notebooks
This notebook uses an open source project mozilla/DeepSpeech to transcribe a given youtube video.
For other deep-learning Colab notebooks, visit tugstugi/dl-colab-notebooks.
Install DeepSpeech
import os
from os.path import exists
import wave
!pip install -q deepspeech-gpu==0.9.3 youtube-dl
if not exists('deepspeech-0.9.3-models.pbmm'):
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
if not exists('deepspeech-0.9.3-models.scorer'):
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
#!tar xvfz deepspeech-0.9.3-models.tar.gz
from IPython.display import YouTubeVideo
pre-trained model files
-
.pbmm
...for TensorFlow runtime -
.tflite
...for TensorFlow Lite runtime
.scorer
log
--2021-02-15 15:40:31-- https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
Resolving github.com (github.com)... 52.69.186.44
Connecting to github.com (github.com)|52.69.186.44|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/60273704/8b25f180-3b0f-11eb-8fc1-de4f4ec3b5a3?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210215T154032Z&X-Amz-Expires=300&X-Amz-Signature=de84c8f71f6fb0d61801e0e6eade089738aab5899a4bd80fdda9fed4e77735d6&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=60273704&response-content-disposition=attachment%3B%20filename%3Ddeepspeech-0.9.3-models.pbmm&response-content-type=application%2Foctet-stream [following]
--2021-02-15 15:40:32-- https://github-releases.githubusercontent.com/60273704/8b25f180-3b0f-11eb-8fc1-de4f4ec3b5a3?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210215T154032Z&X-Amz-Expires=300&X-Amz-Signature=de84c8f71f6fb0d61801e0e6eade089738aab5899a4bd80fdda9fed4e77735d6&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=60273704&response-content-disposition=attachment%3B%20filename%3Ddeepspeech-0.9.3-models.pbmm&response-content-type=application%2Foctet-stream
Resolving github-releases.githubusercontent.com (github-releases.githubusercontent.com)... 185.199.110.154, 185.199.111.154, 185.199.108.154, ...
Connecting to github-releases.githubusercontent.com (github-releases.githubusercontent.com)|185.199.110.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 188915987 (180M) [application/octet-stream]
Saving to: ‘deepspeech-0.9.3-models.pbmm’
deepspeech-0.9.3-mo 100%[===================>] 180.16M 20.4MB/s in 9.1s
2021-02-15 15:40:41 (19.9 MB/s) - ‘deepspeech-0.9.3-models.pbmm’ saved [188915987/188915987]
--2021-02-15 15:40:41-- https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/60273704/924cff80-3b0f-11eb-878c-cacaa2a0d946?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210215T154041Z&X-Amz-Expires=300&X-Amz-Signature=2a8ac24c6d349b794a20407523a3416878ee60c0f079d8c68c8eb6b59bc980af&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=60273704&response-content-disposition=attachment%3B%20filename%3Ddeepspeech-0.9.3-models.scorer&response-content-type=application%2Foctet-stream [following]
--2021-02-15 15:40:42-- https://github-releases.githubusercontent.com/60273704/924cff80-3b0f-11eb-878c-cacaa2a0d946?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210215T154041Z&X-Amz-Expires=300&X-Amz-Signature=2a8ac24c6d349b794a20407523a3416878ee60c0f079d8c68c8eb6b59bc980af&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=60273704&response-content-disposition=attachment%3B%20filename%3Ddeepspeech-0.9.3-models.scorer&response-content-type=application%2Foctet-stream
Resolving github-releases.githubusercontent.com (github-releases.githubusercontent.com)... 185.199.108.154, 185.199.109.154, 185.199.110.154, ...
Connecting to github-releases.githubusercontent.com (github-releases.githubusercontent.com)|185.199.108.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 953363776 (909M) [application/octet-stream]
Saving to: ‘deepspeech-0.9.3-models.scorer’
deepspeech-0.9.3-mo 100%[===================>] 909.20M 25.9MB/s in 40s
2021-02-15 15:41:22 (22.6 MB/s) - ‘deepspeech-0.9.3-models.scorer’ saved [953363776/953363776]
!apt-get install -qq sox
sox - The Python and Node.JS clients use SoX to resample files to 16kHz.
Extractiong YouTube video_id from YouTube URL
from urllib.parse import urlparse, parse_qs
urltext ='https://www.youtube.com/watch?v=qviM_GnJbOM'
args = [urltext]
video_id = ''
def extract_video_id(url):
query = urlparse(url)
if query.hostname == 'youtu.be': return query.path[1:]
if query.hostname in {'www.youtube.com', 'youtube.com'}:
if query.path == '/watch': return parse_qs(query.query)['v'][0]
if query.path[:7] == '/embed/': return query.path.split('/')[2]
if query.path[:3] == '/v/': return query.path.split('/')[2]
# fail?
return None
for url in args:
video_id = (extract_video_id(url))
print('youtube video_id:',video_id)
Rf.
extracting youtube video id from youtube URL
https://qiita.com/dauuricus/private/9e70c4c25566fedb9c19
Transcribe Youtube Video
We are going to make speech recognition on the following youtube video
YouTubeVideo(video_id)
Download the above video, convert to a WAV file and do speech recognition
#!rm -rf *.wav
!youtube-dl --extract-audio --audio-format wav --output "extract.%(ext)s" {urltext}
youtube-dl --extract-audio --audio-format wav --output "test.%(ext)s"
Extract from the video in wav
format with the file name extract.wav
. Deepspeech seems to support audio with a sampling rate of 16000hz.
[youtube] qviM_GnJbOM: Downloading webpage
[download] Destination: extract.m4a
[download] 100% of 2.05MiB in 00:00
[ffmpeg] Destination: extract.wav
Deleting original file extract.m4a (pass -k to keep)
Rf.
Download Audio from YouTube
https://gist.github.com/umidjons/8a15ba3813039626553929458e3ad1fc
This test case does not necessarily have to be YouTube audio, so if you have not installed youtube-dl yet, you may not have ffmpeg installed. If you need ffmpeg separately to convert the audio, you can install it now.
##!apt install ffmpeg ##if you do not have
!ffmpeg -i extract.wav -vn -acodec pcm_s16le -ac 1 -ar 16000 -f wav test.wav
It seems that deepspeech requires wav of 16000hz, so convert from 44100 Hz: extract.wav
to PCM signed 16-bit little-endian 16000 Hz: test.wav
.
here is ffmpeg command cheatsheet:
-codecs # list codecs
-c:a # audio codec (-acodec)
-fs SIZE # limit file size (bytes)
-b:v 1M # video bitrate (1M = 1Mbit/s)
-b:a 1M # audio bitrate
-vn # no video
-aq QUALITY # audio quality (codec-specific)
-ar 16000 # audio sample rate (hz)
-ac 1 # audio channels (1=mono, 2=stereo)
-an # no audio
-vol N # volume (256=normal)
log
ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
libavutil 55. 78.100 / 55. 78.100
libavcodec 57.107.100 / 57.107.100
libavformat 57. 83.100 / 57. 83.100
libavdevice 57. 10.100 / 57. 10.100
libavfilter 6.107.100 / 6.107.100
libavresample 3. 7. 0 / 3. 7. 0
libswscale 4. 8.100 / 4. 8.100
libswresample 2. 9.100 / 2. 9.100
libpostproc 54. 7.100 / 54. 7.100
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from 'test.wav':
Metadata:
encoder : Lavf57.83.100
Duration: 00:02:48.86, bitrate: 1411 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to 'test1.wav':
Metadata:
ISFT : Lavf57.83.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Metadata:
encoder : Lavc57.107.100 pcm_s16le
size= 5277kB time=00:02:48.85 bitrate= 256.0kbits/s speed=1.24e+03x
video:0kB audio:5277kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.001444%
####deepspeech speech to text (STT)
!deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio test.wav
#!deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio test.wav ## old version
The commands are (probably) different from the latest version of deepspeech.
deepspeech-0.9.3-models:
usage: deepspeech [-h] --model MODEL [--scorer SCORER] --audio AUDIO
[--beam_width BEAM_WIDTH] [--lm_alpha LM_ALPHA]
[--lm_beta LM_BETA] [--version] [--extended] [--json]
[--candidate_transcripts CANDIDATE_TRANSCRIPTS]
[--hot_words HOT_WORDS]
Running DeepSpeech inference.
optional arguments:
-h, --help show this help message and exit
--model MODEL Path to the model (protocol buffer binary file)
--scorer SCORER Path to the external scorer file
--audio AUDIO Path to the audio file to run (WAV format)
--beam_width BEAM_WIDTH
Beam width for the CTC decoder
--lm_alpha LM_ALPHA Language model weight (lm_alpha). If not specified,
use default from the scorer package.
--lm_beta LM_BETA Word insertion bonus (lm_beta). If not specified, use
default from the scorer package.
--version Print version and exits
--extended Output string from extended metadata
--json Output json from metadata with timestamp of each word
--candidate_transcripts CANDIDATE_TRANSCRIPTS
Number of candidate transcripts to include in JSON
output
--hot_words HOT_WORDS
Hot-words and their boosts.
log
2021-02-15 16:02:27.698878: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Loading model from file deepspeech-0.9.3-models.pbmm
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2021-02-15 16:02:27.891101: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-15 16:02:27.892196: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-02-15 16:02:27.898478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 16:02:27.899231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2021-02-15 16:02:27.899265: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 16:02:27.904517: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-15 16:02:27.907846: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-02-15 16:02:27.908329: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-02-15 16:02:27.911375: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-02-15 16:02:27.912475: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-02-15 16:02:27.917975: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-15 16:02:27.918097: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 16:02:27.918905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 16:02:27.919609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-02-15 16:02:28.101511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-15 16:02:28.101591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2021-02-15 16:02:28.101610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2021-02-15 16:02:28.101755: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 16:02:28.102641: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 16:02:28.103507: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 16:02:28.104252: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-02-15 16:02:28.104298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10597 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
Loaded model in 0.228s.
Loading scorer from files deepspeech-0.9.3-models.scorer
Loaded scorer in 0.000237s.
Running inference.
2021-02-15 16:02:28.162010: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
###The result of deepspeech's automatic speech recognition (ASR) is
please make sure ... I don't know how to make the task faster yet, so you might want to use redirection ; '>' (linux shell command) to make the output a text file, since you'll be waiting for quite a while.3
!deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio test.wav > test.txt
Comparison: YouTube subtitles.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
YouTube captions
- - - - - - - - - - - - - - - - - - - YouTube - - - - - - - - - - - - - - - - - - -
1 you may write me down in history with
2 your bitter twisted lies
3 you may tribe me in the very dirt but
4 still like dust a lie does my sassiness
5 upset you
6 why are you beset with gloom just
7 because I walked as if I have oil wells
8 pumping in my living room just like
9 moons and like Suns with the certainty
10 of tides just like hope springing high
11 still I rise did you want to see me
12 broken bowed head and lowered eyes
13 shoulders falling down like teardrops we
14 can buy my soul who cries does my
15 sassiness upset you don't take it too
16 hard just cuz I laugh as if I have gold
17 mines digging in my own backyard you can
18 shoot me with your words you can cut me
19 with your lies you can kill me with your
20 hatefulness but just like life arise
21 just my sexiness offend you oh does it
22 come as a surprise that I dance as if I
23 have diamonds at the meeting of my
24 thighs
25 out of the huts of history's shame I
26 rise up from a past rooted in pain I
27 rise a black ocean leaping and wide
28 Welling and swelling and bearing in the
29 time leaving behind nights of terror and
30 fear I rise into a daybreak miraculously
31 clear I rise bringing the gifts that my
32 ancestors gave I am the hope and the
33 dream of the slave and so there go
************************************************************************************
##Cf. Still I Rise by MAYA ANGELOU
https://www.poetryfoundation.org/poems/46446/still-i-rise
##Remarks
Initially, I ran deepspeech on Google Colaboratory in parallel with the IBM watoson TTS demo to voice-recognize and transcribe a video clip that had been tested for more than 3 hours, but there was no sign that it would end at all, and the IBM watoson TTS When the playback audio of the demo is nearing the end, isn't this processed in real time like watoson? I noticed that, I threw away the finished result and started over with a short clip.
I tried to rewrite the article from the installation part according to deepspeech 0.9.3 which is close to the latest version.
The command is different from the latest version of deepspeech.
Cf.
Speech to Text
The IBM Watson Speech to Text service uses speech recognition capabilities to convert Arabic, English, Spanish, French, Brazilian Portuguese, Japanese, Korean, German, and Mandarin speech into text.
https://speech-to-text-demo.ng.bluemix.net/
Rf.
Real-time Speech to Text with DeepSpeech - Getting Started on Windows and Transcribe Microphone Free
https://www.youtube.com/watch?v=c_0Q3T0XYTA
DeepSpeech | Speech to Text | Common Voice | Donate Your Voice by tuxfoo
https://www.youtube.com/watch?v=GixIsv_1__A
Speech to Text using Python - Fast and Accurate
https://www.youtube.com/watch?v=iWha--55Lz0
AutoSub(deepspeech)
https://github.com/abhirooptalasila/AutoSub
autosub(not deepspeech) and googlecolab
https://github.com/HandsomeWJ/SubMe/blob/master/SubMe.ipynb