English
###Deepspeech-0.6.1-models
この記事は、YouTube 動画から音声をとって、deepspeech という英語の音声から音声認識してテキストにするプログラム(スピーチレコグニション とか ASR とか呼ばれているもの)を Google Colabratory で使うもののテストのノートです。
####Google Colabratoryについて
よくある質問
基本
Colaboratory とは何ですか?
Colaboratory(略称: Colab)は、Google Research が提供するサービスです。Colab では、誰でもブラウザ上で Python を記述、実行できるため、機械学習、データ分析、教育に特に適しています。具体的には、GPU などのコンピューティング リソースに無料でアクセスしながら特別な設定なしにご利用いただけるホスト型の Jupyter Notebook サービスです。
本当に無料で利用できるのですか?
はい。Colab は無料でご利用いただけます。
話がうますぎるように思えます。なにか制限事項があるのではありませんか?
https://research.google.com/colaboratory/faq.html
YouTube 動画からの音声の取得の部分は youtube-dl が担当し、deepspeech は英語話者の音声に対して自動スピーチ認識をして対応するテキスト推量して表示します。( 以下のプログラムについては deepspeech-0.6.1-models / TensorFlow1 が使われています。)
これはつまり、「すぐできる」ものですが、 deepspeech-0.6.1-models のデータサイズが 1.14G あることには注意してください。
ただし、設定を残さない場合は、 Google Colabratory ランタイム終了とともにデータも消えます。
むずかしそうな字が並んでいるように感じます。でも、端的に書きますが、以下の python コードを googlecolab のセルに貼りつけて実行していくだけです。
googlecolab では Control キー + Enter キーでセルのコードを実行できます。
いちばん心理的に障壁が高いのは、 google のアカウントをつくることですが、それ以上の難しさはここには無いです。しかしまた、実行する必要もないです。知りたい人のためにサンプルとして、しばしここにあるだけです。
googlecolab ではエディターの設定で vim のキーバインディングが可能なので、vim だと速い人は、Shift キー + insert キーでペーストできます。
Setting up Google Colaboratory
from google.colab import drive
drive.mount('/content/drive')
Rf.
外部データ: ローカル ファイル、ドライブ、スプレッドシート、Cloud Storage
https://colab.research.google.com/notebooks/io.ipynb
Speech Recognition with DeepSpeech
このワードで検索してみてください。以下の引用は全てここからのものです。違いもありますから、違いを認識した上で改良を加えるなりしてください。実際に動作するサンプルを見ないと、なかなか手が出ないので、レシピノートを公開してくれていることをありがたく思います。
- MozillaDeepSpeech.ipynb ... mozilla/DeepSpeech with LM on Youtube videos
Rf.
Erdene-Ochir Tuguldur
tugstugi
Берлиний Техникийн Их Сургууль
https://github.com/tugstugi/dl-colab-notebooks
This notebook uses an open source project mozilla/DeepSpeech to transcribe a given youtube video.
For other deep-learning Colab notebooks, visit tugstugi/dl-colab-notebooks.
Install DeepSpeech
#@title
import os
from os.path import exists, join, basename, splitext
if not exists('deepspeech-0.6.1-models'):
!apt-get install -qq sox
!pip install -q deepspeech-gpu==0.6.1 youtube-dl
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/deepspeech-0.6.1-models.tar.gz
!tar xvfz deepspeech-0.6.1-models.tar.gz
from IPython.display import YouTubeVideo
log
Selecting previously unselected package libopencore-amrnb0:amd64.
(Reading database ... 146425 files and directories currently installed.)
Preparing to unpack .../0-libopencore-amrnb0_0.1.3-2.1_amd64.deb ...
Unpacking libopencore-amrnb0:amd64 (0.1.3-2.1) ...
Selecting previously unselected package libopencore-amrwb0:amd64.
Preparing to unpack .../1-libopencore-amrwb0_0.1.3-2.1_amd64.deb ...
Unpacking libopencore-amrwb0:amd64 (0.1.3-2.1) ...
Selecting previously unselected package libmagic-mgc.
Preparing to unpack .../2-libmagic-mgc_1%3a5.32-2ubuntu0.4_amd64.deb ...
Unpacking libmagic-mgc (1:5.32-2ubuntu0.4) ...
Selecting previously unselected package libmagic1:amd64.
Preparing to unpack .../3-libmagic1_1%3a5.32-2ubuntu0.4_amd64.deb ...
Unpacking libmagic1:amd64 (1:5.32-2ubuntu0.4) ...
Selecting previously unselected package libsox3:amd64.
Preparing to unpack .../4-libsox3_14.4.2-3ubuntu0.18.04.1_amd64.deb ...
Unpacking libsox3:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Selecting previously unselected package libsox-fmt-alsa:amd64.
Preparing to unpack .../5-libsox-fmt-alsa_14.4.2-3ubuntu0.18.04.1_amd64.deb ...
Unpacking libsox-fmt-alsa:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Selecting previously unselected package libsox-fmt-base:amd64.
Preparing to unpack .../6-libsox-fmt-base_14.4.2-3ubuntu0.18.04.1_amd64.deb ...
Unpacking libsox-fmt-base:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Selecting previously unselected package sox.
Preparing to unpack .../7-sox_14.4.2-3ubuntu0.18.04.1_amd64.deb ...
Unpacking sox (14.4.2-3ubuntu0.18.04.1) ...
Setting up libmagic-mgc (1:5.32-2ubuntu0.4) ...
Setting up libmagic1:amd64 (1:5.32-2ubuntu0.4) ...
Setting up libopencore-amrnb0:amd64 (0.1.3-2.1) ...
Setting up libopencore-amrwb0:amd64 (0.1.3-2.1) ...
Setting up libsox3:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Setting up libsox-fmt-base:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Setting up libsox-fmt-alsa:amd64 (14.4.2-3ubuntu0.18.04.1) ...
Setting up sox (14.4.2-3ubuntu0.18.04.1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.3) ...
/sbin/ldconfig.real: /usr/local/lib/python3.6/dist-packages/ideep4py/lib/libmkldnn.so.0 is not a symbolic link
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for mime-support (3.60ubuntu1) ...
|████████████████████████████████| 18.7MB 160kB/s
|████████████████████████████████| 1.9MB 49.9MB/s
--2021-02-13 17:57:27-- https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/deepspeech-0.6.1-models.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/60273704/f29e6300-33cd-11ea-8523-3fc40b31be9a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210213T175727Z&X-Amz-Expires=300&X-Amz-Signature=385f1997b95eb6dfac74a33bd120afe1ef4e11c74ffdc081c45d6de333ba5a0b&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=60273704&response-content-disposition=attachment%3B%20filename%3Ddeepspeech-0.6.1-models.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-02-13 17:57:27-- https://github-releases.githubusercontent.com/60273704/f29e6300-33cd-11ea-8523-3fc40b31be9a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210213T175727Z&X-Amz-Expires=300&X-Amz-Signature=385f1997b95eb6dfac74a33bd120afe1ef4e11c74ffdc081c45d6de333ba5a0b&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=60273704&response-content-disposition=attachment%3B%20filename%3Ddeepspeech-0.6.1-models.tar.gz&response-content-type=application%2Foctet-stream
Resolving github-releases.githubusercontent.com (github-releases.githubusercontent.com)... 185.199.108.154, 185.199.109.154, 185.199.110.154, ...
Connecting to github-releases.githubusercontent.com (github-releases.githubusercontent.com)|185.199.108.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1229020343 (1.1G) [application/octet-stream]
Saving to: ‘deepspeech-0.6.1-models.tar.gz’
deepspeech-0.6.1-mo 100%[===================>] 1.14G 96.4MB/s in 12s
2021-02-13 17:57:39 (95.4 MB/s) - ‘deepspeech-0.6.1-models.tar.gz’ saved [1229020343/1229020343]
._deepspeech-0.6.1-models
deepspeech-0.6.1-models/
deepspeech-0.6.1-models/._lm.binary
deepspeech-0.6.1-models/lm.binary
deepspeech-0.6.1-models/._output_graph.pbmm
deepspeech-0.6.1-models/output_graph.pbmm
deepspeech-0.6.1-models/._output_graph.pb
deepspeech-0.6.1-models/output_graph.pb
deepspeech-0.6.1-models/._trie
deepspeech-0.6.1-models/trie
deepspeech-0.6.1-models/output_graph.tflite
deepspeech-0.6.1-mo 100%[===================>] 1.14G
size: 1.14G
Extractiong YouTube video_id from YouTube URL
from urllib.parse import urlparse, parse_qs
urltext ='https://www.youtube.com/watch?v=qviM_GnJbOM'
args = [urltext]
video_id = ''
def extract_video_id(url):
query = urlparse(url)
if query.hostname == 'youtu.be': return query.path[1:]
if query.hostname in {'www.youtube.com', 'youtube.com'}:
if query.path == '/watch': return parse_qs(query.query)['v'][0]
if query.path[:7] == '/embed/': return query.path.split('/')[2]
if query.path[:3] == '/v/': return query.path.split('/')[2]
# fail?
return None
for url in args:
video_id = (extract_video_id(url))
print('youtube video_id:',video_id)
Rf.
extracting youtube video id from youtube URL
https://qiita.com/dauuricus/private/9e70c4c25566fedb9c19
Transcribe Youtube Video
We are going to make speech recognition on the following youtube video
YouTubeVideo(video_id)
Download the above video, convert to a WAV file and do speech recognition
#!rm -rf *.wav
!youtube-dl --extract-audio --audio-format wav --output "extract.%(ext)s" {urltext}
youtube-dl --extract-audio --audio-format wav --output "test.%(ext)s"
で extract.wav というファイル名で wav フォーマットで動画から抽出します。deepspeech が対応するのはサンプリングレート 16000hz の音声のようです。
[youtube] qviM_GnJbOM: Downloading webpage
[download] Destination: extract.m4a
[download] 100% of 2.05MiB in 00:00
[ffmpeg] Destination: extract.wav
Deleting original file extract.m4a (pass -k to keep)
Rf.
Download Audio from YouTube
https://gist.github.com/umidjons/8a15ba3813039626553929458e3ad1fc
このテストケースでも、かならずしも YouTube の音声でなくてもいいので、youtube-dl のインストールが済んでない場合、 ffmpeg がインストールされていないかもしれません。音声のコンバートに別途 ffmpeg が必要な場合はこれでインストールできます。
!apt install ffmpeg
!ffmpeg -i extract.wav -vn -acodec pcm_s16le -ac 1 -ar 16000 -f wav test.wav
!deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio test.wav
deepspeech では 16000hz の wav が必要らしいので 44100 Hz:extract.wav から PCM signed 16-bit little-endian 16000 Hz:test.wav へコンバートしてから。
-codecs # list codecs
-c:a # audio codec (-acodec)
-fs SIZE # limit file size (bytes)
-b:v 1M # video bitrate (1M = 1Mbit/s)
-b:a 1M # audio bitrate
-vn # no video
-aq QUALITY # audio quality (codec-specific)
-ar 16000 # audio sample rate (hz)
-ac 1 # audio channels (1=mono, 2=stereo)
-an # no audio
-vol N # volume (256=normal)
log
ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
libavutil 55. 78.100 / 55. 78.100
libavcodec 57.107.100 / 57.107.100
libavformat 57. 83.100 / 57. 83.100
libavdevice 57. 10.100 / 57. 10.100
libavfilter 6.107.100 / 6.107.100
libavresample 3. 7. 0 / 3. 7. 0
libswscale 4. 8.100 / 4. 8.100
libswresample 2. 9.100 / 2. 9.100
libpostproc 54. 7.100 / 54. 7.100
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from 'test.wav':
Metadata:
encoder : Lavf57.83.100
Duration: 00:02:48.86, bitrate: 1411 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to 'test1.wav':
Metadata:
ISFT : Lavf57.83.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Metadata:
encoder : Lavc57.107.100 pcm_s16le
size= 5277kB time=00:02:48.85 bitrate= 256.0kbits/s speed=1.24e+03x
video:0kB audio:5277kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.001444%
###deepspeech の自動スピーチ認識( ASR )の結果は、
you may write me down and history with your bitter rested line you may try me in the very dirt but still like dust or i does my satinette you why are you visitations a waltari have oil wells pumping in my living room that's like moons and like sons with the seance just like hopes springing high still and he did you want to see me broken bowed head and lowered eyes soldiering down like hiram weakened by my soul socrates my sansonnetto do take it to her i just got a laugh as if i have gold man sinking in my own back yard you can shoot me with your words you can cut me with your lies you can kill me with your hatefulness but just like life ran does my saxon as the firm you all does it come as a surprise that i danced as if i have diamonds that the meeting of my size out of a hut of history shame i ride up from a past rooted in pain i rise a black ocean leaving and by welling and swelling and bearing him i leaving behind might of terror and fear i ran into a daybreak miraculously clear i right bringing the gifts that my emphasis gay i am the whole and the dream of the sleeve and so that
比較:YouTubeの字幕は、こちら。
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
YouTube captions
- - - - - - - - - - - - - - - - - - - YouTube - - - - - - - - - - - - - - - - - - -
1 you may write me down in history with
2 your bitter twisted lies
3 you may tribe me in the very dirt but
4 still like dust a lie does my sassiness
5 upset you
6 why are you beset with gloom just
7 because I walked as if I have oil wells
8 pumping in my living room just like
9 moons and like Suns with the certainty
10 of tides just like hope springing high
11 still I rise did you want to see me
12 broken bowed head and lowered eyes
13 shoulders falling down like teardrops we
14 can buy my soul who cries does my
15 sassiness upset you don't take it too
16 hard just cuz I laugh as if I have gold
17 mines digging in my own backyard you can
18 shoot me with your words you can cut me
19 with your lies you can kill me with your
20 hatefulness but just like life arise
21 just my sexiness offend you oh does it
22 come as a surprise that I dance as if I
23 have diamonds at the meeting of my
24 thighs
25 out of the huts of history's shame I
26 rise up from a past rooted in pain I
27 rise a black ocean leaping and wide
28 Welling and swelling and bearing in the
29 time leaving behind nights of terror and
30 fear I rise into a daybreak miraculously
31 clear I rise bringing the gifts that my
32 ancestors gave I am the hope and the
33 dream of the slave and so there go
************************************************************************************
##Cf. Still I Rise by MAYA ANGELOU
https://www.poetryfoundation.org/poems/46446/still-i-rise
###最新のバージョンの deepspeech とはコマンドが(たぶん)違います。(しらんけど。)
deepspeech-0.6.1-models
usage: deepspeech [-h] --model MODEL [--lm [LM]] [--trie [TRIE]] --audio AUDIO
[--beam_width BEAM_WIDTH] [--lm_alpha LM_ALPHA]
[--lm_beta LM_BETA] [--version] [--extended] [--json]
Running DeepSpeech inference.
optional arguments:
-h, --help show this help message and exit
--model MODEL Path to the model (protocol buffer binary file)
--lm [LM] Path to the language model binary file
--trie [TRIE] Path to the language model trie file created with
native_client/generate_trie
--audio AUDIO Path to the audio file to run (WAV format)
--beam_width BEAM_WIDTH
Beam width for the CTC decoder
--lm_alpha LM_ALPHA Language model weight (lm_alpha)
--lm_beta LM_BETA Word insertion bonus (lm_beta)
--version Print version and exits
--extended Output string from extended metadata
--json Output json from metadata with timestamp of each word
##備考
当初、動作テストに 3 時間以上あるビデオクリップをターゲットに音声認識して書き起こしするのに IBM watoson TTS demo と並行して Google Colaboratory でも deepspeech を走らせてみたが、一向に終わる気配がなく、 IBM watoson TTS demo の再生音声が終盤にさしかかった頃、これは watoson と同様に音声の実時間かけて処理するのでは?ということに気がついて、一応仕上がった結果を捨てて、短いクリップでやり直した結果がこれ。
最新版に近い deepspeech 0.9.3 に合わせてインストールパートから記事を書き変えようとしたが、バージョンアップでファイル構成変わっていたので、よく調べないとわからない。
(追記)
ので、調べて書き換えたものは、YouTube, Deepspeech, with Google Colaboratory testing_0002
Cf.
Speech to Text
The IBM Watson Speech to Text service uses speech recognition capabilities to convert Arabic, English, Spanish, French, Brazilian Portuguese, Japanese, Korean, German, and Mandarin speech into text.
https://speech-to-text-demo.ng.bluemix.net/