More than 3 years have passed since last update.

VOSK test_simple.py on GoogleColaboratory [004]

Last updated at 2021-03-29Posted at 2021-03-11

' VOSK test_simple.py on GoogleColaboratory [001] 'のつづきになります。

Google Colab でテストしています。
Vosk が何かというと、音声認識のプログラムです。ローカルで動作します。ローカルで動作するのですけども、google colabで動かしています。ローカルではテストしていません。

Vosk is a speech recognition toolkit. The best things in Vosk are:

Supports 17 languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino. More to come.
Works offline, even on lightweight devices - Raspberry Pi, Android, iOS
Installs with simple pip3 install vosk
Portable per-language models are only 50Mb each, but there are much bigger server models available.
Provides streaming API for the best user experience (unlike popular speech-recognition python packages)
There are bindings for different programming languages, too - java/csharp/javascript etc.
Allows quick reconfiguration of vocabulary for best accuracy.
Supports speaker identification beside simple speech recognition.

Vosk は音声認識ツールキットです。 Vosk の最高のものは次のとおりです。

17の言語と方言をサポートします-英語、インド英語、ドイツ語、フランス語、スペイン語、ポルトガル語、中国語、ロシア語、トルコ語、ベトナム語、イタリア語、オランダ語、カタロニア語、アラビア語、ギリシャ語、ペルシア語、フィリピン語。もっと来ます。
軽量デバイスでもオフラインで動作します-RaspberryPi、Android、iOS
シンプルなpip3 install voskでインストールします
ポータブルな言語ごとのモデルはそれぞれわずか50Mbですが、はるかに大きなサーバーモデルが利用可能です。
最高のユーザーエクスペリエンスのためのストリーミングAPIを提供します（一般的な音声認識Pythonパッケージとは異なります）
さまざまなプログラミング言語のバインディングもあります -java/csharp/javascript など。
語彙をすばやく再構成して、最高の精度を実現します。
単純な音声認識に加えて話者識別をサポートします。

中国語の発音からテキストへ、そして Google translate API で翻訳。

Install VOSK on GoogleColaboratory

GoogleColab

!pip install vosk

!git clone https://github.com/alphacep/vosk-api

Download Language Model

ここでは中国語の LM（Language Model）を使って、中国語の音声認識をテストします。

case a: case a は英語の Language Model
case b: case b は中国語 Language Model

Download via https://alphacephei.com/vosk/models

a:English ASR testing

GoogleColab

%cd vosk-api/python/example
# English lang model
!wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip
!unzip vosk-model-small-en-us-0.15.zip
%mv vosk-model-small-en-us-0.15 model

b:Chinese ASR testing

GoogleColab

%cd vosk-api/python/example
# Chinese lang model
!wget https://alphacephei.com/vosk/models/vosk-model-small-cn-0.3.zip 
!unzip vosk-model-small-cn-0.3.zip
%mv vosk-model-small-cn-0.3 model
!rm -rf vosk-model-small-cn-0.3.zip

LM をダウンロードして解凍して、 Git クローンしてきた model フォルダに移動させます。それここ＞ %mv vosk-model-small-cn-0.3 model
移動できたら、ダウンロードした zip　ファイルは必要ありませんので消します。
!を先頭に付けるとターミナルコマンドが実行できます。
%を付けると同じような効果ですが、両者の動作は少し違います。
%%bashとすると、そのセルが bash として実行されるようになります。

参考:
https://qiita.com/funatsufumiya/items/e455ab8d801af6e1415d

Model structure

Once you trained the model arrange the files according to the following layout (see en-us-aspire for details):

am/final.mdl - acoustic model
conf/mfcc.conf - mfcc config file. Make sure you take mfcc_hires.conf version if you are using hires model (most external ones)
conf/model.conf - provide default decoding beams and silence phones. you have to create this file yourself, it is not present in kaldi model
ivector/final.dubm - take ivector files from ivector extractor (optional folder if the model is trained with ivectors)
ivector/final.ie
ivector/final.mat
ivector/splice.conf
ivector/global_cmvn.stats
ivector/online_cmvn.conf
graph/phones/word_boundary.int - from the graph
graph/HCLG.fst - this is the decoding graph, if you are not using lookahead
graph/HCLr.fst - use Gr.fst and HCLr.fst instead of one big HCLG.fst if you want to run rescoring
graph/Gr.fst
graph/phones.txt - from the graph
graph/words.txt - from the graph
rescore/G.carpa - carpa rescoring is optional but helpful in big models. Usually located inside data/lang_test_rescore
rescore/G.fst - also optional if you want to use rescoring

プログラムの仕組みがわからないとわかりませんね。実際わかりませんので、先へ進めます。

directory check

GoogleColab

!pwd

このコマンドで現在いる場所（ path ）を確認しています。/content/vosk-api/python/exampl ここにいることを確かめます。場所が違えば、こちらに移動して作業してください。ディレクトリが違えば、不具合が出ます。

%mv vosk-model-small-cn-0.3 model

そんなん最初から言ってよ、と別のとこにフォルダコピーしちゃったわという場合は

!rm -rf

でファイルやフォルダを消すことができます。
あたまから順番に実行している場合は、vosk-api/python/example/にいます。コード実行のタイミングがずれると、フォルダがコピーされる前に次のステップに移っているなどで、別の場所にいるかもしれません。

Test Audio sampling

case b: YouTube

case b: つまり中国語の音声認識にトライします。認識する音声は、youtube にある動画から。

GoogleColab

urltext ='https://youtu.be/cNSq5RdVf28' # Chinese YouTube Clip with no captions

GoogleColab

from urllib.parse import urlparse, parse_qs

args = [urltext]
video_id = ''


def extract_video_id(url):
    query = urlparse(url)
    if query.hostname == 'youtu.be': return query.path[1:]
    if query.hostname in {'www.youtube.com', 'youtube.com'}:
        if query.path == '/watch': return parse_qs(query.query)['v'][0]
        if query.path[:7] == '/embed/': return query.path.split('/')[2]
        if query.path[:3] == '/v/': return query.path.split('/')[2]
    # fail?
    return None

for url in args:
    video_id = (extract_video_id(url))
    print('youtube video_id:',video_id)
    
from IPython.display import YouTubeVideo

YouTubeVideo(video_id)

YouTube の URL から動画の識別の 11 ケタの id を取り出しています。上のプログラムを続けて実行するとその該当の 11 文字がvideo_idに入ります。ただそれだけですが、便利な場合もあるのです。
しかし、プログラムで取り出さないといけないわけではないので、これだなと思うところを抽出してください。どちらかというと、そっちの方が確かです。

GoogleColab

!rm -rf e*.wav
!pip install -q youtube-dl
!youtube-dl --extract-audio --audio-format wav --output "extract.%(ext)s" {urltext}

youtube-dl をダウンロードします。youtube-dl を使って、音声を wav ファイルとして抽出します。

GoogleColab

!apt install ffmpeg

!ffmpeg -i extract.wav -vn -acodec pcm_s16le -ac 1 -ar 16000 -f wav test1.wav

使うものは何でもいいのですが、ここでは ffmpeg を使って 16000Hz のモノラルの音声にコンバートしています。ffmpeg の音声変換のチートシートも載せておきます。わからんぜ、という場合でもやめないでください。どこにも説明してないだけで、わからんのですよ。それは。
でも、見てみるとかんたんです。youtube-dl で抽出した音声が、extract.wav で、-vn -acodec pcm_s16le -ac 1 -ar 16000までで、ノービデオ、PCM 1600hz の1チャンネルつまりモノラルで -f wav test1.wavでそういう名前の .wav ファイルが別にできるというということです。覚えたりする必要なないです。そういう順にコマンドが並べられているだけなので、すぐ慣れます。

-codecs          # list codecs
-c:a             # audio codec (-acodec)
-fs SIZE         # limit file size (bytes)
-b:v 1M          # video bitrate (1M = 1Mbit/s)
-b:a 1M          # audio bitrate
-vn              # no video
-aq QUALITY      # audio quality (codec-specific)
-ar 16000        # audio sample rate (hz)
-ac 1            # audio channels (1=mono, 2=stereo)
-an              # no audio
-vol N           # volume (256=normal)

ASR test_simple.py ...

Speech to text
結果を json 形式にしてファイルとして保存します。この結果を見る方法は、こちらになります。https://qiita.com/dauuricus/items/6dde7129b8dbc3ff905a

GoogleColab

# !/usr/bin/env python3

from vosk import Model, KaldiRecognizer, SetLogLevel
import sys
import os
import wave
import json

path = '/content/vosk-api/python/example/'

SetLogLevel(0)

if not os.path.exists("model"):
    print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    exit (1)

# wf = wave.open(path+'/test.wav',"rb")#English test sample
wf = wave.open(path+'/test1.wav',"rb")#Chinese lang test sample
sound = path+'/test1.wav'
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
    print ("Audio file must be WAV format mono PCM.")
    exit (1)

model = Model("model")
rec = KaldiRecognizer(model, wf.getframerate())

while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        continue
        #print(rec.Result())
       ## res = json.loads(rec.Result())
        #print(res['text'])
    #else:
        #print(rec.PartialResult())

original_stdout = sys.stdout
with open('vosk.json','w') as f:
    sys.stdout = f
    print(rec.FinalResult())
    f.close()
    sys.stdout = original_stdout
## res = json.loads(rec.FinalResult())
## print(res['text'])

GoogleColab

from IPython.display import Audio

# Audio(path+'/test.wav') # a:English
Audio(path+'/test1.wav') # b:Chinese

全部まとめると

googlecolab link:

Original 'test_simple.py'

手を加えてないオリジナルのコードはこうすると確認できます。

GoogleColab

%%bash
cat -n /content/vosk-api/python/example/test_simple.py

test_ffmpeg.py

GoogleColab

%%bash
cat -n /content/vosk-api/python/example/test_ffmpeg.py

check:
YouTube, Deepspeech, with Google Colaboratory [testing_0003]

YoavRamon/awesome-kaldi
https://github.com/YoavRamon/awesome-kaldi

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up