LoginSignup
1
0

More than 3 years have passed since last update.

VOSK test_simple.py on GoogleColaboratory [002]

Last updated at Posted at 2021-02-24

previous article is ' VOSK test_simple.py on GoogleColaboratory [001] '

qrcode_qiita.com.png

The Google Colab files for the articles on this page are at the bottom of this article.1

Chinese Speech to text and translate by Google API

https://youtu.be/qtNFAPnqkhQ

Install VOSK on GoogleColaboratory

GoogleColab
!pip install vosk

!git clone https://github.com/alphacep/vosk-api

Download Language Model

Download via https://alphacephei.com/vosk/models

a:English ASR testing

GoogleColab
%cd vosk-api/python/example
#English lang model
!wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip
!unzip vosk-model-small-en-us-0.15.zip
%mv vosk-model-small-en-us-0.15 model

b:Chinese ASR testing

GoogleColab
%cd vosk-api/python/example
#Chinese lang model
!wget https://alphacephei.com/vosk/models/vosk-model-small-cn-0.3.zip 
!unzip vosk-model-small-cn-0.3.zip
%mv vosk-model-small-cn-0.3 model
!rm -rf vosk-model-small-cn-0.3.zip

Model structure

Once you trained the model arrange the files according to the following layout (see en-us-aspire for details):

  • am/final.mdl - acoustic model
  • conf/mfcc.conf - mfcc config file. Make sure you take mfcc_hires.conf version if you are using hires model (most external ones)
  • conf/model.conf - provide default decoding beams and silence phones. you have to create this file yourself, it is not present in kaldi model
  • ivector/final.dubm - take ivector files from ivector extractor (optional folder if the model is trained with ivectors)
  • ivector/final.ie
  • ivector/final.mat
  • ivector/splice.conf
  • ivector/global_cmvn.stats
  • ivector/online_cmvn.conf
  • graph/phones/word_boundary.int - from the graph
  • graph/HCLG.fst - this is the decoding graph, if you are not using lookahead
  • graph/HCLr.fst - use Gr.fst and HCLr.fst instead of one big HCLG.fst if you want to run rescoring graph/Gr.fst
  • graph/phones.txt - from the graph
  • graph/words.txt - from the graph
  • rescore/G.carpa - carpa rescoring is optional but helpful in big models. Usually located inside data/lang_test_rescore
  • rescore/G.fst - also optional if you want to use rescoring

directory check

GoogleColab
!pwd

Test Audio sampling

case b: YouTube

GoogleColab
urltext ='https://youtu.be/cNSq5RdVf28' # Chinese YouTube Clip with no captions
GoogleColab
from urllib.parse import urlparse, parse_qs

args = [urltext]
video_id = ''


def extract_video_id(url):
    query = urlparse(url)
    if query.hostname == 'youtu.be': return query.path[1:]
    if query.hostname in {'www.youtube.com', 'youtube.com'}:
        if query.path == '/watch': return parse_qs(query.query)['v'][0]
        if query.path[:7] == '/embed/': return query.path.split('/')[2]
        if query.path[:3] == '/v/': return query.path.split('/')[2]
    # fail?
    return None

for url in args:
    video_id = (extract_video_id(url))
    print('youtube video_id:',video_id)

from IPython.display import YouTubeVideo

YouTubeVideo(video_id)
GoogleColab
!rm -rf e*.wav
!pip install -q youtube-dl
!youtube-dl --extract-audio --audio-format wav --output "extract.%(ext)s" {urltext}
GoogleColab
!apt install ffmpeg

!ffmpeg -i extract.wav -vn -acodec pcm_s16le -ac 1 -ar 16000 -f wav test1.wav

ASR test_simple.py ...

Speech to text

GoogleColab
#!/usr/bin/env python3

from vosk import Model, KaldiRecognizer, SetLogLevel
import sys
import os
import wave

path = '/content/vosk-api/python/example/'

SetLogLevel(0)

if not os.path.exists("model"):
    print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    exit (1)

#wf = wave.open(path+'/test.wav',"rb")# a:English test sample
wf = wave.open(path+'/test1.wav',"rb")# b:Chinese lang test sample
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
    print ("Audio file must be WAV format mono PCM.")
    exit (1)

model = Model("model")
rec = KaldiRecognizer(model, wf.getframerate())

while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(rec.Result())
    else:
        print(rec.PartialResult())

print(rec.FinalResult())
GoogleColab
from IPython.display import Audio

#Audio(path+'/test.wav') # a:English
Audio(path+'/test1.wav') # b:Chinese 

Original 'test_simple.py'

GoogleColab
%%bash
cat -n /content/vosk-api/python/example/test_simple.py

test_ffmpeg.py

GoogleColab
%%bash
cat -n /content/vosk-api/python/example/test_ffmpeg.py

check:
YouTube, Deepspeech, with Google Colaboratory [testing_0003]

YoavRamon/awesome-kaldi
https://github.com/YoavRamon/awesome-kaldi


  1. test [googlecolab] removed 

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0