More than 3 years have passed since last update.

VOSK test_simple.py on GoogleColaboratory [003]

Last updated at 2021-03-08Posted at 2021-03-07

GoogleColab

Vosk ってなんだよ、なにいってんの？という場合はこちらを見にいってみて、そっかと思ってください。
こちら

Vosk で認識結果を json で保存します。

FinalResult()

で音源に対して、最終的な認識結果をオブジェクトとして返します。おそらく。

なので、

test_simple.pyのおしまいで、このようにするなりしてファイルにプリントします。

original_stdout = sys.stdout
with open('vosk.json','w') as f:
    sys.stdout = f
    print(rec.FinalResult())
    
    sys.stdout = original_stdout

そうすると一旦、json ファイルにまとまります。
ファイルにせずでも作業はできますが、非常に長い音声についての認識を処理すると、再度また認識処理を実行しないといけなくなるようなことが起こると、時間がかかりますので、ステップを分けて、一旦ファイルにする、音声の処理はここまでで、json から、結果のデータを見る。今ここ、というわけです。なので、すでに音声認識は終えています、ここでは。以降、VOSK プログラム自体にはふれません。認識結果を、json としてファイルにして、さて、これからどうしましょうというところです。

はい。
vosk.jsonが出来上がりました。

このデータを見ていって、どういうふうになっているのか確かめたいと思います。

その前に、オリジナルの vosk test_simple.py は、こうなっていました。という確認。

test_simple.py

     1	#!/usr/bin/env python3
     2	
     3	from vosk import Model, KaldiRecognizer, SetLogLevel
     4	import sys
     5	import os
     6	import wave
     7	
     8	SetLogLevel(0)
     9	
    10	if not os.path.exists("model"):
    11	    print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    12	    exit (1)
    13	
    14	wf = wave.open(sys.argv[1], "rb")
    15	if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
    16	    print ("Audio file must be WAV format mono PCM.")
    17	    exit (1)
    18	
    19	model = Model("model")
    20	rec = KaldiRecognizer(model, wf.getframerate())
    21	
    22	while True:
    23	    data = wf.readframes(4000)
    24	    if len(data) == 0:
    25	        break
    26	    if rec.AcceptWaveform(data):
    27	        print(rec.Result())
    28	    else:
    29	        print(rec.PartialResult())
    30	
    31	print(rec.FinalResult())

このオリジナルから、まず単純にこのように変更しました。

# !/usr/bin/env python3

from vosk import Model, KaldiRecognizer, SetLogLevel
import sys
import os
import wave
import json

path = '/content/vosk-api/python/example/'

SetLogLevel(0)

if not os.path.exists("model"):
    print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    exit (1)

# wf = wave.open(path+'/test.wav',"rb")#English test sample
wf = wave.open(path+'/test1.wav',"rb")#Chinese lang test sample
sound = path+'/test1.wav'
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
    print ("Audio file must be WAV format mono PCM.")
    exit (1)

model = Model("model")
rec = KaldiRecognizer(model, wf.getframerate())

while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        continue
        #print(rec.Result())
       ## res = json.loads(rec.Result())
        #print(res['text'])
    #else:
        #print(rec.PartialResult())

original_stdout = sys.stdout
with open('vosk.json','w') as f:
    sys.stdout = f
    print(rec.FinalResult())
    f.close()
    sys.stdout = original_stdout
## res = json.loads(rec.FinalResult())
## print(res['text'])

どうなっているのか、まだ深く追求してませんが、vosk では DeepSpeech 0.9.3 と違って一番最後の音声がやけに長い尺になることもなく、上記の while　ループの Result() では最終センテンスがありません。
また今度、それについて考えてみますが、確認したければ while のなかのコメントアウトを外して、プリントアウトの一番最後を確認してみてください。これは、test_simple.pyでの話なので、他にやりようはあることですが、ここではそっとしておいて、rec.FinalResult()のデータを json として扱います。

というところまで、前準備としてあります。

json から目的のデータをとりだすのに、デフォルトのパッケージで、自分で再帰関数をたぶん初めて考えて、あれやこれやトライアンドエラーをしていたんですが、結論としては、再帰関数を1000回以上呼び出す可能性があれば（だいたいあります。）、ちょっと無理しているという python なりの判断があるので、再帰呼び出しのリミットはsys.setrecursionlimit(10000)と数を増やせますが、それでも足りないこともあるので、他の方法の方がいいかも・・・です。再帰関数を最適化（パズルゲームです）していくより、別の方法を模索します。

Rf.
https://stackoverflow.com/questions/3323001/what-is-the-maximum-recursion-depth-in-python-and-how-to-increase-it

jsonファイルを読みだすのに pandas を使ってみます。


import json
import pandas as pd

with open('vosk.json','r') as f:
    jso = pd.read_json(f)
    print(jso.result)

pandas read_json()を使って見るとこうなっている。

0      {'conf': 1.0, 'end': 0.5399999999999999, 'star...
1      {'conf': 1.0, 'end': 0.75, 'start': 0.53999999...
2      {'conf': 0.698261, 'end': 1.11, 'start': 0.75,...
3      {'conf': 1.0, 'end': 1.29, 'start': 1.14, 'wor...
4      {'conf': 1.0, 'end': 1.68, 'start': 1.29, 'wor...
                             ...                        
241    {'conf': 1.0, 'end': 120.39, 'start': 119.43, ...
242    {'conf': 1.0, 'end': 121.08, 'start': 120.81, ...
243    {'conf': 1.0, 'end': 121.53, 'start': 121.08, ...
244    {'conf': 1.0, 'end': 126.476719, 'start': 126....
245    {'conf': 0.299539, 'end': 126.75, 'start': 126...
Name: result, Length: 246, dtype: object

5 つ表示。

print(jso.head())

                                                result                                               text
0    {'conf': 1.0, 'end': 0.5399999999999999, 'star...  you may write me down in history with your vis...
1    {'conf': 1.0, 'end': 0.75, 'start': 0.53999999...  you may write me down in history with your vis...
2    {'conf': 0.698261, 'end': 1.11, 'start': 0.75,...  you may write me down in history with your vis...
3    {'conf': 1.0, 'end': 1.29, 'start': 1.14, 'wor...  you may write me down in history with your vis...
4    {'conf': 1.0, 'end': 1.68, 'start': 1.29, 'wor...  you may write me down in history with your vis...

もうちょっと見たいところだけど、だいたいこうなっているんだということは解りました。
そこで、resultの方を一個だけ完全に見て確認します。


print(jso.result[0])

{'conf': 1.0, 'end': 0.5399999999999999, 'start': 0.38999999999999996, 'word': 'you'}

ここでなぜ小出しにしているかというと、json の長いテキストを Qiita に貼ると、ブラウザが落ちます（ホレ、このように）。だから全部は止めときます。

さて、ここで分かったのは、vosk では、（ミリセコンドではなく）秒で開始のタイムと終わりのタイムと、その間に発せられた言葉の認識結果がwordとしてディクショナリー型で json に入っているということです。

これによって、json から他のプログラムでも扱えるSubRip字幕フォーマットの.srtにするプログラムコードを考えます。
SupRip のフォーマットは、こういうものです。

subRip 形式とはこういうもの。

1
00:00:00,390 --> 00:00:02,430
you may write me down in history

2
00:00:02,970 --> 00:00:05,490
you may write me down in history
with your visit twisted line you

3
00:00:05,490 --> 00:00:07,530
with your visit twisted line you
may charge me in the very dirt

Json データから subRip 形式へ

これでどうかな。

import json
import pandas as pd
import datetime
import copy

def fmttime(sec):
    secs = sec #millisecs / 1000.0
    d = datetime.timedelta(seconds=secs)
    t = (datetime.datetime.min + d).time()
    milli = t.strftime('%f')[:3]
    value = t.strftime('%H:%M:%S,') + milli
    return value

with open('vosk.json','r') as f:
    jso = pd.read_json(f)
    #print(jso)

caption=[]
prev = []
laststart =[]
lastend =[]
laststart.append(0)
lastend.append(0)

lineNum=1
for i,v in enumerate(jso.result):
    
    caption.append(v['word'])

    if len(caption) == 1:
        start = v['start']

    if i > 0 and len(caption) > 2 and v['start'] - lastend[0] > 2:
        #print(fmttime(lastend[0]),fmttime(v['start']))
        print(lineNum)
        print(fmttime(start),'-->',fmttime(lastend[0]))
        if len(prev) > 0:
             print(*prev)
        temp = caption.pop()
        print(*caption)
        print()
        prev = copy.deepcopy(caption)
        caption.clear()
        start = v['start']
        lineNum +=1
        caption.append(temp)

    if v['end'] - start > 3:
        print(lineNum)
        print(fmttime(start),'-->',fmttime(v['end']))
        if len(prev) > 0:
             print(*prev)
        print(*caption)
        print()
        prev = copy.deepcopy(caption)
        caption.clear()
        start = v['end']
        lineNum += 1
    if i +1 == len(jso.result):
        print(lineNum)
        print(fmttime(start),'-->',fmttime(v['end']))
        if len(prev) > 0:
             print(*prev)
        print(*caption)
        print()

    laststart.pop()
    lastend.pop()
    laststart.append(v['start'])
    lastend.append(v['end'])

json から SubRip (.srt) コンバートの結果はこのように。

1
00:00:00,390 --> 00:00:03,660
you may write me down in history with your visit

2
00:00:03,660 --> 00:00:06,900
you may write me down in history with your visit
twisted line you may charge me in the very

3
00:00:06,900 --> 00:00:11,202
twisted line you may charge me in the very
dirt but still like dust our

4
00:00:11,202 --> 00:00:14,302
dirt but still like dust our
and does my says in this upsets

5
00:00:14,302 --> 00:00:17,430
and does my says in this upsets
you why are you visit with gloom just

6
00:00:17,430 --> 00:00:20,880
you why are you visit with gloom just
go back while kids if i have oil wells

7
00:00:20,880 --> 00:00:24,450
go back while kids if i have oil wells
pump in in my living room that

8
00:00:24,570 --> 00:00:28,350
pump in in my living room that
like moons and lifespans fans with the certainty

9
00:00:28,440 --> 00:00:31,770
like moons and lifespans fans with the certainty
of tides just like hopes bringing hi

10
00:00:32,640 --> 00:00:36,090
of tides just like hopes bringing hi
still a why did you want to see me broken

11
00:00:36,540 --> 00:00:39,990
still a why did you want to see me broken
bow to head and lowered eyes soldiers

12
00:00:40,020 --> 00:00:43,140
bow to head and lowered eyes soldiers
falling down like tier drops weakened

13
00:00:43,170 --> 00:00:46,590
falling down like tier drops weakened
my my soul food drive does my assassin

14
00:00:46,590 --> 00:00:47,340
my my soul food drive does my assassin
a subset to

15
00:00:50,130 --> 00:00:53,310
a subset to
take it so i go that laugh as

16
00:00:53,310 --> 00:00:56,340
take it so i go that laugh as
if i have gold mines digging in my own

17
00:00:56,340 --> 00:00:59,970
if i have gold mines digging in my own
back yard you can shoot me with your words

18
00:01:00,180 --> 00:01:03,240
back yard you can shoot me with your words
you can currently with your allies you can kill

19
00:01:03,240 --> 00:01:07,740
you can currently with your allies you can kill
me with your hateful miss but just like life iran

20
00:01:08,460 --> 00:01:10,650
me with your hateful miss but just like life iran
does my sex sexiness as the thin do

21
00:01:14,250 --> 00:01:17,790
does my sex sexiness as the thin do
does it come as a surprise that add dance

22
00:01:21,120 --> 00:01:24,240
does it come as a surprise that add dance
as if i have diamonds at the meeting of my

23
00:01:24,240 --> 00:01:27,600
as if i have diamonds at the meeting of my
time out

24
00:01:27,600 --> 00:01:30,750
time out
of the huts of history's same i rise

25
00:01:31,440 --> 00:01:34,710
of the huts of history's same i rise
up from a pastor rooted in pain iran

26
00:01:35,310 --> 00:01:39,000
up from a pastor rooted in pain iran
a black ocean leaping and why i'd welling

27
00:01:39,000 --> 00:01:42,870
a black ocean leaping and why i'd welling
and swelling and bearing in the town leaving

28
00:01:42,870 --> 00:01:46,380
and swelling and bearing in the town leaving
behind nights of terror and via i

29
00:01:46,380 --> 00:01:50,070
behind nights of terror and via i
ran into a daybreak miraculously

30
00:01:50,070 --> 00:01:53,490
ran into a daybreak miraculously
clear iran bringing

31
00:01:53,490 --> 00:01:56,610
clear iran bringing
the gifts that my ancestors gave i

32
00:01:56,610 --> 00:02:00,390
the gifts that my ancestors gave i
am the whole and the dream of the sneeze

33
00:02:00,810 --> 00:02:01,530
am the whole and the dream of the sneeze
and so

34
00:02:06,090 --> 00:02:06,750
and so
that go

有名な poetry なので認識結果と本当に発していることばと比較しやすいです。
今現在、 YouTube では、字幕ファイルのアップロードが可能で、アップロードした字幕をYouTube動画に重ね合わせて表示できます。その場合、SubRip 形式だとそのままアップロード可能です。つまり、このテキストを保存して subtitle.srtというファイル名にして、動画に字幕として使えるはずですが、もしかするとエンドとスタートのタイムが同じなのでだめかも。スタートのタイムをオフセット 1 秒未満足すとかして、前のキャプションの終了とずらせば問題ないかも。

SubRip File にコンバート

json_to_srtFile.py

import json
import pandas as pd
import datetime
import copy

def fmttime(sec):
    secs = sec #millisecs / 1000.0
    d = datetime.timedelta(seconds=secs)
    t = (datetime.datetime.min + d).time()
    milli = t.strftime('%f')[:3]
    value = t.strftime('%H:%M:%S,') + milli
    return value

with open('vosk.json','r') as f:
    jso = pd.read_json(f)
    #print(jso)

caption=[]
prev = []
laststart =[]
lastend =[]
laststart.append(0)
lastend.append(0)

lineNum=1

with open('vosk.srt','w') as df:
    orig_std = sys.stdout
    sys.stdout = df

    for i,v in enumerate(jso.result):

        caption.append(v['word'])

        if len(caption) == 1:
            start = v['start']

        if i > 0 and len(caption) > 2 and v['start'] - lastend[0] > 2:
            #print(fmttime(lastend[0]),fmttime(v['start']))
            print(lineNum)
            print(fmttime(start),'-->',fmttime(lastend[0]))
            if len(prev) > 0:
                 print(*prev)
            temp = caption.pop()
            print(*caption)
            print()
            prev = copy.deepcopy(caption)
            caption.clear()
            start = v['start']
            lineNum +=1
            caption.append(temp)

        if v['end'] - start > 3:
            print(lineNum)
            print(fmttime(start),'-->',fmttime(v['end']))
            if len(prev) > 0:
                 print(*prev)
            print(*caption)
            print()
            prev = copy.deepcopy(caption)
            caption.clear()
            start = v['end']
            lineNum += 1
        if i +1 == len(jso.result):
            print(lineNum)
            print(fmttime(start),'-->',fmttime(v['end']))
            if len(prev) > 0:
                 print(*prev)
            print(*caption)
            print()


        laststart.pop()
        lastend.pop()
        laststart.append(v['start'])
        lastend.append(v['end'])
    
    sys.stdout = orig_std

Cf. Still I Rise by MAYA ANGELOU

youtube-captions

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
YouTube captions
- - - - - - - - - - - - - - - - - - -  YouTube  - - - - - - - - - - - - - - - - - - -


1    you may write me down in history with
2    your bitter twisted lies
3    you may tribe me in the very dirt but
4    still like dust a lie does my sassiness
5    upset you
6    why are you beset with gloom just
7    because I walked as if I have oil wells
8    pumping in my living room just like
9    moons and like Suns with the certainty
10    of tides just like hope springing high
11    still I rise did you want to see me
12    broken bowed head and lowered eyes
13    shoulders falling down like teardrops we
14    can buy my soul who cries does my
15    sassiness upset you don't take it too
16    hard just cuz I laugh as if I have gold
17    mines digging in my own backyard you can
18    shoot me with your words you can cut me
19    with your lies you can kill me with your
20    hatefulness but just like life arise
21    just my sexiness offend you oh does it
22    come as a surprise that I dance as if I
23    have diamonds at the meeting of my
24    thighs
25    out of the huts of history's shame I
26    rise up from a past rooted in pain I
27    rise a black ocean leaping and wide
28    Welling and swelling and bearing in the
29    time leaving behind nights of terror and
30    fear I rise into a daybreak miraculously
31    clear I rise bringing the gifts that my
32    ancestors gave I am the hope and the
33    dream of the slave and so there go


************************************************************************************

また、こういうプログラムチェックテストを延々とやってみて、まずデータの最後から確かめると、何がおかしいかつかみやすいということがわかりました。
データの最後の欠損は、気が付かないといつまでも気が付かないので、まず最初に見るべきです。またそのために、データの最後は何であるべきなのかまず把握してからテストすべきです・・・が、何度かやってみないとわかんないです。

youtube_transcript_apiを使えば YouTube 動画から字幕を抽出することができます。YouTube 動画の｀video_id｀が必要になりますが、それは 11 文字のものです。
いつも、 URL のどこかにくっついている 11 文字部
youtube.com/watch?v=j70AA9arThc
ということですね。

urltext の変数に YouTube URL を代入すれば video_id が得られます。
でも、たんにくっついてる 11 文字を抜き出すだけです。

from urllib.parse import urlparse, parse_qs
urltext = 'https://youtu.be/qviM_GnJbOM'
args = [urltext]
video_id = ''


def extract_video_id(url):
    query = urlparse(url)
    if query.hostname == 'youtu.be': return query.path[1:]
    if query.hostname in {'www.youtube.com', 'youtube.com'}:
        if query.path == '/watch': return parse_qs(query.query)['v'][0]
        if query.path[:7] == '/embed/': return query.path.split('/')[2]
        if query.path[:3] == '/v/': return query.path.split('/')[2]
    # fail?
    return None

for url in args:
    video_id = (extract_video_id(url))
    print('youtube video_id:',video_id)
    
from IPython.display import YouTubeVideo

YouTubeVideo(video_id)

!pip install youtube_transcript_api

from youtube_transcript_api import YouTubeTranscriptApi
import datetime

def fmttime(sec):
    secs = sec #millisecs / 1000.0
    d = datetime.timedelta(seconds=secs)
    t = (datetime.datetime.min + d).time()
    milli = t.strftime('%f')[:3]
    value = t.strftime('%H:%M:%S,') + milli
    return value

transcript = YouTubeTranscriptApi.get_transcript(video_id)

i = 1
for tr in transcript:
    print(i)
    print(fmttime(tr['start']),'-->',fmttime(tr['start']+tr['duration']))
    print(tr['text'])
    print()
    i += 1

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up