More than 3 years have passed since last update.

python の SpeechRecognition で長めの wav ファイルを日本語音声認識する

Posted at 2021-07-24

pip install SpeechRecognition

でインストールできる python の SpeechRecognition ライブラリでは、日本語音声認識向けとして主に

Google Speech Recognition
Google Cloud Speech API

が使えます。が、前者は長めの wav ファイルが使いづらく後者は有償ということで、試験運用的に前者で済ませられないかと格闘したメモです。

import speech_recognition as sr
import time

r = sr.Recognizer()

# https://github.com/Uberi/speech_recognition/blob/master/speech_recognition/__init__.py#L500-L509
r.dynamic_energy_threshold = False # デフォルトだと True だが、音声が長めになりがちなので False にしている
r.energy_threshold = 500  # minimum audio energy to consider for recording (300)
r.phrase_threshold = 0.2 # minimum seconds of speaking audio before we consider the speaking audio a phrase - values below this are ignored (for filtering out clicks and pops) (0.3)
r.pause_threshold = 0.1 # seconds of non-speaking audio before a phrase is considered complete (0.8)
r.non_speaking_duration = 0.1 # seconds of non-speaking audio to keep on both sides of the recording (0.5)

with sr.AudioFile("long.wav") as source:
    while True:
        audio_data = r.listen(source)
        # flac でリクエストしているので、flac でのバイト数を気にする
        # https://github.com/Uberi/speech_recognition/blob/master/speech_recognition/__init__.py#L866-L877
        flac_data = audio_data.get_flac_data(
            convert_rate=None if audio_data.sample_rate >= 8000 else 8000,  # audio samples must be at least 8 kHz
            convert_width=2  # audio samples must be 16-bit
        )
        print("len: {} ← 10MB に近づくとエラーになりやすいので、サイズを見てパメータをいじる".format(len(flac_data)))
        try:
            print(r.recognize_google(audio_data, language='ja-JP'))
        except sr.UnknownValueError:
            print("Oops! Didn't catch that") # セグメントによっては音声認識できない、無視
        if len(flac_data) == 0: # ファイルの終わり
            break
        time.sleep(1) # ちょっと投げすぎを気にしている

と言っても、Uberi/speech_recognition には音声の無音を検知して区切る機能が既にあるので、リクエスト不能な長さ(10MBぐらいで Bad request になる)で投げないようパラメータをチューニングするだけです。デフォルトで dynamic_energy_threshold は True ですが、True のままパラメータをチューニングするのが辛かったので False にしています。

1 度のリクエストが 10MB 近くにならなければ問題ないので、元の wav ファイルを固定長でバラバラにしておく方法もありますが、上記チューニングで対応することで、音声の途中で切れたりするケースを若干減らせそうです。

なお結局、YouTube の字幕機能の方が認識精度が良さそうだったので現状あんまり使ってはいないですが……。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up