openai / whisper を使ってマイクに話した言葉を文字起こしする Python スクリプトを書いた

Posted at 2022-09-25

Intro

OpenAI が開発した音声認識モデルが日本語を良い感じに認識できるという評判を Twitter 上で見かけました。
文字起こし器が前々から欲しかったので、マイクに話した言葉を文字起こししてくれる Python スクリプトを書きました。

実行してみる

🗣️ < "はい、ということで今日はOpenAIが開発した Whisper について勉強していきたいとおもいます。Whisper は音声認識モデルです。"

実行結果

$ python main.py
* recording
^C* done recording
Detected language: ja
はい、ということで今日はオープンAIが開発したMISPER、マイツイテレンキをしていきたいと思いますMISPERは大さな仕込でです

Whisper が MISPER になってしまっています。"音声認識モデル" に至っては、"大さな仕込"になっています。マイツイテレンキ??
私の滑舌が悪いのか、それとも Whisper がまだまだなのかは、是非自身の声で確かめてください。

実装

試した環境

macOS Big Sur
Intel CPU
Python 3.9

環境構築

以下のライブラリを install する。

pip install pyaudio
brew install portaudio
- pyaudio の利用に必要
pip install git+https://github.com/openai/whisper.git

実装コード

import pyaudio
import wave
import whisper

model = whisper.load_model("base")


def main():
    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 44100
    WAVE_OUTPUT_FILENAME = "output.wav"

    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)

    print("* recording")

    frames = []

    while True:
        try:
            # Record
            d = stream.read(CHUNK)
            frames.append(d)

        except KeyboardInterrupt:
            # Ctrl - c
            break

    print("* done recording")

    stream.stop_stream()
    stream.close()
    p.terminate()

    with wave.open(WAVE_OUTPUT_FILENAME, 'wb') as wf:
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(p.get_sample_size(FORMAT))
        wf.setframerate(RATE)
        wf.writeframes(b''.join(frames))
        wf.close()


def recognize():
    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio("output.wav")
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    print(f"Detected language: {max(probs, key=probs.get)}")

    # decode the audio
    options = whisper.DecodingOptions(fp16 = False)
    result = whisper.decode(model, mel, options)

    # print the recognized text
    print(result.text)


if __name__ == '__main__':
    main()
    recognize()

実行時に発生したエラーに関して

ほぼほぼ PyAudio のサンプル + whisper のサンプルを足し合わせたものなので解説はしないが、はまったエラーの解決に役立った情報を記載しておきます。

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify faile
- https://github.com/openai/whisper/discussions/80
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
- https://github.com/openai/whisper/discussions/92

参照リンク

cntorol - c で音声認識をストップする実装はこちらから頂きました 🙏

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up