ふりがなWhisperを試す

Last updated at 2025-07-14Posted at 2025-07-14

はじめに

「日本語TTS用の学習データの精度を上げる「ふりがなWhisper」を作った話」で紹介されているふりがなwhisperを手元のmacbookで動かしてみました。

環境

MacBookAir M1
python 3.11.2

試行

サンプル音源

非商用目的のためJVSコーパスのsample_jvs001.wavを使いました。

環境構築

uv init .
uv add "numpy==1.*" # torchの依存関係でv1しか使えないため
uv add torch==2.2 # uvだと最新のtorchをaddできないため
uv add transformers

実装

transcribe.py

from transformers import pipeline
from pathlib import Path

pipe = pipeline(
    "automatic-speech-recognition",
    model="Parakeet-Inc/furigana_whisper_small_jsut",
    device="cpu"
)

def transcribe_with_prompt(pipe, audio_path: str | Path, prompt: str) -> str:
    prompt_ids = pipe.tokenizer.get_prompt_ids(
        prompt, return_tensors="pt"
    ).to(pipe.device)
    generate_kwargs = {"prompt_ids": prompt_ids}
    result = pipe(str(audio_path), generate_kwargs=generate_kwargs)
    return result["text"]

# 実行例
audio_path = "jvs001_VOICEACTRESS100_001.wav"
prompt = "また当時のように五大明王と呼ばれる主要な明王の中央に配されることも多い"
transcription = transcribe_with_prompt(pipe, audio_path, prompt)
print(transcription)

元コードから以下を変更しています。

pipeline構築時にdevice="cpu"を追加。非GPU環境で動かすため。
audio_pathとpromptをJVSのサンプルに合わせる。promptの漢字は適当です。

実行してみます。

uv run transcribe.py

Device set to use cpu
/Users/jiro/development/test/.venv/lib/python3.11/site-packages/transformers/models/whisper/generation_whisper.py:604: FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
マタトウジノヨウニゴダイミョウオウトヨバレルシュヨウナミョウオウノチュウオウニハイサレルコトモオオイ

それっぽい結果が得られました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up