VSCodeではじめるwhisper入門

Last updated at 2024-12-29Posted at 2024-12-29

この記事では, OpenAIが提供するwhisperという文字起こしAIを初心者でも使えるようになる方法を紹介します.

whisperとは

OpenAIのページで紹介されているように, 多様な言語の音声に対して文字起こしを行うことができます. 原理とか, どうして動くのかとかが気になる人は, 詳しく書かれた記事があるのでそちらを読んでください.　

導入

まず,　wisperを動かすために導入が必要なものは

種類	概要
Python	Pythonでwhisperを動かします
wisper	whisper本体
(ffmpeg)	音声ファイル次第で必要
(CUDA Toolkit,cuDNN)	GPUで動かしたい人向け

となります. pythonやCUDAなどの導入は詳しい方の書かれた記事があるので, そちらを読んでください.
以下では, pythonを導入していることが必要です.

whisperの導入

次のGitHubのURLからダウンロードするか,

pip install git+https://github.com/openai/whisper.git
pip install numpy torch torchaudio

をコマンドプロンプトやPowerShellを動かしてください.
whisperは, pythonさえ導入できていれば使用することができます.

ffmpegm

whisperは, mp3ファイルのみに対して実行可能です. そのため, m4aやwavファイルをmp3に変換する必要があるため, ffmpegを使用します.

ダウンロードは, 次のURLからWindos版をダウンロードしてください.

詳しいダウンロード方法

GitHubでクローンする方法

git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg

自分でダウンロードする
今回は最低限の機能でよいので, 次のurlに飛んで「ffmpeg-release-essentials」と書いてあるものをダウンロードしてください.

https://www.gyan.dev/ffmpeg/builds/

このとき, 必ず「ffmpeg.exe」があるディレクトリのパスをコピーしておいてください.
また, ダウンロードしたファイルはPCの一番上の階層においてください.

デフォルトであれば, ffmpeg.exeがあるディレクトリは,
C:\ffmpeg-7.1-essentials_build\bin　となると思います.
※バージョンによって異なるので注意してください.

次に, fmpegの実行ファイルがあるディレクトリを環境変数PATHに追加します.

~やり方~
1. windowsキーを押して環境変数と検索して, 「環境変数を編集(コントロールパネル)」を選択
2. ユーザーの環境変数(U)にある「Path」をダブルクリック
3. 「新規」を選択し, ffmpeg.exeの存在するディレクトリのパスを入力

以上で導入は完了です.

VSCodeで動かしてみる

以下では, pythonを用います. また, CUDAなどを使ってGPUで動かしたい人は次節を見てください.
また, VScodeで動かす際にちょこまかとした拡張機能が必要になる人もいると思います．その際は, その表示が出るたびに入れてください.

まず, 好きな場所にフォルダを作って以下のような階層を組んでください.

transcription/ 
    ┣ original_data/ 
    ┣ mp3＿data/ 
    ┣ data_segments/ 
    ┣ result/

作成が完了したら, このフォルダ全体をVSCodeで開いてください.
フォルダ群をコマンドプロンプトで作成する際の例を次に示します.

cd .\Desktop\
md transcription
cd .\transcription\
md original_data
md mp3＿data
md data_segments
md result

次に, VSCode内でtranscriptionを開き, ipynbファイルを作成します.
ここで仮に, 名前は「voice_to_word.ipynb」とします.

ipynb内の記述

左上に
があると思います.
以降, +コードをコードボタン, +マークダウンをマークダウンボタンと呼びます.

例として紹介するコードでは, ユーザー名を必ず修正してください.

step1 ファイル形式を変更する

以下では,コードボタンを押して, コードを記述するものとします.
最初に, 音声ファイルをmp3形式に変換するものを追加します.

import os
import shutil
from pydub import AudioSegment

def convert_and_copy_files(source_dir, mp3_dir):
    """
    フォルダ内のm4aとwavをmp3に変換し,mp3_dirに保存.
    既存のmp3ファイルはmp3_dirにコピー.

    Args:
        source_dir (str): 変換元のフォルダパス.
        mp3_dir (str): 保存先のフォルダパス.
    """
    if not os.path.exists(mp3_dir):
        os.makedirs(mp3_dir)

    for root, _, files in os.walk(source_dir):
        for file in files:
            source_path = os.path.join(root, file)
            filename, ext = os.path.splitext(file)
            ext = ext.lower()

            # m4aまたはwavの場合,mp3に変換
            if ext in ['.m4a', '.wav']:
                try:
                    audio = AudioSegment.from_file(source_path)
                    output_path = os.path.join(mp3_dir, f"{filename}.mp3")
                    audio.export(output_path, format="mp3")
                    print(f"Converted: {source_path} -> {output_path}")
                except Exception as e:
                    print(f"Error converting {source_path}: {e}")

            # mp3の場合,直接コピー
            elif ext == '.mp3':
                try:
                    output_path = os.path.join(mp3_dir, file)
                    shutil.copy2(source_path, output_path)
                    print(f"Copied: {source_path} -> {output_path}")
                except Exception as e:
                    print(f"Error copying {source_path}: {e}")

# 使用例
source_directory = r"C:\Users\%ユーザー名%\Desktop\transcription\original_data"  # 変換元ファイルのパス
mp3_directory = r"C:\Users\%ユーザー名%\Desktop\transcription\mp3＿data"                   # 保存先フォルダのパス

convert_and_copy_files(source_directory, mp3_directory)

step2 分割する

次に, 30分(任意時間)でファイルを分割します. (このコードはなくてもよいです)

from pydub import AudioSegment
import os

def split_mp3_in_directory(input_directory, output_directory, interval=30*60*1000):  # intervalはミリ秒で指定（30分）
    # 出力ディレクトリが存在しない場合は作成
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    # 入力ディレクトリ内の全てのMP3ファイルを取得
    for filename in os.listdir(input_directory):
        if filename.endswith(".mp3"):
            file_path = os.path.join(input_directory, filename)
            print(f"Processing {file_path}")
            
            # MP3ファイルを読み込む
            audio = AudioSegment.from_mp3(file_path)
            
            # ファイル名の拡張子を除いた部分を取得
            base_name = os.path.splitext(os.path.basename(file_path))[0]
            
            # オーディオの長さを取得
            duration = len(audio)
            
            # 分割して保存する
            for start_ms in range(0, duration, interval):
                end_ms = min(start_ms + interval, duration)
                
                # 分割部分を抽出
                segment = audio[start_ms:end_ms]
                
                # 保存先ディレクトリに保存するファイル名を作成
                output_file = os.path.join(output_directory, f"{base_name}_part_{start_ms // interval + 1}.mp3")
                
                # 分割した部分をMP3ファイルとして保存
                segment.export(output_file, format="mp3")
                print(f"Saved {output_file}")

# 使用例
input_directory_path = r"C:\Users\%ユーザー名%\Desktop\transcription\mp3＿data"  # 入力するmp3ファイルが入っているディレクトリのパス
output_directory_path = r"C:\Users\%ユーザー名%\Desktop\transcription\data_segments"  # 出力先のディレクトリのパス


split_mp3_in_directory(input_directory_path, output_directory_path)

step 3 whisperを使う

最後に, whisperを動かすコードです.

import os
import whisper
import pandas as pd
import re
import json

# ffmpegのフルパスを環境変数に設定
os.environ["PATH"] += os.pathsep + r"C:\ffmpeg-7.1-essentials_build\bin"

# Whisperモデルをロード
model = whisper.load_model("medium")

# mp3ファイルが保存されているディレクトリ
audio_dir = r"C:\Users\%ユーザー名%\Desktop\transcription\data_segments"
txt_dir = r"C:\Users\%ユーザー名%\Desktop\transcription\result"
dictionary_file = r"C:\Users\%ユーザー名%\Desktop\transcription\dic.json"  # 辞書ファイルのパス

# 辞書ファイルをJSONとして読み込む
with open(dictionary_file, 'r', encoding='utf-8') as f:
    correction_dict = json.load(f)

# 指定されたディレクトリ内のすべてのファイルを走査
for file_name in os.listdir(audio_dir):
    if file_name.endswith(".mp3"):  # .mp3 ファイルを対象
        try:
            audio_file = os.path.join(audio_dir, file_name)
            print(f"Processing {file_name}...")

            # 文字起こしを実行
            result = model.transcribe(audio_file, fp16=True, language="ja")
            # fp16=False or True
            # F...32bit,faster,
            # T...16bit,later
            # if use "ja", must use medium or large

             半角数字と全角数字を統一する
            result = re.sub(r"(\d+)", lambda x: str(int(x.group())), result["text"])

            # 辞書的補正
            for key, value in correction_dict.items():
                result = result.replace(key, value)

            # 結果をテキストファイルに保存
            text_file_name = os.path.splitext(file_name)[0] + ".txt"  # 拡張子を .txt に変換
            text_file_path = os.path.join(txt_dir, text_file_name)
            with open(text_file_path, "w", encoding="utf-8") as f:
                f.write(result)
#
            print(f"Finished transcribing {file_name} to {text_file_name}")
        except FileNotFoundError:
            print(f"File not found: {audio_file}")
        except Exception as e:
            print(f"Error processing {file_name}: {e}")

print("すべての文字起こしが完了しました.")

step 4 辞書的補正

これまでのコードを読んだ方は気付いたかもしれませんが, 辞書的補正を行います.
そのため, 一番上の階層に「dic.json」を作成してください.
次に, dic.json内の記述を行います.

{
    "AI": "人工知能",
    "GPT": "生成型言語モデル",
    "ML": "機械学習",
    "NLP": "自然言語処理",
    "100ヨーセパンダ": "客寄せパンダ",
    "チーカー": "ちいかわ",
    "薄身": "炭火",
    "ウナラでは": "ならでは",
    "で分かりますか": "でとかありますか",
    "カルノサイクル": "カルノ―サイクル"
}

カスタマイズ方法

音声ファイル分割時間の変更
コメントアウトで記述していますが, ミリ秒で指定します.
そのため, n分で分割したい際には,

n\times 60\times 1000 \hspace{6pt} \text{ms}

として計算してください.

whisperの精度を変更

Size	Parameters	精度	詳細
tiny	39 M	低	非常に高速
base	74 M	中低	高速
small	244 M	中高	中速
medium	769 M	高	やや遅い
large	1550 M	最高	遅い
turbo	809 M	中~中高	非常に高速

言語指定

先ほど紹介したコードでは,

result = model.transcribe(audio_file, fp16=True, language="ja")

として, 日本語で設定しました. 言語設定を行わなくても, 自動で文字起こししてくれますが, 次のような言語が設定されています. そのため, language に対して入力で示す2文字を渡すといいでしょう.

言語と入力一覧

入力	言語
en	english
zh	chinese
de	german
es	spanish
ru	russian
ko	korean
fr	french
ja	japanese
pt	portuguese
tr	turkish
pl	polish
ca	catalan
nl	dutch
ar	arabic
sv	swedish
it	italian
id	indonesian
hi	hindi
fi	finnish
vi	vietnamese
he	hebrew
uk	ukrainian
el	greek
ms	malay
cs	czech
ro	romanian
da	danish
hu	hungarian
ta	tamil
no	norwegian
th	thai
ur	urdu
hr	croatian
bg	bulgarian
lt	lithuanian
la	latin
mi	maori
ml	malayalam
cy	welsh
sk	slovak
te	telugu
fa	persian
lv	latvian
bn	bengali
sr	serbian
az	azerbaijani
sl	slovenian
kn	kannada
et	estonian
mk	macedonian
br	breton
eu	basque
is	icelandic
hy	armenian
ne	nepali
mn	mongolian
bs	bosnian
kk	kazakh
sq	albanian
sw	swahili
gl	galician
mr	marathi
pa	punjabi
si	sinhala
km	khmer
sn	shona
yo	yoruba
so	somali
af	afrikaans
oc	occitan
ka	georgian
be	belarusian
tg	tajik
sd	sindhi
gu	gujarati
am	amharic
yi	yiddish
lo	lao
uz	uzbek
fo	faroese
ht	haitian creole
ps	pashto
tk	turkmen
nn	nynorsk
mt	maltese
sa	sanskrit
lb	luxembourgish
my	myanmar
bo	tibetan
tl	tagalog
mg	malagasy
as	assamese
tt	tatar
haw	hawaiian
ln	lingala
ha	hausa
ba	bashkir
jw	javanese
su	sundanese
yue	cantonese
my	burmese
ca	valencian
nl	flemish
ht	haitian
lb	letzeburgesch
ps	pushto
pa	panjabi
ro	moldavian
ro	moldovan
si	sinhalese
es	castilian
zh	mandarin

進行度を表してみる

ここでは, tqdmを使用するのでライブラリをインストールします.

pip install tqdm

次に, whisperを動かすコードの修正例を示します.

修正例

import os
import whisper
import pandas as pd
import re
import json
from tqdm import tqdm  # 進捗バーライブラリ

# ffmpegのフルパスを環境変数に設定
os.environ["PATH"] += os.pathsep + r"C:\ffmpeg-7.1-essentials_build\bin"

# Whisperモデルをロード
model = whisper.load_model("medium")

# mp3ファイルが保存されているディレクトリ
audio_dir = r"C:\Users\%ユーザー名%\Desktop\transcription\data_segments"
txt_dir = r"C:\Users\%ユーザー名%\Desktop\transcription\result"
dictionary_file = r"C:\Users\%ユーザー名%\Desktop\transcription\dic.json"  # 辞書ファイルのパス

# 辞書ファイルをJSONとして読み込む
with open(dictionary_file, 'r', encoding='utf-8') as f:
    correction_dict = json.load(f)

# mp3ファイル一覧を取得
audio_files = [f for f in os.listdir(audio_dir) if f.endswith(".mp3")]

# 進行度バーを追加してファイルを順に処理
for file_name in tqdm(audio_files, desc="Processing files", unit="file"):
    try:
        audio_file = os.path.join(audio_dir, file_name)
        print(f"\nProcessing {file_name}...")  # 現在処理中のファイル名を表示

        # 文字起こしを実行
        result = model.transcribe(audio_file, fp16=True, language="ja")

        # 半角数字と全角数字を統一する
        transcribed_text = re.sub(r"(\d+)", lambda x: str(int(x.group())), result["text"])

        # 辞書的補正
        for key, value in correction_dict.items():
            transcribed_text = transcribed_text.replace(key, value)

        # 結果をテキストファイルに保存
        text_file_name = os.path.splitext(file_name)[0] + ".txt"  # 拡張子を .txt に変換
        text_file_path = os.path.join(txt_dir, text_file_name)
        with open(text_file_path, "w", encoding="utf-8") as f:
            f.write(transcribed_text)

        print(f"Finished transcribing {file_name} to {text_file_name}")
    except FileNotFoundError:
        print(f"File not found: {audio_file}")
    except Exception as e:
        print(f"Error processing {file_name}: {e}")

print("\nすべての文字起こしが完了しました.")

よくあるエラー

ファイルパスが通らなくなる
windows側に問題がある場合もあります. また, pythonでwindowsのパスを記述する際には, "r"が必要となりますので, 注意してください.
ffmpegが設定できていない
もう一度同じ操作を行って導入してください. また, ffmpegの登録が完了していない場合もあります.
MP3形式
文字起こしする音声ファイルがMP3形式だとうまくいかないときがあります. その時は, 手動でmp3に変えてみてください.

GPUで動かしてみる

step 3のwhisperを動かすにおいて, modelを指定する際のコードを書き換えるだけで動きます.

model = whisper.load_model("medium", device="cuda")

最後に

導入できたでしょうか？
windowsで動かす際には, 環境構築が難しいことが多いと思いますが, 導入項目が少ない方だと思うので試しにやってみてはどうでしょうか.
また, もっと使えるようになりたい方は, OpenAIからwhisperをダウンロードしたディレクトリ内に, いろいろな説明や実行例が示されていると思いますので, そちらを読んでみるのがいいかと思います.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up