More than 3 years have passed since last update.

Pythonで音声ファイルを文字起こしする

Posted at 2021-11-07

はじめに

会議議事録作成など、音声ファイルをPythonとGoogle Speech recognitionを使って、文字起こしにチャレンジします。
Google Speech recognitionは無料で使えるのは1～3分程度の音声ファイルの為、音声ファイルを1分単位に分割し、そのファイルを分けて、音声認識させています。

実行環境

os：Windows10
Pythonバージョン：3.9

処理の流れ

以下のプログラムソースを実行すると、インプットファイルとなる音声ファイル（.wav)を選択するポップアップが表示されますので、変換したい音声ファイル(.wav)を選択してください。
処理結果は、実行フォルダにOUTPUTというフォルダが作成され、その中に、１分ごとに分割された.wavファイル（0.wav、1.wav・・・）と変換されたエクセルファイル(out.xlsx)が出力されます。

全体の流れや処理結果をYouTubeで紹介しています。

プログラムソース

音声文字変換3.py

import wave
import struct
from scipy import fromstring,int16
import numpy as np
import os
import math
import speech_recognition as sr
import pandas as pd
import tkinter.filedialog

# filenameに読み込むファイル、timeにカットする間隔
def cut_wav(filename,time):  
    # timeの単位は[sec]

    # ファイルを読み出し
    wavf = filename
    wr = wave.open(wavf, 'r')

    # waveファイルが持つ性質を取得
    ch = wr.getnchannels()
    width = wr.getsampwidth()
    fr = wr.getframerate()
    fn = wr.getnframes()
    total_time = 1.0 * fn / fr 
    integer = math.floor(total_time*100) # 小数点以下切り捨て
    t = int(time*100)  # 秒数[sec]
    frames = int(ch * fr * t /100)
    num_cut = int(integer//t)
    # waveの実データを取得し、数値化
    data = wr.readframes(wr.getnframes())
    wr.close()
    X = np.frombuffer(data, dtype=int16)

    for i in range(num_cut + 1):
        # 出力データを生成
        outf = out_dir + '/' + str(i) + '.wav' 
        # 音声をカットした部分は少し巻き戻す
        if i > 0:
            start_cut = int(i*frames) - int(180000)
        else:
            start_cut = int(i*frames)

        end_cut = int(i*frames + frames)
        # print(start_cut)
        # print(end_cut)
        Y = X[start_cut:end_cut]
        outd = struct.pack("h" * len(Y), *Y)

        # 書き出し
        ww = wave.open(outf, 'w')
        ww.setnchannels(ch)
        ww.setsampwidth(width)
        ww.setframerate(fr)
        ww.writeframes(outd)
        ww.close()

    str_out = ""
    list1 = [wavf,"",""]
    df_x = pd.DataFrame([list1])
    df_x.columns = ['No', '音声ファイル', '変換結果']

    for ii in range(num_cut + 1):
        outf = out_dir + '/' + str(ii) + '.wav' 
        str_out = wav_to_text(outf)
        df_x.loc[ii] = [ii,str(ii) + '.wav',str_out]
    
    # excelへ書き出し
    with pd.ExcelWriter(out_file) as writer:
        df_x.to_excel(writer, sheet_name='結果', index=False)

def wav_to_text(wavfile):
    r = sr.Recognizer()

    with sr.AudioFile(wavfile) as source:
        audio = r.record(source)

    wav_to_text = r.recognize_google(audio, language='ja-JP')

    print(wav_to_text)

    return wav_to_text

# 一応既に同じ名前のディレクトリがないか確認。
out_dir = "output"
file = os.path.exists(out_dir)
# print(file)

if file == False:
    #保存先のディレクトリの作成
    os.mkdir(out_dir)

fTyp = [("","*.wav")]
iDir = os.path.abspath(os.path.dirname(__file__))
f_name = tkinter.filedialog.askopenfilename(filetypes = fTyp,initialdir = iDir)

cut_time = 60
out_file = "output/out.xlsx"
cut_wav(f_name,float(cut_time))

参考にさせて頂いたページ

まとめ

色々な音声ファイルでテストをしましたが、Google Speech recognitionの変換結果に若干、ムラがあるようです。
また、変換精度は、録音音声の品質に大きく影響されますので、ご注意ください。
私は、エクセルマクロの経験は長いですが、Pythonはほぼ素人でした。
それが、Qiitaなどに掲載されているサンプルプログラムを組み合わせて、コピー＆ペーストで、このような実用的なプログラムが作成でき、Pythonのすばらしさに感動しました。
今後も、参考になるプログラムを掲載出来ればと思っています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up