More than 1 year has passed since last update.

外国語音声データ(silk形式)をWhisperで音声認識→機械翻訳しよう

Posted at 2023-05-21

外国語音声データ(silk形式)をWhisperで音声認識→機械翻訳しよう。

経緯

かつてSkypeが開発した「SILK」形式は、音声通話などで使用されている音声圧縮形式のようです。
今回は、中国で主に使われているSNSアプリ「WeChat」のボイスチャットで、このSILK形式が使われており、それを上手いこと日本語に直せないかと試しました。

手順

silkファイルをwaveファイルに変換
waveファイルをwhisperで書き起こし＆英訳
英文をEasyNMT + Fugu-MTで和訳

silkファイルをwaveファイルに変換

最初にsilkファイルをwaveファイルに変換します。
変換にはsilk-v3-decoderおよびFFmpegを使用します。
※ 注意 silk-v3-decoderのwindowsバイナリにはマルウェアが含まれています。ソースから自分でコンパイルしてください。

silk2wav.py

import subprocess, tempfile, os
from pathlib import Path
#ffmpegの実行ファイルが必要なので別途用意
from pydub import AudioSegment

class silkconv:
    def __init__(self,osname:str):
        #Linux and Windows両対応（コンパイルしたファイル名に合わせる）
        if osname == "posix":
            self.silkdecoder = "SilkDecoder"
            self.ffmpeg = "ffmpeg"
        elif osname == "nt":
            self.silkdecoder = "SilkDecoder.exe"
            self.ffmpeg = "ffmpeg.exe"
    
    def convertSilk(self,infile:Path) -> Path:
        #WeChatのボイスチャットデータ(.aud)は、先頭にゴミがついている場合があるので、削除
        with open(infile, 'r+b') as bf:
            bf.seek(1)
            bin = bf.read(8)
            if bin == b"#!SILK_V":
                infile = self.splitaud(infile)
            elif bin != b"!SILK_V3":
                return

        outfile = Path(str(infile.resolve()) + ".wav")
        if outfile.is_file():
            return
        #まずSilkDecoderでPCMへ変換
        pcmfile = self.silk2pcm(outfile)
        #ffmpegでpcmからwavファイル作成
        pcm = AudioSegment.from_raw(pcmfile,sample_width=2,frame_rate=24000,channels=1)
        pcm.export(outfile, bitrate=24000, format="wav")
        os.remove(pcmfile)
        if outfile.is_file():
            return str(outfile)
        else: return None
    
    def silk2pcm(self, infile:Path) -> Path:
        #silkファイルをTEMPフォルダ内へpcmで書き出し
        pcmfile = tempfile._get_default_tempdir() + os.sep + next(tempfile._get_candidate_names())
        cmd = []
        cmd.append(self.silkdecoder)
        cmd.append(str(infile))
        cmd.append(pcmfile)
        subprocess.run(cmd,stdout=subprocess.DEVNULL,stderr=subprocess.DEVNULL)
        return pcmfile


    def splitaud(self,f:Path) -> Path:
        #バイナリの先頭1byteを削除
        with open(f, 'r+b') as bf:
            bf.seek(1)
            bin = bf.read()
            output = f.with_suffix(".aud.silk")
            if output.is_file():
                return output
            with open(output,"wb") as wf:
                wf.seek(0)
                wf.write(bin)
            return output

if __name__ == '__main__':
    dir = Path("SILK_DATA_PATH")
    sc = silkconv(os.name)
    #WeChatのバージョンによって？。拡張子が違ったり、両方あったり・・・
    extensions = ['.silk', '.aud']
    files = [i for i in dir.glob('**/*.*') if i.suffix in extensions]
    for file in files:
        wav = sc.convertSilk(file)
        print(wav)

wavファイルからWhisperで文字起こし

OpenAIが提供しているWhisperを使用して、文字起こしをします。
今回は、WhisperのCTranslate2を使用した再実装であるFaster Whisperを使用します。
※　試した感じこれが一番早い（GPUも使える）

注意：Faster WhisperでCUDAを使用する場合、CUDA Tool kit やCuDNNが必要です。
別途インストールしてください。
なお、windowsの場合、最低限下記ファイルを抽出して、同一ディレクトリにおけば使用できます。

cudnn_cnn_infer64_8.dll
cublasLt64_12.dll
cublas64_12.dll
cublas64_11.dll
cudnn_ops_infer64_8.dll
cudnn64_8.dll
zlibwapi.dll

Whisperのインストール

!pip install faster-whisper

whisper.py

from faster_whisper import WhisperModel
from pathlib import Path
import time#翻訳速度測定用

if __name__ == '__main__':
    #CUDAを使用する場合は、ビデオメモリに収まる容量のものを選択くしてください。largeは10GB必要
    model_name = "large-v2"#tiny,base,small,medium,large-v1
    #CUDA使用の場合は、device="cuda"、compute_typeはGPUによって使えるものが違います。
    model = WhisperModel(model_name, device="cuda", compute_type="int8", cpu_threads=10)

    dir = Path("WAV_DIRECTORY_PATH")
    for file in dir.glob("**/*.wav"):
        #languageを指定しないと言語自動判別、task="translate"指定で英訳
        segments, info = model.transcribe(str(file), beam_size=5, language='zh',task="translate")
        t_start = time.perf_counter()
        result = []
        for seg in segments:
            print(seg.text)
            result.append(seg.text + "\n")
        print(time.perf_counter() - t_start)
        #wavと同一階層にtxtを保存(英文)
        with open(file.with_suffix(".txt"),"w",encoding="utf-8") as f:
            f.writelines(result)

英文を和訳（EasyNMT + fugu-mt）

すでにアップ済みのため「クローズド環境における、お手軽機械翻訳の構築」を参照

結果

中国語の場合、Whisperの誤り率が14.7%と日本語の5.3%よりかなり悪い。
（方言のせいか、声調のせいか・・・、学習データは多そうなのに）
一応、何の話をしているか程度には聞き取れそうだった。

今後

EasyNMTも古いライブラリに依存しているので、随時その部分を改造して更新していこう

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up