【音声翻訳】whisper × DeepLでリアルタイム翻訳機を作ってみた（ずんだもん構想あり）

Posted at 2025-05-12

はじめに

英語学習中で、実用的かつちょっと楽しいツールを作りたいと思い、翻訳機の実装をしてみました。
翻訳だけでなく、音声で返してくれると便利＆面白い！ということで「ずんだもん翻訳機」構想が生まれて作ろうと思いました。
まだ未実装で、近日中にずんだもんにしゃべらせるところまで作ったものを公開する予定です。

主な使用ライブラリ

deepl

DeepLは高精度な機械翻訳を提供するサービスで、
deeplライブラリはそのAPIをPythonから操作するためのライブラリです。

whisper

OpenAIが開発した音声認識モデル。whisperライブラリは音声ファイルから自動で
文字起こし（transcription）を行います。

sounddevice

Pythonから音声の録音・再生を行うためのライブラリです。

soundfile

WAV、FLAC、OGGなどの音声ファイルを読み書きするためのライブラリです。

全体のイメージ図

┌──────────┐
│ 録音開始 │◄─────────────┐
└────┬─────┘ ↑
▼ │
┌──────────┐ スペースキー
│ WAVファイル保存 │
└────┬─────┘
▼
┌──────────────┐
│ Whisperで文字起こし │
└────┬─────────┘
▼
┌────────────┐
│ DeepLで日本語翻訳 │
└────┬────────┘
▼
┌─────────────┐
│ 翻訳テキストを表示 │
└────┬────────┘
▼
┌────────────────────┐
│（次回）VOICEVOXでずんだもんが喋る！│
└────────────────────┘

プログラムの全体像

import os
from dotenv import load_dotenv

import deepl
import whisper
import speech_recognition as sr
import keyboard
import sounddevice as sd
import time
import numpy as np
import soundfile as sf
from datetime import datetime


load_dotenv()

class TranslateManager:
    def __init__(self):
        self.deepl_api_key = os.getenv('DEEPL_API_KEY')
        self.wav_dir = fr'/path/to/your/project/wav_file'
        self.model = whisper.load_model('base')
        self.recognizer = sr.Recognizer()
        self.mic = sr.Microphone()
        self.press_key = keyboard.is_pressed
        self.now = datetime.now()
        self.fs = 44100
        self.device = 4

    def fetch_recording_speech(self):
        is_recording = False
        playback = False
        recording = None
        formatted = self.now.strftime("%Y-%m-%d %H%M%S")

        print("「スペースキー」で録音開始、もう一度「スペース」で停止、「e」キーで録音終了・再生へ進む。")

        while True:
            event = keyboard.read_event()
            if event.event_type == keyboard.KEY_DOWN:
                if event.name == 'space' and not is_recording:
                    print("録音開始...")
                    is_recording = True
                    recording = sd.rec(int(8 * self.fs), samplerate=self.fs, channels=1, dtype='float64')
                    sd.wait()

                elif event.name == 'space' and is_recording:
                    sd.stop()
                    is_recording = False
                    print('録音終了。データ取得済')

                elif event.name == 'e':
                    if is_recording:
                        sd.stop()
                        is_recording = False
                        print('録音終了中断')
                    print('録音フェーズ終了 -> 再生フェーズへ進みます。')
                    time.sleep(2)
                    break
        
        print("「p」で録音再生、もう一度「p」で停止、「e」キーで終了します。")
        while True:
            event = keyboard.read_event()
            if event.event_type == keyboard.KEY_DOWN:
                if event.name == 'p' and not playback:
                    print('再生中...')
                    sd.play(recording, self.fs, device=self.device)
                    playback = True

                elif event.name == 'p' and playback:
                    playback = False
                    sd.stop()
                    print('再生停止')
                    print('再生終了')
                    
                elif event.name == 'e':
                    sd.stop()
                    print('終了します')
                    break
        recording = np.squeeze(recording).astype(np.float32)
        print("録音データの最大値:", np.max(np.abs(recording)))

        wav_file = fr'{self.wav_dir}/recording_{formatted}.wav'
        sf.write(wav_file, recording, self.fs)

        return wav_file

    def convert_speech_to_text(self, recording):
        result = self.model.transcribe(recording, language='en')
        print(result['text'])
        return result['text']
    
    def translate_to_japanese(self, text):
        translator = deepl.Translator(self.deepl_api_key)
        result = translator.translate_text(text, target_lang="JA")
        texts = []
        if isinstance(result, list):
            for item in result:
                texts.append(item.text)
        else:
            texts.append(result.text)
        print(texts)
        return texts

    
    def speak_translation(self):
    # ずんだもんAPIで実装予定
        return

    def main(self):
        en_speech = self.fetch_recording_speech()
        en_text = self.convert_speech_to_text(en_speech)
        self.translate_to_japanese(en_text)
        # translated_to_japanese = self.translate_to_japanese(en_text)
        return
    
if __name__ == "__main__":
    translate = TranslateManager()
    translate.main()

whisperの使用感の感想

英語のテキスト化は自分の想像以上に録音データを正確かつ高精度で英語テキストに変換してくれました。
興味本位で日本語のテキスト化も試したのですが、こちらはちょっといまいちで、うまくテキスト化されていないものもあったので、次回whisperをメインで扱う際は、どうしたら正確にテキスト化出来るかを実験してみようと思います。

今後の展望・やりたいこと

次回は、VOICEVOXのずんだもんAPIを使って、翻訳した内容を実際に話させる機能に挑戦します！
リアルタイム翻訳＋読み上げが実現すれば、ちょっとした学習補助ツールとしても面白くなりそうです。

音声から翻訳までを一気通貫で実現
音声合成（TTS）によって“聞いて学べる”機能へ進化
ずんだもん以外のキャラ音声にも展開予定？

参考文献

【Whisper】Pythonで音声ファイルを書き出してみよう！
Pythonで音声認識を簡単に！Whisperライブラリの使い方完全ガイド
 DeepL APIを使ってpythonに翻訳機能を埋め込む
 Python：sounddeviceを使った音声ファイルの再生・リアルタイム処理の実装方法
 PythonでWAVファイルの読み込み(soundfile.read)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up