More than 5 years have passed since last update.

Zoomに無理やりリアルタイム翻訳を導入

Last updated at 2020-05-01Posted at 2020-05-01

Zoomのウェブカメラ映像にリアルタイム翻訳を表示したい!

Zoom Meetingで英→日のリアルタイム翻訳をかけている例。

Zoomがかんたんに国境を超える一方で、英語でコミュニケーションできないとその恩恵を与れなくなってしまうということで、かんたんな仕組みを構築してみた。

大まかな流れとして、

Soundflowerを用いてZoom音声出力から内部ルーティングした音声をPythonでMicrosoft AzureのAPIを使用して、リアルタイム音声翻訳を行う。touch designerでwebカメラ入力に合わせて、pythonからOSCで送信した翻訳結果を字幕で表示。Touch DesignerでSyphon Spout Outを使用して出力し、CamTwistを経由して仮想webカメラとしてZoomに認識させる。力技感。

なお、Zoomはプロアカウントである必要は一切ない。

試した環境/ソフト

・Mac Catalina
・Python3.7
・Microsoft Azure アカウント
・Touch Designer
・Soundflower
・TwistCam

Soundflower, TwistCamのインストール

ここからダウンロード

Soundflower
https://github.com/mattingalls/Soundflower/releases/tag/2.0b2
(注意書きよく見て)

TwistCam
http://camtwiststudio.com/

Soundflowerの設定

インストールされるとmacのsoundメニューで入出力ともにsoundflowerという項目が表示されるので入力2ch、出力2chを設定。
これによってZoomで聞こえてくる音をマイク入力として扱える。windowsだとvoice meeter bananaってのがかなり有能。macに対応してきちんと動くのはsoundflowerしか今の所見つかってない。

Azureを用いたリアルタイム音声翻訳

Azureの中で、Cognitive Servicesと呼ばれるAPIを使用する。
https://azure.microsoft.com/ja-jp/services/cognitive-services/
以下のページから登録。僕も無料試用版であくまで契約しているので、もしがっつりやりたいってなるともちろんお金はかかってきます。

登録したら、サブスクリプションキーと、エリアコードをメモ。

macのpython環境からAzureのリアルタイム翻訳を呼び出す

サンプルコードはここに落ちてます。
https://github.com/Azure-Samples/cognitive-services-speech-sdk
これをダウンロード。
python/consoleフォルダ内の全ファイルの
"YourSubscriptionKey", "YourServiceRegion"
を書き換える。

macの音声入力からリアルタイム翻訳結果の値を得るため、 translation_sample.py ファイルの内部を書き換える。

OSC用の設定

# 文頭
from pythonosc import udp_client
from pythonosc.osc_message_builder import OscMessageBuilder
IP = '~'
PORT = 適当に設定

翻訳先を日本語に設定。OSCでtouch designerに送信するためのコードを追加。


def translation_continuous():
    """performs continuous speech translation from input from an audio file"""
    # <TranslationContinuous>
    # set up translation parameters: source language and target languages
    translation_config = speechsdk.translation.SpeechTranslationConfig(
        subscription=speech_key, region=service_region,
        speech_recognition_language='en-US',
        target_languages=('ja', 'fr'), voice_name="de-DE-Hedda")
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)

    # Creates a translation recognizer using and audio file as input.
    recognizer = speechsdk.translation.TranslationRecognizer(
        translation_config=translation_config, audio_config=audio_config)

    def result_callback(event_type, evt):
        """callback to display a translation result"""
        print("{}: {}\n\tTranslations: {}\n\tResult Json: {}".format(
            event_type, evt, evt.result.translations['ja'], evt.result.json))
        client = udp_client.UDPClient(IP, PORT);
        msg = OscMessageBuilder(address='/translation')
        msg.add_arg(evt.result.translations['ja'])
        m = msg.build()
        client.send(m)

    done = False

　  #省略

これで、consoleフォルダ内のmain.pyをコマンドプロンプトから実行して、適当に英語のyoutubeなんかを流すとこんな形で翻訳結果が、consoleに表示されるはず。

Touch Designerで翻訳結果とwebカメラのデータを合成

touch designerは数回程度しか使ったことなかったので手探り。
ここはoFとかでも実装できると思います。

メニューから、以下のノードを選びつなげていく。

・(TOP) video device in : webカメラ入力
・(TOP) Text : 翻訳字幕を表示
・(DAT) OSC In : OSCを受け、字幕のテキストを変更
・(TOP) Over :webカメラ映像と字幕を合成
・(TOP) Syphon Deveice Out : syphonとして出力
ちなみにsyphonは、Mac OSX 上のアプリケーション間で画像をやり取りするためのオープンソースらしい。

oscノードでは、pythonで選択したportを入力、さらにコードを以下のように書き換える。

def onReceiveOSC(dat, rowIndex, message, bytes, timeStamp, address, args, peer):
	op("text2").par.text = message.strip("/translation ")
	return

これで以下のように表示されるはず。

Touch Designerの出力をTwistCamを通じてZoomへ出力

TwistCamを起動。
syphonを選択すると、touchDesignerの項目が表示されるはず。
本ソフト内で、TouchDesignerからの出力を仮想webカメラ化できるとのこと。

これでzoomを起動。

Zoomのカメラ選択にCamTwistが出ていると思うので選択すれば、touch designerの画面が主力される。

精度はうーんまずまずっていう感じ。
日本語から英語もpythonコード書き換えればすぐにできちゃうはず。
特に難しいことはないけど、使うソフトが多かったのでメモ。
もっといいやり方等あればコメントお願いします。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up