ZOOMの日本語音声を無料で英語に翻訳した字幕をつける。

Posted at 2022-06-01

ZOOMで日本語の音声を無料で英語に翻訳した字幕をつける。

音声の翻訳はAzureのSpeech translationの無料枠を使用します。

構成

ユーザがZoomに対していつものようにしゃべります。
Zoomから声がスピーカーに出力されます。
スピーカーからの出力をマイクとして音声を拾います。
音声をAzureの音声翻訳に送信し、翻訳したテキストとして受け取ります。
受け取った翻訳テキストをZoom APIを通して、字幕を表示します。

ミキサーの設定

https://vb-audio.com/Cable/
上記からミキサーのドライバーをインストールします。
アプリケーションで出力した音を入力マイクとして音声を取得できます。
ZOOMで再生した音をミキサーに出力することで、その音を別アプリがマイクの入力として取得できます。

既存のマイクから音声を取得するので、ミキサーのマイクをWindowsの設定で既存に設定します。

Azure

Azureの設定

Azureの無料アカウントを取得してください。本人認証のためクレジットカートが必要です。無料アカウントのためお金はかかりません。
「Speech Services の作成」で音声サービスを使えるようにします。音声サービを作成します。
- サブスクリプションキーを取得します。
- リージョンを指定します。
無料枠は月5時間分と記述されていました。そのため、5時間を超えて使用したい場合は有料を契約してください。無料で5時間以上使うと停止する？いきなり課金は無いようです。
音声モデルの生成もできます。UTF8テキストを一行ごとに一文記述して、読み込ませることができます。その他の形式もあります。精度が低い場合は、モデルを生成し対応することができます。

コード

基本的にサンプルそのまま使用しています。
翻訳が確定したときにZOOMのキャプションに文字を送っています。
エラーなどで終了したときに、再度起動するようにコードを記述しています。

package speechtran;

import java.io.IOException;
import java.util.Map;
import java.util.Scanner;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.atomic.AtomicBoolean;

import com.microsoft.cognitiveservices.speech.*;
import com.microsoft.cognitiveservices.speech.translation.*;

/**
 * https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/java/jre/console/src/com/microsoft/cognitiveservices/speech/samples/console/TranslationSamples.java
 */
public class Speech {

    static final String SPEECH__MODEL = "XXXXXXXXXXX-XXXXXXXX-9719-32ed32f643a7";
    static final String SPEECH__KEY = "XXXXXXXXXXXc119cfXXXXXXXa99c1e";
    static final String ZOOM = ""; // System.getenv("ZOOM"); //

    Caption caption = new Caption(ZOOM);

    AtomicBoolean end = new AtomicBoolean(false);

    public Speech() {

        caption.renewCount();
    }

    public void translate() {
        if (end.get()) {
            return;
        }

        new Thread(() -> {
            try {
                translationWithMicrophoneAsync();
            } catch (InterruptedException e) {
                e.printStackTrace();
            } catch (ExecutionException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }).start();
    }

    public void start() {
        translate();
        System.out.println("Say something...");
        System.out.println("Press any key to stop");
        new Scanner(System.in).nextLine();
        System.out.println("STOP!!");
        end.set(true);
        caption.end();
    }

    public void checkEnd(CountDownLatch countDownLatch) {
        if (end.get()) {
            countDownLatch.countDown();
        }
    }

    public void translationWithMicrophoneAsync() throws InterruptedException, ExecutionException, IOException {
        String speechSubscriptionKey = SPEECH__KEY;
        String speechServiceRegion = "japaneast";

        CountDownLatch countDownLatch = new CountDownLatch(1);

        try (SpeechTranslationConfig config = SpeechTranslationConfig.fromSubscription(speechSubscriptionKey,
                speechServiceRegion)) {

            String fromLanguage = "ja-JP";
            config.setSpeechRecognitionLanguage(fromLanguage);
            config.setEndpointId(SPEECH__MODEL);
            config.addTargetLanguage("en");

            String voice = "Speech Text to Trans";
            config.setVoiceName(voice);

            try (TranslationRecognizer recognizer = new TranslationRecognizer(config)) {
                // Subscribes to events.
                recognizer.recognizing.addEventListener((s, e) -> {
                    System.out.println("RECOGNIZING in : Text=" + e.getResult().getText());
                    checkEnd(countDownLatch);
                });

                recognizer.recognized.addEventListener((s, e) -> {
                    String t2 = e.getResult().getText();
                    String t1 = "";
                    if (e.getResult().getReason() == ResultReason.TranslatedSpeech) {
                        Map<String, String> map = e.getResult().getTranslations();
                        for (String element : map.keySet()) {
                            System.out.println("    TRANSLATED into : " + map.get(element));
                            t1 = map.get(element);
                        }
                        if (!"".equals(t2)) {
                            caption.send(t1, t2);
                        }
                    }
                    if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
                        System.out.println("RECOGNIZED: Text=" + e.getResult().getText());
                        System.out.println("    Speech not translated.");
                    } else if (e.getResult().getReason() == ResultReason.NoMatch) {
                        System.out.println("NOMATCH: Speech could not be recognized.");
                    }
                });

                recognizer.synthesizing.addEventListener((s, e) -> {
                    System.out.println(
                            "Synthesis result received. Size of audio data: " + e.getResult().getAudio().length);
                });

                recognizer.canceled.addEventListener((s, e) -> {
                    System.out.println("CANCELED:" + e.getSessionId() + " Reason=" + e.getReason());

                    if (e.getReason() == CancellationReason.Error) {
                        System.out.println("CANCELED: ErrorCode=" + e.getErrorCode());
                        System.out.println("CANCELED: ErrorDetails=" + e.getErrorDetails());
                    }
                    if ("ServiceError".equals(e.getErrorCode().name())) {
                        // retry
                        translate();
                        countDownLatch.countDown();
                    }
                });

                recognizer.sessionStarted.addEventListener((s, e) -> {
                    System.out.println("Session started:" + e.getSessionId());
                    checkEnd(countDownLatch);
                });

                recognizer.sessionStopped.addEventListener((s, e) -> {
                    System.out.println("Session stopped:" + e.getSessionId());
                    countDownLatch.countDown();
                });

                // Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop
                // recognition.
                recognizer.startContinuousRecognitionAsync().get();

                countDownLatch.await();
                System.out.println("end process");

                recognizer.stopContinuousRecognitionAsync().get();
            }
        }
    }

    public static void main(String[] args) {
        try {
            new Speech().start();
        } catch (Exception ex) {
            System.out.println("Unexpected exception: " + ex.getMessage());
            assert (false);
            System.exit(1);
        }
    }
}

Zoomの設定

上記に記述されているように、Zoomの画面から[字幕] をクリックします。[API トークンをコピー] をクリックします。クリップボードにトークンURLがコピーされます。
今回はJavaのプロパティから読み込ませるようにしています。

Zoomに字幕を送る方法

HTTP 経由での字幕 URL の使用に記述されているように、取得したトークンに「seq」パラメータと「lang」パラメータを設定します。
また字幕に表示するテキストを、POSTで送ります。
「seq」は最初0からテキストを送るごとにカウントアップします。

現在の「seq」を取得する場合。

 https://wmcapi.zoom.us/closedcaption?id=200610693&ns=GZHkEA==&expire=86400&spparams=id%2Cns%2Cexpire&signature=nYtXJqRKCW

/closedcaption/seq [GET]に変更します。

 https://wmcapi.zoom.us/closedcaption/seq?id=200610693&ns=GZHkEA==&expire=86400&spparams=id%2Cns%2Cexpire&signature=nYtXJqRKCW

をGETで送ります。番号が帰ります。

コード

java.net.HttpURLConnectionでHTTPリクエストを処理しています。そのためライブラリは追加で使用していません。
HTTPリスエストは、レスポンスまでに多少時間がかかるため、スレッド処理を行っています。
Executors.newFixedThreadPool(1)でマルチスレッドで処理を行っています。
sendメソッドで、字幕をZoomに送っています。
renewCountで現在のシーケンス番号を取得しています。
setup.propertiesにZOOM=でトークを記述してください。

package speechtran;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import javax.net.ssl.HttpsURLConnection;

public class Caption {

    String url;
    int cnt = 0;
    ExecutorService exec = Executors.newFixedThreadPool(1);

    public Caption(String url) {
        if (url == null || "".equals(url)) {
            ResourceBundle rb = ResourceBundle.getBundle("setup");
            this.url = rb.getString("zoom");
        } else {
            this.url = url;
        }
        System.out.println(this.url);
    }

    public void send(String txt1, String txt2) {

        exec.submit(() -> {
            try {
                URL obj = new URL(url + "&lang=jp-JP&seq=" + cnt++);
                HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();

                // add reuqest header
                con.setRequestMethod("POST");
                con.setRequestProperty("Content-Type", "text/plain");

                String urlParameters = txt1 + "\n" + txt2;

                // Send post request
                con.setDoOutput(true);
                OutputStream outputStream = con.getOutputStream();
                outputStream.write(urlParameters.getBytes("UTF-8"));
                outputStream.flush();
                outputStream.close();

                int responseCode = con.getResponseCode();

                // print result
                System.out.println(responseCode);
            } catch (Exception e) {
                e.printStackTrace();
            }

        });
    }

    public void renewCount() {
        try {

            URL obj = new URL(url.replace("/closedcaption?", "/closedcaption/seq?"));
            HttpURLConnection con = (HttpURLConnection) obj.openConnection();

            con.setRequestMethod("GET");

            con.getResponseCode();

            BufferedReader in = new BufferedReader(
                    new InputStreamReader(con.getInputStream()));
            String inputLine;
            StringBuffer response = new StringBuffer();

            while ((inputLine = in.readLine()) != null) {
                response.append(inputLine);
            }
            in.close();
            cnt = Integer.parseInt(response.toString()) + 1;
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public void end() {
        try {
            exec.shutdown();
            exec.awaitTermination(10, TimeUnit.SECONDS);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    // public static void main(String[] args) {
    // Caption caption = new Caption(
    // "https://wmcc.zoom.us/closedcaption?id=1111111111111&ns=xxxxxxxxxx&expire=86400&sparams=id%2Cns%2Cexpire&signature=xxxxxxxxxxxxxxxxxxxxx");
    // caption.renewCount();
    // caption.send("aaaaaあああああああ1", "bbbbbb");
    // caption.send("aaaaa2", "bbbbbb");
    // caption.send("aaaaa3", "bbbbbb");
    // caption.send("aaaaa4", "bbbbbb");
    // caption.send("aaaaa5", "bbbbbb");
    // caption.send("aaaaa6", "bbbbbb");
    // caption.send("aaaaa7", "bbbbbb");
    // caption.send("aaaaa8", "bbbbbb");
    // caption.send("aaaaa9", "bbbbbb");
    // caption.end();
    // }

}

実際の実行

SpeechのMainを実行します。

喋った言葉が翻訳されてZoomに表示されます。

発音した日本語と翻訳して英語が無事に表示されました。

そのた

細かいところは説明を省略しています。
AzureとZoomと音声をそれぞれ繋げばできます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up