HoloLens2 × Azure Cognitive Services （Translator APIで英語から日本語に翻訳し音声合成）

Posted at 2020-12-10

はじめに

HoloLensアドベントカレンダー2020の8日目の記事です。
API叩くの慣れてきましたかー？前回の続きで、画像分析APIで画像説明文を生成しそれを読み上げてみましたが、説明文は英語のため、日本語に翻訳して音声合成してみましょう！

開発環境

Azure
- Computer Vision API (画像分析 API)
- Translator API
- Speech SDK 1.14.0
Unity 2019.4.1f1
MRTK 2.5.1
Windows 10 PC
HoloLens2

導入

1．前回の記事まで終わらせてください。

2．Azureポータルを開き、Translatorを作成、エンドポイントとキー、場所（リージョン）をメモっておいてください。

3．Translatorのチュートリアル（C#, Python, Translate.py）を参考にTapToCaptureAnalyzeAPI.csを編集します。画像分析APIで得られた画像説明文（英語）をTranslator APIに投げ、翻訳した日本語のテキストをSSMLを用いて音声合成します。

TapToCaptureAnalyzeAPI.cs

using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System;
using UnityEngine;
using Microsoft.MixedReality.Toolkit.Utilities;
using System.Threading.Tasks;
using OpenCVForUnity.CoreModule;
using OpenCVForUnity.UnityUtils;
using OpenCVForUnity.ImgprocModule;

// SpeechSDK 追加分ここから
using System.IO;
using System.Text;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
// SpeechSDK 追加分ここまで

public class TapToCaptureAnalyzeAPI : MonoBehaviour
{

    // Translator 追加分ここから
    private string translator_endpoint = "https://api.cognitive.microsofttranslator.com/translate?api-version=3.0&to=ja";
    private string translator_subscription_key = "<Insert Your API Key>";

    [System.Serializable]
    public class Translator
    {
        public DetectedLanguage detectedLanguage;
        public Translations[] translations;
    }

    [System.Serializable]
    public class DetectedLanguage
    {
        public string language;
        public float score;
    }

    [System.Serializable]
    public class Translations
    {
        public string text;
        public string to;
    }
    // Translator 追加分ここまで

    // SpeechSDK 追加分ここから
    public AudioSource audioSource;

    async Task SynthesizeAudioAsync(string text) 
    {
        var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
        var synthesizer = new SpeechSynthesizer(config, null); // nullを省略するとPCのスピーカーから出力されるが、HoloLensでは出力されない。
        
        // SSMLを使用して日本語で音声合成
        // https://docs.microsoft.com/ja-jp/azure/cognitive-services/speech-service/get-started-text-to-speech?fbclid=IwAR2zk6s9XcOGxXzGlOb9tfrg8hw7jTozXuGEnbk-M3ZORrJFbLPZE6Y2d1k&tabs=script%2Cwindowsinstall&pivots=programming-language-csharp
        string ssml = "<speak version=\"1.0\" xmlns=\"https://www.w3.org/2001/10/synthesis\" xml:lang=\"ja-JP\"> <voice name=\"ja-JP-Ichiro\">" + text + "</voice> </speak>";

        // https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/quickstart/csharp/unity/text-to-speech/Assets/Scripts/HelloWorld.cs
        // Starts speech synthesis, and returns after a single utterance is synthesized.
        // using (var result = synthesizer.SpeakTextAsync(text).Result)
        using (var result = synthesizer.SpeakSsmlAsync(ssml).Result)
        {
            // Checks result.
            if (result.Reason == ResultReason.SynthesizingAudioCompleted)
            {
                // Native playback is not supported on Unity yet (currently only supported on Windows/Linux Desktop).
                // Use the Unity API to play audio here as a short term solution.
                // Native playback support will be added in the future release.
                var sampleCount = result.AudioData.Length / 2;
                var audioData = new float[sampleCount];
                for (var i = 0; i < sampleCount; ++i)
                {
                    audioData[i] = (short)(result.AudioData[i * 2 + 1] << 8 | result.AudioData[i * 2]) / 32768.0F;
                }

                // The output audio format is 16K 16bit mono
                var audioClip = AudioClip.Create("SynthesizedAudio", sampleCount, 1, 16000, false);
                audioClip.SetData(audioData, 0);
                audioSource.clip = audioClip;
                audioSource.Play();

                // newMessage = "Speech synthesis succeeded!";
            }
            else if (result.Reason == ResultReason.Canceled)
            {
                var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
                // newMessage = $"CANCELED:\nReason=[{cancellation.Reason}]\nErrorDetails=[{cancellation.ErrorDetails}]\nDid you update the subscription info?";
            }
        }
    }
    // SpeechSDK 追加分ここまで

    public GameObject quad;

    [System.Serializable]
    public class Analyze
    {
        public Categories[] categories;
        public Color color;
        public Description description;
        public string requestId;
        public Metadata metadata;
    }

    [System.Serializable]
    public class Categories
    {
        public string name;
        public float score;
    }

    [System.Serializable]
    public class Color
    {
        public string dominantColorForeground;
        public string dominantColorBackground;
        public string[] dominantColors;
        public string accentColor;
        public bool isBwImg;
        public bool isBWImg;
    }

    [System.Serializable]
    public class Description
    {
        public string[] tags;
        public Captions[] captions;
    }

    [System.Serializable]
    public class Captions
    {
        public string text;
        public float confidence;
    }

    [System.Serializable]
    public class Metadata
    {
        public int height;
        public int width;
        public string format;
    }

    UnityEngine.Windows.WebCam.PhotoCapture photoCaptureObject = null;
    Texture2D targetTexture = null;

    private string endpoint = "https://<Insert Your Endpoint>/vision/v3.1/analyze";
    private string subscription_key = "<Insert Your API Key>";
    private bool waitingForCapture;

    void Start(){
        waitingForCapture = false;
    }

    public void AirTap()
    {
        if (waitingForCapture) return;
        waitingForCapture = true;

        Resolution cameraResolution = UnityEngine.Windows.WebCam.PhotoCapture.SupportedResolutions.OrderByDescending((res) => res.width * res.height).First();
        targetTexture = new Texture2D(cameraResolution.width, cameraResolution.height);

        // PhotoCapture オブジェクトを作成します
        UnityEngine.Windows.WebCam.PhotoCapture.CreateAsync(false, delegate (UnityEngine.Windows.WebCam.PhotoCapture captureObject) {
            photoCaptureObject = captureObject;
            UnityEngine.Windows.WebCam.CameraParameters cameraParameters = new UnityEngine.Windows.WebCam.CameraParameters();
            cameraParameters.hologramOpacity = 0.0f;
            cameraParameters.cameraResolutionWidth = cameraResolution.width;
            cameraParameters.cameraResolutionHeight = cameraResolution.height;
            cameraParameters.pixelFormat = UnityEngine.Windows.WebCam.CapturePixelFormat.BGRA32;

            // カメラをアクティベートします
            photoCaptureObject.StartPhotoModeAsync(cameraParameters, delegate (UnityEngine.Windows.WebCam.PhotoCapture.PhotoCaptureResult result) {
                // 写真を撮ります
                photoCaptureObject.TakePhotoAsync(OnCapturedPhotoToMemoryAsync);
            });
        });
    }

    async void OnCapturedPhotoToMemoryAsync(UnityEngine.Windows.WebCam.PhotoCapture.PhotoCaptureResult result, UnityEngine.Windows.WebCam.PhotoCaptureFrame photoCaptureFrame)
    {
        // ターゲットテクスチャに RAW 画像データをコピーします
        photoCaptureFrame.UploadImageDataToTexture(targetTexture);
        byte[] bodyData = targetTexture.EncodeToJPG();

        Response response = new Response();

        try
        {
            string query = endpoint + "?visualFeatures=Categories,Description,Color";
            var headers = new Dictionary<string, string>();
            headers.Add("Ocp-Apim-Subscription-Key", subscription_key);
            response = await Rest.PostAsync(query, bodyData, headers, -1, true);
        }
        catch (Exception e)
        {
            photoCaptureObject.StopPhotoModeAsync(OnStoppedPhotoMode);
            return;
        }

        if (!response.Successful)
        {
            photoCaptureObject.StopPhotoModeAsync(OnStoppedPhotoMode);
            return;
        }

        Debug.Log(response.ResponseCode);
        Debug.Log(response.ResponseBody);
        Analyze analyze = JsonUtility.FromJson<Analyze>(response.ResponseBody);
        Debug.Log(analyze.description.captions[0].text);

        // Translator 追加分ここから
        try
        {
            var headers = new Dictionary<string, string>();
            headers.Add("Ocp-Apim-Subscription-Key", translator_subscription_key);
            headers.Add("Ocp-Apim-Subscription-Region", "japaneast");
            // headers.Add("Content-type", "application/json");
            string jsonData = "[{" + "'text'" + ":'" + analyze.description.captions[0].text + "'}]";
            // headers.Add("X-ClientTraceId", "uuid4");
            response = await Rest.PostAsync(translator_endpoint, jsonData, headers, -1, true);
        }
        catch (Exception e)
        {
            photoCaptureObject.StopPhotoModeAsync(OnStoppedPhotoMode);
            return;
        }

        if (!response.Successful)
        {
            photoCaptureObject.StopPhotoModeAsync(OnStoppedPhotoMode);
            return;
        }

        Debug.Log(response.ResponseCode);
        Debug.Log(response.ResponseBody);
        string newResponseBody = "{ \"results\": " + response.ResponseBody + "}";
        Translator[] translator = JsonHelper.FromJson<Translator>(newResponseBody);
        Debug.Log(translator[0].translations[0].text);
        // Translator 追加分ここまで

        // SpeechSDK 追加分ここから
        // 生成された画像説明文をSynthesizeAudioAsyncに投げる
        // await SynthesizeAudioAsync(analyze.description.captions[0].text); // en
        await SynthesizeAudioAsync(translator[0].translations[0].text); // jp 
        // SpeechSDK 追加分ここまで

        // OpenCVを用いて結果をて画像に書き込み
        Mat imgMat = new Mat(targetTexture.height, targetTexture.width, CvType.CV_8UC4);
        Utils.texture2DToMat(targetTexture, imgMat);
        Debug.Log("imgMat.ToString() " + imgMat.ToString());
        Imgproc.putText(imgMat, analyze.description.captions[0].text, new Point(10, 100), Imgproc.FONT_HERSHEY_SIMPLEX, 4.0, new Scalar(255, 255, 0, 255), 4, Imgproc.LINE_AA, false);
        Texture2D texture = new Texture2D(imgMat.cols(), imgMat.rows(), TextureFormat.RGBA32, false);
        Utils.matToTexture2D(imgMat, texture);
        Renderer quadRenderer = quad.GetComponent<Renderer>() as Renderer;
        quadRenderer.material.SetTexture("_MainTex", texture);

        // カメラを非アクティブにします
        photoCaptureObject.StopPhotoModeAsync(OnStoppedPhotoMode);
    }

    void OnStoppedPhotoMode(UnityEngine.Windows.WebCam.PhotoCapture.PhotoCaptureResult result)
    {
        // photo capture のリソースをシャットダウンします
        photoCaptureObject.Dispose();
        photoCaptureObject = null;
        waitingForCapture = false;
    }
}

4．ソースコードの重要なところ説明していきますね。まずは、translator_subscription_keyにメモしたキーをコピぺしてください。エンドポイントはそのままでOKです。クエリパラメータにapi-version=3.0と日本語への翻訳（to=ja）を指定しています。

5．MRTKのRestを用いて、前回の画像説明文の生成結果がanalyze.description.captions[0].textに格納されているので、json形式にして、Translator APIにPOSTします。ヘッダーはOcp-Apim-Subscription-Keyにtranslator_subscription_key、Ocp-Apim-Subscription-Regionにメモした場所（japaneast）を入力します。

6．POSTが成功したら、次のようなレスポンスボディが返ってくるので、仕様に合わせてTranslatorクラス、DetectedLanguageクラス、Translationsクラスを作ります。

[{"detectedLanguage":{"language":"en","score":1.0},"translations":[{"text":"部屋の人","to":"ja"}]}]

7．今回は、リストのjsonになっているので、Face APIのときみたいに、JsonHelperを用いてパースします。

string newResponseBody = "{ \"results\": " + response.ResponseBody + "}";
Translator[] translator = JsonHelper.FromJson<Translator>(newResponseBody);

8．あとはtranslator[0].translations[0].textに日本語に翻訳されたテキストがあるので、SynthesizeAudioAsyncに投げます。ただし、前回のままだと「ArgumentException: Length of created clip must be larger than 0」エラーが発生し、言語が対応してないことがわかります。

await SynthesizeAudioAsync(translator[0].translations[0].text);

9．そこで、SSML を使用して音声の特徴をカスタマイズするを参考に日本語で音声合成します。SSMLを作成し、日本語、男性、標準音声のja-JP-Ichiroを指定します。サポートしている言語の一覧はこちらです。

string ssml = "<speak version=\"1.0\" xmlns=\"https://www.w3.org/2001/10/synthesis\" xml:lang=\"ja-JP\"> <voice name=\"ja-JP-Ichiro\">" + text + "</voice> </speak>";

10．synthesizer.SpeakTextAsync(text)をsynthesizer.SpeakSsmlAsync(ssml)に変更することで、ssmlを読み込み、指定した音声で合成することができます。

音声合成マークアップ言語 (SSML) を使用して合成を改善する

11．今回、標準音声を用いていますが、ニューラル音声の方がより人間らしく合成できるみたいです。使用したい場合は、Azureの音声リソースの場所を"米国東部"、"東南アジア"、"西ヨーロッパ"のいずれかにしておく必要があるようです。

12．日本語の標準音声は、ja-JP-Ayumi（女性）、ja-JP-HarukaRUS（女性）、ja-JP-Ichiro（男性）、ニューラル音声は、ja-JP-NanamiNeural（女性）、ja-JP-KeitaNeural（男性）がありますので、変更してみてください。

実行

実行してみた結果がこちらの動画になります。

画像キャプチャ→画像説明文生成→説明文（英語）→日本語に翻訳→音声合成（SSML, 標準音声, 男性, ja-JP-Ichiro）#Azure #CognitiveServices #画像分析 #Translator #TTS #API #AI #HoloLens2 #MRTK #Unity #OpenCV pic.twitter.com/UVVq0yfSWq
— 藤本賢志(ガチ本)@XRKaigi (@sotongshi) December 10, 2020

お疲れ様でした。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up