はじめに

Unity で Azure の Speech Service を使って、リアルタイム音声テキスト変換 (STT: Speech To Text) を行う方法を試してみたいと思います。2023年7月現在、Speech Service は Azure AI サービスの1つとして扱われているようです。

MS公式ドキュメントには、以前 Cognitive Services や Azure Applied AI Service と呼ばれていたものが、全て Azure AI サービスに含まれていると記載されています。

Azure AI サービス

開発者と組織が、すぐに使用できる事前構築済みのカスタマイズ可能な API とモデルを使用して、アプリケーションに様々なインテリジェント機能を迅速に組み込むことを可能にするサービスです。以下、Azure AI サービスに含まれるサービス一覧です。今回利用する Speech Service 以外にも数多くのサービスが用意されています。

Anomaly Detector
Azure Cognitive Search
Azure OpenAI
Bot Service
Content Safety
Custom Vision
Document Intelligence
Face
Immersive Reader
Language
Metrics Advisor
Personalizer
Speech
Translator
Video Indexer
Vision

Speech Service の機能

音声テキスト変換 ( STT : Speech To Text )
テキスト読み上げ ( TTS : Text To Speech )
発音評価
音声翻訳
話者認識
カスタムキーワード
意図認識

Speech Service でサポートされている言語

サポートされている言語は、Speech Service 内の機能ごとにも異なります。

Unity で Azure Speech Service を利用する

Unity (C#) で Speech Service を利用するには REST API もしくは Speech SDK を利用します。Microsoft 社が公開しているサンプルプロジェクトでは、Speech SDK を利用する方法が紹介されているので、まずは Hello World を試してみたいと思います。

GitHub: Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/csharp/unity

GitHub でサンプルとして用意されているプロジェクトは5つあります。

プロジェクト名	概要
embedded-speech	Speech SDK for Unity を利用した組み込み (オフライン、デバイス上) での音声認識、合成のデモ。
from-unitymicrophone	Unity のマイク入力を使用して、`PushAudioModeStream`を用いて、音声をストリーミングする方法をデモします。 Speech SDK に含まれるマイクを使わず、Unity のマイクを使うことで、ユーザーが音声録音を求めているシナリオなどで役に立ちます。
keywordrecognizer	Unity でキーワードを使って音声認識を開始するデモ。
speechrecognizer	リアルタイムな音声認識、複数言語への翻訳、自然言語理解を使用した音声入力による意図理解のデモ。
virtual-assistant	`DialogServiceConnector` を使ってボットへ接続、アクティビティを送受信して、Speech SDK を利用して音声認識、発話をするデモ。

手順

それでは from-unitymicrophone のサンプルを試してみたいと思います。

GitHub からリポジトリを clone します。

git clone https://github.com/Azure-Samples/cognitive-services-speech-sdk.git

Unity Hub で from-unitymicrophone フォルダを開きます。

サンプルは Unity 2020.3 以降をターゲットとして作成されています。

プロジェクトには、まだ Speech SDK for Unity (.unitypackage) が含まれておらず、コンパイルエラーが含まれているため、Enter Safe Mode? ダイアログが表示されますが、Ignore ボタンを押して、プロジェクトを開きます。プロジェクトを開くと、警告が表示されますが、次の手順で Speech SDK for Unity をインポートすると、解消するので、スルーして問題ありません。

Assets > Import Package > Custom Packages.. から Speech SDK for Unity (.unitypackage) をインポートします。UnityPackage はこちらからダウンロードできます。

カスタムパッケージのインポートが完了したら、Hello World シーンを開きます。

次に Assets > Scripts フォルダ内にある HelloWorld.cs を開きます。155行目の SpeechCongig.FromSubscription() 引数を、自分の Speech Service のサブスクリプションキーとリージョンに置き換えます。

config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");

デフォルトで、英語で認識されてしまうので、日本語での音声認識を試したい場合は、155行目の直後に以下コードを追加します。

config.SpeechRecognitionLanguage = "ja-JP";

全体のソースコードは、以下となります。

HelloWorld.cs

//
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE.md file in the project root for full license information.
//

using UnityEngine;
using UnityEngine.UI;
using Microsoft.CognitiveServices.Speech;
using System;
using System.Collections;
using Microsoft.CognitiveServices.Speech.Audio;
using System.IO;
#if PLATFORM_ANDROID
using UnityEngine.Android;
#endif
#if PLATFORM_IOS
using UnityEngine.iOS;
using System.Collections;
#endif

public class HelloWorld : MonoBehaviour
{
    private bool micPermissionGranted = false;
    public Text outputText;
    public Button recoButton;
    SpeechRecognizer recognizer;
    SpeechConfig config;
    AudioConfig audioInput;
    PushAudioInputStream pushStream;

    private object threadLocker = new object();
    private bool recognitionStarted = false;
    private string message;
    int lastSample = 0;
    AudioSource audioSource;

#if PLATFORM_ANDROID || PLATFORM_IOS
    // Required to manifest microphone permission, cf.
    // https://docs.unity3d.com/Manual/android-manifest.html
    private Microphone mic;
#endif

    private byte[] ConvertAudioClipDataToInt16ByteArray(float[] data)
    {
        MemoryStream dataStream = new MemoryStream();
        int x = sizeof(Int16);
        Int16 maxValue = Int16.MaxValue;
        int i = 0;
        while (i < data.Length)
        {
            dataStream.Write(BitConverter.GetBytes(Convert.ToInt16(data[i] * maxValue)), 0, x);
            ++i;
        }
        byte[] bytes = dataStream.ToArray();
        dataStream.Dispose();
        return bytes;
    }

    private void RecognizingHandler(object sender, SpeechRecognitionEventArgs e)
    {
        lock (threadLocker)
        {
            message = e.Result.Text;
            Debug.Log("RecognizingHandler: " + message);
        }
    }

    private void RecognizedHandler(object sender, SpeechRecognitionEventArgs e)
    {
        lock (threadLocker)
        {
            message = e.Result.Text;
            Debug.Log("RecognizedHandler: " + message);
        }
    }

    private void CanceledHandler(object sender, SpeechRecognitionCanceledEventArgs e)
    {
        lock (threadLocker)
        {
            message = e.ErrorDetails.ToString();
            Debug.Log("CanceledHandler: " + message);
        }
    }

    public async void ButtonClick()
    {
        if (recognitionStarted)
        {
            await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(true);

            if (Microphone.IsRecording(Microphone.devices[0]))
            {
                Debug.Log("Microphone.End: " + Microphone.devices[0]);
                Microphone.End(null);
                lastSample = 0;
            }

            lock (threadLocker)
            {
                recognitionStarted = false;
                Debug.Log("RecognitionStarted: " + recognitionStarted.ToString());
            }
        }
        else
        {
            if (!Microphone.IsRecording(Microphone.devices[0]))
            {
                Debug.Log("Microphone.Start: " + Microphone.devices[0]);
                audioSource.clip = Microphone.Start(Microphone.devices[0], true, 200, 16000);
                Debug.Log("audioSource.clip channels: " + audioSource.clip.channels);
                Debug.Log("audioSource.clip frequency: " + audioSource.clip.frequency);
            }

            await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
            lock (threadLocker)
            {
                recognitionStarted = true;
                Debug.Log("RecognitionStarted: " + recognitionStarted.ToString());
            }
        }
    }

    void Start()
    {
        if (outputText == null)
        {
            UnityEngine.Debug.LogError("outputText property is null! Assign a UI Text element to it.");
        }
        else if (recoButton == null)
        {
            message = "recoButton property is null! Assign a UI Button to it.";
            UnityEngine.Debug.LogError(message);
        }
        else
        {
            // Continue with normal initialization, Text and Button objects are present.
#if PLATFORM_ANDROID
            // Request to use the microphone, cf.
            // https://docs.unity3d.com/Manual/android-RequestingPermissions.html
            message = "Waiting for mic permission";
            if (!Permission.HasUserAuthorizedPermission(Permission.Microphone))
            {
                Permission.RequestUserPermission(Permission.Microphone);
            }
#elif PLATFORM_IOS
            if (!Application.HasUserAuthorization(UserAuthorization.Microphone))
            {
                Application.RequestUserAuthorization(UserAuthorization.Microphone);
            }
#else
            micPermissionGranted = true;
            message = "Click button to recognize speech";
#endif
            config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
            pushStream = AudioInputStream.CreatePushStream();
            audioInput = AudioConfig.FromStreamInput(pushStream);
            recognizer = new SpeechRecognizer(config, audioInput);
            recognizer.Recognizing += RecognizingHandler;
            recognizer.Recognized += RecognizedHandler;
            recognizer.Canceled += CanceledHandler;

            recoButton.onClick.AddListener(ButtonClick);
            foreach (var device in Microphone.devices)
            {
                Debug.Log("DeviceName: " + device);                
            }
            audioSource = GameObject.Find("MyAudioSource").GetComponent<AudioSource>();
        }
    }

    void Disable()
    {
        recognizer.Recognizing -= RecognizingHandler;
        recognizer.Recognized -= RecognizedHandler;
        recognizer.Canceled -= CanceledHandler;
        pushStream.Close();
        recognizer.Dispose();
    }

    void FixedUpdate()
    {
#if PLATFORM_ANDROID
        if (!micPermissionGranted && Permission.HasUserAuthorizedPermission(Permission.Microphone))
        {
            micPermissionGranted = true;
            message = "Click button to recognize speech";
        }
#elif PLATFORM_IOS
        if (!micPermissionGranted && Application.HasUserAuthorization(UserAuthorization.Microphone))
        {
            micPermissionGranted = true;
            message = "Click button to recognize speech";
        }
#endif
        lock (threadLocker)
        {
            if (recoButton != null)
            {
                recoButton.interactable = micPermissionGranted;
            }
            if (outputText != null)
            {
                outputText.text = message;
            }
        }

        if (Microphone.IsRecording(Microphone.devices[0]) && recognitionStarted == true)
        {
            GameObject.Find("MyButton").GetComponentInChildren<Text>().text = "Stop";
            int pos = Microphone.GetPosition(Microphone.devices[0]);
            int diff = pos - lastSample;

            if (diff > 0)
            {
                float[] samples = new float[diff * audioSource.clip.channels];
                audioSource.clip.GetData(samples, lastSample);
                byte[] ba = ConvertAudioClipDataToInt16ByteArray(samples);
                if (ba.Length != 0)
                {
                    Debug.Log("pushStream.Write pos:" + Microphone.GetPosition(Microphone.devices[0]).ToString() + " length: " + ba.Length.ToString());
                    pushStream.Write(ba);
                }
            }
            lastSample = pos;
        }
        else if (!Microphone.IsRecording(Microphone.devices[0]) && recognitionStarted == false)
        {
            GameObject.Find("MyButton").GetComponentInChildren<Text>().text = "Start";
        }
    }
}

Unity で再生ボタンを押すと、サンプルを試すことができます。

[MR Dev Tips #15] Azure AI Service - Speech Service の Unity サンプルを試してみる

はじめに

Azure AI サービス

Speech Service の機能

Speech Service でサポートされている言語

Unity で Azure Speech Service を利用する

手順

Refs