リクルートA3RTとAzureを用いて音声対話システムを作ってみた

Last updated at 2022-12-22Posted at 2022-12-22

本記事は Craft Egg Advent Calendar 2022の22日目の記事です。

はじめに

株式会社Craft EggでUnityクライアントエンジニアをしている鈴木です。今回は、リクルート A3RTとAzure音声サービスを用いて音声対話システムを作成してみました。

完成品の紹介

gifのため音声は確認することができませんが、「こんにちは」とうつと「こんにちは」、「いい天気ですね」とうつと、「晴れてよかったですね」と音声とテキストが返ってきます。

完成コード

とりあえず作ったコードだけを見たい方はこちら
なお、Text2SpeechSampleはAzureの音声合成，音声認識をUnityから利用より引用しています。

using UnityEngine;
using UnityEngine.UI;
using UnityEngine.Networking;
using System;
using System.Collections;
using System.Text;

public class RequestTalkAPI : MonoBehaviour
{
    [SerializeField]
    private InputField inputField;

    [SerializeField]
    private Text2SpeechSample text2SpeechSample;

    const string url = "https://api.a3rt.recruit.co.jp/talk/v1/smalltalk";
    const string apikey = "ここにapiキー";

    public void OnInputMessage()
    {
        var inputMessage = inputField.text;

        StartCoroutine(SendAPI(inputMessage));
    }

    IEnumerator SendAPI(string message)
    {
        WWWForm form = new WWWForm();

        form.AddField("apikey", apikey);
        form.AddField("query", message, Encoding.UTF8);

        using (var request = UnityWebRequest.Post(url, form))
        {

            yield return request.SendWebRequest();

            if (request.result != UnityWebRequest.Result.Success)
            {
                Debug.Log(request.error);
                yield break;
            }

            try
            {
                string itemJson = request.downloadHandler.text;
                var jsnode = JsonUtility.FromJson<ResponseInfo>(itemJson);

                Debug.Log(jsnode.results[0].reply);
                text2SpeechSample.SynthesizeAudioClip(jsnode.results[0].reply);
            }
            catch (Exception e)
            {
                Debug.Log("JsonNode:" + e.Message);
            }
        }
    }
}

[System.Serializable]
public class ResponseInfo
{
    public int status;

    public string message;

    public Reply[] results;
}


[System.Serializable]
public class Reply
{
    public float perplexity;

    public string reply;
}

using System.IO;
using UnityEngine;
using UnityEngine.UI;

[RequireComponent(typeof(AudioSource))]
public class Text2SpeechSample : MonoBehaviour
{
    [SerializeField] 
    private InputField _inputField;
    
    [SerializeField]
    private Text text;

    /// <summary>
    /// Azureから取得したサブスクリプションキーとリージョンを設定
    /// </summary>
    private const string subscriptionKey = "サブスクリプションキー";
    private const string region = "japaneast";

    private SpeechSDKHelper speechSDKHelper = new SpeechSDKHelper(subscriptionKey, region);

    public void OnInputTextToSpeech()
    {
        var message = _inputField.text;
        SynthesizeAudioClip(message);
    }
    
    public async void SynthesizeAudioClip(string synthText)
    {
        var clip = await speechSDKHelper.Text2SpeechAudioClip(synthText);
        GetComponent<AudioSource>().PlayOneShot(clip);
        text.text = synthText;
    }

    public async void SynthesizeWAV(string synthText)
    {
        await speechSDKHelper.Text2SpeechWAV(synthText, Path.Combine(Application.dataPath, "test.wav"));
    }
}

手順

対話システムを構築するにあたり必要な手順を簡潔に書き出すと

ユーザーからの音声入力受付
音声をテキストに変換
入力文章から返信を作成
返信テキストから音声合成
返信

のようになります。

今回は入力部分は音声ではなくテキストままとしましたが、今回試したAzure音声サービスにはSpeechToTextの機能も用意されているので、利用すれば実現可能です。

各APIの利用登録
A3RT TalkAPIサイトよりTalkAPIのAPIKEYを発行します。
- 発行したAPIKEYをRequestTalkAPIのapikeyに設定します。
Azure Text to Speechにてサブスクリプションキーを発行します。
- 発行したサブスクリプションキーをText2SpeechSampleのsubscriptionKeyに設定します。
- 参考サイト：Azure Cognitive Services の音声サービスで日本語のテキスト読み上げ（ニューラル音声の利用）

A3RT TalkAPIの利用

TalkAPIは https://api.a3rt.recruit.co.jp/talk/v1/smalltalkに対してPOSTAPIを投げる形で利用します。また、レスポンスは指定なしの場合Json形式で返ってくるため、JsonUtilityを利用しています。応答の文章にあたるResponseInfoのReplyは配列となっており、予測性能(perplexity)が異なる複数の応答を受け取ることができます。

    IEnumerator SendAPI(string message)
    {
        WWWForm form = new WWWForm();

        form.AddField("apikey", apikey);
        form.AddField("query", message, Encoding.UTF8);

        using (var request = UnityWebRequest.Post(url, form))
        {

            yield return request.SendWebRequest();

            if (request.result != UnityWebRequest.Result.Success)
            {
                Debug.Log(request.error);
                yield break;
            }

            try
            {
                string itemJson = request.downloadHandler.text;
                var jsnode = JsonUtility.FromJson<ResponseInfo>(itemJson);

                Debug.Log(jsnode.results[0].reply);
                text2SpeechSample.SynthesizeAudioClip(jsnode.results[0].reply);
            }
            catch (Exception e)
            {
                Debug.Log("JsonNode:" + e.Message);
            }
        }

Azure TextToSpeechの利用

Azure TextToSpeechに関してはSpeechSDKHelperを利用しました。変換結果をAudioClip形式で受け取ることができるので、特別な実装なく利用することができます。

参考サイト

Azureの音声合成，音声認識をUnityから利用
 SpeechSDKHelper
Azure Cognitive Services の音声サービスで日本語のテキスト読み上げ（ニューラル音声の利用）

終わりに

リクルート A3RTとAzure音声サービスを用いて音声対話システムを作成してみました。

今回は、返答作成、TextToSpeech共に外部APIを利用する形で実装してみましたが、AIを手元で実行できる流れができている現在では、手元で完結させることも期待できます。[Unity Barracuda]を利用すればONNXという形式で出力されている機械学習済みモデルを利用することができます。また、音声出力などもCevioAIなどのソフトとの連携も試してみたいです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up