【Unity×C#×Whisper(OpenAI)】WAVファイルをWhisperに投げ、テキストを取得する

Posted at 2023-12-23

概要

事前に記録したWAVファイル（音声データ）をWhisperのAPIに投げて、テキストを取得するデモです。

開発環境

Windows 10
Unity 2019.4.31f1
Api Compatibility Level .NET Standard 2.0

使用したパッケージ

UniTask

リンク先からunitypackageをダウンロードして、事前にプロジェクトにインポートしておきます。
https://github.com/Cysharp/UniTask

System.Text.Json

Jsonの解析に使いました。リンク先からパッケージをダウンロードし、dllファイルをUnityのAssetsの中にPluginsフォルダを作って、その中に入れておきます。
https://www.nuget.org/packages/System.Text.Json/

また、同様にして以下の依存パッケージのdllも入れておきます。
入れ忘れがあったら、Unity側のエラーで知らせてくれます。
https://www.nuget.org/packages/System.Threading.Tasks.Extensions
https://www.nuget.org/packages/System.Text.Encodings.Web
https://www.nuget.org/packages/System.Runtime.CompilerServices.Unsafe
https://www.nuget.org/packages/System.Memory
https://www.nuget.org/packages/System.Buffers
https://www.nuget.org/packages/Microsoft.Bcl.AsyncInterfaces

UnityNugetやNugetForUnityというパッケージ管理ツールでも入れられるようですが、今回は使用していません。

実装

詳細はスクリプト中にコメントで記載しています。

OpenAIModel.cs

Whisperのレスポンスで受け取るデータをJson形式で定義します。

namespace OpenAI.QueryJson
{
    using System;
    using System.Text.Json.Serialization;

    [Serializable]
    public class WhisperResponseModel
    {
        [JsonPropertyName("text")]
        public string text { get; set; }
    }
}

WhisperRequest.cs

Unity 2019の環境を利用している点に注意してください。
詳細はコメントに記載の通りです。

using System;
using UnityEngine;
using UnityEngine.Networking;
using Cysharp.Threading.Tasks;
using System.IO;
using System.Collections.Generic;
using OpenAI.QueryJson;
using System.Text.Json;

public class WhisperResponse
{
    public async UniTask<string> RequestToWhisper(string openai_api_key, string filePath)
    {
        string url = "https://api.openai.com/v1/audio/transcriptions";
        string model = "whisper-1";
        // Content-TypeヘッダーはUnityWebRequestによって自動的に設定されます。
        // string contentType = "multipart/form-data";

        // ファイルのバイト配列を取得
        byte[] fileData = File.ReadAllBytes(filePath);

        // マルチパートフォームデータの作成
        List<IMultipartFormSection> formData = new List<IMultipartFormSection>();
        formData.Add(new MultipartFormDataSection("model", model));
        formData.Add(new MultipartFormFileSection("file", fileData, Path.GetFileName(filePath), "audio/wav"));

        // リクエストの作成
        UnityWebRequest request = UnityWebRequest.Post(url, formData);

        // マルチパートフォームデータの設定
        request.downloadHandler = new DownloadHandlerBuffer();
        request.SetRequestHeader("Authorization", $"Bearer {openai_api_key}");

        // リクエスト送信とレスポンス待ち
        await request.SendWebRequest();
        // Debug.Log("HTTP Status Code: " + request.responseCode);

        // isNetworkError と isHttpError プロパティを使用していますが
        // Unity 2020.1以降の場合は、UnityWebRequestにresultプロパティが追加されているため、そちらを利用します。
        if (request.isNetworkError || request.isHttpError)
        {
            Debug.LogError(request.error);
            throw new Exception();
        }
        else
        {
            var responseString = request.downloadHandler.text;
            // Debug.Log("Response: " + responseString);
            try
            {
                // WhisperResponseModelを定義（別ファイル）し、それをdeserializeしてテキストを抽出しています。
                var responseObject = JsonSerializer.Deserialize<WhisperResponseModel>(responseString);
                // esponseObjectがnullでない場合はそのtextプロパティの値を取得し、responseObjectがnullの場合はstring.Empty(空の文字列)を返す
                return responseObject?.text ?? string.Empty;
            }
            catch (JsonException ex)
            {
                Debug.LogError("JSON Parse Error: " + ex.Message);
                return string.Empty;
            }
        }
    }
}

DemoRequestWhisper.cs

事前に用意したWAVファイルを読み込み、Whisperからテキストを取得するデモスクリプトです。
いくつか前提があるので、詳細はコメントをご確認ください。

using UnityEngine;

public class DemoRequestWhisper : MonoBehaviour
{
    // OpenAIのAPI keyをUnityのInspectorに事前にセットします。
    [SerializeField] private string openai_api_key = "openai_api_key";

    // Start is called before the first frame update
    async void Start()
    {
        // 保存先ファイルの設定
        var filePath = string.Format("{0}/{1}/{2}", Application.persistentDataPath, "recordings", "recordedAudio.wav");
        // 実際には、ここに表示されるfilePathにrecordedAudio.wavファイルを配置しておく必要があります。
        Debug.Log("filePath: " + filePath);

        // Whisperにリクエストを投げて、テキストに変換します。
        WhisperResponse whisperResponse = new WhisperResponse();
        string response = await whisperResponse.RequestToWhisper(openai_api_key, filePath);
        Debug.Log("WhisperResponse: " + response);
    }

    // Update is called once per frame
    void Update() { }
}

利用方法

事前に読み込ませておきたいWAVファイルを準備しておきます。
デモスクリプトの通りなら、以下のフォルダにWAVファイルを入れておきます。
C:/Users/[User Name]/AppData/LocalLow/DefaultCompany/[Unity Project Name]/recordings
うまく行かない場合は、とりあえずPlayモードを実行し（ファイルがないため、エラーになります）、Console画面でファイルパスを確認してそこにWAVファイルを格納します。
Unityを起動します。
Cube等のGameObjectを用意します。
用意したGameObjectのコンポーネントにDemoRequestWhisper.csを追加します。
OpenAIのAPI keyをInspectorの「openai_api_key」にセットします。
Console画面を開き、Playモードを実行します。
数秒して、テキストが得られたら成功です。

補足～UnityにおけるHTTP通信の実装について～

HTTP通信を行うに当たり、利用できるパッケージを調べました。
というのも、.NET FrameworkやUnityのバージョンの違いでいくつものパッケージがあるためです。
ざっと、確認したところ以下のパッケージが使われていることが確認できます。

WWW：古くからある
HttpWebRequest：C#標準のもの
WebClient：HttpWebRequestを扱いやすくしたラッパークラス
HttpClient：.Net Framework 4.5で追加（現状、最新？）
UnityWebRequest：Unity 5.4系で追加

どれも一長一短があるため、用途によって選ぶことになるようですが、Unityで利用する分には特別な理由がない限りUnityWebRequestを使っておくのが良いようです。
ということで、今回はUnityWebRequestを採用しています。

なお、特別な事情でHTTP/2を利用したい場合、YetAnotherHttpHandler（+HttpClient）が提供しているようです。

参考資料

HTTP通信関連

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up