More than 5 years have passed since last update.

Unityでdocomoの音声合成APIを使用する方法

Last updated at 2019-03-31Posted at 2017-07-06

Unityでdocomoの音声合成APIを使用する方法が見つからなくて、自分で色々試した結果実装することができました。
なので、Unityでdocomoの音声合成APIを使用する方法を書いておきます。

Unityでキャラクターに音声合成を使って喋らせたいときなどに活用してください。

環境

Unity 2017
Mac OS 10.13

実装方法

docomoの音声合成APIについて

docomoが提供している音声合成APIは3つあります。

エーアイ
HOYA
NTTテクノクロス

今回は、エーアイのAPIを使いました。
ガイドラインのサービス提供条件が一番ゆるいと思ったからです。

エーアイのAPIにデータを送信する際に使用する記述方法は3つあります。

SSML
中間言語(AIカナ)
JEITAカナ

今回は、SSMLを使用しました。
SSMLは音声合成マークアップ言語とも呼ばれます。
XMLと似たような記述方式を採用しています。

docomoの音声合成APIを使用するには、docomo Developer supportでアカウントを作成し、ここからAPIキーを申請する必要があります。

リクエスト送信〜レスポンスまで

string apikey = "YOUR_APIKEY";
string url = "https://api.apigw.smt.docomo.ne.jp/aiTalk/v1/textToSpeech?APIKEY=" + apikey;

Dictionary<string, string> aiTalksParams = new Dictionary<string, string>();  

var postData = createSSML(text, aiTalksParams);
var data = System.Text.Encoding.UTF8.GetBytes(postData);

Dictionary<string, string> headers = new Dictionary<string, string>();
headers["Content-Type"] = "application/ssml+xml";
headers["Accept"] = "audio/L16";
headers["Content-Length"] = data.Length.ToString();

WWW www = new WWW(url, data, headers);
yield return www;

if (www.error != null)
{
	Debug.LogError(www.error);
	yield break;
}

以上がdocomoの音声合成APIにリクエストを送信し、レスポンスをwwwに格納するまでのソースコードです。
ソースコードは、https://teratail.com/questions/76058 を参考にしました。
コルーチンを使用しています。
www.bytesに音声データのバイナリが入っています。

また、createSSMLはSSMLを生成して返すメソッドです。

public string createSSML(string text, Dictionary<string, string> dic)
{
    return "<?xml version=\"1.0\" encoding=\"utf-8\" ?><speak version=\"1.1\"><voice name=\"maki\"><prosody pitch=\"1.5\" rate=\"0.85\">" + text + " </prosody></voice></speak>";
}

上記は一例です。
音声合成の声質や音量なども調整することができます。
詳しくはこちらをご覧ください。

エンディアン変換

次に、音声バイナリデータのエンディアン変換を行います。

private byte[] convertBytesEndian(byte[] bytes)
{
	byte[] newBytes = new byte[bytes.Length];
	for (int i = 0; i < bytes.Length; i += 2)
	{
		newBytes[i] = bytes[i + 1];
		newBytes[i + 1] = bytes[i];
	}
	// 44byte付加したnewBytes
	newBytes = addWAVHeader(newBytes);
	return newBytes;
}

ソースコードは、https://teratail.com/questions/76058 を参考にしました。

WAVヘッダを追加する

次に音声データの先頭にWAVヘッダを追加します。

private byte[] addWAVHeader(byte[] bytes)
{
	byte[] header = new byte[44];
	// サンプリングレート
	long longSampleRate = 16000;
	// チャンネル数
	int channels = 1;
	int bits = 16;
	// データ速度
	long byteRate = longSampleRate * (bits / 8) * channels;
	long dataLength = bytes.Length;
	long totalDataLen = dataLength + 36;
	// 最終的なWAVファイルのバイナリ
	byte[] finalWAVBytes = new byte[bytes.Length + header.Length];
	int typeSize = System.Runtime.InteropServices.Marshal.SizeOf(bytes.GetType().GetElementType());

	header[0] = convertByte("R");
	header[1] = convertByte("I");
	header[2] = convertByte("F");
	header[3] = convertByte("F");
	header[4] = (byte)(totalDataLen & 0xff);
	header[5] = (byte)((totalDataLen >> 8) & 0xff);
	header[6] = (byte)((totalDataLen >> 16) & 0xff);
	header[7] = (byte)((totalDataLen >> 24) & 0xff);
	header[8] = convertByte("W");
	header[9] = convertByte("A");
	header[10] = convertByte("V");
	header[11] = convertByte("E");
	header[12] = convertByte("f");
	header[13] = convertByte("m");
	header[14] = convertByte("t");
	header[15] = convertByte(" ");
	header[16] = 16;
	header[17] = 0;
	header[18] = 0;
	header[19] = 0;
	header[20] = 1;
	header[21] = 0;
	header[22] = (byte)channels;
	header[23] = 0;
	header[24] = (byte)(longSampleRate & 0xff);
	header[25] = (byte)((longSampleRate >> 8) & 0xff);
	header[26] = (byte)((longSampleRate >> 16) & 0xff);
	header[27] = (byte)((longSampleRate >> 24) & 0xff);
	header[28] = (byte)(byteRate & 0xff);
	header[29] = (byte)((byteRate >> 8) & 0xff);
	header[30] = (byte)((byteRate >> 16) & 0xff);
	header[31] = (byte)((byteRate >> 24) & 0xff);
	header[32] = (byte)((bits / 8) * channels);
	header[33] = 0;
	header[34] = (byte)bits;
	header[35] = 0;
	header[36] = convertByte("d");
	header[37] = convertByte("a");
	header[38] = convertByte("t");
	header[39] = convertByte("a");
	header[40] = (byte)(dataLength & 0xff);
	header[41] = (byte)((dataLength >> 8) & 0xff);
	header[42] = (byte)((dataLength >> 16) & 0xff);
	header[43] = (byte)((dataLength >> 24) & 0xff);

	System.Buffer.BlockCopy(header, 0, finalWAVBytes, 0, header.Length * typeSize);
	System.Buffer.BlockCopy(bytes, 0, finalWAVBytes, header.Length * typeSize, bytes.Length * typeSize);

	return finalWAVBytes;
}

private byte convertByte(string str)
{
    return System.Text.Encoding.UTF8.GetBytes(str)[0];
}

ソースコードは http://sky.geocities.jp/kmaedam/directx9/waveform.html と http://qiita.com/tkinjo1/items/a1cf73a471f06ab3ff65 を参考に作成しました。

AudioClipを生成する

Unityで音声を扱うには、音声ファイルをAudioClipというものに変換しないといけません。
音声のバイナリデータからAudioClipを生成できるソースコードを見つけたので、ご紹介します。

http://posposi.blog.fc2.com/blog-entry-245.html
https://github.com/Suzeep/audioclip_maker

ここで使用するのは、AudioClipMaker.cs のみです。
そして、このソースコードを少し変更します。

AudioClip clip = AudioClip.Create( name, samples, channels, frequency, is3D, isStream );

これを

AudioClip clip = AudioClip.Create( name, samples, channels, frequency, isStream );

に変更します。
AudioClip.Createの3Dに関する引数が非推奨になったためです。

このソースコードは以下のように使用します。

AudioClip clip = Create(name, wavBytes, 44, 16, Samples, 1, 16000, false, false);

nameには、クリップの名前を設定します。
これは使わないので、適当でOKです。
wavBytesには、WAVヘッダを追加した音声バイナリデータを設定します。
Samplesには、元の音声バイナリデータの大きさの半分の値を設定します。

これでAudioClipが生成できたので、あとはAudioSourceを生成して終了です。

public void Play(AudioClip clip, int samples)
{
	AudioSource audio = gameObject.AddComponent<AudioSource>();
	float[] rawData = new float[samples * clip.channels];
	clip.GetData(rawData, 0);

	audio.clip = clip;
	audio.Play();
}

さいごに

実装には結構時間がかかりました。
色々と探し回ってたので…。
でも実装できたので良かったです。

何かご質問などがありましたらお気軽にコメントしてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up