More than 1 year has passed since last update.

Azure TextToSpeech APIでストリーム処理する - Node.js

Last updated at 2023-12-03Posted at 2023-12-03

AzureのSpeech ServiceでTTS（Text to speech）のAPIを使用して、受け取ったデータを逐次的にStreamで処理をする方法が判明したので紹介します。~~（なぜ、こんな書き方をするのかというと、MSが詳細なドキュメントを公開していないからです。MS仕事しろ！）~~

前提

AzureのSpeech Serviceを使える環境を用意してください。
Speech Serviceで合成音声(Speech Synthesis)の始め方はこちらをご覧ください。（C＃ですが・・・）

結論

SpeechSynthesizer.speakTextAsyncの引数にPushAudioOutputStreamCallbackを継承したクラスを用意します。また、SpeechSynthesizerの引数にはAudioConfigを設定してはいけません。

詳細は後からします。

コード全体

import * as MsTtsSdk from "microsoft-cognitiveservices-speech-sdk";

async TTS(text){
		// APIの設定。毎回行う必要はありません。使いまわせます。
    //var audioFileName = "temp_tts_" + uuidv4() + '.wav';;
    const speechConfig = MsTtsSdk.SpeechConfig.fromSubscription(process.env.SPEECH_KEY, process.env.SPEECH_REGION);
    //const audioConfig = MsTtsSdk.AudioConfig.fromAudioFileOutput(audioFileName);
    
		// 言語やスピーカーの声、アウトプットのフォーマットの設定
    speechConfig.speechSynthesisLanguage = "ja-JP";
    speechConfig.speechSynthesisVoiceName = "ja-JP-NanamiNeural"; 
    speechConfig.speechSynthesisOutputFormat = MsTtsSdk.SpeechSynthesisOutputFormat.Riff8Khz8BitMonoMULaw;

    // speech synthesizerクラスのインスタンス化
    var synthesizer = new MsTtsSdk.SpeechSynthesizer(speechConfig);
		// Streamしたデータを処理するコールバッククラスのインスタンス化
    let callback = new SamplePushAudioOutputStreamCallback(this.array);
    
    await synthesizer.speakTextAsync(
			text,
      function (result) {
	      if (result.reason === MsTtsSdk.ResultReason.SynthesizingAudioCompleted) {
	      console.log("synthesis finished.");
	      } else {
	      console.error("Speech synthesis canceled, " + result.errorDetails +
	        "\nDid you set the speech resource key and region values?");
	      }
      synthesizer.close();
      synthesizer = null;
    },
      function (err) {
      console.trace("err - " + err);
      synthesizer.close();
      synthesizer = null;
    }, 
		callback);
  }

class SamplePushAudioOutputStreamCallback extends MsTtsSdk.PushAudioOutputStreamCallback{
  constructor(array) {
    super();
    this.array = array;
  }

  write(data){
      // ここでストリーム化したデータを処理
	    console.log(`Received ${data.byteLength} bytes of audio data.`);

	    // encode to base64
	    let buffer = Buffer.from(data);
	    let base64 = buffer.toString('base64');
	    this.array.add(base64);
  }
  close(){
      console.log('Stream closed.');
  }
}

解説

APIの設定

今回は、SPEECH_KEYとSPEECH_REGIONでAPIをConfigを設定します。

スピーカの声の種類は日本語だと、複数あります。公式リファレンスを参照してください

//var audioFileName = "temp_tts_" + uuidv4() + '.wav';;
const speechConfig = MsTtsSdk.SpeechConfig.fromSubscription(process.env.SPEECH_KEY, process.env.SPEECH_REGION);
//const audioConfig = MsTtsSdk.AudioConfig.fromAudioFileOutput(audioFileName);

// 言語やスピーカーの声、アウトプットのフォーマットの設定
speechConfig.speechSynthesisLanguage = "ja-JP";
speechConfig.speechSynthesisVoiceName = "ja-JP-NanamiNeural"; 
speechConfig.speechSynthesisOutputFormat = MsTtsSdk.SpeechSynthesisOutputFormat.Riff8Khz8BitMonoMULaw;

クラスのインスタンス化

SpeechSynthesizerクラスとPushAudioOutputStreamCallbackを継承したクラスをインスタンス化します。

ここでファイルで保存する時は、ファイルの名前を指定したAudioConfigクラスをSpeechSynthesizerの引数に含めますが、ストリームで処理をする場合は含めてはいけません。

// speech synthesizerクラスのインスタンス化
var synthesizer = new MsTtsSdk.SpeechSynthesizer(speechConfig);
// Streamしたデータを処理するコールバッククラスのインスタンス化
let callback = new SamplePushAudioOutputStreamCallback(this.array);

PushAudioOutputStreamCallbackを継承したクラス。

write()はArrayBufferクラスを引数にとり、音声データのバイナリーの配列をここで受け取ることができます。今回は、base64に変換して、用意していた配列に追加する操作をしました。

close()はストリームが終わった時に呼び出されます。

class SamplePushAudioOutputStreamCallback extends MsTtsSdk.PushAudioOutputStreamCallback{
  constructor(array) {
    super();
    this.array = array;
  }

  write(data){
      // ここでストリーム化したデータを処理
	    console.log(`Received ${data.byteLength} bytes of audio data.`);

	    // encode to base64
	    let buffer = Buffer.from(data);
	    let base64 = buffer.toString('base64');
	    this.array.add(base64);
  }
  close(){
      console.log('Stream closed.');
  }
}

PushAudioOutputStreamCallbackはwrite()とclose()を必要とするアブストラクトクラスです。

export declare abstract class PushAudioOutputStreamCallback {
    /**
     * Writes audio data into the data buffer.
     * @member PushAudioOutputStreamCallback.prototype.write
     * @function
     * @public
     * @param {ArrayBuffer} dataBuffer - The byte array that stores the audio data to write.
     */
    abstract write(dataBuffer: ArrayBuffer): void;
    /**
     * Closes the audio output stream.
     * @member PushAudioOutputStreamCallback.prototype.close
     * @function
     * @public
     */
    abstract close(): void;
}

コールバックを引数に

SpeechSynthesizer.SpeechSynthesizerは以下のような引数をとります。

第一：合成音声にするテキスト

第二：API通信成功時に呼び出される。ただ、何かしらの理由でエラーとなる可能性があるので、上のコードのようにエラーハンドリングが必要

第三：API通信がそもそもできなかった時に呼び出される。

第四：PushAudioOutputStreamCallbackクラスを継承したクラスのインスタンス


await synthesizer.speakTextAsync(text, SuccessCallback, ErrorCallback, StreamCallback);

ライブラリによれば、AudioOutputStream PathLikeもいけそうです。

/**
 * Executes speech synthesis on plain text.
 * The task returns the synthesis result.
 * @member SpeechSynthesizer.prototype.speakTextAsync
 * @function
 * @public
 * @param text - Text to be synthesized.
 * @param cb - Callback that received the SpeechSynthesisResult.
 * @param err - Callback invoked in case of an error.
 * @param stream - AudioOutputStream to receive the synthesized audio.
 */
speakTextAsync(text: string, cb?: (e: SpeechSynthesisResult) => void, err?: (e: string) => void, stream?: AudioOutputStream | PushAudioOutputStreamCallback | PathLike): void;

リファレンス

公式リファレンス

公式Github

AudioOutputStreamを用いた方法の記事ではうまくいかず

弊社Passinate Geniusでは一緒に働く仲間を募集しています！興味をお持ちいただける方は、ホームページまで！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up