AivisSpeechを使ったDiscordボットの作成　②Google Cloud上のTTSサーバーを叩くDiscordボットを作る

Posted at 2024-12-04

この記事はTSG Advent Calendar 2024の5日目の記事です。
また、この記事は「AivisSpeechを使ったDiscordボットの作成」のその②です。その①はこちら↓

TSGのボットたち

TSGではリモートサーバー上でいくつものslackbotを稼働させており、クイズに答えたりパズルを解いたりボットとお話しできたりする生態系豊かなチャンネルが存在します。

TSGのアイドル「今言うな」

これらslackbotの実装はGitHubリポジトリにまとめられており、discordで動作するボットもこちらで管理されています。

ちなみにdiscordでは早押しクイズができ、ボットがボイスチャンネルに入ってきて問読みをしてくれます。

このdiscordボットの中にTTSという独自コマンドが実装されています。これは声を出せない状況にある人が会話に入りたいときに、ボットが代わりにボイスチャンネルに入りTTS利用者のポストを読み上げてくれる機能です。読み上げの際の音声は様々な種類を取り揃えています。

ほとんどの声は（bot実装者の自腹で）クラウドサービスを叩いているのですが、TSGサーバー上で推論モデルを動作させればVOICEVOXなどのローカルで動くTTSも利用することも一応できます。しかしながら現在TSGサーバーはメモリが逼迫しており、ディープラーニングベースのリッチなモデルを動かすことができません。そこで私はVOICEVOXエンジンをGoogle Cloud Runにデプロイし、discordボットからそれを叩くことでずんだもんなどのTTS音声の利用を実現しています。

今回の記事では「その①」で作ったクラウド上のAivisSpeechエンジンを利用し、discord上のポストを読み上げるTTSボットの作り方を紹介します。

NodeJSの環境構築

Discordボットはいろいろな言語で実装できますが今回はNodeJSを利用します。
公式ウェブサイトの解説に沿ってNodeをインストールします。

$ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
$ nvm install 22
$ node -v
v22.11.0
$ npm -v
10.9.0

適当にプロジェクトフォルダを作り、npm環境を作ります。

$ mkdir discord-tts-bot
$ cd discord-tts-bot/
$ npm init --yes
$ npm install typescript ts-node axios discord.js google-auth-library dotenv @discordjs/voice libsodium-wrappers async-mutex

Discordにボットを登録する

アプリの作成

まずはdeveloper portalからNew Applicationをクリックして新しいアプリを作成します。そして作成したアプリのBot設定ページを開きます。

また、ボット自身にメンションされていないメッセージも読めるようにするためにMessage Content Intentを有効にしておきます。

そしてそのページにあるTOKENのところでこのボット用のアクセストークンを作成・コピーします。

ここで作成したトークンは厳重に管理してください。漏洩するとdiscordボットが乗っ取られます。

コピーしたトークンを.envという名前のファイルに以下のように保存しておきます。このファイルをうっかりGitHubにプッシュしないように気を付けてください。.env自体はgitignoreしておいて、ダミーの値を入れた.env.exampleをgitに追加しておくとよいでしょう。
ついでにGoogle Cloud RunにデプロイしたサービスのURLも保存しておきます。

.env

DISCORD_TOKEN=<アクセストークン>
TTS_ENDPOINT=https://xxxxxxxxxxxxxxxxxxxx.us-central1.run.app

このファイルはdotenvというnpmパッケージを使って読み出すことができます。

アプリの招待

次にOAuth2というページでこのボットが必要とする権限を設定し、招待リンクを作成します。
ここのscopeにbotを選択し、下に現れるbot permissionsでボットに与える権限を設定しますが、今回作成するボットに必要最小限な権限はおそらく以下の通りです。

General Permissions > View Channels
Text Permissions > Send Messages
Voice Permissions > Connect
Voice Permissions > Speak

これで作成した認証URLを踏むとdiscordに移動し、このボットを招待するかどうかの確認画面が表示されます。

discord.jsを使う

公式ガイドは↓

とりあえずボイスチャンネルにaivisttsと入力するとボットがログインしてくる実装を書いてみます。

index.ts

import dotenv from 'dotenv';
import {Client, Events, Message, TextChannel, VoiceChannel, GatewayIntentBits} from 'discord.js';
import {VoiceConnection, AudioPlayer, joinVoiceChannel, PlayerSubscription, createAudioResource, createAudioPlayer, AudioPlayerStatus} from '@discordjs/voice';
import {Mutex} from 'async-mutex';

dotenv.config({ override: true });

const client = new Client({
	intents: [
		GatewayIntentBits.Guilds,
		GatewayIntentBits.GuildMessages,
		GatewayIntentBits.GuildVoiceStates,
		GatewayIntentBits.MessageContent,
	],
});
let connection: VoiceConnection | null = null;
let audioPlayer: AudioPlayer | null = null;
let subscription: PlayerSubscription | null = null;
const mutex = new Mutex();

client.once(Events.ClientReady, readyClient => {
	console.log(`Ready! Logged in as ${readyClient.user.tag}`);
});
client.login(process.env.DISCORD_TOKEN);  // .envファイルに保存したアクセストークンを入れる

client.on('messageCreate', async (message: Message) => {
	if (message.author.bot) return;  // botの発言は無視
	if (!(message.channel instanceof VoiceChannel)) return;  // ボイスチャンネル以外は無視
	const channel: VoiceChannel = message.channel;
	if (message.content === 'aivistts') {
		// connect
		mutex.runExclusive(async () => {
			if (connection === null) {
				connection = joinVoiceChannel({
					channelId: channel.id,
					guildId: channel.guild.id,
					adapterCreator: channel.guild.voiceAdapterCreator,
				});
				audioPlayer = createAudioPlayer();
				subscription = connection.subscribe(audioPlayer)!;
			}
		});
	} else if (message.content === 'aivistts stop') {
		// disconnect
		mutex.runExclusive(async () => {
			subscription?.unsubscribe();
			connection?.destroy();
			connection = null;
		});
	} else {
		// speak
		mutex.runExclusive(async () => {
			if (connection) {
				console.log(`Speaking: ${message.content}`);
			}
		});
	}
});

では実行してみます。

$ npx tsc --init
$ npx ts-node index.ts

するとボイスチャンネルにaivisttsとポストしたときのみボットがそのチャンネルに接続し、その後入力にたいして以下のようなログが出てきます。

Speaking: oaoa

認証付きGCRサービスの叩き方

次にGCRにデプロイしたTTSサーバーとやり取りし、音声をボットから流す処理を作成します。
ボット用に作ったサービスアカウントの秘密鍵が書かれたJSONファイルをservice_account.jsonという名前でこのディレクトリに用意しておきます。

その①の記事でも述べたとおり、このJSONファイルも公開してはいけません。漏洩しないように気を付けましょう。

試しに適当な音声をダウンロードしてみます。

download.ts

import axios, {AxiosError} from 'axios';
import dotenv from 'dotenv';
import {GoogleAuth, IdTokenClient} from 'google-auth-library';
import fs from 'fs';

dotenv.config({ override: true });

async function getSpeech(cloud_client: IdTokenClient, text: string): Promise<{data: Buffer}> {
    const headers = await cloud_client.getRequestHeaders();  // GCRアクセス用のヘッダ
    headers['content-type'] = 'application/json';
    return new Promise((resolve, reject) => {
        axios.post<Buffer>(process.env.TTS_ENDPOINT!, {text, speaker: 888753760}, {
            headers,
            responseType: 'arraybuffer',
        }).then((response) => {
            resolve({data: response.data});
        }).catch((reason: AxiosError) => {
            console.error(`The VoiceVox API server has returned an error: ${reason.response?.data?.toString()}`);
            reject(reason);
        });
    });
}

(async () => {
    const cloud_auth = new GoogleAuth({keyFile: "./service_account.json"});
    const cloud_client = await cloud_auth.getIdTokenClient(process.env.TTS_ENDPOINT!);
    const {data} = await getSpeech(cloud_client, "メリークリスマス！");
    fs.writeFileSync("output.wav", data);
})();

$ npx ts-node download.ts
$ file output.wav
output.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz

うまくダウンロードできました。これを先ほどのボットプログラムに組み込みます。

index.ts

import axios, {AxiosError} from 'axios';
import dotenv from 'dotenv';
import {Client, Events, Message, TextChannel, VoiceChannel, GatewayIntentBits} from 'discord.js';
import {VoiceConnection, AudioPlayer, joinVoiceChannel, PlayerSubscription, createAudioResource, createAudioPlayer, AudioPlayerStatus, StreamType} from '@discordjs/voice';
import {Mutex} from 'async-mutex';
import {GoogleAuth, IdTokenClient} from 'google-auth-library';
import {Readable} from 'stream';

dotenv.config({ override: true });

async function getSpeech(cloud_client: IdTokenClient, text: string): Promise<{data: Readable}> {
    const headers = await cloud_client.getRequestHeaders();  // GCRアクセス用のヘッダ
    headers['content-type'] = 'application/json';
    return new Promise((resolve, reject) => {
        axios.post<Readable>(process.env.TTS_ENDPOINT! + "/tts", {text, speaker: 888753760}, {
            headers,
            responseType: 'stream',
        }).then((response) => {
            resolve({data: response.data});
        }).catch((reason: AxiosError) => {
            console.error(`The VoiceVox API server has returned an error: ${reason.response?.data?.toString()}`);
            reject(reason);
        });
    });
}

(async () => {

const cloud_auth = new GoogleAuth({keyFile: "./service_account.json"});
const cloud_client = await cloud_auth.getIdTokenClient(process.env.TTS_ENDPOINT!);

const client = new Client({
	intents: [
		GatewayIntentBits.Guilds,
		GatewayIntentBits.GuildMessages,
		GatewayIntentBits.GuildVoiceStates,
		GatewayIntentBits.MessageContent,
	],
});
let connection: VoiceConnection | null = null;
let audioPlayer: AudioPlayer | null = null;
let subscription: PlayerSubscription | null = null;
const mutex = new Mutex();

client.once(Events.ClientReady, readyClient => {
	console.log(`Ready! Logged in as ${readyClient.user.tag}`);
});
client.login(process.env.DISCORD_TOKEN);

client.on('messageCreate', async (message: Message) => {
	if (message.author.bot) return;
	if (!(message.channel instanceof VoiceChannel)) return;
	const channel: VoiceChannel = message.channel;
	if (message.content === 'aivistts') {
		// connect
		mutex.runExclusive(async () => {
			if (connection === null) {
				connection = joinVoiceChannel({
					channelId: channel.id,
					guildId: channel.guild.id,
					adapterCreator: channel.guild.voiceAdapterCreator,
				});
				audioPlayer = createAudioPlayer();
				subscription = connection.subscribe(audioPlayer)!;
			}
		});
	} else if (message.content === 'aivistts stop') {
		// disconnect
		mutex.runExclusive(async () => {
			subscription?.unsubscribe();
			connection?.destroy();
			connection = null;
		});
	} else {
		// speak
		mutex.runExclusive(async () => {
			if (connection) {
				const {data} = await getSpeech(cloud_client, message.content);
				const resource = createAudioResource(data);
				const playFinished = new Promise<void>((resolve) => {
					audioPlayer?.once(AudioPlayerStatus.Idle, resolve);
					audioPlayer?.play(resource);
				});
				let timeout;
				await Promise.race([
					playFinished,
					new Promise<void>((resolve) => {
						timeout = setTimeout(() => {
							console.log(`timeout. message: ${message.content}`);
							audioPlayer?.removeAllListeners();
							audioPlayer?.stop();  // 10秒待って終わらなかったら止める
							resolve();
						}, 10 * 1000);
					}),
				]);
				clearTimeout(timeout);
			}
		});
	}
});

})();

この実装はかなり簡略化したものになっており、TTSを使おうとしているユーザーの管理や、声の種類の選択、使用者がいなくなったときに勝手にボットが切断するなどの処理をいれることでより便利になると思います。

コールドスタート対策

Google Cloud Runにデプロイした際、最小のインスタンス数を０にしたため、しばらくアクセスが無いとインスタンスをすべて落とされてしまいます。料金の節約になるので良いことなのですが、この状態でアクセスをすると、30秒くらいインスタンスが起動するのを待つ必要があり、長大なレイテンシになってしまいます。そのためいきなりTTSのクエリを飛ばすのではなく、ユーザーが接続したときやTTSボットを呼び出した時に適当なGETでTTSサーバーを叩くと良いでしょう。

async function touch(cloud_client: IdTokenClient) {
        const headers = await cloud_client.getRequestHeaders();  // GCRアクセス用のヘッダ
        headers['content-type'] = 'application/json';
        axios.get(process.env.TTS_ENDPOINT!, {
                headers,
        }).then(() => {
                console.log("The TTS API server waked up.");
        }).catch((reason: AxiosError) => {
                console.error(`The TTS API server has returned an error: ${reason.response?.data?.toString()}`);
        });
}

まとめ

この記事ではGoogle Cloud RunにデプロイしたTTSサーバーを叩くDiscordボットを作成し、ユーザーがチャットしたテキストを読み上げる機能の実装を行いました。（もしあれば）次回はAivisSpeechの実装を改造し、レイテンシを短縮する方法を紹介します。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

AivisSpeechを使ったDiscordボットの作成 ②Google Cloud上のTTSサーバーを叩くDiscordボットを作る