音声認識機能の導入：Discordボット開発【その③】

Last updated at 2024-02-02Posted at 2024-01-24

こんにちは。この記事はJavaで実装する音声処理機能のすべて：Discordボット開発【その②】の続きの記事になります。

1. はじめに

Discordボット開発者の皆さん、音声認識機能の導入に際して、技術的な詳細に興味はありませんか？この記事では、Google Cloud Speech-to-Text APIをDiscordボットに統合する具体的なコードの流れを解説します。

2. Google Cloud Speech-to-Text APIの概要

このAPIは、リアルタイムの音声認識とテキスト変換を可能にし、多言語に対応しています。高い精度と柔軟性で、Discordボットの機能強化に最適です。

3. 主要なクラスとその役割

このセクションでは、Discordボットの音声認識機能における重要なクラスとそれぞれの役割について詳しく見ていきましょう。

SpeechToTextService:

SpeechToTextServiceは、Google Cloud Speech-to-Text APIとの直接的な連携を担当するクラスです。主な機能として、Google CloudのRecognizerオブジェクトを生成し、APIへのリクエストを送信します。このクラスは、APIからの応答を受け取り、適切なテキスト形式に変換することで、Discordボットにとって理解可能な情報を提供します。音声データの受け取りから、テキストへの変換までのプロセスがここで行われます。

SpeechToTextService.java

package jp.livlog.cotogoto.api.discord.speech2text;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.concurrent.ExecutionException;

import com.google.cloud.speech.v2.AutoDetectDecodingConfig;
import com.google.cloud.speech.v2.CreateRecognizerRequest;
import com.google.cloud.speech.v2.RecognitionConfig;
import com.google.cloud.speech.v2.RecognizeRequest;
import com.google.cloud.speech.v2.Recognizer;
import com.google.cloud.speech.v2.SpeechClient;
import com.google.cloud.speech.v2.SpeechRecognitionResult;
import com.google.protobuf.ByteString;

public class SpeechToTextService {

    private static String generateRecognizerId() {

        final var rand = new Random();
        return "rec" + Long.toString(Math.abs(rand.nextLong()), 36).substring(0, 10);
    }


    public static Recognizer createRecognizer(final String projectId)
            throws ExecutionException, InterruptedException, IOException {

        try (var speechClient = SpeechClient.create()) {

            final var recognizerId = SpeechToTextService.generateRecognizerId(); // 正規表現に適合するIDを生成
            final var parent = String.format("projects/%s/locations/global", projectId);

            @SuppressWarnings ("deprecation")
            final var recognizer = Recognizer.newBuilder()
                    .setModel("latest_long")
                    .addLanguageCodes("ja-jp")
                    .build();

            final var request = CreateRecognizerRequest.newBuilder()
                    .setParent(parent)
                    .setRecognizerId(recognizerId)
                    .setRecognizer(recognizer)
                    .build();

            final var future = speechClient.createRecognizerAsync(request);
            return future.get();
        }

    }


    public static List <String> transcribeAudio(final SpeechClient speechClient, final Recognizer recognizer, final byte[] audioData)
            throws IOException {

        final var audioBytes = ByteString.copyFrom(audioData);

        final var config = RecognitionConfig.newBuilder()
                .setAutoDecodingConfig(AutoDetectDecodingConfig.newBuilder().build())
                .build();

        final var request = RecognizeRequest.newBuilder()
                .setConfig(config)
                .setRecognizer(recognizer.getName())
                .setContent(audioBytes)
                .build();

        final var response = speechClient.recognize(request);
        final List <String> transcriptions = new ArrayList <>();
        for (final SpeechRecognitionResult result : response.getResultsList()) {
            if (result.getAlternativesCount() > 0) {
                transcriptions.add(result.getAlternativesList().get(0).getTranscript());
            }
        }

        return transcriptions;
    }

}

AudioProcessor:

AudioProcessorクラスの主な役割は、Discordから受け取った音声データを処理し、テキストに変換することです。このクラスでは、音声データを適切な形式に変換し、SpeechToTextServiceクラスに渡すことで、リアルタイムの音声認識を可能にします。また、ノイズリダクションや音声データの最適化など、音声処理に関連するさまざまな機能が実装されています。

AudioProcessor.java

package jp.livlog.cotogoto.api.discord;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;

import javax.sound.sampled.AudioFileFormat;
import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;

import com.google.cloud.speech.v2.Recognizer;
import com.google.cloud.speech.v2.SpeechClient;
import com.google.inject.Guice;
import com.google.inject.Injector;

import jp.livlog.cotogoto.api.discord.speech2text.SpeechToTextService;
import jp.livlog.cotogoto.share.CotogotoModule;
import net.dv8tion.jda.api.audio.AudioSendHandler;

public class AudioProcessor {

    private final Recognizer recognizer;

    private final Injector   injector = Guice.createInjector(new CotogotoModule());

    public AudioProcessor(final Recognizer recognizer) {

        this.recognizer = recognizer;
    }


    public byte[] processAudio(final String userId, final byte[] audioData) {

        try {
            final var wavData = this.convertPcmToWav(audioData);
            // ここでWAVデータを使って音声認識や合成を行う
            try (var speechClient = SpeechClient.create()) {
                final var transcriptions = SpeechToTextService.transcribeAudio(speechClient, this.recognizer, wavData);
                for (final String transcription : transcriptions) {
                    System.out.println("Transcription: " + transcription);
                }
            }

            return wavData;
        } catch (final IOException e) {
            e.printStackTrace();
            return null;
        }
    }


    private byte[] convertPcmToWav(final byte[] pcmData) throws IOException {

        try (
                var wavOutputStream = new ByteArrayOutputStream();
                var audioInputStream = new AudioInputStream(
                        new ByteArrayInputStream(pcmData),
                        AudioSendHandler.INPUT_FORMAT,
                        pcmData.length)) {

            AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, wavOutputStream);
            return wavOutputStream.toByteArray();
        }
    }
}

NobyBot:

最後に、NobyBotクラスはDiscordボットのメインクラスとして機能します。このクラスでは、SpeechToTextServiceとAudioProcessorの機能を統合し、Discordボットがユーザーの音声をリアルタイムで認識し、適切に反応できるようにします。ボットの全体的な動作を制御し、ユーザーからの音声メッセージに対する応答の管理を行います。また、ボットのユーザーインターフェースやイベントハンドリングなど、ユーザーとの対話に必要な他の機能もここで統合されます。

NobyBot.java

package jp.livlog.cotogoto.api.discord;

import java.io.IOException;
import java.util.concurrent.ExecutionException;

import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;

import com.sedmelluq.discord.lavaplayer.player.AudioPlayerManager;
import com.sedmelluq.discord.lavaplayer.player.DefaultAudioPlayerManager;
import com.sedmelluq.discord.lavaplayer.source.AudioSourceManagers;

import jp.livlog.cotogoto.api.discord.source.CustomInputStreamSourceManager;
import jp.livlog.cotogoto.api.discord.speech2text.SpeechToTextService;
import lombok.extern.slf4j.Slf4j;
import net.dv8tion.jda.api.entities.channel.ChannelType;
import net.dv8tion.jda.api.events.message.MessageReceivedEvent;
import net.dv8tion.jda.api.hooks.ListenerAdapter;

@Service
@Slf4j
public class NobyBot extends ListenerAdapter {

    @Value ("${google.projectId}")
    private String googleProjectId;

    @Override
    public void onMessageReceived(final MessageReceivedEvent event) {

        try {
            if (event.getAuthor().isBot()) {
                return; // Ignore messages from bots
            }

            this.printMessage(event);

            final var message = event.getMessage().getContentDisplay();
            if (message.startsWith("!")) {
                this.handleCommand(event, message);
            } else {
                this.echoMessage(event, message);
            }
        } catch (final Exception e) {
            NobyBot.log.error(e.getMessage(), e);
        }

    }


    private void printMessage(final MessageReceivedEvent event) {

        if (event.isFromType(ChannelType.PRIVATE)) {
            NobyBot.log.info("[PM] {}: {}", event.getAuthor().getName(), event.getMessage().getContentDisplay());
        } else {
            NobyBot.log.info("[{}][{}] {}: {}", event.getGuild().getName(), event.getChannel().getName(),
                    event.getMember().getEffectiveName(), event.getMessage().getContentDisplay());
        }
    }


    private void handleCommand(final MessageReceivedEvent event, final String message) throws ExecutionException, InterruptedException, IOException {

        final var command = message.split(" ")[0].substring(1).toLowerCase();
        switch (command) {
            case "join":
                this.joinVoiceChannel(event);
                break;
            case "leave":
                this.leaveVoiceChannel(event.getGuild());
                break;
            // Add more commands as needed
        }
    }


    private void echoMessage(final MessageReceivedEvent event, final String message) {

        event.getChannel().sendMessage(message).queue();
    }


    private void joinVoiceChannel(final MessageReceivedEvent event) throws ExecutionException, InterruptedException, IOException {

        // メッセージ送信者を取得（nullチェック）
        var member = event.getMember();
        if (member == null) {
            // イベントが発生したギルド内でメンバーを取得
            member = event.getGuild().getMemberById(event.getAuthor().getId());
        }

        if (member != null) {
            final var voiceState = member.getVoiceState();
            if (voiceState != null) {
                final var voiceChannel = voiceState.getChannel(); // 現在の音声チャンネルを取得
                if (voiceChannel != null) { // ユーザーが音声チャンネルにいる場合
                    final var audioManager = voiceChannel.getGuild().getAudioManager();

                    final var recognizer = SpeechToTextService.createRecognizer(this.googleProjectId);
                    final var audioProcessor = new AudioProcessor(recognizer);
                    final var sharedAudioData = new SharedAudioData();
                    final var scheduler = new DataCheckScheduler(sharedAudioData);
                    scheduler.start();

                    final AudioPlayerManager playerManager = new DefaultAudioPlayerManager();
                    playerManager.registerSourceManager(new CustomInputStreamSourceManager());
                    AudioSourceManagers.registerLocalSource(playerManager);

                    final var nobyHandler = new NobyAudioHandler(audioProcessor, sharedAudioData, playerManager);
                    audioManager.setReceivingHandler(nobyHandler); // NobyHandlerを設定
                    audioManager.setSendingHandler(nobyHandler); // NobyHandlerを設定
                    audioManager.openAudioConnection(voiceChannel); // そのチャンネルに接続
                    return;
                }
            }
        }

        // メッセージを送信したチャンネルを取得
        final var channel = event.getChannel();
        channel.sendMessage("ボイスチャンネルに誰もいません。").queue();
    }


    private void leaveVoiceChannel(final net.dv8tion.jda.api.entities.Guild guild) {

        final var audioManager = guild.getAudioManager();
        if (audioManager.isConnected()) {
            audioManager.closeAudioConnection();
        }
    }

}

4. 実際のコードの流れと解析

SpeechToTextService.javaは、Google Cloud APIとの連携を担います。
AudioProcessor.javaでは、音声データを受け取り、SpeechToTextServiceを介してテキストに変換します。
NobyBot.javaは、Discordボットの全体的な動作を制御し、他のクラスとの連携を行います。

5. 統合手順の要点

Google Cloud PlatformでAPIを有効にし、APIキーを取得します。
Discordボットを設定し、必要なクラスを統合します。
SpeechToTextServiceでAPIを呼び出し、AudioProcessorで音声データを処理します。
NobyBotで統合した機能を動作させ、ユーザーとの対話を実現します。

6. CotoGotoとの連携について

最後に、この技術メモは、CotoGotoの機能拡張の一環としてDiscordを連携するためのものです。CotoGoto（コトゴト）は、人工知能を搭載した会話型アプリで、日常的な会話を通じて作業内容を分析し、タスク管理やスケジュール管理をサポートします。Discordボットの導入により、CotoGotoはより多くのユーザーとのインタラクションを実現し、日々の生活や作業に役立つ情報を提供できるようになります。

詳細は、以下をご覧ください。

7. まとめ

Google Cloud Speech-to-Text APIをDiscordボットに統合することで、高精度な音声認識と応答性の向上を実現します。この記事で紹介したコードの流れと統合手順を参考に、あなたのボットを強化しましょう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up