More than 5 years have passed since last update.

Discord Advent Calendar 2019

読み上げBotが感情を持ちました

Posted at 2019-12-24

はじめに

本記事はDiscord Advent Calendar 2019の24日目の記事です。

最近、ちまちまとdiscordのボットでチャットボットを作っている@Sashimimochiと申します。
チャットボットを作るためのプラットフォームやAPIはdiscord以外にも様々ありますが、その中でもdiscordを選ぶ魅力の一つがボイスチャンネルが使えることだと思います。

同アドベントカレンダーで@coolwind0202さんが紹介されている通り、提供されているAPIを使えば手軽にボイスチャンネルの制御ができます。
テキストから音声ファイルへの変換(Text to Speech)もgTTSやOpenJTalkを使えば、特別作り込むことなくできます。

私の場合は、パラメータの細かな調整が利く点と音声種類が選べる点からOpenJTalkを使っています。
Open JTalk の標準音声である女性話者(Mei)にはVer. 1.8の時点で

Normal
Happy
Sad
Angry
Bashful

の5種類が、
東北大学大学院工学研究科通信工学専攻伊藤・能勢研究室が公開している HTS voice tohoku-f01 では

Neutral(平静)
Happy(喜び)
Sad(悲しみ)
Angry(怒り)

の4種類が選べます。

一般的な読み上げには、Normal/Neutral(平穏)が良さそうなのですが、せっかくこれだけ種類が用意されているのでどうにかしてそのほかの音声も使いたいところです。どうしよう？そうだ、文から感情を推定してボイスの種類を切り替えればよいのではということで試しに読み上げボットを作ったというのが本記事の概要です。

基本機能

まずは、読み上げボットの基本となるdiscordでボイスチャンネルを制御するための最低限の機能を実装していきます。
今回はpythonで実装していくので、discord.pyを使用します。

使用しているバージョンは以下の通りです。

discord==1.0.1
discord.py==1.2.5

とりあえず動けばいいので取り急ぎ

コードを展開する

app.py

import discord


client = discord.Client()
client_id = 'your_client_id'

voice = None
volume = None

@client.event
async def on_ready():
    # 起動時の処理
    print('Bot is wake up.')

@client.event
async def on_message(message):
    nlp = NLP()
    vc = VoiceChannel()
    # テキストチャンネルにメッセージが送信されたときの処理
    global voice, volume, read_mode

    if voice is True and volume is None:
            source = discord.PCMVolumeTransformer(voice.source)
            volume = source.volume

    if client.user != message.author:
        text = message.content
        if text == '!login':
            channel = message.author.voice.channel
            voice = await channel.connect()
            await message.channel.send('ボイスチャンネルにログインしました')
        elif text == '!logout':
            await voice.disconnect()
            await message.channel.send('ボイスチャンネルからログアウトしました')
        elif text == '!status':
            if voice.is_connected():
                await message.channel.send('ボイスチャンネルに接続中です')
        elif text == '!volume_up':
            volume += 0.1
            await message.channel.send('音量を上げました')
        elif text == '!volume_down':
            volume -= 0.1
            await message.channel.send('音量を下げました')
        elif text == '!bye':
            await client.close()
        elif text == '!read_mode_on':
            read_mode = True
            await message.channel.send('読み上げモードをオンにしました')
        elif text == '!read_mode_off':
            read_mode = False
            await message.channel.send('読み上げモードをオフにしました')
        else:
            if read_mode:
                emotion = nlp.analysis_emotion(text)
                voice_file = vc.make_by_jtalk(text, filepath, emotion=emotion)
                audio_source = discord.FFmpegPCMAudio(voice_file)
                voice.play(audio_source, after=lambda e: vc.after_play(e))

client.run(client_id)

テキスト→音声

冒頭で挙げた通りOpenJTalkを使います。
詳しいインストール方法については参考文献に譲るとして、Macの場合は

$ brew install open-jtalk

の1行で入ります。

辞書ファイル(/usr/local/Cellar/open-jtalk/1.11/dic)
音声ファイル(/usr/local/Cellar/open-jtalk/1.11/voice/)

もまとめて入ります。

$ open_jtalk
The Japanese TTS System "Open JTalk"
Version 1.10 (http://open-jtalk.sourceforge.net/)

これをpythonのコードから呼び出して使用します。
discord.pyとの相性上、.wavから.mp3に変換しています。

コードを展開する

app.py

import os
import subprocess
from pydub import AudioSegment

class VoiceChannel:
    def __init__(self):
        self.conf = {
            "voice_configs": {
                "htsvoice_resource": "/usr/local/Cellar/open-jtalk/1.11/voice/",
                "jtalk_dict": "/usr/local/Cellar/open-jtalk/1.11/dic"
            }
        }


    def make_by_jtalk(self, text, filepath='voice_message', voicetype='mei', emotion='normal'):
        htsvoice = {
            'mei': {
                'normal': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_normal.htsvoice')],
                'angry': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_angry.htsvoice')],
                'bashful': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_bashful.htsvoice')],
                'happy': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_happy.htsvoice')],
                'sad': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_sad.htsvoice')]
            },
            'm100': {
                'normal': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'm100/nitech_jp_atr503_m001.htsvoice')]
            },
            'tohoku-f01': {
                'normal': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'htsvoice-tohoku-f01-master/tohoku-f01-neutral.htsvoice')],
                'angry': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'htsvoice-tohoku-f01-master/tohoku-f01-angry.htsvoice')],
                'happy': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'htsvoice-tohoku-f01-master/tohoku-f01-happy.htsvoice')],
                'sad': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'htsvoice-tohoku-f01-master/tohoku-f01-sad.htsvoice')]
            }
        }

        open_jtalk = ['open_jtalk']
        mech = ['-x', self.conf['voice_configs']['jtalk_dict']]
        speed = ['-r', '1.0']
        outwav = ['-ow', filepath+'.wav']
        cmd = open_jtalk + mech + htsvoice[voicetype][emotion] + speed + outwav
        c = subprocess.Popen(cmd, stdin=subprocess.PIPE)
        c.stdin.write(text.encode())
        c.stdin.close()
        c.wait()
        audio_segment = AudioSegment.from_wav(filepath+'.wav')
        os.remove(filepath+'.wav')
        audio_segment.export(filepath+'.mp3', format='mp3')
        return filepath+'.mp3'

    def after_play(self, e):
        print(e)

これでmake_by_jtalk()関数にテキストを渡せばvoice_message.mp3という音声ファイルを生成するようになります。

ポジティブ/ネガティブ判定

根幹となる文のポジティブ/ネガティブ判定を実装していきます。
文のポジネガ判定の方法もいろいろあると思いますが、今回は単語ごとの印象(極性)を分析していく極性分析を行うことにします。
単語ごとの極性をまとめた極性辞書は、東京工業大学精密工学研究所高村研究室が公開している単語感情極性対応表を使いました。
表の中身は単語ごとに$-1$~$+1$までの極性値が載っています。
これに文を単語分割したものを渡してポジネガ値を計算することにします。

扱いやすいように先ほどの単語感情極性対応表をjson形式にまとめておきます。

pn_ja.json

{
    "—粉": {
        "pos": "名詞",
        "surface": "みじんこ",
        "value": "-0.629769"
    },
    "ああ": {
        "pos": "副詞",
        "surface": "ああ",
        "value": "-0.31688"
    },
(中略)
}

表をtxt形式でダウンロードして次のように変換しています。

reformat_pn_table.py


import json

def main():
    with open('pn_ja.txt', 'r') as f:
        words = f.readlines()

    word_dict = {}

    for word in words:
        tmp = word.split(':')
        word_dict[tmp[0]] = {}
        word_dict[tmp[0]]['surface'] = tmp[1]
        word_dict[tmp[0]]['pos'] = tmp[2]
        word_dict[tmp[0]]['value'] = tmp[3].replace('\n', '')

    with open('pn_ja.json', 'w') as f:
        json.dump(word_dict, f, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))

if __name__ == '__main__':
    main()

そして、単語ごとの極性値から文全体の極性値を計算します。
本当は文脈を考慮した方が良いとは思いますが、今回は簡単のために単純に各単語の極性値の合計で文のポジネガを判断することにします。

$$ S = \sum_{i} s_i$$

$s_i$が$i$番目の単語の極性値で$S$がその合計スコアです。
このスコア値に応じてどの感情の音声を使うかを決めます。
特別根拠はないですが、私の主観で

$0.5 < S$: Happy
$-0.5 \leq S \leq 0.5$: Normal
$-1.0 \leq S < -0.5$: Sad
$S < -1.0$: Angry

としました。

コードを展開する

app.py

import MeCab
import json

class CommonModule:
    def load_json(self, file):
        with open(file, 'r', encoding='utf-8') as f:
            json_data = json.load(f)
        return json_data

class NLP:
    def __init__(self):
        self.cm = CommonModule()

    def morphological_analysis(self, text, keyword='-Ochasen'):
        words = []
        tagger = MeCab.Tagger(keyword)
        result = tagger.parse(text)
        result = result.split('\n')
        result = result[:-2]

        for word in result:
            temp = word.split('\t')
            print(word)
            word_info = {
                'surface': temp[0],
                'kana': temp[1],
                'base': temp[2],
                'pos': temp[3],
                'conjugation': temp[4],
                'form': temp[5]
            }
            words.append(word_info)
        return words

    def evaluate_pn_ja_wordlist(self, wordlist, word_pn_dictpath=None):
        if word_pn_dictpath is None:
            word_pn_dict = self.cm.load_json('pn_ja.json')
        else:
            word_pn_dict = self.cm.load_json(word_pn_dictpath)

        pn_value = 0
        for word in wordlist:
            pn_value += self.evaluate_pn_ja_word(word, word_pn_dict)

        return pn_value

    def evaluate_pn_ja_word(self, word, word_pn_dict:dict):
        if type(word) is dict:
            word = word['base']
        elif type(word) is str:
            pass
        else:
            raise TypeError

        if word in word_pn_dict.keys():
            pn_value = float(word_pn_dict[word]['value'])
            return pn_value
        return 0

    def analysis_emotion(self, text):
            split_words = self.morphological_analysis(text, "-Ochasen")
            pn_value = self.evaluate_pn_ja_wordlist(split_words)
            if pn_value > 0.5:
                emotion = 'happy'
            elif pn_value < -1.0:
                emotion = 'angry'
            elif pn_value < -0.5:
                emotion = 'sad'
            else:
                emotion = 'normal'
            return emotion

より高度にやるのであれば、機械学習による感情推定モデルを使うという手もあります。
無料で手軽に試せるものであれば、COTOHA APIあたりが良さそうでした。

実装全体

最後に全体をつなげて完成です。

コード全体

app.py

import MeCab
import json
import discord
import os
import subprocess
from pydub import AudioSegment

class CommonModule:
    def load_json(self, file):
        with open(file, 'r', encoding='utf-8') as f:
            json_data = json.load(f)
        return json_data

class NLP:
    def __init__(self):
        self.cm = CommonModule()

    def morphological_analysis(self, text, keyword='-Ochasen'):
        words = []
        tagger = MeCab.Tagger(keyword)
        result = tagger.parse(text)
        result = result.split('\n')
        result = result[:-2]

        for word in result:
            temp = word.split('\t')
            print(word)
            word_info = {
                'surface': temp[0],
                'kana': temp[1],
                'base': temp[2],
                'pos': temp[3],
                'conjugation': temp[4],
                'form': temp[5]
            }
            words.append(word_info)
        return words

    def evaluate_pn_ja_wordlist(self, wordlist, word_pn_dictpath=None):
        if word_pn_dictpath is None:
            word_pn_dict = self.cm.load_json('pn_ja.json')
        else:
            word_pn_dict = self.cm.load_json(word_pn_dictpath)

        pn_value = 0
        for word in wordlist:
            pn_value += self.evaluate_pn_ja_word(word, word_pn_dict)

        return pn_value

    def evaluate_pn_ja_word(self, word, word_pn_dict:dict):
        if type(word) is dict:
            word = word['base']
        elif type(word) is str:
            pass
        else:
            raise TypeError

        if word in word_pn_dict.keys():
            pn_value = float(word_pn_dict[word]['value'])
            return pn_value
        return 0

    def analysis_emotion(self, text):
            split_words = self.morphological_analysis(text, "-Ochasen")
            pn_value = self.evaluate_pn_ja_wordlist(split_words)
            if pn_value > 0.5:
                emotion = 'happy'
            elif pn_value < -1.0:
                emotion = 'angry'
            elif pn_value < -0.5:
                emotion = 'sad'
            else:
                emotion = 'normal'
            return emotion

class VoiceChannel:
    def __init__(self):
        self.conf = {
            "voice_configs": {
                "htsvoice_resource": "/usr/local/Cellar/open-jtalk/1.11/voice/",
                "jtalk_dict": "/usr/local/Cellar/open-jtalk/1.11/dic"
            }
        }


    def make_by_jtalk(self, text, filepath='voice_message', voicetype='mei', emotion='normal'):
        htsvoice = {
            'mei': {
                'normal': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_normal.htsvoice')],
                'angry': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_angry.htsvoice')],
                'bashful': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_bashful.htsvoice')],
                'happy': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_happy.htsvoice')],
                'sad': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'mei/mei_sad.htsvoice')]
            },
            'm100': {
                'normal': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'm100/nitech_jp_atr503_m001.htsvoice')]
            },
            'tohoku-f01': {
                'normal': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'htsvoice-tohoku-f01-master/tohoku-f01-neutral.htsvoice')],
                'angry': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'htsvoice-tohoku-f01-master/tohoku-f01-angry.htsvoice')],
                'happy': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'htsvoice-tohoku-f01-master/tohoku-f01-happy.htsvoice')],
                'sad': ['-m', os.path.join(self.conf['voice_configs']['htsvoice_resource'], 'htsvoice-tohoku-f01-master/tohoku-f01-sad.htsvoice')]
            }
        }

        open_jtalk = ['open_jtalk']
        mech = ['-x', self.conf['voice_configs']['jtalk_dict']]
        speed = ['-r', '1.0']
        outwav = ['-ow', filepath+'.wav']
        cmd = open_jtalk + mech + htsvoice[voicetype][emotion] + speed + outwav
        c = subprocess.Popen(cmd, stdin=subprocess.PIPE)
        c.stdin.write(text.encode())
        c.stdin.close()
        c.wait()
        audio_segment = AudioSegment.from_wav(filepath+'.wav')
        os.remove(filepath+'.wav')
        audio_segment.export(filepath+'.mp3', format='mp3')
        return filepath+'.mp3'

    def after_play(self, e):
        print(e)

client = discord.Client()
client_id = 'your_client_id'

voice = None
volume = None

@client.event
async def on_ready():
    # 起動時の処理
    print('Bot is wake up.')

@client.event
async def on_message(message):
    nlp = NLP()
    vc = VoiceChannel()
    # テキストチャンネルにメッセージが送信されたときの処理
    global voice, volume, read_mode

    if voice is True and volume is None:
            source = discord.PCMVolumeTransformer(voice.source)
            volume = source.volume

    if client.user != message.author:
        text = message.content
        if text == '!login':
            channel = message.author.voice.channel
            voice = await channel.connect()
            await message.channel.send('ボイスチャンネルにログインしました')
        elif text == '!logout':
            await voice.disconnect()
            await message.channel.send('ボイスチャンネルからログアウトしました')
        elif text == '!status':
            if voice.is_connected():
                await message.channel.send('ボイスチャンネルに接続中です')
        elif text == '!volume_up':
            volume += 0.1
            await message.channel.send('音量を上げました')
        elif text == '!volume_down':
            volume -= 0.1
            await message.channel.send('音量を下げました')
        elif text == '!bye':
            await client.close()
        elif text == '!read_mode_on':
            read_mode = True
            await message.channel.send('読み上げモードをオンにしました')
        elif text == '!read_mode_off':
            read_mode = False
            await message.channel.send('読み上げモードをオフにしました')
        else:
            if read_mode:
                emotion = nlp.analysis_emotion(text)
                voice_file = vc.make_by_jtalk(text, filepath, emotion=emotion)
                audio_source = discord.FFmpegPCMAudio(voice_file)
                voice.play(audio_source, after=lambda e: vc.after_play(e))

client.run(client_id)

おわりに

感情推定 + Open Jtalk で入力したテキストごとにボットが感情を込めて読み上げてくれるようになりました。
これをチャットボットの応答に使えば、ひとりぼっちのクリスマスでも寂しくないですね。
それでは、メリークリスマス🎄

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up