Raspberry Pi 5 で好きなボイスのスマートスピーカーを作ってみた

Last updated at 2024-10-19Posted at 2024-06-02

Raspberry Pi 5を触ってみたかったので、ラズパイでスマートスピーカー自作にプラスαを加えて、ローカルTTSモデルを介して好きな声で返してくれるものを作ってみました。

20240916追記 : 加えて外でも操作できるようにしました

構成

音声入力から出力までのフローは以下のようになっています。
各フローの説明は後述します。

使用機材

Raspberry Pi 5

ラズパイです。8GBモデルを買いました

USB電源アダプター 5V 3A Type C 1.5m

別売りですので、購入する必要があります。

microSDXC 64GB SanDisk サンディスク Extreme PRO

ストレージ用のMicroSDカードです。品質が良いのを買いました。

HDMIケーブル HDMI(A)-micro(D) V2.0 1m 黒

モニターに繋いでGUIで使いたい場合必要です。
無くても大丈夫です。

USB-Audio-A-For-PI5

音声を出力するためのスピーカーです。
Raspberry Pi 5 の下につけて、背面からピン接続することで音が出ます。
コンパクトなのでとても良いです。

SunFounder 超小型 USBミニマイク

USBに接続する超小型のマイクです。音声入力用です。

TTS モデルローカル動作用パソコン

RTX4070 Ti Superを搭載したデスクトップパソコンです。
TTSモデルはGPUで処理します。

Raspberry Pi 5 セットアップ

立ち上げまで

適当なパソコンで上記ページからRaspberry Pi OSのインストーラーをダウンロードし、MicroSDカードに書き込みます。
この時MicroSDカードをパソコンに認識させる必要があるので、挿入口が無い場合はUSBへ変換する機器が必要です。

Raspberry Pi OS (64-bit)がRecommended Recommendedとめちゃくちゃ推されていたので選択

買ってきたMicroSDカードを選択

ホスト名・ユーザー名とパスワード・Wi-Fi設定・ロケール設定を行います。
スクリーンショットを取り忘れましたが、「サービス」の項目にはRSA暗号の秘密鍵と公開鍵を生成して、SSH接続できるようにする項目があります。

MicroSDにデータを書き込みます。

Raspberry Pi 5 にMicroSDカードを挿し込み、起動できました！

初期設定

キーボード設定

パイプ文字が打てない状態になっていました。Mouse and Keyboard SettingsでLayoutをOADG 109Aにして解決。

Python環境作成

poetryで環境構築をしていきます。

$ pipx install poetry

オーディオ周りにPyAudioを使うので、必要なソフトウェアをインストールします。

$ sudo apt-get install portaudio19-dev

オーディオ周り

再生機器が認識しているかどうかを確認します。

$ lsusb # USBデバイスの確認
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 002: ID 0c76:1203 JMTek, LLC. USB PnP Audio Device
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 08bb:2902 Texas Instruments PCM2902 Audio Codec
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub


$ aplay -l # オーディオ再生デバイスの確認
**** List of PLAYBACK Hardware Devices ****
card 0: vc4hdmi0 [vc4-hdmi-0], device 0: MAI PCM i2s-hifi-0 [MAI PCM i2s-hifi-0]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: vc4hdmi1 [vc4-hdmi-1], device 0: MAI PCM i2s-hifi-0 [MAI PCM i2s-hifi-0]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 3: Device_1 [USB PnP Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
  
# card 3 Subdevice 0 に対して再生テスト
$ aplay -D plughw:3,0 /usr/share/sounds/alsa/Front_Center.wav

音が無事出たらOKです。

PyAudioのデバイス番号を確認します。

poetry run python

import pyaudio
audio = pyaudio.PyAudio()
for i in range(audio.get_device_count()):
    print(audio.get_device_info_by_index(i)

{'index': 0, 'structVersion': 2, 'name': 'USB PnP Sound Device: Audio (hw:2,0)', 'hostApi': 0, 'maxInputChannels': 1, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.008684807256235827, 'defaultLowOutputLatency': -1.0, 'defaultHighInputLatency': 0.034829931972789115, 'defaultHighOutputLatency': -1.0, 'defaultSampleRate': 44100.0}
{'index': 1, 'structVersion': 2, 'name': 'USB PnP Audio Device: Audio (hw:3,0)', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 2, 'defaultLowInputLatency': -1.0, 'defaultLowOutputLatency': 0.008684807256235827, 'defaultHighInputLatency': -1.0, 'defaultHighOutputLatency': 0.034829931972789115, 'defaultSampleRate': 44100.0}
{'index': 2, 'structVersion': 2, 'name': 'pulse', 'hostApi': 0, 'maxInputChannels': 32, 'maxOutputChannels': 32, 'defaultLowInputLatency': 0.008684807256235827, 'defaultLowOutputLatency': 0.008684807256235827, 'defaultHighInputLatency': 0.034807256235827665, 'defaultHighOutputLatency': 0.034807256235827665, 'defaultSampleRate': 44100.0}
{'index': 3, 'structVersion': 2, 'name': 'default', 'hostApi': 0, 'maxInputChannels': 32, 'maxOutputChannels': 32, 'defaultLowInputLatency': 0.008684807256235827, 'defaultLowOutputLatency': 0.008684807256235827, 'defaultHighInputLatency': 0.034807256235827665, 'defaultHighOutputLatency': 0.034807256235827665, 'defaultSampleRate': 44100.0}

1番がスピーカーですね。

次に録音機器の確認をします。

$ arecord -l # 録音機器のデバイスの確認
**** List of CAPTURE Hardware Devices ****
card 2: Device [USB PnP Sound Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

認識確認後は、以下ページのreport.pyを実行して、録音と再生の確認を行いました。

問題無く自分の声が返ってきたらOKです。

プログラム作成

構成の「音声入力受付」「音声wav生成」の部分のプログラムを書きます。
後程紹介するapp.pyから使われますが、Enterを押したら音声入力受付が終わるようになっています。

record.py

import pyaudio
import time
import wave

CHUNK = 4096
CHANNELS = 1
FRAME_RATE = 44100

class AudioRecorder:

    def __init__(self):
        self.audio = pyaudio.PyAudio()
        for x in range(0, self.audio.get_device_count()):
            if self.audio.get_device_info_by_index(x)['name'] == 'USB PnP Sound Device: Audio (hw:2,0)':
                self.card_num = self.audio.get_device_info_by_index(x)['index']
        wav_file = None
        stream = None


    # コールバック関数
    def callback(self, in_data, frame_count, time_info, status):
        # wavに保存する
        self.wav_file.writeframes(in_data)
        return None, pyaudio.paContinue

    # 録音開始
    def start_record(self):

        # wavファイルを開く
        print("録音を開始します。喋り終わったらEnterを押してください。")
        self.wav_file = wave.open('record.wav', 'w')
        self.wav_file.setnchannels(CHANNELS)
        self.wav_file.setsampwidth(2)  # 16bits
        self.wav_file.setframerate(FRAME_RATE)

        # ストリームを開始
        self.stream = self.audio.open(format=self.audio.get_format_from_width(self.wav_file.getsampwidth()),
                                      channels=self.wav_file.getnchannels(),
                                      rate=self.wav_file.getframerate(),
                                      input_device_index=self.card_num,
                                      input=True,
                                      output=False,
                                      frames_per_buffer=CHUNK,
                                      stream_callback=self.callback)

    # 録音停止
    def stop_record(self):

        print("録音を停止します。")
        # ストリームを止める
        self.stream.stop_stream()
        self.stream.close()

        # wavファイルを閉じる
        self.wav_file.close()

    # インスタンスの破棄
    def destructor(self):

        # pyaudioインスタンスを破棄する
        self.audio.terminate()


    # 録音を行って、結果のwavファイルを返す
    def record_for(self, output_filename='record.wav'):
        self.start_record()
        input()
        self.stop_record()
        self.destructor()
        return output_filename

構成の「Whisper-1 APIでSpeech to Text」から「音声出力」までのプログラムを書きます。
実際に実行する時は以下のようなコマンドで実行します。

$ poetry run python app.py https://xxxxxxx (ngrokのURL) 2> /dev/null

app.py

import json
import os
import tempfile
import wave
import sys
from io import BytesIO
import logging

from scipy.io.wavfile import read, write
import pyaudio
import requests
from dotenv import load_dotenv
import openai
from record import AudioRecorder


# ロガーの設定
logger = logging.getLogger()
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
file_handler = logging.FileHandler('app.log')
file_handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)
file_handler.setFormatter(formatter)
logger.addHandler(console_handler)
logger.addHandler(file_handler)

try:
    url_arg = sys.argv[1]
    logger.info(f"URL : {url_arg}")

except:
    raise ValueError("TTS APIアクセスの為のURLを引数にいれてください。")

load_dotenv()
p = pyaudio.PyAudio()
CHUNK = 1024
# CARD_NUM = 2 # arecord -l で確認するスピーカーデバイス

client = openai.OpenAI(
    api_key=os.environ.get('OPENAI_API_KEY')
)
recorder = AudioRecorder()

if __name__ == "__main__":
    # 録音開始
    recorded_file = recorder.record_for()

    # whisperでAudio to Text
    wavfile = open(recorded_file, "rb")
    try:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=wavfile,
            language='ja'
        )
        logger.info(f"Recorded Question : {transcript.text}")
    except openai.APIStatusError as e:
        raise (f"openai status error. {e}")

    # OpenAI GPT4oで会話
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "あなたはゆずソフトのキャラクター「在原 七海」です。七海ちゃんの口調で回答してください。回答自体はできるだけ短くしてください。"},
            {"role": "user", "content": f"{transcript.text}"}
        ],
        model=os.environ.get('OPENAI_API_MODEL')
    )
    answer = chat_completion.choices[0].message.content
    
    # nanami-moe-ttsで七海の声に変換
    logger.info(f"Answer : {answer}")
    payload = {"text": f"{answer}"}
    headers = {"Content-Type": "application/json"}
    response = requests.post(f"{url_arg}/run", headers=headers, data=json.dumps(payload))
    if response.ok:
        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
            tmp_file.write(response.content)
            audio_file_path = tmp_file.name
            logger.info(f"Answer Audio Path : {audio_file_path}")
        with wave.open(audio_file_path, 'rb') as wf:
            # Instantiate PyAudio and initialize PortAudio system resources (1)
            p = pyaudio.PyAudio()

            # Open stream (2)
            stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                            channels=wf.getnchannels(),
                            rate=wf.getframerate(),
                            output=True # output_device_indexは指定せずデフォルトにする
                            )

            # Play samples from the wave file (3)
            while len(data := wf.readframes(CHUNK)):  # Requires Python 3.8+ for :=
                stream.write(data)

            # Close stream (4)
            stream.close()

            # Release PortAudio system resources (5)
            p.terminate()
        
    else:
        print(f"Request Failed with status code {response.status_code}: {response.text}")

app.logへマイクの録音内容と返信内容が記載されるようになっています。

app.log

2024-09-01 12:42:41,834 - root - INFO - URL : https://ngrok-free.app
2024-09-01 12:42:49,158 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/audio/transcriptions "HTTP/1.1 200 OK"
2024-09-01 12:42:49,158 - root - INFO - Recorded Question : おはよう                                  2024-09-01 12:42:49,674 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-01 12:42:49,677 - root - INFO - Answer : おはようございます♪                                  2024-09-01 12:42:59,338 - root - INFO - Answer Audio Path : /tmp/tmpwre1w81z.wav
2024-09-01 12:44:40,031 - root - INFO - URL : https://ngrok-free.app
2024-09-01 12:44:43,887 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/audio/transcriptions "HTTP/1.1 200 OK"
2024-09-01 12:44:43,888 - root - INFO - Recorded Question : こんにちは                                2024-09-01 12:44:44,349 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-01 12:44:44,353 - root - INFO - Answer : こんにちは～！元気にしてる？                         2024-09-01 12:44:49,621 - root - INFO - Answer Audio Path : /tmp/tmpj0wcd2bu.wav
2024-09-01 12:53:20,088 - root - INFO - URL : https://ngrok-free.app
2024-09-01 12:53:25,357 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/audio/transcriptions "HTTP/1.1 200 OK"
2024-09-01 12:53:25,357 - root - INFO - Recorded Question : 今日は何食べたい?
2024-09-01 12:53:25,804 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-01 12:53:25,808 - root - INFO - Answer : うーん、お寿司が食べたいかな！                       2024-09-01 12:53:30,664 - root - INFO - Answer Audio Path : /tmp/tmpj6o_v6o6.wav

TTS モデルセットアップ

プログラム作成のコメントアウトで少し書いていましたが、今回使うTTSモデルはmoe-ttsというモデルです。

詳しくは上記ページを見ていただければと思います。
元データ的に個人利用までにとどめておいた方が良いと思います。

ngrok

TTSモデルを起動するパソコンとRaspberry Piを通信するのに使います。

ngrokはローカルPCで稼働しているネットワークを外部公開できるサービスです。

TTSモデルを立ち上げる環境はWindows 10 HomeのWSL2上のUbuntuであり、この場合互いのローカルアドレスは同じサブネットにありません。

Windows 10 ProであればHyper-Vの機能でブリッジすることができるみたいです。
私はHomeなので代替案としてngrokを使いました。

以下のようにしてセットアップします。

$ curl -s https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
	| sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
	&& echo "deb https://ngrok-agent.s3.amazonaws.com buster main" \
	| sudo tee /etc/apt/sources.list.d/ngrok.list \
	&& sudo apt update \
	&& sudo apt install ngrok

TTSモデルのAPI化

まず使いたいモデルはHugging FaceのSpaces用に作られているので、Flaskを使ったAPI化を行います。

moe-ttsにはrequirements.txtがあるので、cat requirements.txt | xargs poetry addでライブラリをインストールします。
その際、openccはエラーを吐いたので除きました。中国語を日本語に変換するライブラリで、今回は不要です。

moe-ttsのapp.pyを以下のように改変します。
Gradioでの動作部分やVoice Conversion,Soft Voice Conversionの為の関数が多いので、かなり削れます。

また、今回はゆずソフト RIDDLE JOKERの在原七海のボイスを指定したいので、Gradioで複数選択できる部分を編集して直接配列の要素番号を指定します。

import argparse
import os
import re
import warnings

import soundfile as sf
import torch
from torch import LongTensor, no_grad

import commons
import utils
from models import SynthesizerTrn
from text import text_to_sequence

warnings.filterwarnings('ignore')
limitation = os.getenv("SYSTEM") == "spaces"  # limit text and audio length in huggingface spaces

def get_text(text, hps, is_symbol):
    text_norm = text_to_sequence(text, hps.symbols, [] if is_symbol else hps.data.text_cleaners)
    if hps.data.add_blank:
        text_norm = commons.intersperse(text_norm, 0)
    text_norm = LongTensor(text_norm)
    return text_norm


def create_tts_fn(model, hps, speaker_ids):
    def tts_fn(text, speaker, speed, is_symbol):
        if limitation:
            text_len = len(re.sub("\[([A-Z]{2})\]", "", text))
            max_len = 150
            if is_symbol:
                max_len *= 3
            if text_len > max_len:
                return "Error: Text is too long", None

        speaker_id = speaker_ids[speaker]
        stn_tst = get_text(text, hps, is_symbol)
        with no_grad():
            x_tst = stn_tst.unsqueeze(0).to(device)
            x_tst_lengths = LongTensor([stn_tst.size(0)]).to(device)
            sid = LongTensor([speaker_id]).to(device)
            audio = model.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8,
                                length_scale=1.0 / speed)[0][0, 0].data.cpu().float().numpy()
        del stn_tst, x_tst, x_tst_lengths, sid
        return "Success", (hps.data.sampling_rate, audio)

    return tts_fn

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--device', type=str, default='cpu')
    parser.add_argument('--text', type=str, required=True, help='Input text for TTS')
    parser.add_argument('--speed', type=float, default=1.0, help='Speed for TTS')
    args = parser.parse_args()
    device = torch.device(args.device)
    # 七海だけ読むように変更
    config_path = "saved_model/0/config.json"
    model_path = "saved_model/0/model.pth"
    cover_path = "saved_model/0/cover.jpg"
    hps = utils.get_hparams_from_file(config_path)
    model = SynthesizerTrn(
            len(hps.symbols),
            hps.data.filter_length // 2 + 1,
            hps.train.segment_size // hps.data.hop_length,
            n_speakers=hps.data.n_speakers,
            **hps.model)
    utils.load_checkpoint(model_path, model, None)
    model.eval().to(device)
    # 七海を指定
    speakers = ["\u5728\u539f\u4e03\u6d77"]
    speaker_ids = [6]
    
    tts_fn = create_tts_fn(model, hps, speaker_ids)
    output_message, generated_audio = tts_fn(args.text, 0 , args.speed, False)
    
    if output_message == "Success":
        sampling_rate, audio_data = generated_audio
        sf.write("output.wav", audio_data, sampling_rate)
        print("Audio Generated successfully and saved to 'output.wav'")
    else:
        print(output_message)

Flask APIを立ち上げるためのプログラムを書きます。

app.py

# app.py
import shlex
import subprocess

from flask import Flask, jsonify, request, send_file

app = Flask(__name__)

@app.route('/run', methods=['POST'])
def run_script():
    # リクエストからパラメータを取得します
    text = request.json.get('text', '')
    device = request.json.get('device', 'cpu')
    speed = request.json.get('speed', 1.0)

    # スクリプトコマンドを構築します
    command = f"python main.py --text {shlex.quote(text)} --device {shlex.quote(device)} --speed {shlex.quote(str(speed))}"
    try:
        result = subprocess.run(
            shlex.split(command),
            capture_output=True,
            text=True,
            check=True
        )
        return send_file('./output.wav', as_attachment=True)
    except subprocess.CalledProcessError as e:
        print(e.stderr)
        return jsonify({'error': e.stderr, 'status': 'failure'}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

サーバーを立ち上げます

$ poetry install
$ poetry run python app.py
* Serving Flask app 'app'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5000
* Running on http://172.18.172.133:5000

ngrokでlocalhost:5000を公開されたURLからアクセスできるようにします。

$ ngrok config add-authtoken xxxxxxxxx
$ ngrok http http://localhost:5000
Forwarding  https://xxxxxxxx.ngrok-free.app -> http://localhost:5000

コマンドを実行してみて動作すればOKです

$ curl -X POST -H "Content-Type: application/json" -d '{"text": "おはよう"}' https://xxxx.ngrok-free.app/run --output output.wav
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38980  100 38956  100    24   9978      6  0:00:04  0:00:03  0:00:01  9984

ここまでできたら完了！

Raspberry Pi 5 でTTSモデルを動かす

ちなみに、Raspberry Pi 5でmoe-ttsが動くかどうか試してみましたが、OOMで動きませんでした。
仮に動いたとしても、変換には結構な時間を要するのではないかなとは思います。

動作

実際に動作させてみた動画です

待ち時間はカットしていますが、実際は録音終了から出力まで約5~10秒程掛かります。

改善点

ここまで見ていただきありがとうございました。
今後も少しずつ改善して修正できたら更新していこうと思います。
アドバイス等ありましたらコメントいただけると助かります。

改善点としては、

応答までに10秒ほどかかるのでもう少し速くしたい
- 画像系はLCM-LoraとかあるのでTTSも改変可能？調べてみる
- 実行のさせ方とかもそんな速くないと思う。このあたり良く分かってない
たまに発音おかしい
- 直すまでにはめちゃくちゃ勉強が必要そう

参考ページ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up