M5Stack LLM8850 モジュールを Raspberry Pi 5 で動かしてみた(TTS編)

Last updated at 2025-10-07Posted at 2025-10-06

前回は、Raspberry Pi 5 上で M5Stack LLM8850 モジュールをセットアップし、AXERA社から提供されている変換・量子化済みのLLMモデルが実行できるところまでを確認しました。

今回は、M5Stack公式が用意している MeloTTS テキスト音声変換（TTS）モデルを実行してみます。

セットアップ

基本的に、公式の手順に沿って準備します。
(今回、公式の手順だと1箇所わかりづらいところがありました。後述します。）

ソースの取得とビルド

Githubから、MeloTTSにAXERA社のチップ(今回の場合、AX8850)向けの改造が加えられたソース一式を取得します。
(当該のリポジトリですが、おそらく開発者個人のアカウントであり、AXERA社およびM5Stack社の公式配下ではありません。後々変更になるかもしれませんので注意してください。）

# Githubからクローン
$ git clone https://github.com/ml-inory/melotts.axcl.git
Cloning into 'melotts.axcl'...
remote: Enumerating objects: 2167, done.
remote: Counting objects: 100% (2167/2167), done.
remote: Compressing objects: 100% (1776/1776), done.
remote: Total 2167 (delta 403), reused 2079 (delta 325), pack-reused 0 (from 0)
Receiving objects: 100% (2167/2167), 21.76 MiB | 1.09 MiB/s, done.
Resolving deltas: 100% (403/403), done.

# クローンしたディレクトリに移動
$ cd melotts.axcl/

# ビルドスクリプトに実行権限を付与
$ chmod +x build_aarch64.sh

# ビルドスクリプトを実行
$ ./build_aarch64.sh
-- The C compiler identification is GNU 14.2.0
-- The CXX compiler identification is GNU 14.2.0
# 〜中略〜
Install the project...
-- Install configuration: "Release"
-- Installing: /home/pi/melotts.axcl/install/./melotts
-- Set non-toolchain portion of runtime path of "/home/pi/melotts.axcl/install/./melotts" to "$ORIGIN/../3rdparty/onnxruntime_aarch64/lib"

モデルのダウンロード

ビルドが終わったら、次はモデルのダウンロードを行います。

$ ./download_models.sh

2025-10-06 15:17:08 (1.25 MB/s) - `models.tar.gz' へ保存完了 [64503548/64503548]

models/encoder.onnx
models/decoder.axmodel

ここで１点注意がある。
公式のドキュメントには以下の記載となっている。

さらっと読むと、download_models.sh を事項すると、Hugging Face上のMeloTTSのChineseからSpanishまでのモデルデータが自動でダウンロードされるのかようにも読み取れる。

しかし、上記の実行時のログにあるように、 ./download_models.sh でダウンロードされるのは、 encoder.onnx と decoder.axmodel の2つのみです。各言語のモデルは落ちてきません。

よって、実際にここでやるべき作業は、

一つ上のディレクトリに移動する( melotts.axcl/ ディレクトリの親ディレクトリ)
必要なモデル(デモを動かす場合、 Hugging Face MelloTTS-English)を git clone して取得する

です。

$ cd ..
$ git clone https://huggingface.co/M5Stack/MeloTTS-English-ax650
$ cd melotts.axcl

これをやらないと、先の手順でmelottsのコマンドを実行する際に必要なモデルが参照できずエラーとなります。

Open ../MeloTTS-English-ax650/g-en-au.bin failed!

ビルド(2回目)

モデルをダウンロード後、再度ビルドを実行する。
(一応、公式の手順がこうなっているから再実行していますが、すでに実行済みなので必要なのかは疑問。)

$ ./build_aarch64.sh 
CMake Warning (dev) at CMakeLists.txt:25 (install):
  Policy CMP0177 is not set: install() DESTINATION paths are normalized.  Run
  "cmake --help-policy CMP0177" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/pi/melotts.axcl/build_aarch64
[100%] Built target melotts
[100%] Built target melotts
Install the project...
-- Install configuration: "Release"
-- Up-to-date: /home/pi/melotts.axcl/install/./melotts

サンプルの実行確認

サンプルのコマンドを実行し、実際にTTSがうまくいくか確認しましょう。

$ ./install/melotts \
  -e  ../MeloTTS-English-ax650/encoder-en.onnx \
  -d  ../MeloTTS-English-ax650/decoder-en-au.axmodel \
  -l  ../MeloTTS-English-ax650/lexicon-en.txt \
  -t  ../MeloTTS-English-ax650/tokens-en.txt \
  --g ../MeloTTS-English-ax650/g-en-au.bin \
  -s "M5Stack is a leading provider of IoT solutions, committed to providing developers worldwide with convenient and flexible development components and tools. " 
  
encoder: ../MeloTTS-English-ax650/encoder-en.onnx
decoder: ../MeloTTS-English-ax650/decoder-en-au.axmodel
lexicon: ../MeloTTS-English-ax650/lexicon-en.txt
token: ../MeloTTS-English-ax650/tokens-en.txt
sentence: M5Stack is a leading provider of IoT solutions, committed to providing developers worldwide with convenient and flexible development components and tools. 
wav: output.wav
speed: 0.800000
sample_rate: 44100
Load encoder
Load decoder model
Encoder run take 238.44ms
decoder slice num: 8
Decode slice(1/8) take 39.91ms
Decode slice(2/8) take 39.66ms
Decode slice(3/8) take 39.63ms
Decode slice(4/8) take 39.53ms
Decode slice(5/8) take 39.81ms
Decode slice(6/8) take 39.53ms
Decode slice(7/8) take 39.51ms
Decode slice(8/8) take 39.50ms
Saved audio to output.wav

melotts.axcl/output.wav に生成された音声ファイルが出力されるので、再生してみましょう。
うまく「M5Stack is a leading provider of IoT solutions, committed to providing developers worldwide with convenient and flexible development components and tools.」と聞こえればOKです。

ちなみに

コマンドにミスがある場合(この場合はtypoによるパラメータの不足)などで -s パラメータ(生成したい文章)がうまく読み込めない場合、デフォルトとして「爱芯元智半导体股份有限公司，致力于打造世界领先的人工智能感知与边缘计算芯片。服务智慧城市、智能驾驶、机器人的海量普惠的应用」という文章がハードコードされているみたいです(AXERA社の説明ですね）。

$ ./install/melotts \
  -e ../MeloTTS-English-ax650/encoder-en.onnx \
  -d ../MeloTTS-English-ax650/decoder-en-au.axmodel \ 
  -l ../MeloTTS-English-ax650/lexicon-en.txt \
  -t ../MeloTTS-English-ax650/tokens-en.txt \
  --g ../MeloTTS-English-ax650/g-en-au.bin \
  -s "M5Stack is a leading provider of IoT solutions, committed to providing developers worldwide with convenient and flexible development components and tools. " 
  
encoder: ../MeloTTS-English-ax650/encoder-en.onnx
decoder: ../MeloTTS-English-ax650/decoder-en-au.axmodel
lexicon: ./models/lexicon.txt
token: ./models/tokens.txt
sentence: 爱芯元智半导体股份有限公司，致力于打造世界领先的人工智能感知与边缘计算芯片。服务智慧城市、智能驾驶、机器人的海量普惠的应用
wav: output.wav
speed: 0.800000
sample_rate: 44100
Load encoder
Load decoder model
Encoder run take 180.32ms
decoder slice num: 6
Decode slice(1/6) take 39.87ms
Decode slice(2/6) take 39.67ms
Decode slice(3/6) take 39.76ms
Decode slice(4/6) take 39.58ms
Decode slice(5/6) take 39.84ms
Decode slice(6/6) take 39.49ms
Saved audio to output.wav
-bash: -l: コマンドが見つかりません

コマンドについて(パス指定のための大まかな説明)

今回使用したサンプル実装のMeloTTS移植版のコマンドは以下の通りです。

$ ./install/melotts \
  -e  ../MeloTTS-English-ax650/encoder-en.onnx \
  -d  ../MeloTTS-English-ax650/decoder-en-au.axmodel \
  -l  ../MeloTTS-English-ax650/lexicon-en.txt \
  -t  ../MeloTTS-English-ax650/tokens-en.txt \
  --g ../MeloTTS-English-ax650/g-en-au.bin \
  -s "生成する文章" 
# -edlt および --g は使用するモデルやtokenizerのデータなど

「モデルのダウンロード」の項目で触れた、HuggingFace上の変換済みのモデルをダウンロードし、そのアセットのパスを -edlt および --g に指定する必要があります。

日本語

それでは、日本語でも生成してみましょう。
日本語用のMeloTTSの変換済みのモデルは以下にあります。

まずは、前述の通りモデルのデータを取得します。

# 一つ上のディレクトリにモデルデータをクローンしてくる
$ git clone https://huggingface.co/M5Stack/MeloTTS-Japanese-ax650

# 中身はこんな感じ。ファイル名は後で参照するので注意
$ ls -al MeloTTS-Japanese-ax650
合計 81304
drwxrwxr-x  3 pi pi     4096 10月  6 15:30 .
drwx------ 21 pi pi     4096 10月  6 15:35 ..
drwxrwxr-x  9 pi pi     4096 10月  6 15:30 .git
-rw-rw-r--  1 pi pi     1649 10月  6 15:30 .gitattributes
-rw-rw-r--  1 pi pi       24 10月  6 15:30 README.md
-rw-rw-r--  1 pi pi 44847762 10月  6 15:30 decoder-ja-jp.axmodel
-rw-rw-r--  1 pi pi 31479747 10月  6 15:30 encoder-jp.onnx
-rw-rw-r--  1 pi pi     1024 10月  6 15:30 g-jp.bin
-rw-rw-r--  1 pi pi  2813318 10月  6 15:30 ja_tn_tagger.fst
-rw-rw-r--  1 pi pi    86902 10月  6 15:30 ja_tn_verbalizer.fst
-rw-rw-r--  1 pi pi  3986148 10月  6 15:30 lexicon-jp.txt
-rw-rw-r--  1 pi pi     1440 10月  6 15:30 tokens-jp.txt

日本語モデルの準備ができたので、TTSを実行してみます。
各パラメータで渡していたファイルを ../MeloTTS-Japanese-ax650 以下の各ファイルに置き換えるのを忘れないでください。

$ cd melotts.axcl
$ ./install/melotts \
  -e  ../MeloTTS-Japanese-ax650/encoder-jp.onnx \
  -d  ../MeloTTS-Japanese-ax650/decoder-ja-jp.axmodel \
  -l  ../MeloTTS-Japanese-ax650/lexicon-jp.txt \
  -t  ../MeloTTS-Japanese-ax650/tokens-jp.txt \
  --g ../MeloTTS-Japanese-ax650/g-jp.bin \
  -s "吾輩は猫である。名前はまだ無い。どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。" 

encoder: ../MeloTTS-Japanese-ax650/encoder-jp.onnx
decoder: ../MeloTTS-Japanese-ax650/decoder-ja-jp.axmodel
lexicon: ../MeloTTS-Japanese-ax650/lexicon-jp.txt
token: ../MeloTTS-Japanese-ax650/tokens-jp.txt
sentence: 吾輩は猫である。名前はまだ無い。どこで生れたかとんと見当けんとうがつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。
wav: output.wav
speed: 0.800000
sample_rate: 44100
Load encoder
Load decoder model
Encoder run take 259.64ms
decoder slice num: 8
Decode slice(1/8) take 39.94ms
Decode slice(2/8) take 39.68ms
Decode slice(3/8) take 39.72ms
Decode slice(4/8) take 39.60ms
Decode slice(5/8) take 39.86ms
Decode slice(6/8) take 39.57ms
Decode slice(7/8) take 39.55ms
Decode slice(8/8) take 39.48ms
Saved audio to output.wav

だいたい3秒程度で生成できました。

生成できる長さ

どうやら、入力した文字が全て生成されるわけではなく、最初の11〜12秒までしか生成できず、途中で切れるようです。

英語: 155文字、12秒、decoder sliceは9まで
日本語: 58文字、11秒、decoder sliceは8まで

これがMeloTTSの制限なのか、AX8850の制限なのかは私は詳しく無いのでわかっていない。
知っている人がいたらコメントいただけると幸いです。

追記

早速、有識者からコメントいただきました。感謝。

decoderのパラメータを変更してモデルを作ればいいとのことですので、試したら記事にしたいと思います。

参考:

終わりに

今回はここまで。
この後は、VLモデルや画像生成なども動かしてみるとともに、結局Python等で動かすにはどうすればいいのか？もみて行けたらいいなと思います（睡眠時間を引き換えに）。

(AXERAがAPI開けてくれてるみたいなので、それを叩けばいいようは気はするのですが、LLMのサンプルとかだと事項ファイルがC or C++で作ってるような雰囲気なのが気になる。。。）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up