More than 1 year has passed since last update.

備忘録：Text-to-speechを試したときのメモ

Posted at 2023-10-14

ESPNet2 で日本語 Text-To-Speechを試してみたときの備忘録です。
つくよみちゃんコーパスを利用している、espnet-tts-streamlitを使用させていただきました。

環境

Windows 10
Python 3.10

内容

ESPNet2 で日本語 TTS(Text-to-speech)するメモ (Windows でも動くよ)の記事を参考に、https://github.com/syoyo/espnet-tts-streamlitを使わせていただき、音声合成したときにハマった点のメモです。

準備、実行

espnet-tts-streamlitは、上記レポジトリにあるように、

$ python -m pip install -r requirements.txt

で環境構築。

$ streamlit run espnet_tts_app_streamlit.py

で実行します。

espnet_model_zooからモデルがダウンロードされない

「Load/Setup model」を押しても反応がない状況でした。
モデルダウンロードに時間がかかるようなので、一晩放置しておいたら、ダウンロードされていました。

パスが長くて（260文字以上）Windowsでモデルファイルの解凍ができない

モデルはダウンロードされたのですが、
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\path\\to\\espnet-tts-streamlit-main\\.venv\\lib\\site-packages\\espnet_model_zoo\\684a47d8e290fdff14ac8687ecd2f3fe\\exp\\tts_finetune_full_band_jsut_vits_raw_phn_jaconv_pyopenjtalk_prosody\\images\\discriminator_train_time.png'
のようなエラーが出ました。

パス長の問題かな、と思い、
https://learn.microsoft.com/ja-jp/windows/win32/fileio/maximum-file-path-limitation?tabs=registry
を参考に、パス長制限を解除したら、上記エラーは出なくなりました。

core.pyxがない

「Load/Setup model」を実行すると、以下のエラーになりました。

C:\path\to\espnet-tts-streamlit-main\.venv\lib\site-packages\espnet2\gan_tts\vits\monotonic_align\__init__.py:19: UserWarning: Cython version is not available. Fallback to 'EXPERIMETAL' numba version. If you want to use the cython version, please build it as follows: `cd espnet2/gan_tts/vits/monotonic_align; python setup.py build_ext --inplace`

core.pyxを
C:\path\to\espnet-tts-streamlit-main\.venv\Lib\site-packages\espnet2\gan_tts\vits\monotonic_align\
にダウンロードして、同ディレクトリで以下のコマンドを実行。

python setup.py build_ext --inplace

これでcore.cp310-win_amd64.pydをビルドして解消しました。

torchのweight_normのエラー

「Load/Setup model」を実行すると、まだ以下のエラーが出ます。

C:\path\to\espnet-tts-streamlit-main\.venv\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
WARNING:root:It seems weight norm is not applied in the pretrained model but the current model uses it. To keep the compatibility, we remove the norm from the current model. This may cause unexpected behavior due to the parameter mismatch in finetuning. To avoid this issue, please change the following parameters in config to false:
 - discriminator_params.follow_official_norm
 - discriminator_params.scale_discriminator_params.use_weight_norm
 - discriminator_params.scale_discriminator_params.use_spectral_norm

See also:
 - https://github.com/espnet/espnet/pull/5240
 - https://github.com/espnet/espnet/pull/5249

torchのweight_norm.pyの仕様が変わったとのこと。

torchのバージョンを落としてみても解決しなかったので、torch側ではなく、espnet側の問題と判断。
適当に古めのバージョンをインストールしました。

pip install espnet==202209

実行結果

これで、無事に音声合成がText-to-speechが動きました！
素晴らしい！！

短い文章（例文の「吾輩は猫である。名前はまだない。」）くらいだと、CPUでも一瞬で音声合成してくれます。

ただ、500文字くらいの文章（合成された音声ファイルが1分くらい）の長さだと、CPUを100％使ってしまい、2～3分くらいかかりますね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up