More than 3 years have passed since last update.

NVIDIA NeMo (GlowTTS) で日本語音声合成

Posted at 2020-10-21

主に、機械学習とかよくわからないけど、とにかく NVIDIA/NeMo で TTS したい方向けのメモです（筆者がそれです）。Google Colab だけで試しています。

実行環境

2020/10/06 に出た NVIDIA/NeMo 1.0.0b1 以降を想定しています。

2020/10/05 Announcing NVIDIA NeMo: Fast Development of Speech and Language Models | NVIDIA Developer Blog
2020/10/06 Release NVIDIA Neural Modules 1.0.0b1 · NVIDIA/NeMo

2020/10/20 のマージで、2020/10/13 に出た Pytorch Lightning 1.0 系に対応しているので、最新の main を利用します。(PTL 0.9 系の ckpt も pytorch_lightning.utilities.upgrade_checkpoint することで PTL 1.0 系で読めるようになりますが煩雑なため)

0. NVIDIA NeMo って何

公式サイトが詳しいですが、PyTorch Lightning を使ったツールキットで、ASR / NLP / TTS を使った会話型 AI の構築・学習・チューニングが主な用途のようです。モデルの出力を他のモデルの入力にするような使い方が醍醐味な気もしますが、この記事では TTS 部分を単体で利用する用途しか説明していません。

1. データの準備

mozilla/TTS を使った「Mozilla TTS (Tacotron2) を使って日本語音声合成」のデータ準備と同様に、LJSpeech と同様の構造で、データを準備します。

LJSpeech-1.1/
- wavs/
  - LJ001-0001.wav
  - LJ001-0002.wav
  - ...
- metadata.csv

LJ001-0002|in being comparatively modern.|in being comparatively modern.

パイプで区切られたカラムの中身はそれぞれ

wav ファイルの ID(wavs 内のファイルから .wav を省いたものと同じ)
テキスト文
ノーマライズしたテキスト文

です。

今回も「Mozilla TTS (Tacotron2) を使って日本語音声合成」と同様に、ASR 音声文字認識で自動作成後、目検での訂正をいれていない、雑データを使って試してみます。

2. データを使った学習

NVIDIA/NeMo の examples/tts 内のスクリプトを実行することで traning していきます。GlowTTS 学習に使ったノートブックのサンプルはこちらです。

2.1. NVIDIA NeMo のインストール

## NVIDIA/apex
!git clone https://github.com/NVIDIA/apex
!cd apex; pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

## NVIDIA/NeMo のビルド用
!apt-get update && apt-get install -y libsndfile1 ffmpeg && pip install Cython

## NVIDIA/NeMo
!git clone https://github.com/NVIDIA/NeMo
!cd NeMo; ./reinstall.sh

2020/10/21 現在の版 (1.0.0b1 相当) をソースから入れます。

2.2. データの前処理

examples/tts 内のスクリプトは実行時の引数からデータを読み込みますが、metadata.csv を直接は読み込めません。予め JSON ファイルに変換しておく必要があります。
地味に面倒ですが、2020/03 ぐらいまで scripts にあったものの大粛清で消えてしまった get_ljspeech_data.py の一部を借用することで、変換できます。

/content/hogehoge/wavs/ 下に *.wav ファイル群、/content/hogehoge/metadata.csv に metadata.csv がある時のサンプルが以下です。

## https://github.com/NVIDIA/NeMo/blob/d671630adf24ac7a5105a1e83dc5c4cda806ec3a/scripts/get_ljspeech_data.py
def __process_data(data_folder: str, dst_folder: str):
    """
    To generate manifest
    Args:
        data_folder: source with wav files
        dst_folder: where manifest files will be stored
    Returns:
    """

    if not os.path.exists(dst_folder):
        os.makedirs(dst_folder)

    metadata_csv_path = os.path.join(data_folder, "metadata.csv")
    wav_folder = os.path.join(data_folder, "wavs")
    entries = []

    with open(metadata_csv_path) as f:
        line = f.readline()
        while line:
            file, _, transcript = line.split("|")
            wav_file = os.path.join(wav_folder, file + ".wav")
            sr, y = read(wav_file)
            assert sr == 22050
            duration = len(y) / sr

            entry = dict()
            entry['audio_filepath'] = os.path.abspath(wav_file)
            entry['duration'] = float(duration)
            entry['text'] = transcript
            entries.append(entry)
            line = f.readline()

    # Randomly split 64 samples from the entire dataset to create the
    # validation set
    random.seed(1) # seed を固定すると毎回同じ val, training となる (Google Colab 向け)
    random.shuffle(entries)
    training_set = entries[:-64]
    val_set = entries[-64:]
    with open(os.path.join(dst_folder, "ljspeech_train.json"), 'w') as fout:
        for m in training_set:
            fout.write(json.dumps(m) + '\n')
    with open(os.path.join(dst_folder, "ljspeech_eval.json"), 'w') as fout:
        for m in val_set:
            fout.write(json.dumps(m) + '\n')

## 上記の関数を実行
__process_data('/content/hogehoge/', '/content/hogehoge/')

上記を実行すると、/content/hogehoge/ljspeech_eval.json に 64 行分のデータが、/content/hogehoge/ljspeech_train.json に残りのデータが分割されます。
validation_datasets の 64 行は LJSpeech のデータ 13,100 行に対しての数なので、データの増減によって変動させた方が良さそうです。また、音声長計算が 22,050 Hz のモノラル決め打ちなので、そこから変更しているデータの際は同様に計算も変更が必要なのが注意点です。

2.3. 学習

基本的に、データさえあれば examples/tts/conf 内の yaml を適宜修正して実行すれば良いだけです。と言っても NVIDIA/NeMo は Hydra (facebookresearch/hydra) を利用しているので、yaml 自体を書き換えずとも yaml 内にある設定は実行時にスクリプトの引数で上書きできます。なのですが、Google Colab 向けに便利な設定が yaml 内に無いために、実行時に指定することができませんでした。なので、実行前に yaml の exp_manager 部に以下の項目を足しておきます。

resume_if_exists → True
resume_past_end → True
resume_ignore_no_checkpoint → True
- ↑ 3 つとも、Google Colab で以前の学習を再開する時に便利なやつです。
version → hogehoge
- これを入れないと、デフォルトで実行日時が exp_manager.version として使われます。
- 学習経過は {exp_manager.exp_dir}/{name}/{exp_manager.version}/ に保存されて、resume もそのディレクトリから行われます。デフォルトだと exp_manager.version が実行日時になるため、毎回新しいディレクトリが作られます。
- 結果として、デフォルトのままにしておくと毎回最初からになるので、指定しておいた方が楽そうです。

他オプション、詳細は exp_manager.py のコメントを参照ください。

2.3.1 Tacotron2 学習

tacotron2.py スクリプトは conf/tacotron2.yaml を読むので、実行前に修正しておきます。

--- conf/tacotron2.yaml.org
+++ conf/tacotron2.yaml
@@ -146,10 +146,6 @@
 
 exp_manager:
   exp_dir: null
+  resume_if_exists: True
+  resume_past_end: True
+  resume_ignore_no_checkpoint: True
+  version: hogehoge
   name: *name
   create_tensorboard_logger: True
   create_checkpoint_callback: True

yaml 修正後、以下のように、Google Colab 向けに Google Drive 保存用のディレクトリを作ってから実行しました。

## Google Drive に保存したい
!mkdir -p /content/drive/My\ Drive/nemo_experiments
!ln -s /content/drive/My\ Drive/nemo_experiments

## tensorboard を見たい
%load_ext tensorboard
%tensorboard --logdir /content/nemo_experiments/Tacotron2/hogehoge/

!cd NeMo/examples/tts; python tacotron2.py train_dataset=/content/hogehoge/ljspeech_train.json validation_datasets=/content/hogehoge/ljspeech_eval.json exp_manager.exp_dir=/content/nemo_experiments trainer.max_epochs=1000 trainer.check_val_every_n_epoch=5

glow_tts.yaml と違って trainer.max_epochs がプレースホルダになっているので指定しているのと、trainer.check_val_every_n_epoch が 25 と長めだったので 5 にしています。メモリアロケートエラーが起きる際は元々 48 のバッチサイズを model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 などと小さめに指定して調整しました。

ただ、val_loss が低いところでもまだ 2.4 ぐらいあるので、上記でも何か間違えているかも知れませんです。一旦置いています。

2.3.2 GlowTTS 学習

glow_tts.py スクリプトは conf/glow_tts.yaml を読むので、実行前に修正しておきます。
また、fmax が 8000 となっていて、音声合成後の結果が微妙(後述のノートブックでそのまま音声合成した時に声が高めに出る)だったので、tacotron2.yaml 同様 null にしています。

--- conf/glow_tts.yaml.org
+++ conf/glow_tts.yaml
@@ -2,7 +2,7 @@
 sample_rate: &sr 22050
 n_fft: &n_fft 1024
 n_mels: &n_mels 80
-fmax: &fmax 8000
+fmax: &fmax null
 pad_value: &pad_value -11.52
 gin_channels: &gin_channels 0
 
@@ -143,6 +143,10 @@
 
 exp_manager:
   exp_dir: null
+  resume_if_exists: True
+  resume_past_end: True
+  resume_ignore_no_checkpoint: True
+  version: hogehoge
   name: *name
   create_tensorboard_logger: True
   create_checkpoint_callback: True

yaml 修正後、以下のように、Google Colab 向けに Google Drive 保存用のディレクトリを作ってから実行しました。

## Google Drive に保存したい
!mkdir -p /content/drive/My\ Drive/nemo_experiments
!ln -s /content/drive/My\ Drive/nemo_experiments

## tensorboard を見たい
%load_ext tensorboard
%tensorboard --logdir /content/nemo_experiments/GlowTTS/hogehoge/

!cd NeMo/examples/tts; python glow_tts.py train_dataset=/content/hogehoge/ljspeech_train.json validation_datasets=/content/hogehoge/ljspeech_eval.json exp_manager.exp_dir=/content/nemo_experiments

これ以降は過学習とかそういうので val_loss が上がっていきそうなので、とりあえず 65k step あたりの val_loss が低いものを音声合成に使います。

3. 音声合成

NVIDIA/NeMo の README からもリンクがある、tutorials/tts/1_TTS_inference.ipynb を利用して音声出力をします。Google Colab で実行するにあたり冒頭の「If you're using Google Colab and not running locally, uncomment and run this cell.」部に従って実行しました。

自分で学習したモデルを利用する際は、load_spectrogram_model 関数を以下の感じで変更すればいけました。

def load_spectrogram_model():
    if spectrogram_generator == "tacotron2":
        from nemo.collections.tts.models import Tacotron2Model as SpecModel
        #pretrained_model = "Tacotron2-22050Hz"
        model = SpecModel.load_from_checkpoint("/content/drive/My Drive/nemo_experiments/Tacotron2/hogehoge/checkpoints/Tacotron2--last.ckpt")
    elif spectrogram_generator == "glow_tts":
        from nemo.collections.tts.models import GlowTTSModel as SpecModel
        #pretrained_model = "GlowTTS-22050Hz"
        model = SpecModel.load_from_checkpoint("/content/drive/My Drive/nemo_experiments/GlowTTS/hogehoge/checkpoints/GlowTTS--last.ckpt")
    else:
        raise NotImplementedError

    #model = SpecModel.from_pretrained(pretrained_model)
    with open_dict(model._cfg):
        global SAMPLE_RATE
        global NFFT
        global NMEL
        global FMAX
        SAMPLE_RATE = model._cfg.sample_rate or SAMPLE_RATE
        NFFT = model._cfg.n_fft or NFFT
        NMEL = model._cfg.n_mels or NMEL
        FMAX = model._cfg.fmax or FMAX
    return model

3.1. 音声合成結果

Tacotron2 の結果は mozilla/TTS での Tacotron2 の結果と比べて大分品質が悪いので、学習時にまだ何かいじらないといけなそうです。その mozilla/TTS の Tacotron2 と比べても、今回の GlowTTS の結果は日本語の崩壊が少ない印象でした。ただ、GlowTTS の方がイントネーションが機械的な気もします。

4. まとめ

雑データをそのまま使った場合でも、GlowTTS が割と日本語っぽく聞こえたのが嬉しかったです。NVIDIA/NeMo で作ったモデルをエッジで使いやすくする流れ (NVIDIA Jarvis など) にも期待しています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up