More than 5 years have passed since last update.

Neural text to speech のメモ(2020 年 3 月 28 日時点)

Last updated at 2020-05-18Posted at 2018-10-06

テキストから, 自然な(人間が話しているっぽい)スピーチを生成し, LibTorch, TensorFlow C++ でモバイル(オフライン)でリアルタイム or インタラクィブに動く(動かしやすそう)な手法に注力しています.

英語に限っています.

インターネット上のビデオから学習して, 話者の声質を再現

FastSpeech

高速に TTS できるっぽい. ソースコード公開予定

FastSpeech: Fast, Robust and Controllable Text to Speech
https://arxiv.org/abs/1905.09263

有志?による実装

pretrained model でそこそこいい感じに推論できます.

CPU でも I'am happy to see you again だと 1 秒くらいで合成できます(Transformer 0.1 秒, griffin-lim 0.9 秒くらい). waveglow と組みあわえる場合は 9 秒くらい.

Transformer-TTS

Neural Speech Synthesis with Transformer Network
https://arxiv.org/abs/1809.08895

LCPNet

モバイルで動く

Mellotron

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

感情や歌声などのトレーニングデータなしに, 感情を含んだ音声や歌声を生成できる.

2019 年 11 月 25 日にオフィシャル実装公開されました.

WaveFlow

WaveGlow より, よりコンパクトに表現できる(パラメータ数が少ない). 2D convolution する. WaveGlow, WaveNet もここから派生して定義することがでいる.

WaveFlow: A Compact Flow-based Model for Raw Audio
https://arxiv.org/abs/1912.01219

W.I.P 実装
https://github.com/L0SG/WaveFlow

ForwardTacotron

FastSpeech inspired.

Tacotron と FastSpeech のいいとこ取りな感じか.

その他

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
https://arxiv.org/abs/1807.07281

Close to Human Quality TTS with Transformer
https://arxiv.org/abs/1809.08895

https://github.com/tensorflow/models/tree/master/official/transformer を使えばさらっと実装できるっぽい...?

FFTNet http://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/
リアルタイム向け. WaveRNN よりいいかも?

GAN-based text-to-speech synthesis and voice conversion (VC)
https://github.com/r9y9/gantts

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment
https://arxiv.org/abs/2003.01950

WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU
https://arxiv.org/abs/2005.07412

所感

Tacotron は end-to-end で学習できるのが利点であるが, 品質を出すにはいろいろ学習の試行錯誤が必要なようである.
2020 年 3 月 28 日時点では,

Tacotron2 + Waveglow が基準品質(英語)
FastSpeech + WaveGlow -> Tacotron2 + Waveglow に比べ, いくらか機械的にはなるがより抑揚などが制御できている(英語) LJSpeech 以外のデータセットで学習すればいい感じになるかも
モバイル向けなら FastSpeech + SquezeWave or ForwardTacotron + SqueezeWave か?

Transformer 系は高速でモバイルで動かすのによさそうである.

Tacotron + WaveRNN は単一話者向け(一話者一学習データ)っぽいので, マルチスピーカーや声質変換, 感情つけなどの場合は別のモデルがよさそう(DeepVoice, LoopVoice, Mellotron など)

MelNet に期待.

実装

https://github.com/espnet/espnet
- End-to-End Speech Processing Toolkit
- いろいろてんこ盛り. ありがとうございます.
https://github.com/keithito/tacotron
- 中国語(Chinese Mandarin)版 https://github.com/keithito/tacotron/issues/118
https://github.com/NVIDIA/tacotron2
https://github.com/fatchord/WaveRNN
- ネットワークだけなので参考程度
Tacotron2 + WaveRNN https://github.com/h-meru/Tacotron-WaveRNN
Combination of the Tacotron-2 implementation by Rayhane-mamah with the WaveRNN-inspired method by fatchord https://github.com/m-toman/tacorn

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Neural text to speech のメモ(2020 年 3 月 28 日時点)

人気がありそう(いろいろな人がトライしていて知見や実装があるもの)なやりやた

最近のトレンド?

Mel spectroguram(メルスペクトログラム)

Tactron2

WaveRNN

WaveGlow

MelNet

FastSpeech

関連論文

Transformer-TTS

LCPNet

Mellotron

WaveFlow

ForwardTacotron

その他

所感

実装

Neural text to speech のメモ(2020 年 3 月 28 日時点)

人気がありそう(いろいろな人がトライしていて知見や実装があるもの)なやりやた

最近のトレンド?

Mel spectroguram(メル スペクトログラム)

Tactron2

WaveRNN

WaveGlow

MelNet

FastSpeech

関連論文

Transformer-TTS

LCPNet

Mellotron

WaveFlow

ForwardTacotron

その他

所感

実装

Mel spectroguram(メルスペクトログラム)