目的

References

VOCA: Voice Operated Character Animation(CVPR 2019)
https://voca.is.tue.mpg.de/

Neural Voice Puppetry: Audio-driven Facial Reenactment
https://arxiv.org/abs/1912.05566

方向性として, 自動で自然な感じで喋る(e.g. text-to-speech) + 3D or フォトリアリスティックレンダリングな論文

Neural Voice Puppetry : 2D neural rendering であるが, 画像の品質は高い. リアルタイム生成可
- audio から感情も推定(DeepSpeech 利用)
VOCA : 3D 系だとこちらが State of the art か? ただし形状のみ.
- FLAME 4D キャプチャデータから形状学習
- http://flame.is.tue.mpg.de/
- https://github.com/Rubikplayer/flame-fitting
Realistic Speech-Driven Facial Animation with GANs
- 画像は荒いが, 音声とビジュアルはよくマッチしている

Montreal Forced Aligner. テキストと音声からアラインメント(phonemes とその位置など)を抽出する.
https://montreal-forced-aligner.readthedocs.io/en/latest/

PRAAT

上記 MFA などで出力した alighment 情報から, phone の pitch と intensity mean/min/max を求める.