F5-TTS 商用ボイスクローンモデル構築ガイド

Posted at 2025-12-09

概要

F5-TTSで商用利用可能なボイスクローンモデルを構築するには、以下の2つが必要:

学習データ: 商用利用可能なライセンスの音声データセット
事前学習モデル: CC-BY-NC制限のない、スクラッチから学習したモデル

重要: 公式の事前学習モデル（Emiliaデータセットで学習済み）はCC-BY-NC 4.0ライセンスのため、ファインチューニングしても商用利用不可。商用利用には「スクラッチからの学習」が必須。

1. 学習データの要件

1.1 技術的仕様

項目	要件	備考
サンプリングレート	24kHz	Vocosの場合。BigVGANなら44kHz推奨
チャンネル	モノラル	ステレオは変換が必要
フォーマット	WAV	16bit推奨
セグメント長	3〜15秒	推奨は5〜10秒
DNSMOS P.835スコア	3.0以上	品質フィルタリングの基準

出典:

GitHub Discussion #57: "Each clip around 10sec long, normalized, re-sampled to 24000, noise-reduced."
- 各クリップは約10秒、正規化済み、24000Hzにリサンプリング、ノイズ除去済み
- https://github.com/SWivid/F5-TTS/discussions/57
Emilia-Pipe技術仕様: "Filtering using linguistic and acoustic metrics: Segments with... DNSMOS scores below 3.0... are systematically removed"
- DNSMOS P.835スコアが3.0未満のセグメントは体系的に除外される
- https://www.emergentmind.com/topics/emilia-data-processing-pipeline

1.2 必要なデータ量

ユースケース	推奨時間	ステップ数目安	備考
単一話者ファインチューニング	10〜15時間	50,000〜100,000	既存モデルベース
複数話者ファインチューニング	50時間以上	100,000〜150,000	最低50話者推奨
良質なベースモデル	100時間以上	200,000〜500,000	ゼロショット対応
高品質ベースモデル	300時間以上	500,000〜1,000,000	商用グレード
スクラッチ学習（新言語）	90時間以上	350,000+	A100で約1週間

出典:

GitHub Discussion #143: "As I see here, you can fine-tune a single voice with just 10 to 15 hours. but for multiple speakers, you'll need more—about 50 hours to start. If you want a good model, aim for at least 100 hours; for something perfect, aim for at least 300 hours or more."
- 単一音声のファインチューニングは10〜15時間で可能。複数話者には50時間以上が必要。良いモデルには最低100時間、完璧を目指すなら300時間以上を推奨
- https://github.com/SWivid/F5-TTS/discussions/143
GitHub Discussion #1168: "Community tip: Use datasets with diverse speakers and accents for robustness; one user trained from scratch with 90 hours on an A100, converging at 350,000 steps with clear Polish speech."
- あるユーザーはA100で90時間のデータを使い、350,000ステップで明瞭なポーランド語音声に収束
- https://github.com/SWivid/F5-TTS/discussions/1168

1.3 品質要件

必須条件

ノイズ除去: 背景ノイズ、BGM、環境音を除去
話者分離: 複数話者が重ならないこと
正規化: 振幅を最大値で除算して正規化
文字起こし: 正確なテキストトランスクリプト必須

推奨条件

話者の多様性: 性別、年齢、アクセントのバリエーション
発話スタイルの多様性: トークショー、インタビュー、朗読など
韻律のバリエーション: 感情、スピード、強調の変化

出典:

Emilia論文: "Emilia addresses a pressing challenge in speech generation: the insufficiency of diverse and spontaneous speech data. Traditional datasets, primarily derived from audiobooks, fail to capture the natural variability and spontaneity found in real-world conversations"
- 従来のオーディオブックデータセットでは、実際の会話に見られる自然な変動性と自発性を捉えられない
- https://arxiv.org/html/2501.15907v2

2. 商用利用可能なデータセット

2.1 推奨データセット

データセット	言語	時間	ライセンス	商用利用
Emilia-YODAS	多言語	114,000時間	CC-BY 4.0	可能
LibriTTS	英語	585時間	Public Domain	可能
LibriTTS-R	英語	585時間	Public Domain	可能
VCTK	英語	44時間	CC-BY 4.0	可能
Common Voice	多言語	言語により異なる	CC0	可能

2.2 各データセットの詳細

Emilia-YODAS（推奨）

時間: 114,000時間
言語: 多言語（英語、中国語、ドイツ語、フランス語、日本語、韓国語）
ライセンス: CC-BY 4.0（商用利用可能）
特徴: 多様な話者、自発的な発話スタイル

出典:

Hugging Face: "For data in Emilia-YODSA, we download the raw data from espnet/yodas2, and use the same license family: CC BY 4.0."
- Emilia-YODASデータはCC-BY 4.0ライセンスを使用
- https://huggingface.co/datasets/amphion/Emilia-Dataset

LibriTTS / LibriTTS-R（英語向け推奨）

時間: 585時間
話者数: 2,456人
サンプリングレート: 24kHz
ライセンス: パブリックドメイン（LibriVox由来）

出典:

OpenSLR: "LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate"
- LibriTTSは約585時間、24kHzサンプリングレートの多話者英語コーパス
- https://www.openslr.org/60/
LibriVox公式: "LibriVox recordings are in the public domain, which means people can do anything they like with them... sell them... put them in commercials..."
- LibriVoxの録音はパブリックドメインであり、商用利用を含め自由に使用可能
- https://librivox.org/pages/public-domain/

2.3 データ前処理パイプライン

Emilia-Pipeを使用した前処理フロー:

1. Standardization（標準化）
   - フォーマット変換: WAV, モノラル, 16bit, 24kHz
   - 振幅正規化

2. Source Separation（音源分離）
   - BGM除去（UVR-MDX-NET使用）
   - SDR ≈ 11.15達成

3. Speaker Diarization（話者分離）
   - pyannote/speaker-diarization-3.1使用
   - 話者ターンの分離

4. Voice Activity Detection（VAD）
   - Silero-VAD使用（ROC-AUC ≈ 0.99）
   - 3〜30秒のセグメントに分割

5. ASR（自動音声認識）
   - WhisperX使用
   - テキストトランスクリプト生成

6. Filtering（フィルタリング）
   - DNSMOS P.835 < 3.0を除外
   - 言語IDの信頼度チェック

出典:

Emilia-Pipe README: "The Emilia-Pipe is the first open-source preprocessing pipeline designed to transform raw, in-the-wild speech data into high-quality training data with annotations for speech generation."
- Emilia-Pipeは生の音声データを高品質な学習データに変換する初のオープンソースパイプライン
- https://github.com/open-mmlab/Amphion/blob/main/preprocessors/Emilia/README.md

3. 事前学習モデルの構築方法

3.1 アーキテクチャ選択

F5-TTSには2つのモデルサイズがある:

モデル	パラメータ数	dim	depth	heads	用途
F5TTS_Small	151M	768	18	12	軽量、限定リソース向け
F5TTS_Base	335M	1024	22	16	高品質、商用推奨

3.2 Vocoder選択

Vocoder	サンプリングレート	特徴	商用ライセンス
Vocos	24kHz	高速、軽量	MIT
BigVGAN	44kHz	高音質、重い	MIT

出典:

GitHub Discussion #1168: "For the vocoder, choose BigVGAN at 44kHz with hop length 512—this improves alignment for Polish's complex consonant clusters and longer words"
- BigVGAN 44kHzはhop length 512で複雑な子音連続に対応
- https://github.com/SWivid/F5-TTS/discussions/1168

3.3 学習設定（スクラッチ学習用）

# 設定例（F5TTS_Base + Vocos）

# モデル設定
model_cfg = {
    "dim": 1024,
    "depth": 22,
    "heads": 16,
    "ff_mult": 2,
    "text_dim": 512,
    "conv_layers": 4
}

# Mel Spectrogram設定
mel_spec_cfg = {
    "target_sample_rate": 24000,
    "n_mel_channels": 100,
    "hop_length": 256,
    "win_length": 1024,
    "n_fft": 1024,
    "mel_spec_type": "vocos"  # or "bigvgan"
}

# 学習設定
train_cfg = {
    "batch_size_per_gpu": 4000,  # frame単位
    "batch_size_type": "frame",
    "max_samples": 64,
    "learning_rate": 7.5e-5,
    "num_warmup_updates": 20000,
    "grad_accumulation_steps": 4,
    "max_grad_norm": 1.0,
    "epochs": 11,
    "save_per_updates": 50000
}

出典:

GitHub Discussion #453: Training configuration example with complete mel_spec and model settings
- https://github.com/SWivid/F5-TTS/discussions/453
GitHub Discussion #1168: "Configure batch size to 4,000 with 4-10 gradient accumulation steps... Use AdamW optimizer with a peak learning rate of 7.5e-5, linear warm-up over 20,000 steps"
- バッチサイズ4,000、勾配累積4-10ステップ、AdamW最大学習率7.5e-5、ウォームアップ20,000ステップ
- https://github.com/SWivid/F5-TTS/discussions/1168

3.4 スクラッチ学習の手順

Step 1: データセット準備

# 1. データ構造
data/
├── audio/
│   ├── speaker_001/
│   │   ├── audio_0001.wav
│   │   └── audio_0002.wav
│   └── speaker_002/
│       └── ...
├── metadata.csv    # audio_file|text 形式
└── vocab.txt       # 語彙ファイル

Step 2: 語彙ファイル（vocab.txt）作成

# 基本文字（英語の場合）
a
b
c
...
z
A
B
...
Z
0
1
...
9
 
.
,
!
?
'
-

重要: 新言語の場合は、その言語に必要な文字をすべて含める必要がある。

Step 3: 学習開始

# スクラッチ学習（finetune=Falseまたはチェックポイントなし）
accelerate launch --mixed_precision=bf16 \
    src/f5_tts/train/train.py \
    --exp_name F5TTS_Commercial \
    --learning_rate 7.5e-5 \
    --batch_size_per_gpu 4000 \
    --batch_size_type frame \
    --max_samples 64 \
    --grad_accumulation_steps 4 \
    --max_grad_norm 1.0 \
    --epochs 11 \
    --num_warmup_updates 20000 \
    --save_per_updates 50000 \
    --dataset_name your_dataset \
    --tokenizer custom \
    --tokenizer_path /path/to/vocab.txt

出典:

GitHub Discussion #57: "Hi all, this is very important and might be confusing for some. You need to copy the original model F5TTS_Base/model_1200000.pt into the folder where you are training for fine-tuning. If you start training without copying this model, it will train from scratch!"
- ファインチューニング用に元のモデルをコピーしなければ、スクラッチから学習される
- https://github.com/SWivid/F5-TTS/discussions/57

3.5 必要なハードウェアと学習時間

GPU	VRAM	バッチサイズ	100時間データの学習時間目安
RTX 3090	24GB	1,600 frames	5-7日
RTX 4090	24GB	4,000 frames	3-5日
A100 40GB	40GB	8,000 frames	2-3日
A100 80GB	80GB	16,000 frames	1-2日

出典:

GitHub Discussion #57: "run simple change only the dataname my_speak in 3090 with about 60-80 hours dataset working well... batch_size_per_gpu = 1618"
- RTX 3090で60-80時間のデータセット、バッチサイズ1618で動作良好
- https://github.com/SWivid/F5-TTS/discussions/57
Hugging Face SPRINGLab: "The model was trained on 8x A100 40GB GPUs for close to a week."
- 8x A100 40GBで約1週間の学習
- https://huggingface.co/SPRINGLab/F5-Hindi-24KHz

4. 成果物一覧

商用ボイスクローンモデル構築に必要な成果物:

成果物	形式	説明
model_XXXXXX.pt / .safetensors	PyTorchモデル	学習済みモデルチェックポイント
vocab.txt	テキスト	トークナイザー用語彙ファイル
config.json	JSON	モデル設定（任意だが推奨）

4.1 推論用設定例

from f5_tts.infer.utils_infer import infer_process, load_model

# モデルロード
model = load_model(
    model_type="F5TTS_Base",
    ckpt_file="/path/to/model_500000.safetensors",
    vocab_file="/path/to/vocab.txt",
    device="cuda"
)

# 推論
audio = infer_process(
    ref_audio="/path/to/reference.wav",
    ref_text="参照音声のトランスクリプト",
    gen_text="生成したいテキスト",
    model=model
)

5. 商用利用時の法的チェックリスト

5.1 データセットライセンス確認

使用するすべてのデータセットが商用利用可能か確認
CC-BY-NCライセンスのデータを含んでいないか確認
帰属表示が必要な場合は適切に対応

5.2 ボイスクローン使用時の権利

参照音声の権利者から使用許諾を取得
パブリシティ権・肖像権への配慮
悪用防止のための利用規約整備

5.3 出力物の権利

スクラッチ学習＋商用ライセンスデータを使用した場合:

出力音声の著作権はユーザーに帰属（生成AIの出力物として）
ただし、参照音声の権利者の許諾が必要

6. 参考情報

付録: 録音ガイドライン（自社データ収録の場合）

録音環境

項目	推奨
部屋	防音室または吸音材のある静かな部屋
マイク	コンデンサーマイク（SM58以上のグレード）
インターフェース	24bit/48kHz以上対応
ポップガード	必須

録音設定

サンプリングレート: 48kHz（後で24kHzにダウンサンプリング）
ビット深度: 24bit
フォーマット: WAV

収録内容

多様なテキスト: ニュース、会話、物語、技術文書など
感情バリエーション: ニュートラル、明るい、真剣など
発話速度: 通常、やや速め、やや遅めのバリエーション
1セッション: 2-3時間以内（声の疲労防止）

後処理

# ダウンサンプリング
ffmpeg -i input.wav -ac 1 -ar 24000 -sample_fmt s16 output.wav

# セグメント分割（5-10秒）
# Emilia-Pipe または同等のツールを使用

このガイドは2025年6月時点の情報に基づいています。最新情報は公式リポジトリを確認してください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up