Qwen3-TTS に自分の声でしゃべらせる

Python

Last updated at 2026-01-25Posted at 2026-01-25

Qwen3-TTS は、テキスト読み上げ（TTS）だけでなく、参照音声からのボイスクローン（Voice Clone）や、文章で声質を指定する Voice Design まで含む音声生成モデル群です。公式リポジトリでは 10言語対応・低遅延ストリーミング・3秒クローン等が明記されています。 (GitHub)

ここでは「実際の音を学習して話させる」を、まずは再学習（ファインチューニング）ではなく、推論時に参照音声へ条件付けする “Voice Clone” として実現する手順をまとめます。

0. 「学習して話させる」の現実的な意味（Qwen3-TTSの場合）

多くのケースで必要なのは「モデルを追加学習する」ではなく、Base（Voice Clone）で

参照音声（ref_audio）
（任意だが推奨）参照音声の正確な書き起こし（ref_text）

を入力し、推論時に声質（＋話し方）を寄せるやり方です。公式READMEでも、Baseは「3-second rapid voice clone」で、ref_text を使わない埋め込みのみのモード（x_vector_only_mode=True）もあるが品質が落ち得る、と説明されています。 (GitHub)

1. 使い分け（どのモデルを選ぶか）

公式のモデル説明は次の整理です。 (GitHub)

CustomVoice：用意された話者（複数プリセット）＋指示でスタイル制御
VoiceDesign：文章で声質（timbre等）を設計
Base：参照音声からの Voice Clone（3秒）／FTの土台にもなる

2. インストール（最短）

公式Quickstartは「PyPIから入れる」だけです。 (GitHub)

pip install -U qwen-tts

FlashAttention 2（flash-attn）は「GPUメモリ削減のため推奨」とされていますが、環境依存が強いです（Windowsは詰まりやすい）。 (GitHub)
ローカル検証なら、まず無しで進めて問題ありません。

3. ローカル Web UI デモを起動（3系統）

公式READMEに、各モデルの qwen-tts-demo 起動例があります。 (GitHub)

# CustomVoice
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8800

# VoiceDesign
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8800

# Base（Voice Clone）
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8800

補足：コミュニティ記事では、flash-attn を使わないために --no-flash-attn を付けて起動しています（Windowsでの運用メモもあり）。 (きしだのHatena)

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --port 8800 --no-flash-attn

4. 「自分の声でしゃべらせる」最短手順（Base / Voice Clone）

4.1 まずはUIで試す（おすすめ）

Base を起動すると、UIが「参照音声＋書き起こし」を入れる形に変わります（記事報告）。 (きしだのHatena)
ここで重要なのは ref_text の精度です。公式も ref_text なしの簡易モードは品質低下し得ると書いています。 (GitHub)

4.2 Pythonで回す（最小例）

公式の設計に沿う形（ref_audio + ref_text、必要なら x_vector_only_mode 切替）で書くと、だいたい次の流れになります。Voice Clone の入出力要件（ref_audio / ref_text / x_vector_only_mode）自体はREADMEに明記されています。 (GitHub)

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",          # CPUなら "cpu"
    dtype=torch.bfloat16,         # flash-attn等の都合でfp16/bf16推奨になることがある
)

ref_audio = "ref.wav"  # 自分の声（なるべくノイズ少なく）
ref_text  = "ref.wav で実際に読んだ文章（句読点含めて正確に）"

# 参照音声から「クローン用プロンプト」を作る
prompt = model.create_voice_clone_prompt(
    ref_audio=ref_audio,
    ref_text=ref_text,
    x_vector_only_mode=False,     # Trueにするとref_text不要だが品質が落ち得る :contentReference[oaicite:10]{index=10}
)

text = "ここに読み上げたい別の文章を書く。"
wavs, sr = model.generate_voice_clone(
    text=text,
    voice_clone_prompt=prompt,
    language="Japanese",
)

sf.write("out.wav", wavs[0], sr)

5. 声が自然になるコツ（効きやすい順）

参照音声は「短すぎない」「無音少なめ」「環境音少なめ」
ref_text は “実際に読んだ通り” に合わせる（読み間違いがあると崩れやすい） (GitHub)
日本語なら、無声子音（k/s/t/h/p）＋長音＋促音＋撥音を多めに含む文を用意（実測で効いた報告あり） (きしだのHatena)
「声質だけ寄せる（埋め込みのみ）」より「話し方まで寄せる（ref_text込み）」を優先（READMEの注意と整合） (GitHub)

6. Voice Design（文章で声質を作る）はどうやるのか

VoiceDesign は “声の説明文（例：female voice 等）” を条件として音声を出す用途です。モデルとしても Qwen3-TTS-…-VoiceDesign が明示されています。 (GitHub)
実際に「female voice を指定するとアニメ声っぽくなった」という報告もあります。 (きしだのHatena)

まずは公式の通り VoiceDesign モデルでデモを起動し、UI上の指示欄（description/instruction）に声質テキストを入れて試すのが最短です。 (GitHub)

7. 注意（最低限）

他人の声を本人同意なくクローンして公開・業務利用するのは避ける（規約・法務リスクが高い）
社内用途なら「録音条件」「利用範囲」「第三者提供禁止」「保管期間」を先に決める

付録：コマンドだけ欲しい版（コピペ）

pip install -U qwen-tts

# 自分の声クローン（Base）
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --port 8800 --no-flash-attn

# 声質を文章で指定（VoiceDesign）
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --port 8800 --no-flash-attn

# プリセット話者（CustomVoice）
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --port 8800 --no-flash-attn

（--no-flash-attn はコミュニティ記事由来の実用フラグです。 (きしだのHatena) 公式READMEの起動例自体は --ip/--port が中心です。 (GitHub)）

# Program Name: qwen3_tts_colab_fix_transformers_and_speak.py
# Creation Date: 20260125
# Purpose: Fix qwen-tts import error (GenerationMixin) by pinning transformers, then speak with Qwen3-TTS in Colab.

# =========================
# PARAM_INIT（設定一元管理）
# =========================
PARAM_INIT = {
    # ---- Runtime ----
    "SEED": 42,
    "FORCE_CPU": False,                  # True: CPU固定（遅い）
    "DTYPE": "bfloat16",                 # "float16" / "bfloat16" / "float32"
    "DEVICE_MAP_CUDA": "cuda:0",
    "SAMPLE_RATE_FALLBACK_HZ": 24000,    # [Hz]
    # ---- Version pins (重要) ----
    # JP: qwen-tts側が参照する transformers のAPI変更で ImportError が起きるため、動作確認済みの版へ固定
    # EN: Pin transformers to a known working version for qwen-tts to avoid API breakage.
    "TRANSFORMERS_PIN": "4.57.3",
    "ACCELERATE_PIN": "1.12.0",
    "QWEN_TTS_MIN": "0.0.5",
    # ---- Models (CPUなら0.6Bを推奨) ----
    "MODEL_CUSTOMVOICE_06B": "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
    "MODEL_BASE_06B": "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
    "MODEL_VOICEDESIGN_17B": "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    # ---- Defaults ----
    "DEFAULT_TEXT_JA": "今日はとても晴れた日で、風は少し冷たく感じます。",
    "DEFAULT_LANGUAGE": "Japanese",      # "Japanese" / "English" / "auto"
    "DEFAULT_SPEAKER": "Ono Anna",
    "DEFAULT_INSTRUCT": "female voice, Japanese native, young adult, high pitch, anime-style, cheerful, clear articulation",
    # ---- Output ----
    "OUT_DIR": "/content",
    "OUT_PREFIX": "qwen3tts_speak",
}

# ==========================================
# Install（Colab向け / インストール + 版固定）
# ==========================================
# JP: transformers のAPI変更で qwen_tts の import が壊れるため、先に transformers を固定する
# EN: Pin transformers first to avoid qwen_tts import breakage
try:
    import sys, subprocess, os, time, datetime, math, importlib

    def _run(cmd):
        print(">>", " ".join(cmd))
        subprocess.check_call(cmd)

    def pip_install(pkgs):
        _run([sys.executable, "-m", "pip", "install", "-U"] + pkgs)

    def pip_uninstall(pkgs):
        _run([sys.executable, "-m", "pip", "uninstall", "-y"] + pkgs)

    # JP: いったん transformers を外してから、固定版を入れる
    # EN: Remove transformers then install pinned versions
    try:
        pip_uninstall(["transformers"])
    except Exception:
        pass

    pip_install([
        f"transformers=={PARAM_INIT['TRANSFORMERS_PIN']}",
        f"accelerate=={PARAM_INIT['ACCELERATE_PIN']}",
        "numpy",
        "matplotlib",
        "ipywidgets",
        "numba",
        "soundfile",
        "torchaudio",
        "librosa",
        "onnxruntime",
        "sox",
        f"qwen-tts>={PARAM_INIT['QWEN_TTS_MIN']}",
    ])

except Exception as e:
    print("Install failed:", repr(e))
    raise

# ==========================================
# Imports（入力→計算→出力）
# ==========================================
try:
    import numpy as np
    import matplotlib.pyplot as plt  # required by user rule (no seaborn)
    import soundfile as sf
    import torch
    from numba import njit
    import ipywidgets as widgets
    from IPython.display import display, Audio, clear_output
except Exception as e:
    print("Import failed:", repr(e))
    raise

# ==========================================
# Seed（乱数seed明示）
# ==========================================
def set_seed(seed: int):
    # JP: 再現性のためseedを固定
    # EN: Fix random seeds for reproducibility
    np.random.seed(seed)
    try:
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)
    except Exception:
        pass

set_seed(PARAM_INIT["SEED"])

# ==========================================
# Numba JIT（必須）
# ==========================================
@njit(cache=False)
def rms_numba(x: np.ndarray) -> float:
    """
    Inputs:
        x: 1-D float array
    Outputs:
        rms: root-mean-square value (unitless)
    Process:
        RMS calculation using Numba JIT.
    """
    s = 0.0
    n = x.size
    for i in range(n):
        s += x[i] * x[i]
    if n == 0:
        return 0.0
    return math.sqrt(s / n)

def ensure_mono_float32(audio: np.ndarray) -> np.ndarray:
    # JP: (N,) float32 モノラルへ統一
    # EN: Convert to mono float32 (N,)
    if audio.ndim == 2:
        audio = np.mean(audio, axis=1)
    return audio.astype(np.float32, copy=False)

# ==========================================
# Device / dtype 決定
# ==========================================
def pick_device_and_dtype():
    # JP: GPUがあれば使う（FORCE_CPUならCPU固定）
    # EN: Use GPU if available (unless FORCE_CPU)
    use_cuda = torch.cuda.is_available() and (not PARAM_INIT["FORCE_CPU"])
    device_map = PARAM_INIT["DEVICE_MAP_CUDA"] if use_cuda else "cpu"

    s = PARAM_INIT["DTYPE"]
    if s == "float16":
        dtype = torch.float16
    elif s == "bfloat16":
        dtype = torch.bfloat16
    else:
        dtype = torch.float32
    return device_map, dtype, use_cuda

DEVICE_MAP, DTYPE, HAS_CUDA = pick_device_and_dtype()

# ==========================================
# qwen_tts import（モジュールキャッシュ破棄）
# ==========================================
def purge_modules(prefixes):
    # JP: 古いtransformers等が残っているとimportが壊れるので掃除
    # EN: Purge cached modules to avoid stale imports
    import sys
    kill = []
    for k in list(sys.modules.keys()):
        for p in prefixes:
            if k == p or k.startswith(p + "."):
                kill.append(k)
                break
    for k in kill:
        sys.modules.pop(k, None)

try:
    purge_modules(["qwen_tts", "transformers"])
    from qwen_tts import Qwen3TTSModel
except Exception as e:
    print("qwen_tts import failed even after pinning transformers.")
    print("Error:", repr(e))
    raise

# ==========================================
# Model loader（モデル読み込み）
# ==========================================
_MODEL_CACHE = {}

def load_model(model_id: str):
    # JP: 同一モデルの再ロードを避ける
    # EN: Cache loaded model
    if model_id in _MODEL_CACHE:
        return _MODEL_CACHE[model_id]
    model = Qwen3TTSModel.from_pretrained(
        model_id,
        device_map=DEVICE_MAP,
        dtype=DTYPE,
    )
    _MODEL_CACHE[model_id] = model
    return model

# ==========================================
# Utility: save wav with timestamp
# ==========================================
def save_wav(out_wav: np.ndarray, sr: int, tag: str) -> str:
    ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    out_path = os.path.join(PARAM_INIT["OUT_DIR"], f"{PARAM_INIT['OUT_PREFIX']}_{tag}_{ts}.wav")
    sf.write(out_path, out_wav, sr)
    return out_path

# ==========================================
# UI（slider / button / dropdown）
# ==========================================
mode_dd = widgets.Dropdown(
    # JP: CPU環境では0.6B CustomVoice推奨（VoiceDesignは1.7Bで重い）
    # EN: On CPU, prefer 0.6B CustomVoice (VoiceDesign 1.7B is heavy)
    options=[
        ("CustomVoice 0.6B (Recommended on CPU)", "customvoice_06b"),
        ("Base 0.6B (Voice Clone prompt API needed)", "base_06b"),
        ("VoiceDesign 1.7B (Heavy on CPU)", "voicedesign_17b"),
    ],
    value=("customvoice_06b" if not HAS_CUDA else "voicedesign_17b"),
    description="MODE:",
)

lang_dd = widgets.Dropdown(
    options=[("Japanese", "Japanese"), ("English", "English"), ("Auto", "auto")],
    value=PARAM_INIT["DEFAULT_LANGUAGE"],
    description="LANG:",
)

text_in = widgets.Textarea(
    value=PARAM_INIT["DEFAULT_TEXT_JA"],
    description="TEXT:",
    layout=widgets.Layout(width="100%", height="110px"),
)

speaker_in = widgets.Text(
    value=PARAM_INIT["DEFAULT_SPEAKER"],
    description="SPK:",
)

instruct_in = widgets.Textarea(
    value=PARAM_INIT["DEFAULT_INSTRUCT"],
    description="INSTR:",
    layout=widgets.Layout(width="100%", height="90px"),
)

gen_btn = widgets.Button(description="Speak (Generate)", button_style="primary")
reset_btn = widgets.Button(description="Reset", button_style="")
out_area = widgets.Output(layout={"border": "1px solid #ccc", "padding": "8px"})

def ui_reset(_):
    mode_dd.value = ("customvoice_06b" if not HAS_CUDA else "voicedesign_17b")
    lang_dd.value = PARAM_INIT["DEFAULT_LANGUAGE"]
    text_in.value = PARAM_INIT["DEFAULT_TEXT_JA"]
    speaker_in.value = PARAM_INIT["DEFAULT_SPEAKER"]
    instruct_in.value = PARAM_INIT["DEFAULT_INSTRUCT"]
    with out_area:
        clear_output()

reset_btn.on_click(ui_reset)

def _print_env_banner():
    print("=== ENV ===")
    print("Torch:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    print("DEVICE_MAP:", DEVICE_MAP)
    print("DTYPE:", str(DTYPE).replace("torch.", ""))

def speak(_):
    with out_area:
        clear_output()
        try:
            _print_env_banner()

            mode = mode_dd.value
            lang = lang_dd.value
            text = text_in.value.strip()

            if len(text) == 0:
                raise ValueError("TEXT is empty.")

            t0 = time.time()

            if mode == "customvoice_06b":
                model = load_model(PARAM_INIT["MODEL_CUSTOMVOICE_06B"])
                speaker = speaker_in.value.strip()
                instruct = instruct_in.value.strip()
                if len(speaker) == 0:
                    raise ValueError("SPK is empty.")
                if not hasattr(model, "generate_custom_voice"):
                    raise AttributeError("generate_custom_voice not found. Update qwen-tts.")
                wavs, sr = model.generate_custom_voice(text=text, language=lang, speaker=speaker, instruct=instruct)
                tag = "customvoice06b"

            elif mode == "voicedesign_17b":
                model = load_model(PARAM_INIT["MODEL_VOICEDESIGN_17B"])
                instruct = instruct_in.value.strip()
                if len(instruct) == 0:
                    raise ValueError("INSTR is empty.")
                if not hasattr(model, "generate_voice_design"):
                    raise AttributeError("generate_voice_design not found. Update qwen-tts.")
                wavs, sr = model.generate_voice_design(text=text, language=lang, instruct=instruct)
                tag = "voicedesign17b"

            else:
                # JP: Baseは「Voice Clone」用。ここではAPI存在確認だけして案内する。
                # EN: Base is for voice cloning; here we only check API and inform.
                model = load_model(PARAM_INIT["MODEL_BASE_06B"])
                if not hasattr(model, "create_voice_clone_prompt") or not hasattr(model, "generate_voice_clone"):
                    raise AttributeError("Base Voice Clone API not found in this qwen-tts build.")
                raise RuntimeError(
                    "Base(Voice Clone)は ref_audio/ref_text が必要です。"
                    "このセルは『喋る（TTS）』優先のため CustomVoice/VoiceDesign を使ってください。"
                )

            dt = time.time() - t0

            out_wav = ensure_mono_float32(np.array(wavs[0]))
            sr = int(sr) if sr is not None else int(PARAM_INIT["SAMPLE_RATE_FALLBACK_HZ"])

            dur_s = out_wav.size / float(sr)      # [s]
            rms_1 = rms_numba(out_wav)            # unitless
            out_path = save_wav(out_wav, sr, tag)

            print("=== SPEAK RESULT ===")
            print(f"Mode: {mode}")
            print(f"Language: {lang}")
            print(f"SampleRate [Hz]: {sr}")
            print(f"Duration [s]: {dur_s:.3f}")
            print(f"RMS [1]: {rms_1:.6f}")
            print(f"Elapsed [s]: {dt:.2f}")
            print(f"Saved: {out_path}")
            display(Audio(out_wav, rate=sr))

        except Exception as e:
            print("Speak failed.")
            print("Error:", repr(e))

gen_btn.on_click(speak)

def on_mode_change(change):
    # JP: モードに応じて表示を切替
    # EN: Switch visible inputs by mode
    if change["new"] == "customvoice_06b":
        speaker_in.layout.display = "flex"
        instruct_in.layout.display = "flex"
    elif change["new"] == "voicedesign_17b":
        speaker_in.layout.display = "none"
        instruct_in.layout.display = "flex"
    else:
        speaker_in.layout.display = "none"
        instruct_in.layout.display = "none"

mode_dd.observe(on_mode_change, names="value")
on_mode_change({"new": mode_dd.value})  # init

display(widgets.HBox([mode_dd, lang_dd, gen_btn, reset_btn]))
display(text_in)
display(speaker_in)
display(instruct_in)
display(out_area)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up