Gemini 3.1 Flash TTS入門 — 音声タグとマルチスピーカーをAPIで実装する

Last updated at 2026-04-24Posted at 2026-04-24

はじめに

2026年4月15日、GoogleはGemini 3.1 Flash TTSをプレビューリリースしました。従来のTTSモデルと異なり、多数の音声タグ（Audio Tags）によってセリフ単位で感情・ペース・アクセントを自然言語で制御できる点が大きな特徴です。

この記事では、公式ドキュメントをもとにGemini 3.1 Flash TTSの主要機能と、PythonによるAPIの実装方法を解説します。

この記事で学べること

Gemini 3.1 Flash TTSの特徴と従来モデルとの違い
シングルスピーカーTTSの実装
音声タグ（Audio Tags）を使った表現制御
マルチスピーカー対話の生成
Audio Profileによる声質・キャラクター設定
料金体系と注意点

対象読者

Gemini APIを活用したアプリ開発者
音声コンテンツ生成・ナレーション自動化を検討しているエンジニア
既存のTTSサービスを置き換えたいチーム

前提環境

Python 3.10+
google-genai SDK インストール済み
Gemini APIキー取得済み（Google AI Studio）

TL;DR

モデルID: gemini-3.1-flash-tts-preview
70+言語、30種類のプリセットボイス、豊富な音声タグ対応
最大2名のマルチスピーカー対話を1コールで生成可能
料金: 入力$1.00/1Mトークン、音声出力$20.00/1Mトークン
バッチAPI利用で50%割引

Gemini 3.1 Flash TTSとは

Gemini 3.1 Flash TTSは、Googleが2026年4月15日にプレビュー公開したテキスト読み上げモデルです。公式ブログによると、Artificial Analysis TTS leaderboardでEloスコア1,211を記録し、「品質とコストのバランスが最も優れた領域」に位置しています。

従来のTTSとの違い

従来のTTSは「テキストを入力して音声を出力する」シンプルな構造でしたが、Gemini 3.1 Flash TTSはプロンプト形式で音声の表現を細かく制御できます。

項目	従来TTS	Gemini 3.1 Flash TTS
感情制御	定義済みスタイルのみ	自然言語タグで自由に指定
言語	モデル毎に限定	70+言語を自動検出
マルチスピーカー	別コール必要	1コールで2名同時生成
声質カスタマイズ	プリセット選択のみ	Audio Profileでキャラクター設定

モデル仕様

項目	値
モデルID	`gemini-3.1-flash-tts-preview`
入力トークン上限	8,192 tokens
出力トークン上限	16,384 tokens
対応言語	70+
プリセットボイス数	30種類
音声タグ数	多数（網羅的なリストは非公開）

セットアップ

pip install google-genai

環境変数にAPIキーを設定します。

export GOOGLE_API_KEY="your-api-key"

Pythonで音声出力を保存するためのユーティリティ関数を用意しておきます。

import wave

def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
    with wave.open(filename, "wb") as wf:
        wf.setnchannels(channels)
        wf.setsampwidth(sample_width)
        wf.setframerate(rate)
        wf.writeframes(pcm)

サンプルレートは24000Hz（24kHz）、モノラル、16bit PCM が標準出力形式です。

シングルスピーカーTTSの実装

最もシンプルな構成です。response_modalities=["AUDIO"] と speech_config を設定します。

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.1-flash-tts-preview",
    contents="Say cheerfully: Have a wonderful day!",
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(
                    voice_name='Kore',
                )
            )
        ),
    )
)

data = response.candidates[0].content.parts[0].inline_data.data
wave_file('output.wav', data)

プリセットボイスの選択肢

公式ドキュメントに掲載されているボイスの一部を以下に示します。

ボイス名	特徴
Kore	落ち着いた、しっかりした声（Firm）
Puck	明るく活発（Upbeat）
Zephyr	爽やかで明瞭（Bright）
Enceladus	柔らかく息遣いがある（Breathy）
Algieba	滑らかで聞きやすい（Smooth）

その他を含め30種類のプリセットボイスが利用可能です。

音声タグ（Audio Tags）で表現を制御する

Gemini 3.1 Flash TTSの最大の特徴が音声タグです。テキストの中に [タグ名] を埋め込むことで、セリフ単位で感情・ペース・声質を切り替えられます。

代表的な音声タグ

タグ	効果
`[whispers]`	ささやき声
`[laughs]`	笑いながら話す
`[excited]`	興奮した口調
`[slow]`	ゆっくり話す
`[shouting]`	叫ぶ声
`[sarcastic]`	皮肉な口調

音声タグの使い方

text = """
[excited] We just hit our first million users! 
[slow] But our infrastructure bill is... [whispers] quite concerning.
"""

response = client.models.generate_content(
    model="gemini-3.1-flash-tts-preview",
    contents=text,
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(
                    voice_name='Puck',
                )
            )
        ),
    )
)

data = response.candidates[0].content.parts[0].inline_data.data
wave_file('tagged_output.wav', data)

公式ドキュメントには網羅的なタグリストは掲載されていませんが、[cheerful]、[sad]、[nervous]、[authoritative] などの感情タグを含む多数の音声タグが利用可能です。

マルチスピーカーTTSの実装

2名のキャラクターが会話する音声を1回のAPIコールで生成できます。MultiSpeakerVoiceConfig を使用し、各話者の名前とボイスを対応付けます。

from google import genai
from google.genai import types

client = genai.Client()

prompt = """TTS the following conversation between Taro and Hanako:
Taro: APIから直接音声が生成できるなんて、すごい時代になりましたね。
Hanako: しかも感情タグで表情まで制御できるんですよ。[excited] 早速プロダクトに組み込みたいです！
Taro: [laughs] 私もそう思います。マルチスピーカーが1コールで済むのも嬉しいですね。"""

response = client.models.generate_content(
    model="gemini-3.1-flash-tts-preview",
    contents=prompt,
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
                speaker_voice_configs=[
                    types.SpeakerVoiceConfig(
                        speaker='Taro',
                        voice_config=types.VoiceConfig(
                            prebuilt_voice_config=types.PrebuiltVoiceConfig(
                                voice_name='Kore',
                            )
                        )
                    ),
                    types.SpeakerVoiceConfig(
                        speaker='Hanako',
                        voice_config=types.VoiceConfig(
                            prebuilt_voice_config=types.PrebuiltVoiceConfig(
                                voice_name='Zephyr',
                            )
                        )
                    ),
                ]
            )
        )
    )
)

data = response.candidates[0].content.parts[0].inline_data.data
wave_file('multi_speaker_output.wav', data)

注意点

マルチスピーカーは最大2名まで対応しています
プロンプト内で話者名（Taro:, Hanako: など）を明記することで、対応するボイス設定が適用されます
シングルスピーカーの voice_config とマルチスピーカーの multi_speaker_voice_config は排他的で、同時に指定することはできません

Audio Profileで声質・キャラクターを細かく指定する

「プリセットボイスを選ぶだけでは物足りない」という場合は、Audio Profileと呼ばれるプロンプト構造を活用できます。公式ドキュメントでは以下の4要素を組み合わせた構造が推奨されています。

audio_profile_prompt = """
Audio Profile:
You are Jaz R., an energetic radio presenter from East London. 
Your voice reflects your background: you frequently use Brixton slang and have an upbeat, 
infectious energy. You speak quickly but clearly, with rhythm and flow.

Scene:
You're announcing the morning show at a local radio station. 
Background: lively, upbeat music fading in.

Director's Notes:
Keep it fast-paced. Let the excitement show. 
Lean into the Brixton accent. Short punchy sentences.

Transcript:
Good morning London! [excited] It's six AM and you're listening to the number one breakfast show!
[slow] Take a breath... [excited] Because today is going to be LEGENDARY!
"""

response = client.models.generate_content(
    model="gemini-3.1-flash-tts-preview",
    contents=audio_profile_prompt,
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(
                    voice_name='Puck',
                )
            )
        ),
    )
)

data = response.candidates[0].content.parts[0].inline_data.data
wave_file('character_output.wav', data)

Audio Profileで指定できる主な要素は以下の通りです。

要素	内容
Audio Profile	キャラクターの人格・出身地・話し方の特徴
Scene	環境・状況・背景音のコンテキスト
Director's Notes	ペース・トーン・アクセントの演出指示
Transcript	音声タグを含む実際の読み上げテキスト

バッチAPIの活用

大量の音声を生成する場合、バッチAPIを利用することでコストを50%削減できます。

# バッチAPIはREST API形式でリクエストを組み立てる（Python SDKとは異なるインターフェース）
batch_request = {
    "model": "gemini-3.1-flash-tts-preview",
    "contents": [
        {"parts": [{"text": "[cheerful] Welcome to our service!"}]},
    ],
    "generation_config": {
        "response_modalities": ["AUDIO"],
        "speech_config": {
            "voice_config": {
                "prebuilt_voice_config": {
                    "voice_name": "Kore"
                }
            }
        }
    }
}

バッチAPIの詳細は公式のBatch APIドキュメントを参照してください。

料金

公式料金ページより。

項目	通常料金	バッチ料金（50%割引）
入力（テキスト）	$1.00 / 1Mトークン	$0.50 / 1Mトークン
出力（音声）	$20.00 / 1Mトークン	$10.00 / 1Mトークン

音声出力は通常のテキスト出力に比べてトークン単価が高い点に注意が必要です。大量生成が想定される場合はバッチAPIの利用が推奨されます。

注意点

公式ドキュメントに記載されている主な注意点を整理します。

1. ストリーミング非対応
現在のプレビューバージョンではストリーミング出力はサポートされていません。長い音声を生成する場合は、テキストを分割してリクエストを分けることが推奨されています。

2. 出力が不安定になる場合がある
数分以上の長い音声では品質の揺れ（quality drift）が発生することがあります。また、稀にテキストトークンが返ってくる場合があり、リトライロジックの実装が推奨されます。

import time

def generate_tts_with_retry(client, contents, config, max_retries=3):
    for attempt in range(max_retries):
        response = client.models.generate_content(
            model="gemini-3.1-flash-tts-preview",
            contents=contents,
            config=config,
        )
        part = response.candidates[0].content.parts[0]
        if part.inline_data and part.inline_data.mime_type.startswith("audio/"):
            return part.inline_data.data
        time.sleep(1)
    raise RuntimeError("Failed to get audio response after retries")

3. SynthIDウォーターマーク
全ての生成音声にはSynthIDウォーターマークが不可視的に埋め込まれます。生成AIの音声であることを追跡できる仕組みになっています。

4. マルチスピーカーは最大2名
現在のプレビューでは同時生成できる話者は2名が上限です。

まとめ

Gemini 3.1 Flash TTSは2026年4月15日にプレビューリリース
多様な音声タグでセリフ単位の感情・ペース制御が可能
最大2名のマルチスピーカー対話を1コールで生成
70+言語対応、30プリセットボイス
Audio Profileでキャラクター声質のカスタマイズが可能
バッチAPIで50%コスト削減が可能
現時点ではストリーミング非対応のため、長い音声は分割処理が必要

音声タグとAudio Profileを組み合わせることで、ナレーション・キャラクターボイス・多言語コンテンツ生成など幅広い用途に対応できます。

参考リンク

Gemini 3.1 Flash TTS 公式ブログ — 機能紹介・ユースケース
Text-to-speech generation | Gemini API — APIリファレンス・コードサンプル
Gemini 3.1 Flash TTS Preview モデル仕様 — モデルパラメータ詳細
Gemini API 料金 — 最新料金体系

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up