PythonのTenacityで堅牢なリトライ処理を書く：AI API・外部API・バッチ処理で使える実践パターン

Posted at 2026-06-06

PythonのTenacityで堅牢なリトライ処理を書く：AI API・外部API・バッチ処理で使える実践パターン

はじめに

外部API、LLM API、DB、クラウドストレージ、スクレイピング、バッチ処理などを扱っていると、次のような一時的な失敗に必ず遭遇します。

APIのレート制限
ネットワークの瞬断
タイムアウト
503 / 504 系の一時障害
LLM APIの混雑
DB接続の一時的な失敗
S3 / GCS などのクラウドサービスの一時エラー

これらに対して、毎回 try-except + sleep + for loop を手書きするのはつらいです。

そこで便利なのが Tenacity です。

Tenacityは、Pythonでリトライ処理を宣言的に書けるライブラリです。

pip install tenacity

この記事では、Tenacityの基本から、AI API呼び出しで使える実践的なリトライ設計まで紹介します。

Tenacityとは

Tenacityは、関数に対してリトライ条件・停止条件・待機時間・ログ出力などを簡単に追加できるPythonライブラリです。

例えば、以下のようにデコレータを付けるだけでリトライ処理を追加できます。

from tenacity import retry

@retry
def unstable_function():
    print("call")
    raise RuntimeError("temporary error")

unstable_function()

ただし、この書き方は本番コードではおすすめしません。

理由は、デフォルトだと「例外が出続ける限りリトライし続ける」ためです。
本番では必ず、以下を明示しましょう。

最大リトライ回数
待機時間
リトライ対象の例外
ログ
最終的に失敗したときの扱い

基本形：最大3回までリトライする

from tenacity import retry, stop_after_attempt

@retry(stop=stop_after_attempt(3))
def fetch_data():
    print("fetching...")
    raise TimeoutError("timeout")

fetch_data()

stop_after_attempt(3) は「初回実行を含めて最大3回試行する」という意味です。

つまり、

1回目の実行
失敗したら2回目
失敗したら3回目
それでも失敗したら例外を送出

という流れになります。

待機時間を入れる

外部APIやLLM APIに対して、失敗直後に即リトライするのはよくありません。

from tenacity import retry, stop_after_attempt, wait_fixed

@retry(
    stop=stop_after_attempt(3),
    wait=wait_fixed(2),
)
def call_api():
    print("calling api...")
    raise TimeoutError("timeout")

call_api()

この例では、失敗するたびに2秒待ってから再試行します。

指数バックオフを使う

APIのレート制限や一時障害には、指数バックオフがよく使われます。

指数バックオフとは、リトライごとに待機時間を長くしていく方式です。

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
)
def call_external_api():
    print("calling external api...")
    raise TimeoutError("temporary failure")

call_external_api()

この設定では、リトライごとに待機時間が増えます。

最小待機時間: 1秒
最大待機時間: 30秒
最大試行回数: 5回

ただし、複数のプロセスやワーカーが同時にリトライする可能性がある場合は、単純な指数バックオフだけではリトライタイミングが集中することがあります。

そのため、実務では ランダム性を加えた指数バックオフ を使うことが多いです。

実務向け：ランダム指数バックオフ

from tenacity import retry, stop_after_attempt, wait_random_exponential

@retry(
    stop=stop_after_attempt(6),
    wait=wait_random_exponential(multiplier=1, max=60),
)
def call_busy_api():
    print("calling busy api...")
    raise TimeoutError("server is busy")

call_busy_api()

wait_random_exponential を使うと、指数バックオフにランダム性を加えられます。

これは、以下のようなケースで有効です。

OpenAI APIなどのLLM API
複数ワーカーから同じAPIを叩くバッチ処理
429 Too Many Requests への対応
503 Service Unavailable への対応
クローラーやETL処理

リトライ対象の例外を限定する

本番コードでは、すべての例外をリトライ対象にするのは危険です。

例えば、以下のようなエラーはリトライしても意味がない可能性があります。

認証エラー
バリデーションエラー
権限エラー
存在しないリソースへのアクセス
コード上のバグ

そのため、リトライ対象の例外は明示的に限定しましょう。

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
)

@retry(
    retry=retry_if_exception_type((TimeoutError, ConnectionError)),
    stop=stop_after_attempt(5),
    wait=wait_random_exponential(multiplier=1, max=30),
)
def fetch_resource():
    print("fetching resource...")
    raise TimeoutError("temporary timeout")

fetch_resource()

この例では、TimeoutError と ConnectionError の場合だけリトライします。

ログを出す

リトライ処理は、ログを出さないと運用時に原因調査が難しくなります。

Tenacityでは before_sleep_log を使うと、次のリトライ前にログを出せます。

import logging

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
    before_sleep_log,
)

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

@retry(
    retry=retry_if_exception_type((TimeoutError, ConnectionError)),
    stop=stop_after_attempt(5),
    wait=wait_random_exponential(multiplier=1, max=30),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def fetch_with_logging():
    print("fetching...")
    raise TimeoutError("temporary timeout")

fetch_with_logging()

ログには、何回目の試行で失敗したか、次のリトライまでどれくらい待つか、といった情報が出ます。

本番運用では、最低でも以下をログに含めるとよいです。

対象API
リトライ回数
例外種別
待機時間
リクエストID
ユーザーIDやジョブID
モデル名
バッチID

AI APIの実例：OpenAI API呼び出しにTenacityを使う

LLM APIでは、以下のような失敗が起こり得ます。

レート制限
一時的なネットワークエラー
タイムアウト
サーバー側の一時エラー
バッチ処理中の瞬間的な混雑

そこで、OpenAI API呼び出しをTenacityでラップします。

import logging

from openai import OpenAI
from openai import RateLimitError, APIConnectionError, APITimeoutError, InternalServerError

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
    before_sleep_log,
)

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

client = OpenAI()

@retry(
    retry=retry_if_exception_type((
        RateLimitError,
        APIConnectionError,
        APITimeoutError,
        InternalServerError,
    )),
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(6),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def generate_summary(text: str) -> str:
    response = client.responses.create(
        model="gpt-5.5",
        input=[
            {
                "role": "system",
                "content": "あなたは文章要約の専門家です。要点を簡潔にまとめてください。",
            },
            {
                "role": "user",
                "content": text,
            },
        ],
    )

    return response.output_text

使い方は以下です。

if __name__ == "__main__":
    text = """
    ここに長い文章を入れます。
    議事録、問い合わせ内容、ログ、仕様書などを想定しています。
    """

    summary = generate_summary(text)
    print(summary)

ポイントは、リトライ対象を明示していることです。

retry=retry_if_exception_type((
    RateLimitError,
    APIConnectionError,
    APITimeoutError,
    InternalServerError,
))

レート制限や一時的な接続エラーはリトライしてよい可能性があります。
一方で、認証エラーやリクエスト内容の不備は、何度リトライしても成功しない可能性が高いため、リトライ対象に含めない方が安全です。

AIバッチ処理の実例：大量テキストを要約する

次は、複数のテキストを順番に要約するバッチ処理です。

import logging
from dataclasses import dataclass

from openai import OpenAI
from openai import RateLimitError, APIConnectionError, APITimeoutError, InternalServerError

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
    before_sleep_log,
)

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

client = OpenAI()

@dataclass
class SummaryJob:
    id: str
    text: str

@retry(
    retry=retry_if_exception_type((
        RateLimitError,
        APIConnectionError,
        APITimeoutError,
        InternalServerError,
    )),
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(6),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def summarize_one(job: SummaryJob) -> dict:
    response = client.responses.create(
        model="gpt-5.5",
        input=[
            {
                "role": "system",
                "content": "文章を日本語で3行に要約してください。",
            },
            {
                "role": "user",
                "content": job.text,
            },
        ],
    )

    return {
        "id": job.id,
        "summary": response.output_text,
    }

def run_batch(jobs: list[SummaryJob]) -> list[dict]:
    results = []

    for job in jobs:
        try:
            result = summarize_one(job)
            results.append(result)
        except Exception as e:
            logger.exception("summary job failed: job_id=%s error=%s", job.id, e)
            results.append({
                "id": job.id,
                "summary": None,
                "error": str(e),
            })

    return results

この設計では、1件の要約が最終的に失敗しても、バッチ全体は継続できます。

AI系のバッチ処理では、以下の考え方が重要です。

1件失敗しても全体を止めない
リトライ可能なエラーだけ再試行する
最終失敗はログと結果に残す
ジョブIDを必ずログに出す
再実行可能な形式で保存する

非同期処理で使う

Tenacityは async def にも使えます。

import logging
import httpx

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
    before_sleep_log,
)

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

@retry(
    retry=retry_if_exception_type((httpx.TimeoutException, httpx.ConnectError)),
    wait=wait_random_exponential(multiplier=1, max=30),
    stop=stop_after_attempt(5),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
async def fetch_json(url: str) -> dict:
    async with httpx.AsyncClient(timeout=10) as client:
        response = await client.get(url)
        response.raise_for_status()
        return response.json()

LLM API、外部API、Webhook、クローリングなど、非同期I/Oが多い処理でも同じように使えます。

戻り値を見てリトライする

例外ではなく、戻り値を見てリトライしたい場合もあります。

例えば、APIレスポンスが None のときだけリトライするケースです。

from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_result

def is_none(value):
    return value is None

@retry(
    retry=retry_if_result(is_none),
    stop=stop_after_attempt(3),
    wait=wait_fixed(1),
)
def get_cached_value():
    print("checking cache...")
    return None

result = get_cached_value()

これは以下のようなケースで使えます。

キャッシュ反映待ち
非同期ジョブの完了待ち
ステータス確認API
一時的に空レスポンスが返るAPI
ベクトルDBへの登録完了待ち

リトライ後に独自例外を投げたい場合

Tenacityは、最終的に失敗すると RetryError でラップされることがあります。
アプリケーション側で扱いやすくしたい場合は、呼び出し側で例外を変換するのもよいです。

from tenacity import RetryError

class ExternalApiUnavailableError(Exception):
    pass

try:
    result = fetch_resource()
except RetryError as e:
    raise ExternalApiUnavailableError("external api is unavailable") from e

APIサーバーであれば、この例外をハンドリングして 503 Service Unavailable を返す、といった設計にできます。

本番でのおすすめ設定

個人的には、外部API・AI APIでは以下のような設定をベースにすることが多いです。

@retry(
    retry=retry_if_exception_type((
        TimeoutError,
        ConnectionError,
    )),
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(6),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def call_api():
    ...

考え方は以下です。

項目	推奨
停止条件	必ず設定する
待機戦略	`wait_random_exponential`
最大待機時間	30〜60秒程度
最大試行回数	3〜6回程度
リトライ対象	明示的に限定する
ログ	必ず出す
認証エラー	リトライしない
バリデーションエラー	リトライしない
429	基本的にリトライ候補
500 / 502 / 503 / 504	リトライ候補
400 / 401 / 403 / 404	原則リトライしない

よくあるアンチパターン

1. 無限リトライ

@retry
def call_api():
    ...

これは本番では危険です。
必ず stop_after_attempt や stop_after_delay を設定しましょう。

2. すべての例外をリトライする

@retry(stop=stop_after_attempt(5))
def call_api():
    ...

この書き方だと、プログラムのバグや認証エラーまでリトライしてしまいます。

リトライ対象は明示しましょう。

3. 固定間隔で大量リトライする

@retry(
    stop=stop_after_attempt(10),
    wait=wait_fixed(1),
)
def call_api():
    ...

大量のワーカーが同時に動いている環境では、同じタイミングでリトライが集中します。
レート制限対策では、ランダム性のあるバックオフを使う方が安全です。

4. ログを出さない

リトライは「成功すれば見えなくなる失敗」です。

ログがないと、以下のような問題に気づけません。

実は毎回3回リトライしている
APIのレート制限に頻繁に当たっている
特定の時間帯だけ失敗している
特定ユーザーの処理だけ失敗している
コストやレイテンシが増えている

リトライは、可観測性とセットで設計しましょう。

AIアプリケーションでの設計ポイント

LLM APIやAIワークロードでは、通常のAPIよりも次の観点が重要になります。

1. リトライはコストになる

LLM APIでは、失敗したリクエストや再試行がトークン消費・レート制限・レイテンシに影響します。
無制限にリトライするのではなく、最大試行回数を明確に決めるべきです。

2. 冪等性を意識する

同じ入力を再送しても問題ない処理かを確認します。

例えば、要約や分類は比較的リトライしやすいです。
一方で、以下のような処理は注意が必要です。

課金処理
メール送信
Slack投稿
DB更新
チケット作成
外部システムへの登録

こうした処理では、リトライによって二重実行が起こる可能性があります。
ジョブID、リクエストID、重複排除キーなどを設計しましょう。

3. タイムアウトもセットで設計する

リトライだけ設定しても、1回のAPI呼び出しが長時間ブロックすると意味がありません。
HTTPクライアントやSDK側のタイムアウトも必ず設定しましょう。

4. キューと組み合わせる

大量のAI処理では、Tenacityだけで制御するより、キューと組み合わせる方が安定します。

例えば、

Celery
RQ
Dramatiq
Cloud Tasks
SQS
Pub/Sub
Cloud Run Jobs
Kubernetes Job

などと組み合わせると、ジョブ単位で再実行・監視・失敗管理ができます。

まとめ

Tenacityを使うと、Pythonのリトライ処理をシンプルかつ堅牢に書けます。

特にAI APIや外部APIでは、以下の設定が重要です。

stop_after_attempt で最大試行回数を決める
wait_random_exponential でランダム指数バックオフを使う
retry_if_exception_type でリトライ対象を限定する
before_sleep_log でリトライログを出す
認証エラーやバリデーションエラーはリトライしない
バッチ処理では1件失敗しても全体を止めない
リトライはコスト・レイテンシ・冪等性とセットで考える

リトライ処理は、単なるエラーハンドリングではありません。
外部サービスと安全につながるための、重要な信頼性設計です。

PythonでAIアプリケーションや外部API連携を作るなら、Tenacityはかなり実用的な選択肢です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up