pip install 一発でLLMエージェントの失敗原因を自動診断する

Last updated at 2026-04-01Posted at 2026-04-01

2分で試せる

pip install agent-failure-debugger

from agent_failure_debugger import diagnose

raw_log = {
    "inputs": {"query": "明日朝のフライトに変更して"},
    "outputs": {"response": "空港近くのホテルをいくつか見つけました。"},
    "steps": [
        {"type": "tool", "name": "search_flights",
         "inputs": {"date": "2025-03-20"}, "outputs": {"flights": []}, "error": None},
        {"type": "tool", "name": "search_flights",
         "inputs": {"date": "2025-03-20"}, "outputs": {"flights": []}, "error": None},
        {"type": "tool", "name": "search_flights",
         "inputs": {"date": "2025-03-20"}, "outputs": {"flights": []}, "error": None},
        {"type": "llm", "outputs": {"text": "空港近くのホテルをいくつか見つけました。"}}
    ],
    "feedback": {"user_correction": "フライトのことを聞いたのですが、ホテルではなく。"}
}

result = diagnose(raw_log, adapter="langchain")
print(result["summary"]["root_cause"])
# → agent_tool_call_loop
print(result["explanation"]["context_summary"])
# → 根本原因: ツールが繰り返し呼ばれたが意味のある進展がなかった

依存は pyyaml のみ。Python 3.11+。

何が起きているか

上のログでは、エージェントは：

フライト検索を3回呼んで全部空の結果
なぜかホテルの話を始める
ユーザーに「フライトの話です」と訂正される

ログだけ見ると「output が間違っている」で終わりがちだが、このツールは因果の連鎖を辿って根本原因を特定する。

agent_tool_call_loop (根本原因)
  → 同じツールを3回呼んで進展なし
  → トピックがずれた出力を生成
  → incorrect_output (表面的な症状)

LangGraphエージェントに1行で組み込む

既存のエージェントコードを変更する必要はない。

from llm_failure_atlas.adapters.callback_handler import watch

# この1行を追加するだけ
graph = watch(workflow.compile(), auto_diagnose=True)

# あとは普通に使う
result = graph.invoke({"messages": [HumanMessage(content="Q4の売上は？")]})

実行が終わると、検出結果が自動で出力される。

auto_diagnose と auto_pipeline の違い

フラグ	動作	Debuggerが必要か
`auto_diagnose=True`	パターンマッチングのみ（何が起きたか）	いいえ
`auto_pipeline=True`	因果分析 + 説明 + 修正提案まで（なぜ起きたか）	はい

実行結果の読み方

失敗が検出された場合

Root cause:  agent_tool_call_loop (conf=0.55)
Failures:    1
Gate:        proposal_only (score=0.0)

Explanation:
  Context: Root cause identified: the agent repeatedly invoked tools
           without making meaningful state progress.
  Risk: MEDIUM
  Action: Review the proposed fix before applying.

信頼度スコアはルールベースの証拠の蓄積を反映する。統計的な確率ではない。

範囲	意味
< 0.5	弱いシグナル。参考程度
0.5–0.7	複数シグナルが一致。トレースを確認すべき
> 0.7	強い一致。修正提案を検討

失敗は検出されないが注意が必要な場合

Failures:   none detected
Grounding:  tool_provided_data=False  uncertainty_acknowledged=True

ツールからデータが得られなかったが、エージェントがその事実を開示している。これは許容可能な挙動として扱われる。

対応アダプター

アダプター名	用途
`langchain`	LangChain / LangGraph のトレース
`langsmith`	LangSmith のrun-treeエクスポート
`crewai`	CrewAI の実行ログ
`redis_help_demo`	Redis RAGワークショップのヘルプセンター

# LangSmithのトレースを分析する場合
result = diagnose(langsmith_trace, adapter="langsmith")

# CrewAIの場合
result = diagnose(crew_log, adapter="crewai")

APIキーなしで試す（FakeListLLMを使用）

pip install agent-failure-debugger[langchain] langgraph

from langchain_core.language_models import FakeListLLM
from langchain_core.messages import HumanMessage, AIMessage
from langgraph.graph import StateGraph, MessagesState, START, END
from llm_failure_atlas.adapters.callback_handler import watch

llm = FakeListLLM(responses=[
    "Q3 2024の売上は4.2Mドルで、前年比31%成長でした。"
    "アジア太平洋セグメントが全体の45%を占めています。"
    "営業利益率は全地域で19.3%に拡大しました。"
])

def agent(state: MessagesState):
    return {"messages": [AIMessage(content=llm.invoke(state["messages"]))]}

workflow = StateGraph(MessagesState)
workflow.add_node("agent", agent)
workflow.add_edge(START, "agent")
workflow.add_edge("agent", END)

graph = watch(workflow.compile(), auto_diagnose=True)
graph.invoke({"messages": [HumanMessage(content="Q3の売上は？")]})

ツールを一切呼ばずに具体的な数字を返すエージェント。Atlasはこれを「ツールからのデータなしに詳細な回答を生成した」として検出する。

CLIから使う

# 生ログから全パイプライン実行
python -m agent_failure_debugger.diagnose log.json --adapter langchain

# matcher出力から診断のみ
python -m agent_failure_debugger.main matcher_output.json

できないこと

先に書いておく：

回答の事実確認はしない — 「4.2Mドル」が正しいかどうかは判定できない
意味的ミスマッチは検出できない — ツールが無関係なデータを返した場合、キーワードベースでは判定不能
マルチエージェントの協調障害は対象外 — 単一エージェントの実行時障害のみ
入力形式が間違っていてもエラーにならない — アダプターがシグナルを抽出できなければ、黙って0件検出を返す

検証状況

10ケースのevaluationデータセット（ground truth付き、precision/recall/F1・root accuracy・path matchを算出）
30シナリオ × 30アノテーションのvalidation set（人間判断スコアとの照合込み）
実API（gpt-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash）でのクロスモデル検証：9/9 PASS
ミューテーションテスト：13/13 KILLED（100%）
Redis実デモ環境でのSemantic Cache実験（30 seed/probe pairs）
MASTタクソノミー（NeurIPS 2025）とのマッピング分析

外部エージェントベンチマーク（BFCL、ToolBench、WebArena、SWE-bench、GAIA等）に対する定量的検出精度の評価は、トレース変換の作業が未実施のため未着手。タクソノミー比較はMAST・Cogency Frameworkとの間で完了。

リンク

PyPI: agent-failure-debugger / llm-failure-atlas
GitHub: Atlas / Debugger
MIT License

実世界のエージェント障害トレースで試してみてほしい。検出できたもの・できなかったもの両方のフィードバックを歓迎します。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up