【Google Colab】Qwen3-Rerankerで作るAI寿司評論家 - 「yes/no」判定の精度を試してみた

Posted at 2025-06-10

はじめに

最新のQwen3-Reranker-0.6Bを使って、寿司に関する質問に「yes」か「no」でズバッと答えるAI寿司評論家を作ってみました。結果として、AIの判定精度の高さに驚かされたので、実装方法と結果を共有します。

この記事で分かること：

Qwen3-Rerankerの基本的な使い方
Google Colabでの実装方法
実際の判定結果と精度分析
Rerankerの実用的な応用例

Qwen3-Rerankerとは？

Qwen3-Reranker は Alibaba Cloud 製の最新テキストランキングモデルで、わずか 6 億パラメータながら 100 以上の言語（日本語を含む）を扱い、最大 32,000 トークンの長い文脈を処理できます。質問と文書の対応度を yes/no トークンの確率差から 0〜1 のスコアで返すため、「その文書が質問に答えているか」を直感的に評価できるのが大きな特長です。

実装：AI寿司評論家を作ってみた

Google Colabで試した実装

# ライブラリインストール
!pip install transformers torch -q

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import math

def create_sushi_judge():
    """AI寿司評論家を初期化"""
    print("AI寿司評論家を起動中...")
    
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()
    
    # AIが理解する「yes」と「no」のトークン
    token_yes = tokenizer.convert_tokens_to_ids("yes")
    token_no = tokenizer.convert_tokens_to_ids("no")
    
    print("起動完了！\n")
    return tokenizer, model, token_yes, token_no

def judge_sushi(tokenizer, model, token_yes, token_no, question, document):
    """AI寿司評論家の判定メイン処理"""
    
    # Qwen3-Reranker用のプロンプトフォーマット
    prompt = f"""<|im_start|>system
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
<|im_start|>user
<Instruct>: Given a web search query, retrieve relevant passages that answer the query

<Query>: {question}

<Document>: {document}<|im_end|>
<|im_start|>assistant
<think>

</think>

"""
    
    # 推論実行
    inputs = tokenizer(prompt, return_tensors="pt", max_length=8192, truncation=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits[0, -1, :]  # 最後のトークンの確率
        
        # 「yes」と「no」の確率を計算
        yes_logit = logits[token_yes].item()
        no_logit = logits[token_no].item()
        
        # 最終スコア計算（0-1の範囲）
        yes_score = math.exp(yes_logit)
        no_score = math.exp(no_logit)
        final_score = yes_score / (yes_score + no_score)
        
        # 判定結果
        judgment = "yes" if final_score > 0.5 else "no"
        
    return judgment, final_score

def custom_test(tokenizer, model, token_yes, token_no, question, document):
    """独自のテストケースを実行"""
    judgment, score = judge_sushi(tokenizer, model, token_yes, token_no, question, document)
    print(f"\nカスタムテスト結果:")
    print(f"質問: {question}")
    print(f"文書: {document}")
    print(f"AI判定: {judgment} (スコア: {score:.3f})")
    return judgment, score

# ========== メイン実行部分 ==========

# 初期化
tokenizer, model, token_yes, token_no = create_sushi_judge()

# テストケースの定義
test_cases = [
    {
        "question": "マグロの寿司について教えて",
        "document": "マグロは寿司ネタの王様と呼ばれ、特に大トロは脂が多く濃厚な味わいが特徴です。",
        "expected": "完璧マッチ"
    },
    {
        "question": "回転寿司の仕組みを教えて", 
        "document": "寿司は酢飯の上に新鮮な魚介類を乗せた日本の伝統的な料理です。",
        "expected": "微妙な関連"
    },
    {
        "question": "寿司の作り方は？",
        "document": "ラーメンは中華麺を使った美味しい料理で、スープが決め手です。",
        "expected": "完全にズレ"
    },
    {
        "question": "この寿司屋は新鮮な魚を使っていますか？",
        "document": "毎朝築地市場から仕入れた新鮮な魚介類のみを使用しています。",
        "expected": "バッチリ対応"
    }
]

# テスト実行
print("AI寿司評論家のテスト開始！\n")
print("=" * 70)

results = []
for i, test in enumerate(test_cases, 1):
    judgment, score = judge_sushi(
        tokenizer, model, token_yes, token_no,
        test["question"], 
        test["document"]
    )
    
    results.append({
        "no": i,
        "expected": test["expected"],
        "question": test["question"],
        "document": test["document"],
        "judgment": judgment,
        "score": score
    })
    
    print(f"\nテストケース {i}: {test['expected']}")
    print(f"質問: {test['question']}")
    print(f"文書: {test['document']}")
    print(f"AI判定: {judgment} (スコア: {score:.3f})")
    print("-" * 70)

# 結果サマリー表示
print("\n\n結果サマリー")
print("=" * 70)
print("| No | 期待値         | 判定 | スコア | 成功？ |")
print("|----|--------------|----|--------|-------|")

for result in results:
    # 期待通りかどうかを判定
    success = "○" if (
        (result["expected"] in ["完璧マッチ", "バッチリ対応"] and result["judgment"] == "yes" and result["score"] > 0.9) or
        (result["expected"] == "微妙な関連" and result["judgment"] == "yes" and 0.5 < result["score"] < 0.7) or
        (result["expected"] == "完全にズレ" and result["judgment"] == "no" and result["score"] < 0.1)
    ) else "×"
    
    print(f"| {result['no']}  | {result['expected']:<12} | {result['judgment']}  | {result['score']:.3f}  | {success}     |")

print("\n" + "=" * 70)

# カスタムテストの実行例
print("\n\nカスタムテストを実行してみましょう！")
print("=" * 70)

# カスタムテスト1
custom_test(
    tokenizer, model, token_yes, token_no,
    "サーモンの寿司は健康に良いですか？",
    "サーモンにはオメガ3脂肪酸が豊富で、心臓の健康に良いとされています。"
)

# カスタムテスト2
custom_test(
    tokenizer, model, token_yes, token_no,
    "わさびの適量は？",
    "寿司を食べる際は、醤油に少量つけるのが一般的です。"
)

# カスタムテスト3
custom_test(
    tokenizer, model, token_yes, token_no,
    "江戸前寿司の特徴は？",
    "江戸前寿司は、東京湾で獲れた魚介類を使い、職人の技術で仕込みを施した寿司です。"
)

テストケースの設計と実行

寿司をテーマに、4つの異なるパターンでテスト：

完璧マッチ: 質問と文書が直接的に対応
微妙な関連: 関連はあるが直接的な回答ではない
完全にズレ: 全く関係ない内容
バッチリ対応: 質問に対する明確な回答

実行結果と分析

実際の出力例

AI寿司評論家を起動中...
起動完了！

AI寿司評論家のテスト開始！

======================================================================

テストケース 1: 完璧マッチ
質問: マグロの寿司について教えて
文書: マグロは寿司ネタの王様と呼ばれ、特に大トロは脂が多く濃厚な味わいが特徴です。
AI判定: yes (スコア: 1.000)
----------------------------------------------------------------------

テストケース 2: 微妙な関連
質問: 回転寿司の仕組みを教えて
文書: 寿司は酢飯の上に新鮮な魚介類を乗せた日本の伝統的な料理です。
AI判定: yes (スコア: 0.582)
----------------------------------------------------------------------

テストケース 3: 完全にズレ
質問: 寿司の作り方は？
文書: ラーメンは中華麺を使った美味しい料理で、スープが決め手です。
AI判定: no (スコア: 0.002)
----------------------------------------------------------------------

テストケース 4: バッチリ対応
質問: この寿司屋は新鮮な魚を使っていますか？
文書: 毎朝築地市場から仕入れた新鮮な魚介類のみを使用しています。
AI判定: yes (スコア: 0.999)
----------------------------------------------------------------------

結果サマリー

No	期待値	判定	スコア	成功？
1	完璧マッチ	yes	1.000	○
2	微妙な関連	yes	0.582	○
3	完全にズレ	no	0.002	○
4	バッチリ対応	yes	0.999	○

注目ポイント

0.5 前後のスコアで「関連はあるが回答には不十分」という微妙なケースを見極める（例：0.582）。

関連性ゼロの組み合わせは 0.002 など極低値を返し、誤判定をほぼ排除。

完全一致は 0.999 以上で示され、実運用でも安心できる高い確信度を提供。

技術的なポイント

Rerankerの仕組み

入力: 質問と文書のペア
処理: トランスフォーマーによる意味理解
出力: yes/noトークンの確率比較
判定: 0-1スケールでの関連度スコア

プロンプトエンジニアリングの重要性

Qwen3-Rerankerでは、特定のフォーマットが必要：

<|im_start|>system
Judge whether the Document meets the requirements...
<|im_end|>
<|im_start|>user
<Instruct>: [指示]
<Query>: [質問]
<Document>: [文書]
<|im_end|>
<|im_start|>assistant
<think>

</think>

まとめ

Qwen3-Rerankerで AI寿司評論家を試した結果、100行程度の実装でも日本語を含む質問と文書の関連度を0〜1で高精度に判定できることが確認できました。レコメンドやカスタマーサポート、検索再ランキングなど実サービスへの応用余地があるかと思いました・・。

皆さんもぜひ試してみてください！

参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up