QWen3-Rerankerの使い方

Posted at 2025-07-17

Qwen3‑Reranker徹底解説 ― LLMベースRerankerの仕組みとFine‑Tuning Tips

RAG（Retrieval‑Augmented Generation）を組むときに “最後のひと押し” として欠かせないのが Reranker。本記事では LLM ベース Reranker の中でも注目株である Qwen3‑Reranker を取り上げ、

コードの読み解きポイント

OOM を避ける推論テク

ほかのローカル LLM への応用方法

Fine‑tuning 手順とデータ形式

をまとめます。

なぜ LLM ベース Reranker？

クロスエンコーダ型の Reranker はモデルが軽いが精度に限界がある。
近年は LLM をRerankerとして扱うことで、シンプルな実装で高いリコールを実現する手法が注目。
Qwen3‑Reranker は “yes / no” でクエリとドキュメントの関連を判定する設計が特徴です。

基本的な使い方（from_pretrained → 推論）はモデルカードに詳しいため、本記事では 内部実装のツボ にフォーカスします。

コードリーディング：3 つのキーファンクション

1. `format_instruction`

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = (
            'Given a web search query, retrieve relevant passages that answer the query'
        )
    output = (
        "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}"
        .format(instruction=instruction, query=query, doc=doc)
    )
    return output

役割：query と document を Qwen3‑Reranker が読めるテンプレートに整形。
ポイント：プロンプトエンジニアリングの余地はここ。<Instruct> を変えるだけでタスクを拡張できます。

2. `process_inputs`

def process_inputs(pairs):
    inputs = tokenizer(
        pairs,
        padding=False,
        truncation='longest_first',
        return_attention_mask=False,
        max_length=max_length - len(prefix_tokens) - len(suffix_tokens),
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs

役割：prefix / suffix を付与しつつ 最大長以内 にトリム。
デバイス転送もここで完了するため、後続コードは GPU/CPU を意識せずに書けます。

3. `compute_logits` ― Reranker の核心

def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector  = batch_scores[:, token_true_id]   # "yes" の生ログit
    false_vector = batch_scores[:, token_false_id]  # "no"  の生ログit

    score_mat = torch.stack([false_vector, true_vector], dim=1)  # (B, 2)
    score_mat = torch.nn.functional.log_softmax(score_mat, dim=1)

    # yes の確率を exp で戻す → 類似度スコア
    return score_mat[:, 1].exp().tolist()

これが一番面白い関数でRerankのスコアである類似度を計算する関数です。
まず、最初のbatch_scoresの計算で、最後のトークンの線形出力を取得しています。最後のトークンの線形出力というのは、文字に直す前のベクトルの状態です。
ここのbatch_scoresから、true_tokenのスコアとfalse_tokenのスコアを抜き出しています。
yesの強さとnoの強さですね。
その二つのスコアをstackさせたあと、
torch.nn.functional.log_softmax()で、TrueとFalseの対数確率を取得します。ここで、対数の形ですが類似しているかどうかという質問に対してyesと答えるかnoと答えるかの確率が求まるわけです。
そして最後に、yesと答える確率を対数から実数に戻して返却。ここのyesと答える確率をRerank score、類似度として返却しています。

Tips ① OOM を避けるミニバッチ推論

複数ドキュメントをまとめて投げると VRAM がひっ迫しがち。以下のように CHUNK 単位に分割 するだけで大幅に安定します。

@torch.no_grad()
def compute_logits_chunked(inputs, chunk_size=4):
    def slice_dict(src, start, end):
        return {k: v[start:end] for k, v in src.items()}

    all_scores = []
    total = inputs["input_ids"].size(0)
    for start in range(0, total, chunk_size):
        end  = min(start + chunk_size, total)
        mini = slice_dict(inputs, start, end)

        logits = model(**mini).logits[:, -1, :]
        true_vec  = logits[:, token_true_id]
        false_vec = logits[:, token_false_id]
        score_mat = torch.stack([false_vec, true_vec], dim=1)
        score_mat = torch.nn.functional.log_softmax(score_mat, dim=1)
        all_scores.append(score_mat[:, 1].exp())

    return torch.cat(all_scores).tolist()

ポイント：torch.no_grad() を忘れずに。推論時は勾配を計算しないだけでメモリが大幅に節約できます。

Tips ② 好きなローカル LLM を Reranker に変える

Qwen3‑Reranker は特別なヘッドを持たず、yes/no の確率をそのままスコアに使っています。
同じようにyes/noの確率を算出すれば他のローカルLLMにも適用可能(Finetuningが必要かもしれませんが)
ただし token ID に対応する単語（たとえば "yes" = 1098, "no" = 3798 など）はモデルごとに異なるので注意。

Tips ③ 分類器としての転用

最後のトークンを数値などに置き換えれば シンプルな多クラス分類器 に早変わり。

例）"<label>: 1" / "<label>: 2" ... の確率を見る。
通常のテキスト生成よりも 出力フォーマットが崩れる心配が少ない のが利点。

Fine‑Tuning 手順

ベースは公式レポジトリの SWIFT スクリプト
https://github.com/QwenLM/Qwen3-Embedding/blob/main/docs/training/SWIFT.md
データ形式に注意：最終的にyes/no を期待する JSONLines 形式に揃えます。

{"system":"Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".","input":"<Instruct>: Given a search query, retrieve relevant passages that answer the query\n<Query>: sentence1\n<Document>: sentence1-positive","output":"<think>\n\n</think>\n\nyes"}
{"system":"Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".","input":"<Instruct>: Given a search query, retrieve relevant passages that answer the query\n<Query>: sentence1\n<Document>: sentence1-negative1","output":"<think>\n\n</think>\n\nno"}

あとは通常どおり LoRA / QLoRA などで微調整すれば OK です。

まとめ

Qwen3‑Reranker は yes/no というシンプルな枠組みで高精度な再ランキングを実現。
コードを追えば 他モデルへの移植 や 分類タスク への応用も容易。
Fine‑tuning も JSONLines を揃えれば比較的簡単にカスタム可能。

RAG の精度向上を目指すなら、ぜひ一度試してみてください！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up