trec_evalを用いてAzure AI Searchの検索精度を評価する

Last updated at 2024-11-30Posted at 2024-11-30

はじめに

業務でAI Searchの検索精度を向上させたい！ということになったので、まずは検索精度を評価する基盤構築からはじめてみました。

環境構築

必要なツール・ライブラリ

Python 3.8以上
必要なPythonパッケージ
- azure-search-documents
- azure-idendity
- python-dotenv
- pytrec_eval ← 今回の主役
.envファイル（Azureエンドポイントやインデックス設定用）

pip install azure-search-documents azure-identity python-dotenv pytrec_eval

Azure AI Searchサービスの準備

すでにAzure AI Searchリソースの作成とインデックスの作成はされているものとします。
.envファイルには以下を記載してください：

AZURE_AI_SEARCH_ENDPOINT="https://<your-search-service>.search.windows.net"
AZURE_AI_SEARCH_INDEX="example"

検索精度評価の実装

AI Searchクライアントの取得

まずはAI Searchクライアントを取得します。
今回はAzureCliCredentialを利用するので、Azure CLIのログイン情報をそのまま使用できます。

from azure.identity import AzureCliCredential
from azure.search.documents import SearchClient
import os

def get_client():
    credential = AzureCliCredential()
    search_client = SearchClient(
        endpoint=os.getenv("AZURE_AI_SEARCH_ENDPOINT", ""),
        index_name=os.getenv("AZURE_AI_SEARCH_INDEX", ""),
        credential=credential
    )
    return search_client

AI Searchの検索処理

AI Searchにクエリを送信し、検索結果を取得します。
※topは最大取得件数

def search_azure(client: SearchClient, query: str, top: int = 30):
    results = client.search(search_text=query, top=top)
    return results

取得した検索結果をTREC形式で保存

def save_trec_results(client, queries, results_file="results.trec"):
    with open(results_file, "w") as file:
        for query_id, query_text in queries.items():
            results = search_azure(client, query_text)
            for rank, result in enumerate(results):
                doc_id = result["id"]
                score = result["@search.score"]
                file.write(f"{query_id} Q0 {doc_id} {rank + 1} {score} azure_search\n")

TREC Evalの実行

評価用データの準備

qrels.txt：ゴールドスタンダードファイル（クエリと関連付ける評価用データ）
```
# サンプルデータ
1 0 doc1 1
1 0 doc2 0
2 0 doc3 1
```
results.trec：AI Searchの検索結果ファイル

検索結果を評価

今回はtrec_evalのPythonラッパーpytrec_evalを用いて評価指標を計算します
とりあえず計算できるすべての評価指標を出力してみます

import pytrec_eval

def run_trec_eval(qrels_file, results_file, eval_output="trec_eval_output.txt"):
    if not os.path.exists(qrels_file) or not os.path.exists(results_file):
        print("Error: qrels or results file is missing!")
        return

    # qrelsファイルを読み込む
    with open(qrels_file, 'r') as f:
        qrels = pytrec_eval.parse_qrel(f)

    # 検索結果ファイルを読み込む
    with open(results_file, 'r') as f:
        run = pytrec_eval.parse_run(f)

    # 評価を実行
    evaluator = pytrec_eval.RelevanceEvaluator(qrels, pytrec_eval.supported_measures)
    results = evaluator.evaluate(run)

    # 結果をファイルに保存
    with open(eval_output, "w") as output_file:
        for query_id, metrics in results.items():
            for metric, value in metrics.items():
                output_file.write(f"{query_id} {metric} {value}\n")

実行例

1 runid 0.0
1 num_q 1.0
1 num_ret 30.0
1 num_rel 4.0
1 num_rel_ret 4.0
1 map 0.8541666666666666
1 gm_map -0.15762894420358317
1 Rprec 0.75
1 bpref 1.0
1 recip_rank 1.0
1 iprec_at_recall_0.00 1.0
1 iprec_at_recall_0.10 1.0
1 iprec_at_recall_0.20 1.0
1 iprec_at_recall_0.30 1.0
1 iprec_at_recall_0.40 1.0
1 iprec_at_recall_0.50 1.0
1 iprec_at_recall_0.60 0.75
1 iprec_at_recall_0.70 0.75
1 iprec_at_recall_0.80 0.6666666666666666
1 iprec_at_recall_0.90 0.6666666666666666
1 iprec_at_recall_1.00 0.6666666666666666
1 P_5 0.6
1 P_10 0.4
1 P_15 0.26666666666666666
1 P_20 0.2
1 P_30 0.13333333333333333
1 P_100 0.04
1 P_200 0.02
1 P_500 0.008
1 P_1000 0.004
1 relstring 0.0
1 recall_5 0.75
1 recall_10 1.0
1 recall_15 1.0
1 recall_20 1.0
1 recall_30 1.0
1 recall_100 1.0
1 recall_200 1.0
1 recall_500 1.0
1 recall_1000 1.0
1 infAP 0.8541643750340272
1 gm_bpref 0.0
1 Rprec_mult_0.20 1.0
1 Rprec_mult_0.40 1.0
1 Rprec_mult_0.60 0.6666666666666666
1 Rprec_mult_0.80 0.75
1 Rprec_mult_1.00 0.75
1 Rprec_mult_1.20 0.6
1 Rprec_mult_1.40 0.6666666666666666
1 Rprec_mult_1.60 0.5714285714285714
1 Rprec_mult_1.80 0.5
1 Rprec_mult_2.00 0.5
1 utility -22.0
1 11pt_avg 0.8636363636363636
1 binG 0.7827324383928644
1 G 0.7827324383928644
1 ndcg 0.9438661545147249
1 ndcg_rel 0.9371690323796685
1 Rndcg 0.8743380647593371
1 ndcg_cut_5 0.8048099750039491
1 ndcg_cut_10 0.9438661545147249
1 ndcg_cut_15 0.9438661545147249
1 ndcg_cut_20 0.9438661545147249
1 ndcg_cut_30 0.9438661545147249
1 ndcg_cut_100 0.9438661545147249
1 ndcg_cut_200 0.9438661545147249
1 ndcg_cut_500 0.9438661545147249
1 ndcg_cut_1000 0.9438661545147249
1 map_cut_5 0.6875
1 map_cut_10 0.8541666666666666
1 map_cut_15 0.8541666666666666
1 map_cut_20 0.8541666666666666
1 map_cut_30 0.8541666666666666
1 map_cut_100 0.8541666666666666
1 map_cut_200 0.8541666666666666
1 map_cut_500 0.8541666666666666
1 map_cut_1000 0.8541666666666666
1 relative_P_5 0.75
1 relative_P_10 1.0
1 relative_P_15 1.0
1 relative_P_20 1.0
1 relative_P_30 1.0
1 relative_P_100 1.0
1 relative_P_200 1.0
1 relative_P_500 1.0
1 relative_P_1000 1.0
1 success_1 1.0
1 success_5 1.0
1 success_10 1.0
1 set_P 0.13333333333333333
1 set_relative_P 1.0
1 set_recall 1.0
1 set_map 0.13333333333333333
1 set_F 0.23529411764705882
1 num_nonrel_judged_ret 3.0

おわりに

trec_evalを使ってAzure AI Searchの検索精度を評価する指標を出力してみました。
これで検索結果の品質を定量的に評価できますね！
とは言ってもゴールドスタンダードを上手く作成することが非常に難しいのですが...

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up