ハイブリッド検索の融合方法の検討と Azure AI Search への実装

Last updated at 2024-08-01Posted at 2024-08-01

Pinecone のチームによる、ハイブリッド検索の詳細な分析論文。ハイブリッド検索において、語彙的（レキシカル）および意味的（セマンティック）検索を融合する方法として、凸結合（CC）と逆順位融合（RRF）という二つの融合手法を比較し、それぞれの手法が異なる条件下でどのように性能を発揮するかを詳細に分析している。

データセット

MS MARCO、Natural Questions、Quora、NFCorpus、HotpotQA、Fever、SciFact、DBPedia、FiQA

モデル

レキシカル: BM25(k1=0.9、b=0.4)、topk=1,000
セマンティック: all-MiniLM-L6-v2 モデルを使用し、クエリと文書を 384 次元のベクトルに変換、IndexFlatIP、topk=1,000

スコア融合の方法

正規化前：

f_{\text{convex}}(q, d) = \alpha \cdot f_{\text{sem}}(q, d) + (1 - \alpha) \cdot f_{\text{lex}}(q, d)

各スコアをそのまま融合してしまうと、スコアの範囲や分布が異なるので、

正規化後：

f_{\text{Convex}}(q, d) = \alpha \cdot \phi_{sem}(f_{\text{sem}}(q, d)) + (1 - \alpha) \cdot \phi_{lex}(f_{\text{lex}}(q, d))

$ \alpha$ は重み付けパラメータで、例えば 0.8 に設定されます。
$\phi_{sem}(f_{\text{sem}}(q, d))$ はセマンティックスコア $f_{\text{sem}}(q, d)$ の正規化。
$\phi_{lex}(f_{\text{lex}}(q, d))$ はレキシカルスコア $f_{\text{lex}}(q, d)$ の正規化。
このとき、0 ≤ 𝛼 ≤ 1。つまり、𝛼 = 1 のとき、上式はセマンティックスコアに、0 のとき、レキシカルスコアと等しくなる。

正規化手法

TM2C2 には、理論的最小-最大正規化（theoretical min-max normalization）を使用している。

\phi_{tmm}(f_o(q, d)) = \frac{f_o(q, d) - \inf f_o(q, \cdot)}{M_q - \inf f_o(q, \cdot)}

ここで、$ \inf f_o(q, \cdot) $ は理論上の最小値、$ M_q $ はスコアの最大値。他にも Min-Max 正規化、Z-Score 正規化を比較している。

なるほど、このような正規化手法があるのか。

$ f_o(q, d) $ はクエリ $q$とドキュメント$d$に対するスコア
$\inf f_o(q, \cdot) $ は理論的最小値（例えば、BM25 の最小値は0、コサイン類似度の最小値は-1）
$M_q$ はスコアの最大値

図3. $f_{Convex}$ の性能に対する正規化の効果（検証セットにおける 𝛼 の関数として）。

$𝛼 = 0.6 \sim 0.8$ の範囲が最も NDCG@1000 が高くなる傾向がある

→80% セマンティックスコア、20% BM25 スコアの比率で統合したときが一番精度が高いようだ。

データセットによって有効なスコア融合の方法は異なるが、全体的に TM2C2(𝛼=0.8) が高いことが分かる。

※セマンティック検索に使用した all-MiniLM-L6-v2 モデルには、MS MARCO、NQ、および Quora が含まれているので、インドメインとして扱う。それ以外はアウトオブドメインとして扱っている。

スコア融合を行う際は正規化手法と重み 𝛼 を変えることで精度を上げることができるかもしれない。図3を見るに、α を変えながら精度の変化を見たいところ。

Azure AI Search で実装

Azure AI Search ではキーワード（論文ではレキシカル）検索とベクトル（論文ではセマンティック）検索の検索スコアに重みを設定することができない（現状はマルチベクトル検索のみ実装）ため、独自で TM2C2 を計算してランキングする。

非同期複数検索

複数の検索リクエストを非同期化して高速化する。

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.aio import SearchClient
from azure.search.documents.models import VectorizedQuery
import asyncio

async def execute_search(search_client, query=None, vector_query=None, top=10):
    search_params = {
        "search_text": query,
        "vector_queries": [vector_query] if vector_query else None,
        "search_fields": ["text"],
        "select": ["text", "docid"],
        "top": top
    }
    
    results = await search_client.search(**search_params)

    scores = {}
    
    async for result in results:
        scores[result['docid']] = result['@search.score']

    return scores

async def perform_search(query: str, top: int):
    credential = AzureKeyCredential(search_api_key)
    search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)

    async with search_client:
        # セマンティック検索用のタスク作成
        vector_query = VectorizedQuery(vector=generate_embeddings(query), k_nearest_neighbors=top, fields=vector_field)
        semantic_task = execute_search(search_client, vector_query=vector_query, top=top)

        # レキシカル検索用のタスク作成
        lexical_task = execute_search(search_client, query=query, top=top)

        # 非同期で検索を実行して結果を待つ
        semantic_scores, lexical_scores = await asyncio.gather(semantic_task, lexical_task)

    return {
        "semantic_scores": semantic_scores,
        "lexical_scores": lexical_scores
    }

k_nearest_neighbors=top の調整も別途必要。

TM2C2 の計算

async def search_tm2c2(query, top):
    # 検索の実行
    search_results = await perform_search(query, top)
    if not search_results:
        print("No results found.")
        return

    semantic_scores = search_results['semantic_scores']
    lexical_scores = search_results['lexical_scores']

    # スコアの正規化関数（理論的最小-最大正規化）
    def tmm_normalize(scores, inf, sup):
        return {doc_id: (score - inf) / (sup - inf) for doc_id, score in scores.items()}

    # スコアの正規化
    normalized_lexical_scores = tmm_normalize(lexical_scores, inf=0, sup=max(lexical_scores.values()))
    normalized_semantic_scores = tmm_normalize(semantic_scores, inf=-1, sup=max(semantic_scores.values()))

    # TM2C2スコアの計算
    alpha = 0.8
    tm2c2_scores = {}
    for doc_id in set(normalized_semantic_scores.keys()).union(normalized_lexical_scores.keys()):
        semantic_score = normalized_semantic_scores.get(doc_id, 0)
        lexical_score = normalized_lexical_scores.get(doc_id, 0)
        tm2c2_score = alpha * semantic_score + (1 - alpha) * lexical_score
        tm2c2_scores[doc_id] = tm2c2_score
        
    search_result_ids = set()
    # 結果の表示
    sorted_results = sorted(tm2c2_scores.items(), key=lambda item: item[1], reverse=True)[:top]
    for doc_id, score in sorted_results:
        print(f"Document ID: {doc_id}, TM2C2 Score: {score}")
        search_result_ids.add(doc_id)
        
    return search_result_ids

※精度評価はミスってたので取り下げました。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up