Elasticsearchを用いたハイブリッド検索とRRF検索の実装と比較（RAGで使う想定での評価）

Elasticsearch

Posted at 2024-12-26

要旨

Elasticsearchを利用して、ハイブリッド検索とRRF（Ranked Retrieval Fusion）検索を実装し、自然文クエリに対する網羅性を比較した。以下に各手法の実装方法とその特徴を示す。

ハイブリッド検索のPythonコード

ハイブリッド検索では、まずキーワード検索でドキュメントを絞り込み、その後に絞り込んだドキュメント群に対してベクトル検索の類似度計算を行いスコアを再計算する。

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

# Elasticsearchの接続設定
es_host = 'https://localhost:9200'
es_user = 'elastic'
es_password = '[ユーザー「elastic」のパスワード]'
ca_cert_path = '/elasticsearchコンテナから取り出したca.crtの格納先パス/ca.crt'

# Elasticsearchクライアントの初期化
es_client = Elasticsearch(
    es_host,
    basic_auth=(es_user, es_password),
    verify_certs=True,
    ca_certs=ca_cert_path
)

# 埋め込みモデルのロード
model = SentenceTransformer('pkshatech/RoSEtta-base-ja', trust_remote_code=True)

def hybrid_search(es_client, query, index_name='knowledge_base'):
    # クエリのベクトル化
    query_vector = model.encode(query).tolist()
    # ハイブリッド検索のクエリ
    search_query = {
        "query": {
            "script_score": {
                "query": {
                    "match": {
                        "content": query
                    }
                },
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                    "params": {
                        "query_vector": query_vector
                    }
                }
            }
        }
    }
    # Elasticsearchへの検索リクエスト
    response = es_client.search(
        index=index_name,
        query=search_query['query']
    )
    return response

if __name__ == '__main__':
    user_input = input("検索クエリを入力してください: ")
    results = hybrid_search(es_client, user_input)
    for hit in results['hits']['hits']:
        print(f"Score: {hit['_score']}, Title: {hit['_source'].get('title', 'No Title')}, Content: {hit['_source']['content']}")

RRF検索のPythonコード

RRF検索では、複数の検索結果を順位に基づいて統合する。以下はその実装例である。

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

# Elasticsearchの接続設定
es_host = 'https://localhost:9200'
es_user = 'elastic'
es_password = '[ユーザー「elastic」のパスワード]'
ca_cert_path = '/elasticsearchコンテナから取り出したca.crtの格納先パス/ca.crt'

# Elasticsearchクライアントの初期化
es_client = Elasticsearch(
    es_host,
    basic_auth=(es_user, es_password),
    verify_certs=True,
    ca_certs=ca_cert_path
)

# 埋め込みモデルのロード
model = SentenceTransformer('pkshatech/RoSEtta-base-ja', trust_remote_code=True)

# キーワード検索を実行する関数
def keyword_search(es_client, query, index_name='knowledge_base'):
    # キーワード検索クエリ
    search_query = {
        "query": {
            "match": {
                "content": query
            }
        }
    }
    # Elasticsearchへの検索リクエスト
    response = es_client.search(index=index_name, body=search_query)
    return response['hits']['hits']

# ベクトル検索を実行する関数
def vector_search(es_client, query_vector, index_name='knowledge_base'):
    # ベクトル検索クエリ
    search_query = {
        "query": {
            "script_score": {
                "query": {
                    "match_all": {}
                },
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                    "params": {
                        "query_vector": query_vector
                    }
                }
            }
        }
    }
    # Elasticsearchへの検索リクエスト
    response = es_client.search(index=index_name, body=search_query)
    return response['hits']['hits']

# RRFスコアを計算してランキングを統合する関数
def rrf_fusion(keyword_results, vector_results, k=10):
    scores = {}
    content_map = {}

    # キーワード検索結果を処理
    for rank, hit in enumerate(keyword_results):
        doc_id = hit['_id']
        score = 1 / (k + rank + 1)
        scores[doc_id] = scores.get(doc_id, 0) + score
        content_map[doc_id] = hit['_source'].get('content', '内容がありません')

    # ベクトル検索結果を処理
    for rank, hit in enumerate(vector_results):
        doc_id = hit['_id']
        score = 1 / (k + rank + 1)
        scores[doc_id] = scores.get(doc_id, 0) + score
        content_map[doc_id] = hit['_source'].get('content', '内容がありません')

    # スコアでソートしてランキングを作成
    sorted_results = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [(doc_id, score, content_map[doc_id]) for doc_id, score in sorted_results]

if __name__ == '__main__':
    # ユーザー入力を取得
    user_input = input("検索クエリを入力してください: ")

    # クエリをベクトル化
    query_vector = model.encode(user_input).tolist()

    # キーワード検索を実行
    keyword_results = keyword_search(es_client, user_input)

    # ベクトル検索を実行
    vector_results = vector_search(es_client, query_vector)

    # RRFスコアを計算して統合ランキングを作成
    final_results = rrf_fusion(keyword_results, vector_results)

    # 結果を表示
    for doc_id, score, content in final_results:
        print(f"Doc ID: {doc_id}, RRF Score: {score}")
        print(f"Content: {content}\n")

ハイブリッド検索とRRF検索の比較

特性比較

特徴	ハイブリッド検索	RRF検索
スコアの計算方法	キーワード検索スコアとベクトル検索スコアを統合	順位に基づくスコア計算
網羅性	キーワード検索で絞り込むため、取りこぼしの可能性がある	両方の検索結果を統合するため、網羅性が高い
精度	スコア重み付け次第で高精度な検索が可能	特定の検索手法のスコア差を無視するため、特化型には不向き
計算負荷	統合処理が比較的高負荷	スコア計算が軽量で順位統合のため負荷が低い

RAG（Retrieval-Augmented Generation）での使用における網羅性の観点

ハイブリッド検索:
- キーワード検索で絞り込んだ後にベクトル検索を適用するため、意味的な関連性を捉える力は強い。
- しかし、キーワード検索で一致しない文書を除外するため、取りこぼしが発生する可能性がある。
RRF検索:
- キーワード検索とベクトル検索の結果をフラットに統合するため、網羅性が高い。
- 特に、質問文のような自然文クエリでは、部分一致や意味的関連性を同時に考慮できるため有利。

結論

自然文クエリにおいて網羅性を重視する場合、RRF検索が適している。一方、スコア重みを適切に調整することで高い精度を追求したい場合には、ハイブリッド検索が優れる。システムの要件に応じて使い分けるのが良い。

所感

RAGのためにテキストチャンクを取得する方法として、上記のようにテキストをチャンク化・ベクトル化しておいて関連性の高そうなチャンクを抽出する、あるいはより高度な手法としてlanggraphも提案されている（まだ日本語テキストに対応できていないようだが）。
ただ、やってみて改めて思ったが、RAGというのは単純にLLMのコンテキストウィンドウが参照したいテキスト全文を読み込めないほどに狭い、という点に基づく弥縫策に過ぎない感がある。Geminiとかレベルの超広大なコンテキストウィンドウがあればもはやRAGのために文書をベクトル化する意味があるのか？ということも考えてしまう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up