DatabricksAdvent Calendar 2024

Vector searchのハイブリッド検索とフィルタリング

Posted at 2024-12-20

Mosaic AI Vector Searchは、Databricksのデータインテリジェンスプラットフォームに組み込まれたベクトルデータベースです。

Hybrid search

Databricks Vector searchは、ベクトルベースの埋め込み検索(query_typeパラメータをannとして設定、既定値)とハイブリッド検索(ベクトルベースの埋め込み検索と従来のキーワードベース検索の組み合わせ、query_typeパラメータをHybridとして設定)の２つクエリタイプをサポートしています。

ベクトルベースの埋め込み検索にHNSW（Hierarchical Navigable Small World）アルゴリズムを使用し、埋め込みベクトルの類似度を測定するためにL2距離メトリクスを使用します。（注：返される結果の数の上限は10,000行/クエリになります。）
ハイブリッド検索では、クエリー内の単語とExact Matchだけでなく、概念的に類似したドキュメントも検索し、より包括的で関連性の高い検索結果を提供します。類似性検索の計算はドキュメントをご参照ください。インデクスにある全てのStringタイプの列はキーワード検索の対象となります。現時点すべての文字列を同じ重みで扱うとしています。（注：現時点、返される結果の数の上限は50行/クエリになります。）

# Delta Sync Index with embeddings computed by Databricks
results = vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, index_name).similarity_search(
    query_text="Greek myths",
    columns=["id", "text"],
    num_results=2
    )

# Delta Sync Index using hybrid search, with embeddings computed by Databricks
results3 = vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, index_name).similarity_search(
    query_text="Greek myths",
    columns=["id", "text"],
    num_results=2,
    query_type="hybrid"
    )

Filter

クエリでは、インデックス・テーブル内の任意の列に基づくフィルターを設定することができます。 similarity_search は、指定されたフィルターに一致する行のみを返します。サポートしているフィルター演算子はこちらに記載されています。

注）LIKEの挙動が少しわかりにくい状態です。より理解しやすいものに改良中ですが、現状SQLのような部分一致ではなく、「トークン」一致（大まかには単語と同等）しか対応していません。

# Match rows where `title` exactly matches `Athena` or `Ares`
results = index.similarity_search(
    query_text="Greek myths",
    columns=["id", "text"],
    filters={"title": ["Ares", "Athena"]},
    num_results=2
    )

# Match rows where `title` or `id` exactly matches `Athena` or `Ares`
results = index.similarity_search(
    query_text="Greek myths",
    columns=["id", "text"],
    filters={"title OR id": ["Ares", "Athena"]},
    num_results=2
    )

# Match only rows where `title` is not `Hercules`
results = index.similarity_search(
    query_text="Greek myths",
    columns=["id", "text"],
    filters={"title NOT": "Hercules"},
    num_results=2
    )

ハイブリッド検索とフィルターを組み合わせて使用することもできます。

# Match rows where `title` or `id` exactly matches `Athena` or `Ares`
results = index.similarity_search(
    query_text="Greek myths",
    columns=["id", "text"],
    filters={"title OR id": ["Ares", "Athena"]},
    num_results=2,
    query_type="hybrid"
    )

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up