DatabricksのMosaic AI Vector Searchのハイブリッド検索を試す

Posted at 2024-06-10

すでに試されている方もいらっしゃいます。

ちなみに、現時点での正式名称はMosaic AI Vector Searchとなっています。LLM/RAG関連の機能はすべてMosaic AIの枠組みの中に位置付けられるようになっています。

こちらで構築したVector Indexに対して試してみます。

ライブラリdatabricks-vectorsearchをアップグレードします。そうしないと、ハイブリッド検索の引数を受け付けてくれません。0.38になっていることを確認します。

%pip install --upgrade databricks-vectorsearch
dbutils.library.restartPython()

Vector Search Indexを取得します。

from databricks.vector_search.client import VectorSearchClient
import os

VECTOR_SEARCH_ENDPOINT_NAME="one-env-shared-endpoint-8"
catalog = "takaakiyayoi_catalog"
dbName = db = "rag_chatbot"

index_name=f"{catalog}.{db}.databricks_documentation_vs_index"
host = "https://" + spark.conf.get("spark.databricks.workspaceUrl")
os.environ['DATABRICKS_TOKEN'] = dbutils.secrets.get("demo-token-takaaki.yayoi", "rag_sp_token")
os.environ["DATABRICKS_HOST"] = host

# vector search indexの取得
vsc = VectorSearchClient(workspace_url=host, personal_access_token=os.environ["DATABRICKS_TOKEN"])
vs_index = vsc.get_index(
    endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
    index_name=index_name
)

デフォルトの挙動で確認します。こちらにあるように、デフォルトは"ann"(近似最近傍:approximate nearest neighbor)となっています。

question = "Deltaログの保持期間は？"

results = vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, index_name).similarity_search(
  query_text=question,
  columns=["url", "content"],
  num_results=5,
  query_type="ann")
docs = results.get('result', {}).get('data_array', [])
display(docs)

ふーんという感じです。

ハイブリッド検索をオンにします。

results = vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, index_name).similarity_search(
  query_text=question,
  columns=["url", "content"],
  num_results=5,
  query_type="hybrid")
docs = results.get('result', {}).get('data_array', [])
display(docs)

やはり、感触的には精度あがっていますね。

今度はリトリーバの精度検証もやってみたいところです。

はじめてのDatabricks

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up