Databricks Materialized ViewはそのままVector Search Indexを作成できるのか？

Posted at 2024-05-25

結論

2024/5月時点ではダメっぽいです。

私が試してみた範囲では。
公式見解ではないことに注意ください。

実験背景

Databricksは既存のDelta Tableを利用して、Databricks VectorSearchのためのインデックステーブルを作成することができます。

実際的には、基となるテーブルは何らかのETL/ELT処理から出来たものであることが多いと思います。特にDatabricksにはDelta Live Tables(DLT)というマネージドなデータ加工の仕組があり、DLTによる加工処理結果からテーブルを作ることも多いでしょう。

ただし、DLTで現在プレビュー中のUnity Catalog対応のパイプラインを作成・実行する場合、作成されるデータオブジェクトは通常のテーブルではなく、Materialized ViewやStreaming Tableとなります。

そういえば、Materialized Viewから直接インデックス作成できるんだっけ？というのがドキュメントを見ただけだとわからなかったので試してみました。

検証コード

こちらからQAのデータセットをダウンロードしてインデックスが作成できるか試します。

まずはデータをUnity Catalog内に保管。

%pip install datasets

from datasets import load_dataset
import pyspark.sql.functions as F
from pyspark.sql import Window

def load_databricks_qa_jp_dataset():
    # Load the dataset from Hugging Face
    dataset = load_dataset("yulanfmy/databricks-qa-ja")

    df = spark.createDataFrame(dataset['train'])
    # ID列を追加
    df = df.withColumn("id", F.row_number().over(Window.orderBy("source")))
    return df

# ロード&保管。カタログやスキーマは既に存在していることとします。
CATALOG = "training"
SCHEMA = "databricks_qa_jp"

df = load_databricks_qa_jp_dataset()
df.write.mode("overwrite").option("mergeSchema", True).saveAsTable(f"{CATALOG}.{SCHEMA}.databricks_qa_jp_raw")

保管したテーブルを利用してMatelialized View(MV)を作成。
Change Data Feedが有効でないとインデックス作成できないので設定します。

USE CATALOG training;
USE SCHEMA databricks_qa_jp;

CREATE MATERIALIZED VIEW IF NOT EXISTS databricks_qa_jp_mv TBLPROPERTIES (delta.enableChangeDataFeed = true) AS
SELECT
  *
FROM
  databricks_qa_jp_raw;

MVを使ってVector Search Indexを作成を試みます。

まず、Vector Search Endpointを作成。

%pip install --upgrade --force-reinstall databricks-vectorsearch

dbutils.library.restartPython()

from databricks.vector_search.client import VectorSearchClient

vector_search_endpoint_name = "exsample_vector_search_endpoint"
source_table_name = "training.databricks_qa_jp.databricks_qa_jp_mv"
index_name = "training.databricks_qa_jp.databricks_qa_jp_index"

embedding_endpoint_name = "embedding_bge_m3_endpoint "

client = VectorSearchClient()

client.create_endpoint(
    name=vector_search_endpoint_name,
    endpoint_type="STANDARD"
)

エンドポイントのプロビジョニングが終了した後、インデックスを作成。
(事前にEmbeddingのモデルサービングエンドポイントを用意しています)

index = client.create_delta_sync_index(
  endpoint_name=vector_search_endpoint_name,
  source_table_name=source_table_name,
  index_name=index_name,
  pipeline_type='TRIGGERED',
  primary_key="id",
  embedding_dimension=1024,
  embedding_source_column="instruction",
  embedding_model_endpoint_name=embedding_endpoint_name,
)

結果は、エラーでした。
(メッセージがMVだからダメ、という内容ではないので別要因で引っ掛かっている可能性はあります）

出力

Exception: Response content b'{"message":"The service at /api/2.0/vector-search/endpoints/exsample_vector_search_endpoint/indexes is taking too long to process your request. Please try again later or try a faster operation. [TraceId: 00-39f06994fd4239ee862e5e5fd756fd28-73f9d400bdb3a500-00]","error_code":"TEMPORARILY_UNAVAILABLE"}\n', status_code 504

UIからも同等の操作をしてみましたがダメでした。
ちなみに、MVの参照元であるテーブルを指定すると問題なくインデックスは作成できました。

まとめ

Delta Live Tablesとの連携を考えると、DLTのパイプラインでそのままインデックス作成までいけると理想的なのですが、現時点では難しそうです。
DLTのUnity Catalog対応パイプラインもまだプレビュー中であり、制約もまだまだ厳しいため、これからも改善されていくことを願っています。
（私がDLT大好きなのも期待理由のひとつ。）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up