Chromaで埋め込みを体感してみる

Posted at 2024-05-05

最初に

埋め込み(Embedding)はいい実装がモジュール化されているということを認識しているのですが、基本的なことを理解するためにChromaのAPIを触ってみました。
この記事はその過程をメモしたものです。

Chromaとは

Chromaはオープンソースのベクトルデータベースです。
内部的にはRedisを使っているそうです。

Python環境の構築

以下はAnaconda環境での構築を想定しています。
ただし、condaだと新しいライブラリが使えないことがあるので、pipでパッケージを導入しています。

conda create -n embtest python=3.8
conda activate embtest
pip install chromadb streamlit

Pythonプログラムの実装

ハードコーディングしたドキュメントをベクトルデータベースに格納して、その後に検索するプログラムを作成します。

embed_basic.py

import chromadb
import streamlit as st

# ChromeDBのクライアントを取得する
chroma_client = chromadb.Client()

# コレクションを作成する。既にある場合はそれを参照する
collection = chroma_client.get_or_create_collection(name="test_collection")

# まずベクトルデータベースを構築する
st.markdown("## 構築")

# コレクションにドキュメントを追加する。デフォルトの埋め込みモデル(all-MiniLM-L6-v2)を使う
# 初回は埋め込みモデルをキャッシュディレクトリにダウンロードするため時間がかかる
collection.add(
    documents=[
        "Opus One Winery is a winery in Oakville, California, United States. The wine was called napamedoc until 1982 when it was named Opus One. The winery was founded as a joint venture between Baron Philippe de Rothschild of Château Mouton Rothschild and Robert Mondavi to create a single Bordeaux style blend based upon Napa Valley Cabernet Sauvignon.",
        "Dominio de Pingus is a Spanish winery located in Quintanilla de Onésimo in the Province of Valladolid with vineyards in La Horra area of the Ribera del Duero region. The estate's flagship wine, Pingus, is considered a \"cult wine\", sold at extremely high prices while remaining very inaccessible, and commands an average price of $811 per bottle.",
        "Cooper Mountain Vineyards is an American winery located in Beaverton, Oregon, United States. Started in 1978, the certified organic wine maker produces Pinot noir, Pinot gris, Chardonnay, and Pinot blanc. Located in the Portland metropolitan area, the vineyard is sited on the western slopes of Cooper Mountain, an extinct volcano."],
    metadatas=[
        {"source": "Wikipedia"},
        {"source": "Wikipedia"},
        {"source": "Wikipedia"}],
    ids=["id1", "id2", "id3"]
)

# コレクションの中身を確認する
head = collection.peek()
st.write(head)
count = collection.count()
st.write(f"count: {count}")

# ベクトルデータベースから検索する
st.markdown("## 検索")
query_text = st.text_input("検索文字列", value="What is Opus One?")
st.write(f"検索文字列: {query_text}")

# コレクションから検索する(＝最近傍探索する)
# デフォルトの埋め込みモデルでquery_textsをベクトルに変換してくれる
result = collection.query(
    query_texts=[query_text],
    n_results=1
)
st.write(result)

このプログラムをstreamlit run embed_basic.pyで起動してみると、正しく検索できてそうです。

なお、初回実行時は埋め込みモデルをダウンロードするために時間がかかります。
ターミナルに以下のような表示が出てきます。

.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████████████████████████████| 79.3M/79.3M [01:12<00:00, 1.15MiB/s]

また、稀にValueError: Could not connect to tenant default_tenantというエラーが発生することがあります。
これはChromaが起動中で正しく接続できていないようなので、リロードすると治ります。

補足

ドキュメントとして日本語を入力する場合は日本語に対応した埋め込みモデルが必要になります。
デフォルトの埋め込みモデルのままで実際に試してみると、エラーにはならないのですが、検索文字列と内容が似ていないドキュメントが検索結果に出てしまいます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up