Chroma DBの基本的な使い方 | データ保存, Embedding, データの永続化, サーバー起動など

Last updated at 2024-03-16Posted at 2024-03-16

概要

Chroma DBの基本的な使い方をまとめる。

ChromaのPythonライブラリをインストール

pip install charomadb

データをCollectionに加える

まずはChromaクライアントを取得する。

import chromadb
chroma_client = chromadb.Client()

次にCollectionを作成する。

collection = chroma_client.create_collection(name="my_collection")

すでに作成済みのcollectionに接続するためには、get_collectionメソッドが使用できる。

collection = chroma_client.get_collection(name="my_collection")

また、既存のCollectionがある場合は取得、ない場合は新規作成をするget_or_create_collectionメソッドもある。

collection = chroma_client.get_or_create_collection("new_collection")

Chroma DBは内部にembeddingの仕組みを持っているため、別途embeddingの仕組みを用意しなくても、ドキュメントを保存できる。

collection.add(
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

ただし、一度、embeddingされたドキュメントをいれたcollectionに、次元数の違うembeddingを入れようとすると無効な次元としてエラーが出てしまう。

collection.add(
    embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

実行時に出てくるエラー

InvalidDimensionException: Embedding dimension 3 does not match collection dimensionality 384

Chromaのデフォルトのembeddingは384次元にするものだということがわかる。

Collectionに入っているデータを手軽に取り出すためにはpeekメソッドが便利。デフォルトでは10件のデータを取得してくれる。
ちなみにpeekは、「ちらっと見る」という意味。

collection.peek()

countメソッドでは、Collectionに含まれるembeddingの数を取得することができる。

collection.count()

データの検索

collectionにquery methodを用いることでデータの検索をすることができる。

collection.query(
    query_texts=["This is a query document"],
    n_results=2
)

実行結果

{'ids': [['id1', 'id2']],
 'distances': [[0.7111212015151978, 1.0109771490097046]],
 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]],
 'embeddings': None,
 'documents': [['This is a document', 'This is another document']]}

ids, distancesなどをkeyとして、valueにはlistが入っているdictが返ってくる。
valueに入っているlistが検索結果で、距離の近い順でドキュメントの情報が並べられている。
例えば、それぞれのlistの0番目、つまり一番近いとされたドキュメントは、idは　id1 で、距離は 0.7111212015151978で、元のドキュメントは This is a document となる。

検索の類似度

類似度を理解するために、まずはChromaのデフォルトで使われるEmbeddingを知る必要がある。
Chromaはデフォルトで all-MiniLM-L6-v2 というembeddingモデルを利用している。
デフォルトのEmbeddingモデルを使ってembeddingを試してみる。

import numpy as np
from chromadb.utils import embedding_functions

default_ef = embedding_functions.DefaultEmbeddingFunction()

さきほど、Collectionに入れていたドキュメントと検索クエリを変換して、出力されたarrayを調べてみる。

embedded = np.array(default_ef(["This is a document", "This is a query document"]))
embedded.dtype, embedded.shape, np.linalg.norm(embedded, axis=1)

実行結果

(dtype('float64'), (2, 384), array([1., 1.]))

embeddingにより、2つのドキュメントがそれぞれ384次元、大きさ1のfloat配列変換されている。
この2つのvectorの二乗距離 (Squared L2) を計算すると検索時に出てきた結果と同じことがわかる。

((embedded[0] - embedded[1])**2).sum()

実行結果

0.7111213785443224

ChromaでOpenAIのembeddingモデルを使ってみる

Chromaで他のembeddingモデルを使うこともできる。
例えば、openaiのembeddingモデルを使うときは以下のようにembeddingモデルを呼び出す。環境変数OPENAI_API_KEYにOpenAIのAPIキーが設定されていることを前提とする。

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                model_name="text-embedding-3-small"
            )

OPENAI_API_KEYにOpenAIのAPIキーを環境変数に設定していない場合は、api_key引数を使って直接APIキーを指定することもできる。

別のembedding modelを使う場合も、defaultのembeddingのときと同様にstringの配列を引数に与えることでembeddingできる。

np.array(openai_ef(["This is a document"])).shape

実行結果

(1, 1536)

OpenAIのtext-embedding-3-smallでは、1526次元のベクトルにembeddingされていることがわかる。
embeddingモデルよって全く異なるロジックでembeddingされるので、embeddingモデルを統一することが重要となる。

上記OpenAIのembeddingモデルを使ったembeddingの実行時に、以下のようなエラーが出た場合、

You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0

メッセージに従って古いバージョンのopenaiをインストールすればよい。

pip uninstall openai
pip install openai==0.28

Chromaの永続化

ここまで紹介した方法では、プログラムを終了したときにChromaに保存したデータが失われてしまう。
Chromに保存したデータを永続化したい場合は、PersistentClientを利用する。

import chromadb
client = chromadb.PersistentClient(path="/path/to/persist/directory")

iPythonやJupyter Notebookで、Chroma Clientを色々試していると ValueError: An instance of Chroma already exists for ephemeral with different settings というエラーが出ることがある。
Chromaインスタンスが別の設定で立ち上がっていることによるエラーなので、コンソールを再起動して、必要なところから実行すればエラーはなくなる。

ちなみに、以下のようにpersist_directoryを使って永続化をするという記事が多く見つかる。

import chromadb
from chromadb.config import Settings
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="/path/to/persist/directory"
))

これを実行しようとすると、ValueError: You are using a deprecated configuration of Chroma.というエラーが出るので、PersistentClientを使うようにしましょう。

試しに、PersistentClientで、./tmp/chromaを指定してみる。

chroma_client = chromadb.PersistentClient(path="./tmp/chroma")

ここで、./tmp/chromaの中身をみるとsqlite3のファイルができていることが確認できる。

$ ls ./tmp/chroma

以下のようにCollectionを作ってドキュメントを追加すると

collection = chroma_client.creante_collection(name="test")
collection.add(
    embeddings=[[1.2, 2.3, 4.5]],
    documents=["This is a document"],
    metadatas=[{"source": "my_source"}],
    ids=["id1"]
)

collectionのUUIDのディレクトリが作成されて、そこにデータが保存される。

$ ls ./tmp/chroma
chroma.sqlite3                       fbaeeb3f-3846-4616-8cf6-a8ece75c42c7
$ ls ./tmp/chroma/fbaeeb3f-3846-4616-8cf6-a8ece75c42c7
data_level0.bin header.bin      length.bin      link_lists.bin

作成したcollectionのUUIDはidメソッドで取得できる。

collection.id

実行結果

UUID('bdd7cc46-4f64-444d-8b28-a865fcafa7bb')

Chromaの疎通確認ができるheartbeatメソッドがある。

chroma_client.heartbeat()

Chromaをサーバーモードで起動

Chromaはchromaコマンドを利用してサーバーモードで起動することができる。
Python上ではなくterminal上で、以下のコマンドを実行すると、chromaのロゴが表示されて、Chromaサーバが起動される。

$ chroma run --path ./tmp/chroma


                (((((((((    (((((####
             ((((((((((((((((((((((#########
           ((((((((((((((((((((((((###########
         ((((((((((((((((((((((((((############
        (((((((((((((((((((((((((((#############
        (((((((((((((((((((((((((((#############
         (((((((((((((((((((((((((##############
         ((((((((((((((((((((((((##############
           (((((((((((((((((((((#############
             ((((((((((((((((##############
                (((((((((    #########



Running Chroma

Saving data to: ./tmp/chroma
Connect to chroma at: http://localhost:8000
Getting started guide: https://docs.trychroma.com/getting-started

サーバーモードのChromaに接続するためには、HTTPClientを利用する。

import chromadb
client = chromadb.HttpClient(host='localhost', port=8000)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up