More than 1 year has passed since last update.

Vector Database まとめ

Posted at 2023-08-04

vector database というものについて調べたのでメモです。

vector database とは

類似ベクトルを検索するためのデータベースです。機械学習の分野では文章・画像などを高次元のベクトルに変換して類似したものをまとめるということが行われており、それらを検索するためのデータベースとして使用されます。
2,3次元のベクトルであれば空間分割木で容易に検索できますが、高次元では高速に検索することが出来ないので、確率的に探索する近似最近傍探索という方法が使われます。

一覧

DB 名	スター数	開発言語	ライセンス	主要開発者の所属
milvus	20 k	Go	Apache-2.0	zilliz (中国企業)
qdrant	11.1k	Rust	Apache-2.0	qdrant (ドイツ企業)
weaviate	6.4k	Go	3条項BSD	weaviate (オランダ企業)
cozodb	2.5k	Rust	MPL-2.0	ケンブリッジ大学 (英国大学)
volox	2.5k	C++	Apache-2.0	Meta (米国企業)

ライセンスの違いは特になし。最近のプロダクトなだけあって開発言語は go, rust が多いです。
この中でスター数で並べると天下の Meta 製のものが（僅差で）最下位なのが意外でした。

このうち milvus と qdrand を実際に動かしてみました。

milvus

インストール・起動

wget https://github.com/milvus-io/milvus/releases/download/v2.0.2/milvus-standalone-docker-compose.yml -O docker-compose.yml
docker compose up -d

保存場所は DOCKER_VOLUME_DIRECTORY 環境変数が指定されている場合は ${DOCKER_VOLUME_DIRECTORY}/volumes、されていない場合は ./volumes となります。

操作

接続

from pymilvus import connections

connections.connect("default", host="localhost", port="19530")

'default' は接続名で、以後の操作ではこの接続名を指定する必要があります。
ただし、引数のデフォルト値に 'default' が指定されているので、この場合は実際には指定する必要がありません。

コレクション作成・取得

from pymilvus import Collection, CollectionSchema, DataType, FieldSchema

schema = CollectionSchema([
    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="value", dtype=DataType.DOUBLE),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=8),
])

my_collection = Collection("my_collection", schema)

my_collection.create_index(
    "embeddings",
    {
        "index_type": "HNSW",
        "metric_type": "L2",
        "params": {"M": 16, "efConstruction": 100},
    },
)

既にコレクションが存在する時は my_collection = Collection("my_collection", schema), my_collection = Collection("my_collection") のどちらの方法でも取得することが出来ます。
my_collection = Collection("my_collection", schema) としたが既にコレクションが存在し、スキーマが異なっていた場合はエラーとなります。
DataType.STRING は存在しますが、今はまだ利用できないようです。

追加

import numpy

rng = numpy.random.default_rng()
insert_result = my_collection.insert(
    [
        rng.random(100).tolist(),
        rng.random((100, 8)),
    ]
)

id_list = insert_result.primary_keys
print(f"{len(id_list)=}")  # len(id_list.primary_keys)=100
print(f"{id_list[0]=}")    # id_list[0]=442366941865181988

[
    [エントリ1_フィールド1, エントリ1_フィールド2, エントリ1_フィールド3 ],
    [エントリ2_フィールド1, エントリ2_フィールド2, エントリ2_フィールド3 ],
    ...
]

ではなく、

[
    [エントリ1_フィールド1, エントリ2_フィールド2, ... ],
    [エントリ1_フィールド2, エントリ2_フィールド2, ... ],
    [エントリ1_フィールド3, エントリ2_フィールド3, ... ], 
]

という形式で入力します。

取得

my_collection.load()
expr = f"pk in [{id_list[0]}, {id_list[1]}]"
query_result = my_collection.query(expr, ["value", "embeddings"])

print(f"{len(query_result)=}")              # len(query_result)=2
print(f"{query_result[0]['embeddings']=}")  # query_result[0]['embeddings']=[0.612779, 0.729674, 0.871807, 0.797426, 0.396676, 0.352511, 0.400366, 0.205642]

ベクトル検索

my_collection.load()
results = my_collection.search(
    rng.random((10, 8)),
    "embeddings",
    { "metric_type": "l2", "params": {"nprobe": 10} },
    limit=3,
    output_fields=["pk", "value"],
)

print(f"{len(search_result)=}")     # len(search_result)=10
print(f"{len(search_result[0])=}")  # len(search_result[0])=3

hit = search_result[0][0]
print(f"{hit.id=}, {hit.distance=}")       # hit.id=442366941865182017, hit.distance=0.3244890570640564
print(f"pk={hit.entity.get('pk')}")        # pk=442366941865182017
print(f"value={hit.entity.get('value')}")  # value=0.6984228743147701

qdrant

インストール・起動

3つの動作方法があります。

クライアントライブラリに動作させる。 (インメモリ)
クライアントライブラリに動作させる。 (オンディスク)
(docker 等で) サーバーを設置する。

開発やテストの時にポータブルに利用できるのは良いですね。

docker を使用してサーバーを設置するには以下のようにします。

docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant

接続

from qdrant_client import QdrantClient

client = QdrantClient(":memory:")                   # インメモリで動作させる場合
client = QdrantClient(path="./data")                # オンディスクで動作させる場合
client = QdrantClient(host="localhost", port=6333)  # サーバーに接続する場合

コレクション作成

from qdrant_client.models import Distance, VectorParams

client.create_collection(
    collection_name="my_collection",
    vectors_config=VectorParams(size=8, distance=Distance.EUCLID),
)

ベクトル以外のデータはペイロードとして扱い、スキーマを設定しません。
同じ名前のコレクションが存在する場合は (スキーマが同じであっても) エラーとなります。

追加

client.upload_records(
    collection_name="my_collection",
    records=[
        Record(
            id=str(uuid4()),
            vector=[random() for _ in range(8)],
            payload={"name": f"entry{i}", "value": random()},
        )
        for i in range(100)
    ],
)

id は整数または uuid の文字列を指定できます。

取得

retrieve_result = client.retrieve(
    collection_name="my_collection",
    ids=[
        "94c7f62d-2f1c-4750-b297-cd6bdd9579fb",
        "76975215-f312-4695-8ddf-926589a8edb5",
    ],
    with_vectors=True,
)
print(f"{len(retrieve_result)=}")  # len(retrieve_result)=2

hit = retrieve_result[0]
print(f"{hit.id=}")      # hit.id='94c7f62d-2f1c-4750-b297-cd6bdd9579fb'
print(f"{hit.vector=}")  # hit.vector=[0.60470337, 0.14107256, 0.30591038, 0.8446568, 0.17880422, 0.5339794, 0.9225369, 0.8381986]

ベクトル検索

search_result = client.search_batch(
    collection_name="my_collection",
    requests=[
        SearchRequest(
            vector=[random() for _ in range(8)],
            limit=3,
            with_vector=True,
            with_payload=["name"],
        )
        for _ in range(10)
    ],
)

print(f"{len(search_result)=}")     # len(search_result)=10
print(f"{len(search_result[0])=}")  # len(search_result[0])=3

hit = search_result[0][0]
print(f"{hit.id=} {hit.score=}")  # hit.id='cf73c608-89d2-42d6-bfc5-1f6549fe2c1a' hit.score=0.786472
print(f"{hit.vector=}")           # hit.vector=[0.5125426, 0.6317287, 0.375882, 0.6864018, 0.8782652, 0.5168138, 0.8444203, 0.4356831]
print(f"{hit.payload=}")          # hit.payload={'name': 'entry69'}

感想

調べた二つともチュートリアルが整備されていて入門しやすかったです。また、python 以外のライブラリもあるようなので利用しやすいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up