More than 1 year has passed since last update.

Spotify製のvoyagerをTitan Embeddings G1でやってみた

Posted at 2024-01-13

少し前にこのようなニュースが出てました。

Spotify、最近傍検索ライブラリVoyagerをオープンソース化

気になっていたのでAmazon Bedrockのテキスト埋め込みモデルTitan Embeddings G1で試してみました。

情報

公式サイト

GitHub

PythonとJavaに対応していて、それぞれでインデックスデータは共有できるようです。また、現時点でWindows環境はサポートされていないようです。

公式のデモ

公式サイトにデモ動画に沿って実施しました。（動画しかないので文字起こししておきます）

ライブラリーのインストール
```
pip install -Uq voyager
```

テストデータの取得

wget https://storage.googleapis.com/embedding-projector/data/word2vec_10000_200d_tensors.bytes
wget https://storage.googleapis.com/embedding-projector/data/word2vec_10000_200d_labels.tsv

Numbyのロード

ここからPythonのコードです。動画ではIPythonで実施していました。
```
import numpy as np
```

テストデータのロード

テストデータはWord2Vecでベクトル化したデータだと思われます。

num_dimentions = 200

with open("word2vec_10000_200d_tensors.bytes", "rb") as f:
  vectors = np.fromfile(f, np.float32).reshape(-1, num_dimentions)

with open("word2vec_10000_200d_labels.tsv", "r") as f:
  labels = [line.split("\t")[0] for line in f.readlines()[1:]]

データの確認
```
labels[1]
```
'the'
```
labels[2]
```
'of'
```
labels[3]
```
'and'
```
vectors[0]
```
array([-0.141771 , 0.249576 , -0.188584 , -0.0815223 , 0.128442 ,
0.547185 , -0.197366 , 0.142269 , -0.438946 , -0.0416157 ,
0.370258 , 0.408382 , 0.011527 , 0.274481 , -0.00205041,
0.165503 , -0.0883049 , 0.286902 , -0.0418692 , 0.0736817 ,
-0.0211798 , -0.0613568 , 0.17691 , -0.141145 , 0.0105192 ,
-0.226281 , -0.324913 , 0.266758 , -0.104392 , -0.170748 ,
0.00121182, -0.0421411 , -0.126701 , -0.335706 , -0.0201676 ,
-0.314706 , 0.227294 , 0.181603 , 0.103264 , 0.333935 ,
0.0354471 , 0.0635742 , 0.205139 , 0.249157 , -0.136408 ,
-0.0435595 , 0.095526 , -0.0772112 , -0.00595369, -0.182302 ,
0.323586 , -0.204001 , -0.0916038 , 0.0807845 , -0.243777 ,
0.119747 , 0.00691663, -0.1902 , 0.263702 , 0.244449 ,
0.0441175 , 0.0958303 , -0.0618864 , -0.202204 , -0.342813 ,
0.318309 , -0.094516 , 0.307758 , 0.109799 , 0.179937 ,
0.209827 , 0.270957 , 0.0364361 , -0.26446 , 0.364246 ,
-0.366684 , -0.0671071 , 0.10821 , -0.259401 , 0.0538033 ,
0.159056 , -0.206028 , -0.0396725 , -0.336107 , 0.234527 ,
-0.0116628 , -0.0904525 , -0.212477 , -0.408584 , -0.0243164 ,
-0.274519 , 0.403208 , 0.215137 , -0.132371 , -0.0714231 ,
0.188115 , 0.0488086 , -0.0825052 , -0.0325133 , -0.143736 ,
-0.0349842 , 0.646822 , 0.17857 , 0.386225 , 0.266737 ,
0.261134 , 0.0294623 , 0.069207 , 0.0511157 , 0.145836 ,
-0.0794004 , 0.204002 , -0.193424 , 0.157486 , -0.0425395 ,
-0.297959 , -0.0443972 , -0.24584 , 0.328743 , -0.0362118 ,
-0.109993 , 0.368324 , -0.0865976 , -0.0313383 , 0.148474 ,
...
0.183288 , 0.0209599 , -0.104203 , -0.169894 , -0.107271 ,
0.292521 , -0.177604 , -0.108201 , -0.367897 , -0.281144 ,
0.0879999 , -0.291526 , -0.231764 , 0.17579 , 0.0101314 ,
0.161831 , -0.0566941 , -0.0891432 , 0.263995 , -0.303327 ],
dtype=float32)

Voyagerのインデックスを作成

import voyager

index = voyager.Index(voyager.Space.Cosine, 200)

データを登録
```
len(vectors)
```
10000
```
index.add_items(vectors)
```
登録したデータの確認
```
labels.index("dog")
```
1902

検索実行

vector = vectors[labels.index("dog")]

ids, distances = index.query(vector, 5)

検索結果の確認
```
ids
```
array([1902, 4138, 2602, 2974, 5054], dtype=uint64)
```
distances
```
array([5.3644180e-07, 2.7525765e-01, 3.5191095e-01, 3.7909913e-01,
3.8402766e-01], dtype=float32)
```
for id, distance in zip(ids, distances):
  print(f"\t{labels[id]!r} is {distance:.2f} away from dog.")
```
'dog' is 0.00 away from dog.
'dogs' is 0.28 away from dog.
'cat' is 0.35 away from dog.
'bird' is 0.38 away from dog.
'breed' is 0.38 away from dog.
インデックスを保存
```
index.save("demo-index.voy")
```
保存したインデックスの読み込みはこのように行います。
index = voyager.Index.load("demo-index.voy")

使い方は理解できたでしょうか？あくまでベクトルの検索に特化しているので、ベクトル化した値から元の文書を取得したい場合は別で管理する必要があります。

Titan Embeddings G1で実施

上記デモと同様の内容をAmazon BedrockのTitan Embeddings G1で行いました。

ライブラリーのインストール
```
pip install -Uq voyager boto3
```

ライブラリーのロードとBedrockクライアントの作成

import json
import boto3
client = boto3.client('bedrock-runtime')

Titan Embeddings G1を呼び出す関数の作成

テキストを渡してベクトル化した結果を取得する関数を作成しました。

def embedding(inputText: str):
  body = {
    "inputText": inputText
  }

  response = client.invoke_model(
    modelId="amazon.titan-embed-text-v1",
    contentType="application/json",
    accept="*/*",
    body=json.dumps(body)
  )

  embedding = json.loads(response["body"].read())["embedding"]
  return embedding

テストデータの作成

labels = ["dog", "dogs", "cat", "bird", "breed"]

vectors = list(map(lambda x: embedding(x), labels))

Voyagerインデックスの作成とデータの登録

import voyager
index = voyager.Index(voyager.Space.Cosine, 1536)
index.add_items(vectors)

検索実行

vector = vectors[labels.index("dog")]
ids, distances = index.query(vector, 5)

検索結果の確認
```
for id, distance in zip(ids, distances):
  print(f"\t{labels[id]!r} is {distance:.2f} away from dog.")
```
'dog' is 0.00 away from dog.
'dogs' is 0.14 away from dog.
'cat' is 0.25 away from dog.
'breed' is 0.47 away from dog.
'bird' is 0.48 away from dog.

特にハマるところもなく、使えました。

まとめ

簡単に使えるかつ、Bedrockでの利用も可能でした。

また、事前準備されたテストデータとTitan Embeddings G1で、検索した結果（近さだけでなく順番も）が変わることがわかりました。このあたりがベクトル化の性能ということでしょうか？

Word2Vec

'dog' is 0.00 away from dog.
'dogs' is 0.28 away from dog.
'cat' is 0.35 away from dog.
'bird' is 0.38 away from dog.
'breed' is 0.38 away from dog.

Titan Embeddings G1

'dog' is 0.00 away from dog.
'dogs' is 0.14 away from dog.
'cat' is 0.25 away from dog.
'breed' is 0.47 away from dog.
'bird' is 0.48 away from dog.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up