faissで総当たりコサイン類似度検索実装

Posted at 2024-05-28

ANN(Approximate Nearest Neighbor)のPythonパッケージである faissでコサイン類似度での検索をインデックスIndexFlatIPで実装しました。少しだけ面倒だったので、記録しておきます。

公式の方法に準拠しています。

以前の関連記事です。

環境

Google Colabで動かしています。pipでfaiss-cpuをインストールしています。

種類	バージョン	備考
Python	3.10.12
faiss-cpu	1.8.0	CPUで動かしいます
numpy	1.25.2
pandas	2.0.3	必須でないけど私が慣れているので

プログラム

パッケージインポート

何の変哲もないインポート。

import numpy as np
import faiss
import pandas as pd

テストデータ作成

2次元×3行のテストデータを作成。デフォルトのfloat64では後続プロセスのfaiss.normalize_L2でエラーが起きたのでfloat32にしています。
float64で後続処理をする方法は調べていないです。

df = pd.DataFrame({'D1': [.1, .2, .3],
                   'D2': [.2, .3, .4]}, dtype='float32')

データの中身です。

    D1   D2
0  0.1  0.2
1  0.2  0.3
2  0.3  0.4

データのL2で正規化

あまり意味がわかっていない部分もあるのですが、配列をL2正規化します。

# Convert the DataFrame to a NumPy array
df_array = df.to_numpy()

# Make the array C-contiguous
df_array = np.ascontiguousarray(df_array)

# Normalize the array using the faiss library
faiss.normalize_L2(df_array)
print(df_array)

各行単位でL2正規化されているのがわかります。L2で正規化されているので、検索時のスコアが内積でそのままコサイン類似度になります。

df_array

[[0.44721362 0.89442724]
 [0.5547002  0.8320503 ]
 [0.6        0.8       ]]

インデックス構築

インデックスをIndexIDMap2で構築します。

quantizer = faiss.IndexFlatIP(2)

# IDのマッピング
index = faiss.IndexIDMap2(quantizer)

index.train(df_array)

# [1, 2, 3]はID
index.add_with_ids(df_array, [1, 2, 3])

検索

[1, 1]に近いものを探します。ベクトルをL2正規化してから検索。

query = np.array([[1, 1]], dtype='float32')
faiss.normalize_L2(query)
print(qeury)
distance, ind = index.search(query, 2)
print(f'{distance=}')
print(f'{ind=}')

3番と2番が検索できていて、正しさを確認。コサイン類似度も1に近くて(計算していないですが)正しいでしょう。

[[0.70710677 0.70710677]]
distance=array([[0.9899495, 0.9805807]], dtype=float32)
ind=array([[3, 2]])

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up