faiss のちょっとニッチな機能紹介

Posted at 2019-12-24

こんにちは，ABEJA Advent Calendar の23日目です．

はじめに

みなさん，faissというライブラリをご存知でしょうか？Facebook Resarch の近傍探索ライブラリで，非常に使いやすくチュートリアルも充実しているため，業務でも趣味でも頻繁に利用しています．
しかし，そんな便利なライブラリにも弱点はあります．ちょっとニッチなことをやろうとすると，「まぁ細かいことは説明しなんでテストコード見て雰囲気掴んでね(-д☆)ｷﾗｯ」と言わんばかりのこれです．

実際にテストコードを読んだり，時によっては本体の C++ のコードを読んだりして使い方を勉強したので，将来の自分への備忘も込めて書いていきます．

ちょっとニッチな機能紹介

faiss の特徴量取得

faiss が保持している特徴量を検索するのではなく，それ自体を取得したいことが稀にあります．例えば，faiss.IndexFlatL2だとxbという attribute から特徴量そのものを取得することができます．さらにfaiss.vector_float_to_arrayという関数を利用することでそれをnumpy.ndarrayに変換することができます．

>>> d = 2
>>> nb = 10
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>>
>>> index = faiss.IndexFlatL2(d)
>>> index.add(xb)
>>>
>>> index.xb
<faiss.swigfaiss.FloatVector; proxy of <Swig Object of type 'std::vector< float > *' at 0x7f7ff26dede0> >
>>> faiss.vector_float_to_array(index.xb)
array([0.70434606, 0.8172881 , 0.27514696, 0.04918063, 0.16584012,
       0.58303493, 0.83627784, 0.7318148 , 0.91633004, 0.16084996,
       0.6760264 , 0.65586466, 0.45432937, 0.35858378, 0.0895841 ,
       0.3424391 , 0.6606455 , 0.7392444 , 0.07704416, 0.13714503],
      dtype=float32)

なお，どの attribute から特徴量そのものを取得できるかはfaiss/IndexFlat.hなどのヘッダーから確認できます．

`numpy.ndarray` => `float *x` の変換

faiss/python/faiss.pyに実装されていない swig interface の関数を使うためにfaiss/c_api/IndexFlat_c.hなどを見るとfloat *xなどに遭遇することがあります．numpy.ndarrayにfaiss.swig_ptrを使えば swig のポインタに変換できるので，そういったちょっとニッチな関数も使うことが出来ます．

>>> x = np.random.random(10).astype(np.float32)
>>> faiss.swig_ptr(x)
<Swig Object of type 'faiss::IndexReplicasTemplate< faiss::Index >::component_t *' at 0x7fe8a57c3b10>

部分空間を扱う

今まではベーシックな機能の紹介が多かったですが，これからはちょっと応用編です．faiss における部分空間の取り扱いについてです．
まずは部分空間での距離計算です．compute_distance_subsetを使えば以下のように実現できます．

>>> def compute_distance_subset(index, xq, subset):
...     n, _ = xq.shape
...     _, k = subset.shape
...     distances = np.empty((n, k), dtype=np.float32)
...     index.compute_distance_subset(
...             n, faiss.swig_ptr(xq),
...             k, faiss.swig_ptr(distances),
...             faiss.swig_ptr(subset)
...     )
...     return distances
... 
>>>
>>> d = 2
>>> nb = 10000
>>> nq = 10
>>> k = 5
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>> xq = np.random.random((nq, d)).astype(np.float32)
>>> subset = np.random.choice(range(nb), (nq, k))
>>> 
>>> index = faiss.IndexFlatL2(d)
>>> index.add(xb)
>>> 
>>> compute_distance_subset(index, xq, subset)
array([[0.04731181, 0.1585833 , 0.4276843 , 0.02083743, 0.14153683],
       [0.55289364, 0.19499591, 0.24127454, 0.16293366, 0.02044217],
       [0.61750704, 0.48981428, 0.51042193, 0.12334089, 0.55514073],
       [0.5959296 , 0.6604827 , 0.20945217, 0.10136123, 0.01619768],
       [0.13882631, 0.16818088, 0.01572821, 0.17454663, 0.03992677],
       [0.46265444, 0.70609426, 0.49902472, 0.730565  , 0.09248901],
       [1.1638596 , 1.1041545 , 0.73789394, 0.60920525, 0.21328084],
       [0.02405633, 0.00557276, 0.6880306 , 0.821055  , 0.0421453 ],
       [0.08726364, 0.33441633, 0.15067156, 0.28792596, 0.30785137],
       [0.04219329, 0.747885  , 0.01912764, 0.19305223, 0.51132184]],
      dtype=float32)

なお，compute_distance_subsetはあくまで subset の距離計算をするだけで，K近傍などは計算してくれないのであしからず．

次は部分空間そのものを作る方法です．copy_subset_toを使えば実現できます．Splitting and merging indexesを見れば大体わかるので注意点だけ書きます．
1点目はドキュメントにも書いてありますが，IndexIVFでしか使えない点です．
2点目は部分空間の作り方にやや難がある点です．ソースコードを見れば分かるのですが，連続したコピーや周期的なコピーしかサポートされていません．

faiss/IndexIVF.h

/** copy a subset of the entries index to the other index
 *
 * if subset_type == 0: copies ids in [a1, a2)
 * if subset_type == 1: copies ids if id % a1 == a2
 * if subset_type == 2: copies inverted lists such that a1
 *                      elements are left before and a2 elements are after
 */
virtual void copy_subset_to (IndexIVF & other, int subset_type,
                             idx_t a1, idx_t a2) const;

独自の ID 体系を使う

デフォルトでは連続した ID が割り振られますが，IndexIDMapを使えば独自の ID 体系を使うこともできます．Faiss ID mappingを見れば大体わかますが，以下のように独自の ID 体系を使うことができます．

>>> d = 2
>>> nb = 10000
>>> nq = 10
>>> k = 2
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>> xq = np.random.random((nq, d)).astype(np.float32)
>>> ids = np.arange(nb) * 10
>>>
>>> index = faiss.IndexFlatL2(d)
>>> custom_index = faiss.IndexIDMap(index)
>>> custom_index.add_with_ids(xb, ids)
>>> custom_index.search(xq, k)
(array([[1.97887421e-05, 2.86698341e-05],
       [1.38282776e-05, 6.81877136e-05],
       [4.52995300e-06, 1.12056732e-05],
       [8.55922699e-05, 9.26256180e-05],
       [1.41859055e-05, 1.38044357e-04],
       [2.40206718e-05, 4.58657742e-05],
       [6.55651093e-06, 3.46302986e-05],
       [5.24520874e-06, 1.35898590e-05],
       [2.90870667e-05, 3.90410423e-05],
       [2.38418579e-07, 2.86102295e-05]], dtype=float32), array([[38210, 66060],
       [51500, 97890],
       [17100, 97780],
       [42300, 51430],
       [ 3790, 63660],
       [19050, 26470],
       [22070, 45900],
       [39140,  4190],
       [10040,  7850],
       [14390, 48690]]))

ここで注意点があります．faiss.IndexIDMapはあくまで元々の index と ID 体系のマッピングをするだけなので，faiss の古いバージョンを使っていると，以下のように index が GC されうる状況で Segmentation fault になってしまいます．なお，最新版の faiss (1.5.3) ではこの問題は解決されているようです．

>>> index = faiss.IndexIDMap(faiss.IndexFlatL2(d))
>>> index.add_with_ids(xb, ids)
Segmentation fault (core dumped)

ちなみに，使い道はあまりありませんが，元々の index に対して検索をかけることもできます．

>>> index.search(xq, k)
(array([[1.97887421e-05, 2.86698341e-05],
       [1.38282776e-05, 6.81877136e-05],
       [4.52995300e-06, 1.12056732e-05],
       [8.55922699e-05, 9.26256180e-05],
       [1.41859055e-05, 1.38044357e-04],
       [2.40206718e-05, 4.58657742e-05],
       [6.55651093e-06, 3.46302986e-05],
       [5.24520874e-06, 1.35898590e-05],
       [2.90870667e-05, 3.90410423e-05],
       [2.38418579e-07, 2.86102295e-05]], dtype=float32), array([[3821, 6606],
       [5150, 9789],
       [1710, 9778],
       [4230, 5143],
       [ 379, 6366],
       [1905, 2647],
       [2207, 4590],
       [3914,  419],
       [1004,  785],
       [1439, 4869]]))

まとめ

Facebook Resarch の近傍探索ライブラリである faiss について，自分への備忘も込めてちょっとニッチな機能を紹介しました．中途半端にニッチな記事になって誰が読むんだ...と突っ込まれそうですが，もし誰か一人の役にでもなれたら幸いです．今後も faiss のちょっとニッチな機能を試したら追記していこうと思います．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up