More than 5 years have passed since last update.

Pythonのベクトルの類似度計算高速化

Last updated at 2018-10-08Posted at 2018-10-08

3万文書をベクトル化し、類似度計算するときに大量ベクトルの類似度計算にすさまじい時間がかかっていたので、効率的な計算方法を調査。scipyのライブラリを使うことで100倍くらい速くなったのでメモ。

calc_similarity.py

from scipy.spatial import distance
# ベクトルの作成 (Mは30,000×100の配列）
M = [a.vector for a in articles] # a.vectorは100要素のnp.arrayベクトル
# 30,000の記事の総当たり類似度を計算
dist_M = distance.cdist(M, M, metric='cosine')

結果がコサイン類似度ではなくコサイン距離で出るので注意。

今回の場合同じ配列同士を計算しているので、上三角形だけ計算するpdistの方がいいらしい。（計算時間が２分の１になる）。後ほど試す。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up