MinHashによる類似度の計算のサンプルを実装してみた.
参考 b-Bit MinHashによる高速かつ省スペースな類似度判定
上記文献に従って, ハッシュの計算にはMurmurHash3(のpython wrapper)を使った.
minhash.py
# -*- coding: utf-8
import mmh3
MIN_HASH_VALUE = 2 ** 128
def min_hash(words, seed):
min_hash_word = None
min_hash_value = MIN_HASH_VALUE
for word in words:
hash_ = mmh3.hash128(word, seed)
if hash_ < min_hash_value:
min_hash_word = word
min_hash_value = hash_
return min_hash_word
def get_score(s1, s2, k):
"""Get the similarity between s1 and s2 using min-hash algorithm
k: the number of hash functions
"""
num_match = 0
for seed in xrange(k):
if min_hash(s1, seed) == min_hash(s2, seed):
num_match += 1
return float(num_match) / k
def main():
s1 = ['a', 'b']
s2 = ['a']
s3 = ['b']
# a-z
s4 = [chr(ascii) for ascii in xrange(97, 123)]
k = 2 ** 10
# Should be ~0.074
print get_score(s1, s4, k)
# Should be ~0.370
print get_score(s2, s4, k)
# Should be ~0.370
print get_score(s3, s4, k)
if __name__ == '__main__':
main()
[yamaneko]$ time python min_hash.py
0.078125
0.046875
0.03125
real 0m0.112s
user 0m0.076s
sys 0m0.015s
kの値を大きくした場合, 値が収束したが計算時間は増えた.
次はb-bit MinHashを実装したい.