More than 5 years have passed since last update.

MinHashによる類似度の算出

Python

Last updated at 2014-07-25Posted at 2014-07-25

MinHashによる類似度の計算のサンプルを実装してみた.

上記文献に従って, ハッシュの計算にはMurmurHash3(のpython wrapper)を使った.

minhash.py

# -*- coding: utf-8


import mmh3


MIN_HASH_VALUE = 2 ** 128


def min_hash(words, seed):
    min_hash_word = None
    min_hash_value = MIN_HASH_VALUE

    for word in words:
        hash_ = mmh3.hash128(word, seed)
        if hash_ < min_hash_value:
            min_hash_word = word
            min_hash_value = hash_

    return min_hash_word


def get_score(s1, s2, k):
    """Get the similarity between s1 and s2 using min-hash algorithm

    k: the number of hash functions
    """
    num_match = 0

    for seed in xrange(k):
        if min_hash(s1, seed) == min_hash(s2, seed):
            num_match += 1

    return float(num_match) / k


def main():
    s1 = ['a', 'b']
    s2 = ['a']
    s3 = ['b']
    # a-z
    s4 = [chr(ascii) for ascii in xrange(97, 123)]

    k = 2 ** 10

    # Should be ~0.074
    print get_score(s1, s4, k)
    # Should be ~0.370
    print get_score(s2, s4, k)
    # Should be ~0.370
    print get_score(s3, s4, k)


if __name__ == '__main__':
    main()

[yamaneko]$ time python min_hash.py 
0.078125
0.046875
0.03125

real    0m0.112s
user    0m0.076s
sys     0m0.015s

kの値を大きくした場合, 値が収束したが計算時間は増えた.

次はb-bit MinHashを実装したい.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up