11
10

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

MinHashによる類似度の算出

Last updated at Posted at 2014-07-25

MinHashによる類似度の計算のサンプルを実装してみた.

参考 b-Bit MinHashによる高速かつ省スペースな類似度判定

上記文献に従って, ハッシュの計算にはMurmurHash3(のpython wrapper)を使った.

minhash.py
# -*- coding: utf-8


import mmh3


MIN_HASH_VALUE = 2 ** 128


def min_hash(words, seed):
    min_hash_word = None
    min_hash_value = MIN_HASH_VALUE

    for word in words:
        hash_ = mmh3.hash128(word, seed)
        if hash_ < min_hash_value:
            min_hash_word = word
            min_hash_value = hash_

    return min_hash_word


def get_score(s1, s2, k):
    """Get the similarity between s1 and s2 using min-hash algorithm

    k: the number of hash functions
    """
    num_match = 0

    for seed in xrange(k):
        if min_hash(s1, seed) == min_hash(s2, seed):
            num_match += 1

    return float(num_match) / k


def main():
    s1 = ['a', 'b']
    s2 = ['a']
    s3 = ['b']
    # a-z
    s4 = [chr(ascii) for ascii in xrange(97, 123)]

    k = 2 ** 10

    # Should be ~0.074
    print get_score(s1, s4, k)
    # Should be ~0.370
    print get_score(s2, s4, k)
    # Should be ~0.370
    print get_score(s3, s4, k)


if __name__ == '__main__':
    main()
[yamaneko]$ time python min_hash.py 
0.078125
0.046875
0.03125

real    0m0.112s
user    0m0.076s
sys     0m0.015s

kの値を大きくした場合, 値が収束したが計算時間は増えた.

次はb-bit MinHashを実装したい.

11
10
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
11
10

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?