Gensim Dictionaryで高頻度語の方を取得する

Posted at 2022-11-14

環境

Python 3.7.15
Gensim 3.6.0

Gensim Dictionaryで高い頻度の語を上位n語をしゅとくする

gensim.corpora.Dictionaryには、出現頻度の最も高い指定個数の語を省くfilter_n_most_frequentメソッドがあります。

from gensim.corpora import Dictionary

corpus = [
    ["ビール", "寿司", "焼肉"],
    ["ハンバーグ", "寿司"],
    ["焼肉","ビール", "寿司"]
]
dct = Dictionary(corpus)

for id, c in dct.dfs.items():
    print(dct[id], c)

出力

ビール 2
寿司 3
焼肉 2
ハンバーグ 1

filter_n_most_frequent(2)すると、上位2語が除外されます。

dct.filter_n_most_frequent(2)

for id, c in dct.dfs.items():
    print(dct[id], c)

出力

焼肉 2
ハンバーグ 1

逆に取り除かれる高頻度語の方を取得したい場合、Gensimにはそういうメソッドがありそうでありません。が、Gensimのfilter_n_most_frequent()の実装の内部で使われている処理をそのまま持ってきて少し手を加えれば使えます。

def ids_to_words(dictionary: Dictionary, ids):
    return [dictionary[idx] for idx in ids]

def most_frequent_words_top_n(dictionary: Dictionary, top_n: int):
    most_frequent_ids = (v for v in dictionary.token2id.values())
    most_frequent_ids = sorted(most_frequent_ids, key=dictionary.dfs.get, reverse=True)
    most_frequent_ids = most_frequent_ids[:top_n]
    return ids_to_words(dictionary, most_frequent_ids)

corpus = [
    ["ビール", "寿司", "焼肉"],
    ["ハンバーグ", "寿司"],
    ["焼肉","ビール", "寿司"]
]
dct = Dictionary(corpus)

most_frequent_words_top_n(dct, 2)

出力

['寿司', 'ビール']

x%以上の文書に登場する語を取得する

上記と同じようにfilter_extremes()のno_aboveを使うと、x%以上の文書に登場する語を除外できます。

corpus = [
    ["ビール", "寿司", "焼肉"],
    ["ハンバーグ", "寿司"],
    ["焼肉","ビール", "寿司"]
]
dct = Dictionary(corpus)

dct.filter_extremes(no_above=0.5, no_below=1)

for id, c in dct.dfs.items():
    print(dct[id], c)

出力

ハンバーグ 1

これも、取り除くのではなく逆に取得したい場合、そのようなメソッドはありませんがfilter_extremes()の実装を参考にすれば取得できます。

corpus = [
    ["ビール", "寿司", "焼肉"],
    ["ハンバーグ", "寿司"],
    ["焼肉","ビール", "寿司"]
]
dct = Dictionary(corpus)

def most_frequent_words_rate(dictionary: Dictionary, threshold: float):
    threshold_abs = int(threshold * dictionary.num_docs)
    ids = [ v for v in dictionary.token2id.values() if threshold <= dictionary.dfs.get(v, 0) > threshold_abs]
    return ids_to_words(dictionary, ids)

most_frequent_words_rate(dct, 0.5)

出力

['ビール', '寿司', '焼肉']

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up