0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Gensim Dictionaryで高頻度語の方を取得する

Posted at

環境

  • Python 3.7.15
  • Gensim 3.6.0

Gensim Dictionaryで高い頻度の語を上位n語をしゅとくする

gensim.corpora.Dictionaryには、出現頻度の最も高い指定個数の語を省くfilter_n_most_frequentメソッドがあります。

from gensim.corpora import Dictionary

corpus = [
    ["ビール", "寿司", "焼肉"],
    ["ハンバーグ", "寿司"],
    ["焼肉","ビール", "寿司"]
]
dct = Dictionary(corpus)

for id, c in dct.dfs.items():
    print(dct[id], c)
出力
ビール 2
寿司 3
焼肉 2
ハンバーグ 1

filter_n_most_frequent(2)すると、上位2語が除外されます。

dct.filter_n_most_frequent(2)

for id, c in dct.dfs.items():
    print(dct[id], c)
出力
焼肉 2
ハンバーグ 1

逆に取り除かれる高頻度語の方を取得したい場合、Gensimにはそういうメソッドがありそうでありません。が、Gensimのfilter_n_most_frequent()の実装の内部で使われている処理をそのまま持ってきて少し手を加えれば使えます。

def ids_to_words(dictionary: Dictionary, ids):
    return [dictionary[idx] for idx in ids]

def most_frequent_words_top_n(dictionary: Dictionary, top_n: int):
    most_frequent_ids = (v for v in dictionary.token2id.values())
    most_frequent_ids = sorted(most_frequent_ids, key=dictionary.dfs.get, reverse=True)
    most_frequent_ids = most_frequent_ids[:top_n]
    return ids_to_words(dictionary, most_frequent_ids)
corpus = [
    ["ビール", "寿司", "焼肉"],
    ["ハンバーグ", "寿司"],
    ["焼肉","ビール", "寿司"]
]
dct = Dictionary(corpus)

most_frequent_words_top_n(dct, 2)
出力
['寿司', 'ビール']

x%以上の文書に登場する語を取得する

上記と同じようにfilter_extremes()no_aboveを使うと、x%以上の文書に登場する語を除外できます。

corpus = [
    ["ビール", "寿司", "焼肉"],
    ["ハンバーグ", "寿司"],
    ["焼肉","ビール", "寿司"]
]
dct = Dictionary(corpus)

dct.filter_extremes(no_above=0.5, no_below=1)

for id, c in dct.dfs.items():
    print(dct[id], c)
出力
ハンバーグ 1

これも、取り除くのではなく逆に取得したい場合、そのようなメソッドはありませんがfilter_extremes()の実装を参考にすれば取得できます。

corpus = [
    ["ビール", "寿司", "焼肉"],
    ["ハンバーグ", "寿司"],
    ["焼肉","ビール", "寿司"]
]
dct = Dictionary(corpus)

def most_frequent_words_rate(dictionary: Dictionary, threshold: float):
    threshold_abs = int(threshold * dictionary.num_docs)
    ids = [ v for v in dictionary.token2id.values() if threshold <= dictionary.dfs.get(v, 0) > threshold_abs]
    return ids_to_words(dictionary, ids)

most_frequent_words_rate(dct, 0.5)
出力
['ビール', '寿司', '焼肉']

参考

0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?