前提
- modelは学習済みのgensimのmodel
- 品詞判定にはJanomeのTokenizerを利用
- '鬼'に類似した単語1,00個を取得してからフィルタリング
TokenFilterを使う方法
from janome.analyzer import Analyzer
from janome.tokenfilter import POSKeepFilter, POSStopFilter
target_word = '鬼'
topn = 1000
for word, score in model.similar(target_word, topn=topn):
tokens = a.analyze(word)
for token in tokens:
print(token, score)
TokenFilterの利用イメージ
from janome.analyzer import Analyzer
from janome.tokenfilter import POSKeepFilter, POSStopFilter
target_word = '「鬼の」'
# 「名詞」を取得して、「記号」と「助詞」は除く
token_filters = [POSKeepFilter('名詞'), POSStopFilter(['記号', '助詞'])]
a = Analyzer(token_filters=token_filters)
tokens = a.analyze(target_word)
for token in tokens:
print(token)
#=> 鬼 名詞,一般,*,*,*,*,鬼,オニ,オニ
TokenFilterを使わない方法
from janome.tokenizer import Tokenizer
t = Tokenizer()
target_word = '鬼'
topn = 1000
for word, score in model.similar(target_word, topn=topn):
tokens = t.tokenize(word)
for token in tokens:
pos0 = token.part_of_speech.split(',')[0]
pos1 = token.part_of_speech.split(',')[1]
# 形容詞と形容動詞に限定して取得
if pos0 == '形容詞' or pos1 == '形容動詞':
print(token.surface, score, pos0, pos1)