【Python3】MeCabでテキストファイルから名詞を頻出順に抽出

Last updated at 2018-09-18Posted at 2018-09-16

やりたいこと

Python3でMeCabを使ってテキストファイルから名詞のみを抜き出し、出現回数ごとにリストアップ

コード全文

count_noun.py

import MeCab
import sys
import re
from collections import Counter

# ファイル読み込み
cmd, infile = sys.argv
with open(infile) as f:
    data = f.read()

# パース
mecab = MeCab.Tagger()
parse = mecab.parse(data)
lines = parse.split('\n')
items = (re.split('[\t,]', line) for line in lines)



# 名詞をリストに格納
words = [item[0]
         for item in items
         if (item[0] not in ('EOS', '', 't', 'ー') and
             item[1] == '名詞' and item[2] == '一般')]

# 頻度順に出力
counter = Counter(words)
for word, count in counter.most_common():
    print(f"{word}: {count}")

実行方法

ターミナルで引数にテキストファイルを与えてやれば良い

$ python3 count_noun.py ファイル名.txt

おまけ

試しに夏目漱石の『吾輩は猫である（青空文庫）』から最頻出単語ベスト10を調べてみた

主人: 933
人: 357
迷亭: 329
先生: 274
顔: 273
人間: 272
猫: 248
細君: 213
鼻: 199
自分: 175

タイトルにある『猫』という単語は作品中に248回と意外と出てこない

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up