34
24

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

【Python3】MeCabでテキストファイルから名詞を頻出順に抽出

Last updated at Posted at 2018-09-16

やりたいこと

Python3でMeCabを使ってテキストファイルから名詞のみを抜き出し、出現回数ごとにリストアップ

コード全文

count_noun.py
import MeCab
import sys
import re
from collections import Counter

# ファイル読み込み
cmd, infile = sys.argv
with open(infile) as f:
    data = f.read()

# パース
mecab = MeCab.Tagger()
parse = mecab.parse(data)
lines = parse.split('\n')
items = (re.split('[\t,]', line) for line in lines)



# 名詞をリストに格納
words = [item[0]
         for item in items
         if (item[0] not in ('EOS', '', 't', '') and
             item[1] == '名詞' and item[2] == '一般')]

# 頻度順に出力
counter = Counter(words)
for word, count in counter.most_common():
    print(f"{word}: {count}")

実行方法

ターミナルで引数にテキストファイルを与えてやれば良い

$ python3 count_noun.py ファイル名.txt

おまけ

試しに夏目漱石の『吾輩は猫である(青空文庫)』から最頻出単語ベスト10を調べてみた


主人: 933
人: 357
迷亭: 329
先生: 274
顔: 273
人間: 272
猫: 248
細君: 213
鼻: 199
自分: 175

タイトルにある『猫』という単語は作品中に248回と意外と出てこない


34
24
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
34
24

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?