Help us understand the problem. What is going on with this article?

【Python3】MeCabでテキストファイルから名詞を頻出順に抽出

More than 1 year has passed since last update.

やりたいこと

Python3でMeCabを使ってテキストファイルから名詞のみを抜き出し、出現回数ごとにリストアップ

コード全文

count_noun.py
import MeCab
import sys
import re
from collections import Counter

# ファイル読み込み
cmd, infile = sys.argv
with open(infile) as f:
    data = f.read()

# パース
mecab = MeCab.Tagger()
parse = mecab.parse(data)
lines = parse.split('\n')
items = (re.split('[\t,]', line) for line in lines)



# 名詞をリストに格納
words = [item[0]
         for item in items
         if (item[0] not in ('EOS', '', 't', 'ー') and
             item[1] == '名詞' and item[2] == '一般')]

# 頻度順に出力
counter = Counter(words)
for word, count in counter.most_common():
    print(f"{word}: {count}")

実行方法

ターミナルで引数にテキストファイルを与えてやれば良い

$ python3 count_noun.py ファイル名.txt

おまけ

試しに夏目漱石の『吾輩は猫である(青空文庫)』から最頻出単語ベスト10を調べてみた


主人: 933
人: 357
迷亭: 329
先生: 274
顔: 273
人間: 272
猫: 248
細君: 213
鼻: 199
自分: 175


タイトルにある『猫』という単語は作品中に248回と意外と出てこない


Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
No comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
ユーザーは見つかりませんでした