単語の出現頻度を計算するときはCounterを使おう

Last updated at 2017-12-09Posted at 2017-12-09

Counterクラスは単語や数値などの出現した回数の数え上げに便利です。
今回はCounterクラスの実例を交えた紹介を行いたいと思います。

Counterクラス

[難易度★☆☆] 簡単な動作例

まずはじめにCounterクラスはこんな感じに使えますという例を紹介します。

対象データ群

names = ["Suzuki", "Tanaka", "Kato", "Suzuki", "Kato"]

numbers = [5, 2, 4, 5, 1, 3, 5, 6]

リスト内の重複した要素を数え上げしてくれます。

要素の数え上げ

from collections import Counter

Counter(names)
"""
Counter({'Kato': 2, 'Suzuki': 2, 'Tanaka': 1})
"""

Counter(numbers)
"""
Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 3, 6: 1})
"""

[難易度★★☆] 応用例

上の例で紹介したデータであれば目視で頻度を確認できますが・・・
例えばこんなデータの場合

対象データ

import random

random.seed(1)
# 1から10までの数値を100個用意しました
numbers = [random.randint(1, 10) for i in range(100)]

そういう場合に便利なのがmost_commonメソッドになります。
このメソッド引数nで指定したn番目までの要素を出現頻度共に以下のようにして返してくれます。
[(要素A, 頻度), (要素B, 頻度)...]

出現頻度多い順に並べる

cnt = Counter(numbers)
# 出現頻度が多かった５番目までの要素を出現頻度共に表示
cnt.most_common(5)
"""
[(9, 14), (7, 13), (8, 13), (1, 12), (4, 12)]
"""

[難易度★★★] 英文メールを使った実用的な例　

これだけだと面白くないのでもう少し実用的な例を。上記二つの例を組み合わせたものになります。
プログラム中で出てくるtest.txtは下記リンクの英文メールサンプルを貼り付けたものになります。
今すぐ使える英語メール文例集【ビジネス英語メール編】「自己紹介」　サンプル英語メール

単語出現頻度を調べる

from collections import Counter
import re
from pprint import pprint

pattern = "[,.;?!]"
file_name = "test.txt"
with open(file_name, encoding="utf-8") as txt:
	t = txt.read()
sentence = t.split("\n")

p = re.compile(pattern)
result = []

for s in sentence:
    # 文中の記号を除去        
	replaced_sentence = re.sub(p, "", s)
   # 空白で分割 
	words = replaced_sentence.split()
	result.extend(words)

cnt = Counter(result)
# 出現頻度が多かった５番目までの要素を出現頻度共に表示
pprint(cnt.most_common(5))
"""
[('I', 11), ('of', 5), ('a', 5), ('to', 5), ('your', 4)]
"""

以上です、お疲れ様でした。

　参考

ライブラリーリファレンス
Counterクラス

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up