自然言語処理をpythonで書く中で大文字のみのカウンターを書く場面があったので忘れないようにメモ
・大文字から始まる単語のカウント
・単語の先頭のみが大文字の場合のみ
・先頭大文字+数字なども対象
※jupyter notebookで書いているため言語指定などは行っていません。。。
import re
from collections import Counter
org_text = """
Of course a writer can write from the viewpoint of Southern slave owners who of course would be racist and
view black people with, at best, benevolent paternalism, without her characters'
attitudes necessarily being her own. But it's very apparent that Mitchell was wholly and uncritically sympathetic to her antebellum ancestors.
The Old South was a graceful, chivalrous land where slavery was not a horrible and oppressive institution creating generations of misery and oppression,
but a divinely-ordained means of preserving racial harmony. And for all that Mitchell,
like her characters, probably considered herself to be kind and affectionate to all the African-Americans she knew personally,
there isn't a single black character in the book who isn't an ignorant, semi-human ape -- which is literally how they are described.
Even beloved Mammy is repeatedly compared to a monkey.
"""
#文字列を区切って配列へ
words= re.split(r'\s|\,|\.|\(|\)',org_text)
#大文字から始まる単語のみ抽出
r = re.compile("^[A-Z]$|^[A-Z][a-z0-9]+$")
dict_word=[x for x in words if r.match(x)]
counter=Counter(dict_word)
for word,count in counter.most_common():
print("%s,%d" % (word,count))