More than 3 years have passed since last update.

形態素解析して頻出単語を出す

Last updated at 2020-09-23Posted at 2020-09-09

このコードで行うこと
・txtデータを読み込む
・形態素解析を行い、名詞だけを取り出してリスト化
・リストの平坦化、記号の削除できれいな状態にする
・単語の頻出数を出す

とりあえずコード

qiita.py


import MeCab
import itertools
import collections
import string

# データ全体を名詞群として扱う,リスト化して収納
with open("deta.txt","r",encoding='utf-8') as f:
        a = f.read().splitlines()

# 形態素解析2パターン目<-いろいろやり方がある

def split_text_only_noun(text):
    tokenizer = MeCab.Tagger()
    node = tokenizer.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞":
            keywords.append(node.surface)
            print(keywords)
        node = node.next
    return ','.join(keywords)

# １つ1つのリストaの要素を形態素解析して名詞抽出
r_aa = []

for aa in a:
     r_a = ''.join(split_text_only_noun(aa)).split(',')
     r_aa.append(r_a)

# 形態素解析、リストの平坦化、記号の削除
r_aa = list(itertools.chain.from_iterable(r_aa))
kigo = string.punctuation
table = str.maketrans( '', '',kigo)

word_list = []
for bb in r_aa:
    word_list.append(bb.translate(table))

# 空白を削除
count = 0
words_list = []
while count < len(word_list):
    if word_list[count] != '':
        words_list.append(word_list[count])
    count += 1

# 単語の頻出数
count_num = 0
c = collections.Counter(words_list)

# テキストに書き込み
with open('結果.txt',mode='w') as f:
    for word,count in c.most_common():
        if  count > 50:
            f.writelines(str(f"{word}:{count}"))
            f.writelines("\n")

これでいい感じに出る。これがTF-IDFを出すためのTFを出すのに使えそう。

以下、今チャレンジしている(挫折中の)collections.Counterを使用せずにソートしようとしたコード

zasetu.py


result = []

for num_1 in range(len(words_list)):
    ct = 0
    for num_2 in range(len(words_list)):
        if words_list[num_1] == words_list[num_2]:
            ct +=1
            continue
    if ct > 50:
        result.append(words_list[num_1])
        result.append(ct)
        
print(result)

これだと頻出単語が出る度に個数を数えちゃうから、この結果を使ってソートしたりしないといけない気がする。

結論：collections.Counter(class)使え。

続き
Markdown: 続・tf-idf関数を使わずに重みづけ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up