More than 5 years have passed since last update.

Kaggle備忘録 ~NLP with Disaster Tweets第一回~

Last updated at 2020-09-07Posted at 2020-09-06

Kaggleに挑戦

しばらく手を付けていなかったKaggleに久々にトライしてみました。

挑戦するのはこちら↓
Real or Not? NLP with Disaster Tweets
https://www.kaggle.com/c/nlp-getting-started

まずはデータセットをDataFrameに落とし込む。

import os
import pandas as pd

for dirname, _, filenames in os.walk('../input/nlp-getting-started'):
    for filename in filenames:
        path = os.path.join(dirname, filename)
        exec("{0}_df = pd.read_csv(path)".format(filename.replace(".csv","")))

特定の単語と災害発生Tweetに相関があるんじゃないかと考えて以下のコードを作成。

# Tweet文を単語ごとに区切り、DataFrameに格納する
words_df = pd.DataFrame([], columns = ['words' , 'target_count'])
for index,item in train_df[['text','target']].iterrows():
    word_df = pd.DataFrame([], columns = ['words' , 'target_count'])
    word_df['words'] = item[0].split(' ')
    word_df['target_count'] = item[1]
    words_df = pd.concat([words_df,word_df])

# ストップワードを除外するために５文字以上の単語に絞る
long_words_df = words_df[words_df['words'].str.len() > 5]
# 同一の単語をGroupByしてその集計結果を表示する
long_words_df.groupby(['words']).sum().sort_values("target_count", ascending=False)

結果は以下の通り。
Hiroshimaって単語が上位に食い込んでいるのが気になりますね。

words	target_count
California	86
killed	86
people	83
suicide	71
disaster	59
Hiroshima	58

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up