More than 5 years have passed since last update.

Twitter民の発言をsentiment_jaを使って感情分析した話

Last updated at 2019-10-09Posted at 2019-10-08

会社で機械学習の勉強会をしていたときに感情分析の話が挙がったので、Twitterランドの住人の感情を分析しました。

お品書き

やったこと

Twitterの発言を大量に感情分析し、Twitter民の普遍的な感情を調べた
- 手順
  1. 大量のツイートを取得
  2. 各ツイートの感情を分析し、最大スコアの感情のみ抽出
    - happy:スコア10、sad:スコア10、disgust:スコア9ならhappyとsadのみ抽出
  3. 抽出した感情をカウント
  4. どの感情がもっとも多いかを調べる
- 感情分析をするために sentiment_ja を使用
- Twitterの発言を取得するために twitterscraper を使用
- 感情は6分類 (ポール・エクマンの 表情の分類 に準拠)
  - happy(幸福感)
  - sad(悲しみ)
  - disgust(嫌悪)
  - angry(怒り)
  - fear(恐れ)
  - surprise(驚き)
- 計測期間は 2019-01-01 以降のツイートのみ
- リツイートは集計から外す

免責

皮肉は利きません。
- 「良いご身分ですね！」-> happyに分類される(sentiment_jaは素直な子)
機械学習の話は出てきません。
ぶっちゃけ統計結果の精度はよくありません。

集計方法

sentiment_jaより、sentiment_jaを使用する準備をする
twitterscraperより、twitterscraperを使用する準備をする(pip install twitterscraperなどを済ませる)
sentiment_jaディレクトリ以下にある sentimentja/sentiment.py を開く
以下のコードを反映し、 $ python sentimentja/sentiment.pyとかすれば集計できる

コード (コメントでざっくりと解説)

sentimentja.py

# coding:utf-8

from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from twitterscraper import query_tweets
import datetime
import json
import pickle
import tensorflow as tf


def preprocess(data, tokenizer, maxlen=280):
    return(pad_sequences(tokenizer.texts_to_sequences(data), maxlen=maxlen))


def predict(sentences, graph, emolabels, tokenizer, model, maxlen):
    preds = []
    targets = preprocess(sentences, tokenizer, maxlen=maxlen)
    with graph.as_default():
        for i, ds in enumerate(model.predict(targets)):
            preds.append({
                "sentence": sentences[i],
                "emotions": dict(zip(emolabels, [str(round(100.0*d)) for d in ds]))
            })
    return preds


def load(path):
    model = load_model(path)
    graph = tf.get_default_graph()
    return model, graph


if __name__ == "__main__":
    maxlen = 280
    model, graph = load("sentimentja/model_2018-08-28-15:00.h5")

    with open("sentimentja/tokenizer_cnn_ja.pkl", "rb") as f:
        tokenizer = pickle.load(f)

    emolabels = ["happy", "sad", "disgust", "angry", "fear", "surprise"]

    # ツイート情報を取得。最大で100000件。64スレッドで取得しにいくので負荷に注意
    list_of_tweets = query_tweets(
        "lang:ja", begindate=datetime.date(2019, 1, 1), limit=100000, poolsize=64)
    # 取得したツイート情報のうち、リツイートではないものだけを抽出し、
    # 発言内容(text)だけを取得する
    list_of_tweets_text = [(tweet.text) for tweet in
        list_of_tweets if (tweet.is_retweet == 0)]

    # 感情分析を行う
    text_with_emotion_list = predict(
        list_of_tweets_text, graph, emolabels, tokenizer, model, maxlen)

    # 感情をカウントする箱
    emo_count = {
        "happy": 0, "sad": 0, "disgust": 0, "angry": 0, "fear": 0, "surprise": 0
    }
    for text_with_emotion in text_with_emotion_list:
        emotions = text_with_emotion['emotions']
        # 感情のスコアがもっとも高いものだけを抽出しカウントしていく
        max_emos = [max_emotions[0] for max_emotions in emotions.items() if max_emotions[1] == max(
            emotions.items(), key=(lambda emotion: float(emotion[1])))[1]]

        for max_emo in max_emos:
            emo_count[max_emo] += 1

    # 結果表示 (集計数と集計結果)
    print("-------------")
    print(len(text_with_emotion_list))
    print(str(emo_count))
    print("=============")