More than 1 year has passed since last update.

猫でも作れる感情分析

Last updated at 2022-10-26Posted at 2022-10-22

感情分析とは、入力したテキストの感情を予測する自然言語処理タスクの一つです。データの準備、訓練、実行が簡単にできるということを説明します。

※実行は自己責任でお願いします。

1. データ収集

sentiment140とは、英語の感情分析データです。emoticon(海外の顔文字)でツイートを検索することで、ツイートと感情を対応付けます。(このような自動アノテーションをdistant supervisionといいます)

最近は絵文字が使えるので、絵文字を使えば日本語データを検索することが可能です。

以下のツールを使うことでデータ収集します。
https://github.com/sugiyamath/twdl

git clone --depth=1 https://github.com/sugiyamath/twdl.git
cd twdl && pip3 install . -r requirements.txt

ツイートの収集例です。

twdl -s "Donald Trump" --since 2022-01-01 --until 2022-09-01

torとparallelをインストールしておきます。

sudo apt install tor parallel

以下のpythonコードで、日付範囲データを作ります。これはtwdlを並列実行するためです。

generate_date.py

import datetime 
today = datetime.datetime.now()
n = (2022-2007)*365
for i in range(n):
    d1 = datetime.timedelta(days = i)
    d2 = datetime.timedelta(days = i+1)
    a = today - d1
    b = today - d2
    print(a.strftime("%Y-%m-%d")+" "+b.strftime("%Y-%m-%d"))

python3 generate_date.py > dates.txt

収集スクリプトを作ります。qと書いてあるのは、感情分析で使いたい感情です。好きに設定してください。(試した感じだと、😊をhappyにするより、😃をhappyにしたほうが良いかもしれない)

gather.py

from subprocess import check_output
import sys
a, b = sys.argv[1].strip().split()
q = "(😊 OR 😢 OR 🤢 OR 😠 OR 😮 OR 😨 OR 😐) lang:ja"
check_output(f"torsocks -i twdl -s \"{q}\" --until {a} --since {b} >> tweets.txt", timeout=60*10, shell=True)

日付範囲を入力して並列収集します。torを事前に起動しておいてください。

cat dates.txt | parallel -j 100 python3 gather.py {}

テキトーに1日ぐらい放置すれば、tweets.txtにファイルが保存されます。ただし、収集が成功するかどうかはtorがブロックされるかどうかにかかってきます。

2. jupyterで実行

実行例は以下で公開してます。

事前インストール

pip3 install scikit-learn scipy numpy pandas sudachipy sudachidict_core

必要なものをインポートします。

In[1]:

import re
import numpy as np
import pandas as pd
from sudachipy import tokenizer
from sudachipy import dictionary
from scipy.sparse import vstack as svstack
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C

データを読み込みます。

In[2]:

data = []
regs = [r"[@#].+? ", r"http.+? "]
labels = list(zip("happy,sad,disgust,angry,surprise,fear".split(","), "😊 😢 🤢 😠 😮 😨".split(" ")))
with open("tweets.txt", "r") as f:
    for line in f:
        line = line.split("[:::]")[0]
        x = ' '.join(line.strip().split()[5:])+" "
        for r in regs:
            x = re.sub(r, ' ', x)
        x = ' '.join([m.dictionary_form() for m in tokenizer_obj.tokenize(x, mode)]).strip()
        y = {lab[0]: (lab[1] in x) for lab in labels}
        for lab in labels:
            x = x.replace(lab[1], " ")
        if "😐" not in x and not any(a for _, a in y.items()):
            continue
        y["text"] = x
        data.append(y)
df = pd.DataFrame(data)
del data
df.head()

以下の流れを実行します。

tfidfでテキストをベクトル化。
ラベルごとにtrain_test_splitでデータ分割。
訓練データでロジスティック回帰を訓練。
テストデータでclassification_reportを出力。

In[3]:

X = df["text"]
vect = TfidfVectorizer(min_df=3, max_df=0.3).fit(X)
X_vec = vect.transform(X)
models = []
for c in df.columns[:-1]:
    y = df[c]
    inds1 = np.where(y==True)[0]
    inds2 = np.random.choice(np.where(y==False)[0], len(inds1))
    X_sel = svstack([X_vec[inds1], X_vec[inds2]])
    y_sel = np.array(y[inds1].tolist() + y[inds2].tolist())
    X_train, X_test, y_train, y_test = train_test_split(X_sel, y_sel, shuffle=True, test_size=0.01)
    clf = LogisticRegression(solver="liblinear", penalty="l1").fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred))
    models.append((c, clf))

Out[3]:

              precision    recall  f1-score   support

       False       0.79      0.80      0.79    112085
        True       0.80      0.78      0.79    111253

    accuracy                           0.79    223338
   macro avg       0.79      0.79      0.79    223338
weighted avg       0.79      0.79      0.79    223338

              precision    recall  f1-score   support

       False       0.76      0.78      0.77     23322
        True       0.78      0.76      0.77     23509

    accuracy                           0.77     46831
   macro avg       0.77      0.77      0.77     46831
weighted avg       0.77      0.77      0.77     46831

              precision    recall  f1-score   support

       False       0.76      0.73      0.75       720
        True       0.74      0.77      0.76       723

    accuracy                           0.75      1443
   macro avg       0.75      0.75      0.75      1443
weighted avg       0.75      0.75      0.75      1443

              precision    recall  f1-score   support

       False       0.83      0.80      0.81      1978
        True       0.81      0.83      0.82      1986

    accuracy                           0.82      3964
   macro avg       0.82      0.82      0.82      3964
weighted avg       0.82      0.82      0.82      3964

              precision    recall  f1-score   support

       False       0.74      0.66      0.70      2320
        True       0.69      0.76      0.72      2291

    accuracy                           0.71      4611
   macro avg       0.71      0.71      0.71      4611
weighted avg       0.71      0.71      0.71      4611

              precision    recall  f1-score   support

       False       0.81      0.76      0.78      5180
        True       0.77      0.83      0.80      5152

    accuracy                           0.79     10332
   macro avg       0.79      0.79      0.79     10332
weighted avg       0.79      0.79      0.79     10332

モデルを保存しておきます。

In[4]:

import pickle
with open("model.pkl", "wb") as f:
    pickle.dump((vect, tuple(models)), f)

3. 実行例

import pickle
import pprint
from sudachipy import tokenizer
from sudachipy import dictionary

with open("model.pkl", "rb") as f:
    vect, models = pickle.load(f)

tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C

def tok(x):
    return ' '.join([m.dictionary_form() for m in tokenizer_obj.tokenize(x, mode)]).strip()

texts = [
    "ティファの技って面白いですね",
    "エアリスが死んで悲しいよ",
    "クラウドって自称ソルジャーとかキモいね",
    "バレットって叫んでばっかでうるさくてムカつく",
    "セフィロスチョー強くてびっくり",
    "タークスとかいう闇組織こわ",]

v = vect.transform(tok(x) for x in texts)
out = [{"text": x} for x in texts]
for n, m in models:
    for i, p in enumerate(m.predict_proba(v)[:,1]):
        out[i][n] = p

pprint.pprint(out)

[出力]

[{'angry': 0.4224291515429129,
  'disgust': 0.4173733262323974,
  'fear': 0.359762507050206,
  'happy': 0.7178410456561672,
  'sad': 0.1611998429397086,
  'surprise': 0.5737928576153907,
  'text': 'ティファの技って面白いですね'},
 {'angry': 0.28790300106603023,
  'disgust': 0.6518970091207457,
  'fear': 0.5216814638042353,
  'happy': 0.012175882719503922,
  'sad': 0.9862391779299033,
  'surprise': 0.2140938162742958,
  'text': 'エアリスが死んで悲しいよ'},
 {'angry': 0.8017693086335025,
  'disgust': 0.914084995109833,
  'fear': 0.8251661193502758,
  'happy': 0.2468809876476842,
  'sad': 0.3667284373890585,
  'surprise': 0.6902082109170887,
  'text': 'クラウドって自称ソルジャーとかキモいね'},
 {'angry': 0.9991787157146794,
  'disgust': 0.8073388378968666,
  'fear': 0.40859210600758866,
  'happy': 0.07045632207386164,
  'sad': 0.2319066526171211,
  'surprise': 0.6695058352724972,
  'text': 'バレットって叫んでばっかでうるさくてムカつく'},
 {'angry': 0.5421001856739588,
  'disgust': 0.6279451519549495,
  'fear': 0.8930258596749754,
  'happy': 0.2795665261469537,
  'sad': 0.39058399704319413,
  'surprise': 0.8688049878299227,
  'text': 'セフィロスチョー強くてびっくり'},
 {'angry': 0.6322618613146697,
  'disgust': 0.6578651098916986,
  'fear': 0.8964255553614843,
  'happy': 0.18828917617500254,
  'sad': 0.5363194193208867,
  'surprise': 0.6429357498601701,
  'text': 'タークスとかいう闇組織こわ'}]

Q: なんでgithubアカウントを一回消したのですか?

Pythonにおいてテキストの感情分析を容易に行うことができる、sentiment_jaライブラリを使用しました。しかし、(理由は知りませんが) ちょうど資料を作成していたタイミングで、GitHubリポジトリが削除されていました。

上記のような記事を見つけました。sentiment_jaとは私が作ったものです。

理由: 強い人が"your model is harmful"とか騒いでいて怖くなった

キャシー・オニール界隈が「(機械学習)アルゴリズムって意見をエンコードしたものでしかねーよな、差別を助長するモデルって価値あんのか？」的なことをよく言ってます。

“Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy,” the data scientist argues that the mathematical models underpinning these algorithms aren’t just flawed, they are encoded opinions and biases disguised as empirical fact, silently introducing and enforcing inequities that inflict harm right under our noses. https://news.harvard.edu/gazette/story/2016/10/dont-trust-that-algorithm/

例えば特定の人種名が入力にあったときに、ネガティブな感情を出力したら？

そういうのが怖いんです。結局、社会にある偏見がそのままモデルとして反映されることになります。「Xという語にはYという感情が発生する」という統計的な対応関係があるとして、ほとんどの対応関係は「偏見」と言えます。

(Googleとかも属性予測モデルを広告の背後で使ってるっぽいので、まあクソですね。一部の企業はこういう「有害」なモデルを使って偏見を正当化している。「汎化性能」とか以前に、タスク自体が有害ってことがありますよ、と。)

自分が責任を感じた極めつけは以下のような研究です。

http://iminseisaku.org/top/conference/conf2020/200524_f2-4_omoya.pdf

この研究のイメージとしては、以下です。

何らかの偏見を持っているAliceさんがいる。
Aliceさんに「日本人が外国人にどんな偏見を持ってるか分析して」とBobさんが依頼。
Aliceさんは自分の偏見に基づいて社会の偏見を分析。
分析結果をBobさんにレポートする。

まあ、Alice(sentiment_ja)が信用できるという根拠がどこにあるのか全くわからないですね。

そういうわけで、まあ自分、統合失調症なので被害妄想とかも強まって一回リポジトリもアカウントも消しました。

ただ「感情分析がどうやって訓練されてるか」ということ自体は教育的なので広く知られて良いと思いました。つまり「機械学習の悪い側面を知るには、その中身を知る必要がある」という考え方です。これはサイバーセキュリティにも似ている考え方です。

なので「訓練過程を全部公開する」「モデルの透明性を確保する」という条件付きで再度公開しています。ロジスティック回帰なので、ニューラルネットよりはモデルに透明性があり、係数を見てある程度は解釈できます。

追記: ここで書いたことは"sentiment_ja"という具体的ツールに限った話ではないかと。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up