Jubatusを使って文書分類をやってみた #Python

Jubatusで文書分類をやってみました。意外とそういう記事なかったんですね。

データセットについて

使用したデータセットは livedoor ニュースコーパス。
livedoor ニュースコーパスは以下の9つのカテゴリに属するニュース記事のデータセットです。

トピックニュース
Sports Watch
ITライフハック
家電チャンネル
MOVIE ENTER
独女通信
エスマックス
livedoor HOMME
Peachy

今回は記事のタイトルからニュースのカテゴリを推定する問題を解いてみます。
完全なjupyter notebookはこちらをご覧ください。

データセットをダウンロードして展開すると、textディレクトリの下にCHANGES.txt, README.txtとそれぞれのカテゴリのディレクトリができます。カテゴリごとのディレクトリにはLICENSE.txtと記事のデータが含まれています。

ソフトウェアの準備

今回の記事では以下のソフト、ライブラリを使います

ソフトウェア
- Jubatus(MeCabプラグイン入り)
Pythonライブラリ
- scikit-learn
- pandas
- embedded_jubatus

それぞれの公式ドキュメント等を参考にインストールしてください。
ちなみに私の環境では、Jubatusはjubatus-installerを使って ${HOME}/local にインストールしています。

手順

以下のような手順でやっていきます。
1. データの準備
2. 分類器の用意
3. 交差検証
4. ホールドアウト検証

データの準備

まずはカテゴリとデータの組を作っていきます。あとでいろいろこねくり回しやすいようにpandasに突っ込んでいきます。

import os
import pandas as pd

categories = [f for f in os.listdir("text") if os.path.isdir(os.path.join("text", f))]
print(categories)
articles = []
for c in categories:
    articles = articles + [(c, os.path.join("text", c, t)) for t in os.listdir(os.path.join("text", c)) if t != "LICENSE.txt"]
df = pd.DataFrame(articles, columns=["target", "data"])
df

以下のようにカテゴリと記事データへのパスの組のDataFrameが作られます。

['sports-watch', 'smax', 'it-life-hack', 'livedoor-homme', 'kaden-channel', 'topic-news', 'peachy', 'movie-enter', 'dokujo-tsushin']


    target          data
0   sports-watch    text/sports-watch/sports-watch-6311178.txt
1   sports-watch    text/sports-watch/sports-watch-5470923.txt
2   sports-watch    text/sports-watch/sports-watch-5655567.txt
3   sports-watch    text/sports-watch/sports-watch-5724486.txt
4   sports-watch    text/sports-watch/sports-watch-5417346.txt

次に、Jubatusに与えられる学習データを作ります。JubatusはDatumというkey/valueのデータ形式を受け付けます。今回は"title": 記事タイトル のような形でデータを作っていきます。
記事データは以下のようになっています。

1行目: 記事へのリンク
2行目: タイムスタンプ
3行目: 記事タイトル
4行目以降: 記事本文

従って、3行目のデータだけを使って学習データを作っていきます。

from jubatus.common import Datum

datum_list = []
for d in df["data"]:
    dt = Datum()
    with open(d) as f:
        l = f.readlines()
        doc = l[2].rstrip()
        dt.add_string("title", doc) # Datumにテキストデータを追加
    datum_list.append(dt)

分類器の準備

今回はembedded-jubatusを使います。従来Jubatusはサーバ-クライアント型で動作していて、Pythonの外でサーバの起動・終了をする必要がありました。embedded-jubatusを使うとそのあたりの手間を省いてPythonプログラムの中だけで完結させることができます。詳しくはこちらの記事などを参考にしてください。

from embedded_jubatus import Classifier
config = {"converter" : {
        "string_filter_types" : {},
        "string_filter_rules" : [],
        "num_filter_types" : {},
        "num_filter_rules" : [],
        "string_types": {
                "mecab": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
                    "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
                    "ngram": "1",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""
                }
        },
        "string_rules" : [
            { "key" : "*", "type" : "mecab", "sample_weight" : "bin", "global_weight" : "bin" }
        ],
        "num_types" : {},
        "num_rules" : [
            { "key" : "*", "type" : "num" }
        ]
    },
    "parameter" : {
        "regularization_weight" : 1.0
    },
    "method" : "AROW"
}
cl = Classifier(config)

string_types の中で設定しているmecabの引数の("arg") はご自分の環境での辞書の場所を指すようにしてください。

交差検証

訓練用のデータセットとテスト用のデータセットに分割し、訓練用のデータセットで交差検証をやってみます。データの分割はscikit-learnを使うと簡単にいい感じに(ラベルごとにバランスよく、ランダムに)することができます。今回は4分割にしてみます。

from sklearn.model_selection import train_test_split, StratifiedKFold
# 訓練用、テスト用にデータセットをわける
X_train, X_test, y_train, y_test = train_test_split(df["data"], df["target"], random_state=42, stratify=df["target"])
num_splits = 4
# 交差検証の準備
kf = StratifiedKFold(n_splits=num_splits, random_state=42, shuffle=True)

分割したデータを使って交差検証を行います。


import random
from sklearn.utils import shuffle

random.seed(42)
y_cv_results = []
for fold, indexes in enumerate(kf.split(X_train.index, y_train)):
    cl.clear()
    train_index, test_index = indexes

    # (ラベル, Datum)のリストを作る
    training_data = [(df["target"][X_train.index[i]], datum_list[X_train.index[i]]) for i in train_index]

    # Jubatusに学習させる
    cl.train(training_data)

    test_data = [datum_list[X_train.index[i]] for i in test_index]

    # Jubatusに分類させる
    result = cl.classify(test_data)

    # 分類スコアが最大のラベルを予測結果として取り出す
    y_pred = [max(x, key=lambda y:y.score).label  for x in result]

    # 正解を取り出す
    y = [df["target"][X_train.index[i]] for i in test_index]

    y_cv_results.append([y, y_pred])

交差検証の結果を確認します。Foldごとに結果を見てもよいのですが、全体の結果だけを見てます。

from sklearn.metrics import classification_report, confusion_matrix

y_sum = []
y_pred_sum = []
for y, y_pred in y_cv_results:
    y_sum.extend(y)
    y_pred_sum.extend(y_pred)
print(classification_report(y_sum, y_pred_sum))
print(confusion_matrix(y_sum, y_pred_sum))

上記のコードを実行すると以下のような結果が得られます。

                precision    recall  f1-score   support

dokujo-tsushin       0.78      0.81      0.80       652
  it-life-hack       0.85      0.83      0.84       653
 kaden-channel       0.92      0.90      0.91       648
livedoor-homme       0.74      0.61      0.67       383
   movie-enter       0.80      0.81      0.80       652
        peachy       0.72      0.69      0.71       632
          smax       0.90      0.94      0.92       652
  sports-watch       0.89      0.80      0.84       675
    topic-news       0.70      0.85      0.77       578

   avg / total       0.82      0.82      0.81      5525

[[530  11   4  14  18  49   5   4  17]
 [ 16 544  17   8  12  18  24   3  11]
 [  2  22 583  10   6   9   9   0   7]
 [ 26  21   9 234  22  33  12   6  20]
 [ 12  10   1  10 530  34   1   9  45]
 [ 61  15   7  29  42 437  13   8  20]
 [  1  14   6   4   6   5 616   0   0]
 [ 18   0   1   6  10  12   1 537  90]
 [ 12   6   3   2  19   8   1  34 493]]

全体平均でF値0.81とまぁまぁの精度が出ました。記事のタイトルだけでもそれなりの精度で分類できていることがわかります。

ホールドアウト検証

最後にテスト用にとっておいたデータで精度を検証してみましょう。

cl.clear()
training_data = [(df["target"][i], datum_list[i]) for i in X_train.index]
test_data = [datum_list[i] for i in X_test.index]
y_test = [df["target"][i] for i in X_test.index]
cl.train(training_data)
r = cl.classify(test_data)

y_pred = [max(x, key=lambda y:y.score).label  for x in r]
report = classification_report(y_test, y_pred)
print(report)

                precision    recall  f1-score   support

dokujo-tsushin       0.71      0.86      0.78       218
  it-life-hack       0.81      0.87      0.84       217
 kaden-channel       0.96      0.87      0.91       216
livedoor-homme       0.81      0.71      0.76       128
   movie-enter       0.85      0.80      0.82       218
        peachy       0.73      0.70      0.71       210
          smax       0.93      0.91      0.92       218
  sports-watch       0.95      0.82      0.88       225
    topic-news       0.78      0.88      0.82       192

   avg / total       0.84      0.83      0.83      1842

ホールドアウト検証でも交差検証と同程度の精度を得ることができました。

おまけ jubadumpでモデルの中を見てみる

jubatusの線形モデルを使っている場合、jubadump というコマンドラインツールを使うと特徴量の重みを見ることができます。今回はすべての特徴量の値が1になっているため、重みが大きいほど分類に寄与している特徴量とみることができます。

jubadumpを使うには
1. saveメソッドでjubatusのモデルをsaveする
2. jubadump コマンドでモデルの中身をjsonにする
3. jsonをパースする
という手順が必要になります。

jubatusのモデルをsaveする

先程のclassifierオブジェクトでsaveメソッドを呼んで上げます。
引数はモデルにつける名前です。saveに成功すると、デフォルトでjubatusは/tmp以下にモデルファイルを作成します。

cl.save("livedoor_title")
> {'127.0.0.1_0': '/tmp/127.0.0.1_0_classifier_livedoor_title.jubatus'}

jubadumpコマンドを使う

コマンドラインから、先程保存したモデルに対してjubadumpを実行します。

$ jubadump -i /tmp/127.0.0.1_0_classifier_livedoor_title.jubatus > titile_weights.json

jsonのパース

jubadumpの出力のjsonは結構わかりにくい構造をしていますが、基本的にv1という値だけを見れば十分です。jsonをパースして、各行が形態素、各列がラベルのDataFrameを作り、各形態素が各ラベルに対してどの程度の重みを持っているかを格納してみます。

import re
weights = {k:[] for k in categories}
index = []
for w in j["storage"]["storage"]["weight"]:
    # 特徴名から形態素の部分だけを取り出す
    k = re.search(r"\$.+@", w).group(0).replace("$", "").replace("@", "")
    index.append(k)
    for label in categories:
        try:
            weights[label].append(j["storage"]["storage"]["weight"][w][label]["v1"])
        except KeyError:
            weights[label].append(0)
d = pd.DataFrame(weights, index=index)

そして、各カテゴリの重み上位、下位3位をプリントしてみます。

for c in categories:
    print(c)
    print("positive feature")
    print(d[c].sort_values(ascending=False)[:3])
    print("")
    print("negative feature")
    print(d[c].sort_values()[:3])
    print("")

sports-watch
positive feature
Watch     0.525509
Sports    0.525509
なでしこ      0.394376
Name: sports-watch, dtype: float64

negative feature
：      -0.270749
お      -0.235614
ニュース   -0.199929
Name: sports-watch, dtype: float64

smax
positive feature
アプリ        0.419476
レポート       0.345398
Android    0.342291
Name: smax, dtype: float64

negative feature
話題   -0.382349
　    -0.335096
デジ   -0.268417
Name: smax, dtype: float64

it-life-hack
positive feature
凄い     0.452164
孫      0.428668
掴める    0.419770
Name: it-life-hack, dtype: float64

negative feature
映画     -0.209024
夏      -0.208336
おすすめ   -0.200895
Name: it-life-hack, dtype: float64

livedoor-homme
positive feature
ゴルフ    0.526195
特集     0.345328
｜      0.338539
Name: livedoor-homme, dtype: float64

negative feature
映画   -0.235123
結婚   -0.174290
S    -0.159502
Name: livedoor-homme, dtype: float64

kaden-channel
positive feature
話題       0.462991
SALON    0.427266
売れ筋      0.370941
Name: kaden-channel, dtype: float64

negative feature
韓国      -0.395027
プレゼント   -0.266264
レポート    -0.256500
Name: kaden-channel, dtype: float64

topic-news
positive feature
物議      0.439206
ネット     0.370690
ニュース    0.362856
Name: topic-news, dtype: float64

negative feature
！    -0.453532
方    -0.344805
カレ   -0.262586
Name: topic-news, dtype: float64

peachy
positive feature
プレゼント         0.475135
ガールズコレクション    0.394012
クリスマス         0.385962
Name: peachy, dtype: float64

negative feature
部     -0.337270
理由    -0.301786
すぎる   -0.238123
Name: peachy, dtype: float64

movie-enter
positive feature
映画      0.387900
エンター    0.347985
映像      0.316921
Name: movie-enter, dtype: float64

negative feature
女子       -0.447995
オトナ      -0.258326
バレンタイン   -0.228755
Name: movie-enter, dtype: float64

dokujo-tsushin
positive feature
独      0.475533
女      0.455230
オトナ    0.410368
Name: dokujo-tsushin, dtype: float64

negative feature
氏    -0.314418
さん   -0.304657
やる   -0.270551
Name: dokujo-tsushin, dtype: float64

"映画"と入っていればそりゃmovieだろうのような当たり前の特徴も学習されている一方、よくわからないけど｜ はlivedoor-hommeの分類に利いているなどちょっと変なものも取られています。形態素解析の段階でいまいち処理しきれていない感じもあるので、辞書を変えたり、品詞のフィルタリングを入れたりするなどすればもう少しちゃんとした特徴を拾ってくれそうですね。