Jubatusの文書分類をゴリゴリチューニングしてみた(COTOHA API) #Python

はじめに

以前、Jubatusでやった文書分類をいろいろチューニングしてみました。
チューニングで試すものは以下の通り。順番に試していって、精度が上がったものを採用していく方針でやっていきます。

Jubatusで頑張る
- 学習回数を増やす
- 特徴抽出を頑張る
  - 辞書を変える
  - 重みづけを変える
  - 形態素n-gram使ってみる
- アルゴリズムとハイパーパラメータを変えてみる
Jubatus以外の特徴を混ぜる
- COTOHA APIの分析結果を入れてみる

サンプルコードはこちら

ベースライン

はじめに問題設定のおさらいとベースラインとなる精度を測ります。

問題設定はlivedoor ニュースコーパスで、記事のタイトルのみを使って各記事のカテゴリ(9種類)を推定する問題です。タイトル以外の情報は使わないという制約でやっていきます。

ベースラインとしてJubatus で MeCabの形態素解析のみを使い、交差検証、ホールドアウト検証をした場合の精度を測ります。

コード:


import os
import pandas as pd
import random
import json
from jubatus.common import Datum
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
from embedded_jubatus import Classifier

# データを読み込んでdataframeを作る
categories = [f for f in os.listdir("text") if os.path.isdir(os.path.join("text", f))]
print(categories)
articles = []
for c in categories:
    articles = articles + [(c, os.path.join("text", c, t)) for t in os.listdir(os.path.join("text", c)) if t != "LICENSE.txt"]
df = pd.DataFrame(articles, columns=["target", "data"])

# datumのリストを作成しておく
datum_list = []
for d in df["data"]:
    dt = Datum()
    with open(d) as f:
        l = f.readlines()
        doc = l[2].rstrip()
        dt.add_string("title", doc) # Datumにテキストデータを追加
    datum_list.append(dt)

# 訓練用、テスト用にデータセットをわける
X_train, X_test, y_train, y_test = train_test_split(df["data"], df["target"], random_state=42, stratify=df["target"])
num_splits = 4

# 交差検証の準備
kf = StratifiedKFold(n_splits=num_splits, random_state=42, shuffle=True)


# Jubatusの準備
config = {"converter" : {
        "string_filter_types" : {},
        "string_filter_rules" : [],
        "num_filter_types" : {},
        "num_filter_rules" : [],
        "string_types": {
                "mecab": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
                    "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
                    "ngram": "1",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""
                }
        },
        "string_rules" : [
            { "key" : "*", "type" : "mecab", "sample_weight" : "bin", "global_weight" : "bin" }
        ],
        "num_types" : {},
        "num_rules" : [
            { "key" : "*", "type" : "num" }
        ]
    },
    "parameter" : {
        "regularization_weight" : 1.0
    },
    "method" : "AROW"
}
cl = Classifier(config)

# 交差検証実行用の関数
def do_cv(cl, n=3):
    random.seed(42)
    y_cv_results = []
    for fold, indexes in enumerate(kf.split(X_train.index, y_train)):
        cl.clear()
        train_index, test_index = indexes

        # (ラベル, Datum)のリストを作る
        training_data = [(df["target"][X_train.index[i]], datum_list[X_train.index[i]]) for i in train_index]

        # Jubatusに学習させる
        for i in range(n):
            cl.train(training_data)

        test_data = [datum_list[X_train.index[i]] for i in test_index]

        # Jubatusに分類させる
        result = cl.classify(test_data)

        # 分類スコアが最大のラベルを予測結果として取り出す
        y_pred = [max(x, key=lambda y:y.score).label  for x in result]

        # 正解を取り出す
        y = [df["target"][X_train.index[i]] for i in test_index]

        y_cv_results.append([y, y_pred])
    y_sum = []
    y_pred_sum = []
    for y, y_pred in y_cv_results:
        y_sum.extend(y)
        y_pred_sum.extend(y_pred)
    print(classification_report(y_sum, y_pred_sum, digits=4))
    print(confusion_matrix(y_sum, y_pred_sum))

# ホールドアウト検証実行用の関数
def do_holdout(cl, n):
    random.seed(42)
    training_data = [(df["target"][i], datum_list[i]) for i in X_train.index]
    test_data = [datum_list[i] for i in X_test.index]
    y_true = [df["target"][i] for i in X_test.index]

    for i in range(n):
        cl.train(training_data)
    result = cl.classify(test_data)
    y_pred = [max(x, key=lambda y:y.score).label  for x in result]

    print(classification_report(y_true=y_true, y_pred=y_pred, digits=4))

do_cv(cl, 1)
do_holdout(cl, 1)

交差検証
                precision    recall  f1-score   support

dokujo-tsushin     0.7581    0.8267    0.7909       652
  it-life-hack     0.8193    0.8469    0.8328       653
 kaden-channel     0.9074    0.9074    0.9074       648
livedoor-homme     0.8345    0.6188    0.7106       383
   movie-enter     0.7791    0.8006    0.7897       652
        peachy     0.7244    0.6820    0.7025       632
          smax     0.9104    0.9509    0.9302       652
  sports-watch     0.9050    0.8326    0.8673       675
    topic-news     0.7500    0.8304    0.7882       578

   avg / total     0.8218    0.8203    0.8192      5525


ホールドアウト検証
                precision    recall  f1-score   support

dokujo-tsushin     0.7214    0.8670    0.7875       218
  it-life-hack     0.8515    0.8986    0.8744       217
 kaden-channel     0.9439    0.9352    0.9395       216
livedoor-homme     0.9231    0.6562    0.7671       128
   movie-enter     0.8106    0.8440    0.8270       218
        peachy     0.8085    0.7238    0.7638       210
          smax     0.9067    0.9358    0.9210       218
  sports-watch     0.8815    0.8267    0.8532       225
    topic-news     0.8103    0.8229    0.8165       192

   avg / total     0.8481    0.8436    0.8430      1842

いろいろ数値は出ますが、モデルの汎化性能を保つため、交差検証のf1-scoreのtotalを指標にチューニングしていきます。スタートは 0.8192 です。

Jubatusで頑張る

繰り返し学習

Jubatusで使用しているアルゴリズムはオンライン学習のものが大半です。オンライン学習は与えられたデータを逐次学習していくため、データの学習順序によってできるモデルが異なります。そのため、同じデータでも複数回学習させると精度が向上する場合があります。
まずは学習回数を増やすとどうなるか確かめてみます。

do_cv(cl, 2)
do_cv(cl, 3)
do_cv(cl, 4)
do_cv(cl, 5)

結果:

2回
   avg / total     0.8329    0.8324    0.8316      5525

3回
   avg / total     0.8358    0.8351    0.8345      5525

4回
   avg / total     0.8351    0.8340    0.8337      5525

5回
   avg / total     0.8347    0.8337    0.8334      5525

3回目まで精度の向上が見られ、0.8345まで精度が上がりました。
繰り返し学習は効果があるようなので採用です。
今回は3回目が最大となりましたが、今後特徴量が増えたりアルゴリズムを変えた場合には回数を増減させて精度を測っていきます。

辞書を変える

形態素解析に用いる辞書をJubatusデフォルトのものからNeologdに変えてみます。インストールはNeologdの公式サイトを参考にしてください。ソースコードの変更点は、Jubatusのconfigで辞書を指定するディレクトリのみです。
私の環境ではhomeディレクトリのlocal/lib以下にインストールしてあるため、そこを指定します。


config = {"converter" : {
        "string_filter_types" : {},
        "string_filter_rules" : [],
        "num_filter_types" : {},
        "num_filter_rules" : [],
        "string_types": {
                "mecab": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
-                   "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/mecab-ipadic/",
+                   "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/mecab-ipadic-neologd/",
                    "ngram": "1",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""
                }
        },
        "string_rules" : [
            { "key" : "*", "type" : "mecab", "sample_weight" : "bin", "global_weight" : "bin" }
        ],
        "num_types" : {},
        "num_rules" : [
            { "key" : "*", "type" : "num" }
        ]
    },
    "parameter" : {
        "regularization_weight" : 1.0
    },
    "method" : "AROW"
}
cl = Classifier(config)

結果:

   avg / total     0.8245    0.8221    0.8217      5525

Neologdを入れて繰り返し学習も5回まで試しましたが、若干精度が落ちたため不採用です。

n-gramを使う

Jubatusでは連続するn個の形態素を抽出する形態素n-gramを使うことができます。形態素n-gramもJubatusのコンフィグファイルに変更を加えるだけで使うことができます。下記のように string_types 中で ngram の数字を変更した特徴抽出方法を作成し、 string_rules でそれを適用する設定を入れます。

bi-gram


config = {"converter" : {
        "string_filter_types" : {},
        "string_filter_rules" : [],
        "num_filter_types" : {},
        "num_filter_rules" : [],
        "string_types": {
                "mecab": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
                    "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
                    "ngram": "1",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""
                },
+           "mecab-bi": {
+                   "method": "dynamic",
+                   "path": "libmecab_splitter.so",
+                   "function": "create",
+                   "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
+                   "ngram": "2",
+                  "base": "true",
+                   "include_features": "*",
+                   "exclude_features": ""                
+           }
        },
        "string_rules" : [
            { "key" : "*", "type" : "mecab", "sample_weight" : "bin", "global_weight" : "bin" },
+           { "key" : "*", "type" : "mecab-bi", "sample_weight" : "bin", "global_weight" : "bin" }
        ],
        "num_types" : {},
        "num_rules" : [
            { "key" : "*", "type" : "num" }
        ]
    },
    "parameter" : {
        "regularization_weight" : 1.0
    },
    "method" : "AROW"
}

tri-gram


config = {"converter" : {
        "string_filter_types" : {},
        "string_filter_rules" : [],
        "num_filter_types" : {},
        "num_filter_rules" : [],
        "string_types": {
                "mecab": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
                    "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
                    "ngram": "1",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""
                },
+           "mecab-bi": {
+                   "method": "dynamic",
+                   "path": "libmecab_splitter.so",
+                   "function": "create",
+                   "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
+                   "ngram": "2",
+                  "base": "true",
+                   "include_features": "*",
+                   "exclude_features": ""                
+           },
+           "mecab-tri": {
+                   "method": "dynamic",
+                   "path": "libmecab_splitter.so",
+                   "function": "create",
+                   "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
+                   "ngram": "3",
+                  "base": "true",
+                   "include_features": "*",
+                   "exclude_features": ""                
+           }
        },
        "string_rules" : [
            { "key" : "*", "type" : "mecab", "sample_weight" : "bin", "global_weight" : "bin" },
+           { "key" : "*", "type" : "mecab-bi", "sample_weight" : "bin", "global_weight" : "bin" },
+           { "key" : "*", "type" : "mecab-tri", "sample_weight" : "bin", "global_weight" : "bin" }


        ],
        "num_types" : {},
        "num_rules" : [
            { "key" : "*", "type" : "num" }
        ]
    },
    "parameter" : {
        "regularization_weight" : 1.0
    },
    "method" : "AROW"
}

交差検証の結果は以下のようになりました。

bi-gram追加
   avg / total     0.8492    0.8456    0.8458      5525

tri-gramも追加
   avg / total     0.8482    0.8424    0.8427      5525

bi-gramの追加は精度の向上に寄与しているので採用します。一方tri-gramまで入れると精度が下がってしまったので、不採用とします。

重みづけの変更

これまで、抽出した特徴量はすべて等しく重みを 1として学習させていました。しかし、実際には非常に特徴的な特徴にはより大きな重みを、全ての文書に現れて分類に寄与しないような特徴には小さな重みを与えることで、精度があがることがあります。Jubatusでは TF-IDF, Okapi BM25 の2種類の重みづけが使えます。

変更点は下記のように string_rules のなかで sample_weight と global_weight の設定を変更します。 sample_weight は tf に global_weight は TF-IDF利用の場合は idf, BM25 利用の場合は
bm25を指定します。

tf-idf


config = {"converter" : {
        "string_filter_types" : {},
        "string_filter_rules" : [],
        "num_filter_types" : {},
        "num_filter_rules" : [],
        "string_types": {
                "mecab": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
                    "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
                    "ngram": "1",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""
                },
            "mecab-bi": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
                    "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
                    "ngram": "2",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""                
            }
        },
        "string_rules" : [
-           { "key" : "*", "type" : "mecab", "sample_weight" : "bin", "global_weight" : "bin" },
+           { "key" : "*", "type" : "mecab", "sample_weight" : "tf", "global_weight" : "idf" },
-           { "key" : "*", "type" : "mecab-bi", "sample_weight" : "bin", "global_weight" : "bin" }
+           { "key" : "*", "type" : "mecab-bi", "sample_weight" : "tf", "global_weight" : "idf" }
        ],
        "num_types" : {},
        "num_rules" : [
            { "key" : "*", "type" : "num" }
        ]
    },
    "parameter" : {
        "regularization_weight" : 1.0
    },
    "method" : "AROW"
}

BM25


config = {"converter" : {
        "string_filter_types" : {},
        "string_filter_rules" : [],
        "num_filter_types" : {},
        "num_filter_rules" : [],
        "string_types": {
                "mecab": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
                    "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
                    "ngram": "1",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""
                },
            "mecab-bi": {
                    "method": "dynamic",
                    "path": "libmecab_splitter.so",
                    "function": "create",
                    "arg": "-d /home/TkrUdagawa/local/lib/mecab/dic/ipadic",
                    "ngram": "2",
                    "base": "true",
                    "include_features": "*",
                    "exclude_features": ""                
            }
        },
        "string_rules" : [
-           { "key" : "*", "type" : "mecab", "sample_weight" : "bin", "global_weight" : "bin" },
+           { "key" : "*", "type" : "mecab", "sample_weight" : "tf", "global_weight" : "bm25" },
-           { "key" : "*", "type" : "mecab-bi", "sample_weight" : "bin", "global_weight" : "bin" }
+           { "key" : "*", "type" : "mecab-bi", "sample_weight" : "tf", "global_weight" : "bm25" }
        ],
        "num_types" : {},
        "num_rules" : [
            { "key" : "*", "type" : "num" }
        ]
    },
    "parameter" : {
        "regularization_weight" : 1.0
    },
    "method" : "AROW"

交差検証結果は下記のようになりました。

tf-idf
   avg / total     0.8487    0.8454    0.8455      5525

bm25
   avg / total     0.8471    0.8452    0.8450      5525

bm25は6回学習まで精度が向上しましたが、0.8450で打ち止めでした。どちらの手法も精度は向上しなかったため、不採用とします。

パラメータチューニング

アルゴリズムとパラメータのチューニングをやってみます。
真面目にやるならすべてのアルゴリズムを細かくパラメータ調整しながらやっていくべきですが、時間と労力の関係で CW, AROW の2つのアルゴリズムをいろいろパラメータ振って試してみました。
CW, AROW はそれぞれ regularization_weight というパラメータを調整できます。
決め打ちで 0.01, 0.1, 0.5, 1.0, 10.0 の5種類の値を試してみます。

CW/0.01
   avg / total     0.8402    0.8371    0.8360      5525

CW/0.1
   avg / total     0.8466    0.8449    0.8437      5525

CW/0.5 
   avg / total     0.8541    0.8525    0.8510      5525

CW/1.0
   avg / total     0.8540    0.8525    0.8508      5525

CW/10.0
   avg / total     0.8375    0.8376    0.8351      5525

AROW/0.01
   avg / total     0.8438    0.8402    0.8394      5525

AROW/0.1
   avg / total     0.8506    0.8471    0.8469      5525

AROW/0.5
   avg / total     0.8492    0.8460    0.8460      5525

AROW/1.0
   avg / total     0.8492    0.8456    0.8458      5525

AROW/10.0
   avg / total     0.8482    0.8445    0.8447      5525

上記の通り CW で regularization_weight を0.5にした時が最大となりました。

Jubatus以外の特徴追加

ここまでJubatusだけを使ってできることのみでチューニングをしてきて、交差検証の精度が0.8192 から 0.8510 まで精度を向上させることができました。ここから更にダメ押しで精度をあげるためにJubatusの外で特徴を作り、追加することを試みていきます。

COTOHA API

NTTコミュニケーションズが公開している自然言語処理のAPI。
構文解析や固有表現抽出、キーワード抽出などの7種類の日本語処理のAPIを利用できます。
Developerユーザなら誰でも無料で登録できてAPIを使うことができるようです。
詳しくは公式サイトや関連記事を参照してください。

データの収集

COTOHA APIのDeveloperユーザは、無料ではあるのですが1日のリクエスト数に制限があります。
分析用のデータセットをつくるために数日にわけてリクエストを投げてはデータを保存するよう作業を行いました。。。
今回は構文解析と固有表現抽出も使ったため、さらに時間がかかりました。。。

COTOHA APIを実行するコードは以下の通りです。


import requests
import json
import os

# 下記の情報はCOTOHA API Portalにログインすると確認できます。
CLIENT_SECRET = "CLIENT SECRETを入れる"
CLIENT_ID = "CLIENT IDを入れる"
TOKEN_URL = "TOKEN_URLを入れる"
API_BASE = "API_BASEを入れる"

def  get_token():
    """トークン認証を行う
    """
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8"
    }
    data = {
        "grantType": "client_credentials",
        "clientId": CLIENT_ID,
        "clientSecret": CLIENT_SECRET
    }
    r = requests.post(TOKEN_URL, headers=headers, data=json.dumps(data))
    return r.json()


def parse(text, token):
    """構文解析を実行する
    """
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8",
        "Authorization": "Bearer {}".format(token)
    }
    data = {
        "sentence": text,
        "type": "default"
    }
    r = requests.post(API_BASE + "v1/parse", headers=headers, data=json.dumps(data))
    if r.json()["status"] != 0:
        print(r.json()["status"], text)
    return r.json()


def ne(text, token):
    """固有表現抽出を行う
    """
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8",
        "Authorization": "Bearer {}".format(token)
    }
    data = {
        "sentence": text,
        "type": "default",
        "dic_type": []
    }
    r = requests.post(API_BASE + "v1/ne", headers=headers, data=json.dumps(data))
    if r.json()["status"] != 0:
        print(r.json()["status"], text)
    return r.json()

TOKEN = get_token()["access_token"]
text = "週末映画まとめ読み】 『モテキ』初登場2位でトップ3を邦画が独占＜10月1日号＞"
print(json.dumps(parse(text, TOKEN), indent=2, ensure_ascii=False))
print(json.dumps(ne(text, TOKEN), indent=2, ensure_ascii=False))

構文解析結果


{
  "result": [
    {
      "chunk_info": {
        "id": 0,
        "head": 5,
        "dep": "D",
        "chunk_head": 1,
        "chunk_func": 1,
        "links": []
      },
      "tokens": [
        {
          "id": 0,
          "form": "【",
          "kana": "",
          "lemma": "【",
          "pos": "括弧",
          "features": [
            "開括弧"
          ],
          "attributes": {}
        },
        {
          "id": 1,
          "form": "週末",
          "kana": "シュウマツ",
          "lemma": "週末",
          "pos": "名詞",
          "features": [
            "時",
            "連用"
          ],
          "dependency_labels": [
            {
              "token_id": 0,
              "label": "punct"
            }
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 1,
        "head": 2,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 0,
        "links": []
      },
      "tokens": [
        {
          "id": 2,
          "form": "映画",
          "kana": "エイガ",
          "lemma": "映画",
          "pos": "名詞",
          "features": [],
          "dependency_labels": [],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 2,
        "head": 3,
        "dep": "A",
        "chunk_head": 1,
        "chunk_func": 1,
        "links": [
          {
            "link": 1,
            "label": "other"
          }
        ]
      },
      "tokens": [
        {
          "id": 3,
          "form": "まとめ",
          "kana": "マトメ",
          "lemma": "まとめ",
          "pos": "名詞",
          "features": [],
          "attributes": {}
        },
        {
          "id": 4,
          "form": "読み",
          "kana": "ヨミ",
          "lemma": "読み",
          "pos": "名詞",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 2,
              "label": "nmod"
            },
            {
              "token_id": 3,
              "label": "compound"
            },
            {
              "token_id": 5,
              "label": "punct"
            },
            {
              "token_id": 6,
              "label": "punct"
            }
          ],
          "attributes": {}
        },
        {
          "id": 5,
          "form": "】",
          "kana": "",
          "lemma": "】",
          "pos": "括弧",
          "features": [
            "閉括弧"
          ],
          "attributes": {}
        },
        {
          "id": 6,
          "form": " ",
          "kana": "",
          "lemma": " ",
          "pos": "空白",
          "features": [],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 3,
        "head": 4,
        "dep": "D",
        "chunk_head": 1,
        "chunk_func": 1,
        "links": [
          {
            "link": 2,
            "label": "other"
          }
        ]
      },
      "tokens": [
        {
          "id": 7,
          "form": "『",
          "kana": "",
          "lemma": "『",
          "pos": "括弧",
          "features": [
            "開括弧"
          ],
          "attributes": {}
        },
        {
          "id": 8,
          "form": "モテキ",
          "kana": "モテキ",
          "lemma": "モテキ",
          "pos": "名詞",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 4,
              "label": "nmod"
            },
            {
              "token_id": 7,
              "label": "punct"
            },
            {
              "token_id": 9,
              "label": "punct"
            }
          ],
          "attributes": {}
        },
        {
          "id": 9,
          "form": "』",
          "kana": "",
          "lemma": "』",
          "pos": "括弧",
          "features": [
            "閉括弧"
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 4,
        "head": 5,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 0,
        "links": [
          {
            "link": 3,
            "label": "other"
          }
        ]
      },
      "tokens": [
        {
          "id": 10,
          "form": "初",
          "kana": "ハツ",
          "lemma": "初",
          "pos": "冠名詞",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 8,
              "label": "dep"
            }
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 5,
        "head": 6,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 0,
        "links": [
          {
            "link": 0,
            "label": "time"
          },
          {
            "link": 4,
            "label": "other"
          }
        ]
      },
      "tokens": [
        {
          "id": 11,
          "form": "登場",
          "kana": "トウジョウ",
          "lemma": "登場",
          "pos": "名詞",
          "features": [
            "動作"
          ],
          "dependency_labels": [
            {
              "token_id": 1,
              "label": "nmod"
            },
            {
              "token_id": 10,
              "label": "dep"
            }
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 6,
        "head": 10,
        "dep": "D",
        "chunk_head": 1,
        "chunk_func": 2,
        "links": [
          {
            "link": 5,
            "label": "other"
          }
        ]
      },
      "tokens": [
        {
          "id": 12,
          "form": "2",
          "kana": "ニ",
          "lemma": "2",
          "pos": "Number",
          "features": [],
          "attributes": {}
        },
        {
          "id": 13,
          "form": "位",
          "kana": "イ",
          "lemma": "位",
          "pos": "助数詞",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 11,
              "label": "nmod"
            },
            {
              "token_id": 12,
              "label": "compound"
            },
            {
              "token_id": 14,
              "label": "cop"
            }
          ],
          "attributes": {}
        },
        {
          "id": 14,
          "form": "で",
          "kana": "デ",
          "lemma": "で",
          "pos": "判定詞",
          "features": [
            "連用"
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 7,
        "head": 8,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 0,
        "links": []
      },
      "tokens": [
        {
          "id": 15,
          "form": "トップ",
          "kana": "トップ",
          "lemma": "トップ",
          "pos": "名詞",
          "features": [],
          "dependency_labels": [],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 8,
        "head": 10,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": [
          {
            "link": 7,
            "label": "time"
          }
        ]
      },
      "tokens": [
        {
          "id": 16,
          "form": "3",
          "kana": "サン",
          "lemma": "3",
          "pos": "Number",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 15,
              "label": "nmod"
            },
            {
              "token_id": 17,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 17,
          "form": "を",
          "kana": "ヲ",
          "lemma": "を",
          "pos": "格助詞",
          "features": [
            "連用"
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 9,
        "head": 10,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": []
      },
      "tokens": [
        {
          "id": 18,
          "form": "邦画",
          "kana": "ホウガ",
          "lemma": "邦画",
          "pos": "名詞",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 19,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 19,
          "form": "が",
          "kana": "ガ",
          "lemma": "が",
          "pos": "格助詞",
          "features": [
            "連用"
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 10,
        "head": 11,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 0,
        "links": [
          {
            "link": 6,
            "label": "other"
          },
          {
            "link": 8,
            "label": "object"
          },
          {
            "link": 9,
            "label": "agent"
          }
        ],
        "predicate": []
      },
      "tokens": [
        {
          "id": 20,
          "form": "独占",
          "kana": "ドクセン",
          "lemma": "独占",
          "pos": "名詞",
          "features": [
            "動作"
          ],
          "dependency_labels": [
            {
              "token_id": 13,
              "label": "nmod"
            },
            {
              "token_id": 16,
              "label": "dobj"
            },
            {
              "token_id": 18,
              "label": "nsubj"
            }
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 11,
        "head": -1,
        "dep": "O",
        "chunk_head": 2,
        "chunk_func": 2,
        "links": [
          {
            "link": 10,
            "label": "other"
          }
        ]
      },
      "tokens": [
        {
          "id": 21,
          "form": "<",
          "kana": "",
          "lemma": "<",
          "pos": "Symbol",
          "features": [],
          "attributes": {}
        },
        {
          "id": 22,
          "form": "10月1日",
          "kana": "ジュウガツイチニチ",
          "lemma": "10月1日",
          "pos": "名詞",
          "features": [
            "日時"
          ],
          "attributes": {}
        },
        {
          "id": 23,
          "form": "号",
          "kana": "ゴウ",
          "lemma": "号",
          "pos": "名詞",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 20,
              "label": "nmod"
            },
            {
              "token_id": 22,
              "label": "compound"
            },
            {
              "token_id": 21,
              "label": "compound"
            },
            {
              "token_id": 24,
              "label": "punct"
            }
          ],
          "attributes": {}
        },
        {
          "id": 24,
          "form": ">",
          "kana": "",
          "lemma": ">",
          "pos": "Symbol",
          "features": [],
          "attributes": {}
        }
      ]
    }
  ],
  "status": 0,
  "message": ""
}

固有表現抽出結果


{
  "result": [
    {
      "begin_pos": 13,
      "end_pos": 16,
      "form": "モテキ",
      "std_form": "モテキ",
      "class": "ART",
      "extended_class": "",
      "source": "basic"
    },
    {
      "begin_pos": 34,
      "end_pos": 39,
      "form": "10月1日",
      "std_form": "10月1日",
      "class": "DAT",
      "extended_class": "",
      "source": "basic"
    },
    {
      "begin_pos": 20,
      "end_pos": 22,
      "form": "2位",
      "std_form": "2位",
      "class": "NUM",
      "extended_class": "Rank",
      "source": "basic"
    },
    {
      "begin_pos": 23,
      "end_pos": 27,
      "form": "トップ3",
      "std_form": "トップ3",
      "class": "NUM",
      "extended_class": "Rank",
      "source": "basic"
    }
  ],
  "status": 0,
  "message": ""
}

構文解析では、形態素の情報とそれの文法的な役割や特殊な特徴などを抽出できます。
固有表現抽出では、固有名詞や数量表現など特別な意味を持つ表現を抽出することができます。

これらの分析結果を以下のようなディレクトリ構造で保存しています。
それぞれの記事カテゴリの下に分析結果のjsonファイルが格納されています。

├── ne_title
│   ├── dokujo-tsushin
│   ├── it-life-hack
│   ├── kaden-channel
│   ├── livedoor-homme
│   ├── movie-enter
│   ├── peachy
│   ├── smax
│   ├── sports-watch
│   └── topic-news
└── parse_title
    ├── dokujo-tsushin
    ├── it-life-hack
    ├── kaden-channel
    ├── livedoor-homme
    ├── movie-enter
    ├── peachy
    ├── smax
    ├── sports-watch
    └── topic-news

これらのデータから、構文解析結果の形態素のlemma(標準形)と固有表現を抽出するために以下のような関数を用意しました。

特徴抽出関数


def get_tokens(result):
    tokens = []
    for r in result:
        for t in r["tokens"]:
            tokens.append(t)
    return tokens

def make_datum_list_with_cotoha(df, add_lemma=False,
                                add_ne_form=False, ne_filter=[]):
    datum_list = []
    for d in df["data"]:
        dt = Datum()
        with open(d) as f:
            l = f.readlines()
            doc = l[2].rstrip()
            dt.add_string("title", doc) # Datumにテキストデータを追加

        parse_file = d.replace("text", "parse_title").replace("txt", "json")
        ne_file = d.replace("text", "ne_title").replace("txt", "json")
        with open(parse_file) as f, open(ne_file) as ne:
            j = json.load(f)
            ne_j = json.load(ne)
            tokens = get_tokens(j["result"])

            # 固有表現を入れる
            for r in ne_j["result"]:
                if add_ne_form:
                    if ne_filter:
                        if r["class"] in ne_filter:
                            dt.add_number("ne-{}".format(r["form"]), 1.0)                            
                    else:
                        dt.add_number("ne-{}".format(r["form"]), 1.0)

            # token情報からlemmaを取得
            for r in j["result"]:
                for t in r["tokens"]:
                    k = "lemma-{}".format(t["lemma"])
                    v = 1.0
                    if add_lemma:
                        dt.add_number(k, v)
        datum_list.append(dt)
    print(len(datum_list))
    return datum_list

lemmaの追加

COTOHA APIとMeCabでは形態素解析の結果が結構違うようです。
(参考) COTOHA APIとMeCabの比較

COTOHA APIの方が長めに形態素を構築してくれる傾向があるようなので、より特徴的な言葉を拾って分類精度が向上するかもしれません。

lemma情報を追加して3回学習で交差検証を行ったところ下記のような結果となりました。


datum_list = make_datum_list_with_cotoha(df, add_lemma=True)
do_cv(cl, 3) # jubatusは CW:0.5 で動作

   avg / total     0.8548    0.8530    0.8516      5525

少し精度が上がったので、採用とします。

固有表現の追加

固有表現とは人名や地名、数量表現など特定の意味を持つ表現のことです。
COTOHA APIにはそういった固有表現を抽出する機能があり、これは形態素をまたいだ単位で文字列を抜き出してくれるので、今までと違った特徴を作ってくれる可能性があります。
COTOHA APIで取れる固有表現はこちらの通りです。
日付や時刻を抜き出しても分類にはあまり意味がなさそうなので、今回は組織名、人名、場所、固有物名を抜き出すようにします。


datum_list = make_datum_list_with_cotoha(
    df, add_lemma=True,
    add_ne_form=True, ne_filter=set(["ORG", "PSN", "LOC", "ART"]))

do_cv(cl, 3) # jubatusは CW:0.5 で動作

   avg / total     0.8550    0.8534    0.8518      5525

こちらも少しですが精度が向上したので採用します。

アルゴリズム選択とパラメータチューニング

特徴量が増えたところで再度アルゴリズムとハイパーパラメータを選択します。

AROW 0.01
   avg / total     0.8505    0.8485    0.8474      5525
AROW 0.1
   avg / total     0.8543    0.8516    0.8513      5525
AROW 0.5
    avg / total     0.8526    0.8501    0.8499      5525
AROW 1.0
   avg / total     0.8547    0.8523    0.8522      5525
AROW 10.0
   avg / total     0.8528    0.8500    0.8498      5525

CW 0.01
   avg / total     0.8426    0.8393    0.8379      5525
CW 0.1
   avg / total     0.8522    0.8509    0.8495      5525
CW 0.5
   avg / total     0.8553    0.8538    0.8522      5525
CW 1.0
   avg / total     0.8538    0.8521    0.8506      5525
CW 10.0
   avg / total     0.8467    0.8460    0.8439      5525

AROWの1.0とCWの0.5が並びました。学習回数をさらに増やすとCWの方が伸びたのでこちらを採用します。
最終的に学習回数を増やしていくとCWは以下のような精度となりました。

3回学習
   avg / total     0.8553    0.8538    0.8522      5525

4回学習
   avg / total     0.8578    0.8565    0.8551      5525

5回学習
   avg / total     0.8582    0.8567    0.8553      5525

6回学習
   avg / total     0.8590    0.8576    0.8563      5525

7回学習
   avg / total     0.8595    0.8581    0.8569      5525

8回学習
   avg / total     0.8588    0.8577    0.8564      5525

7回学習で作成したモデルが最も精度が良く、0.8569まで精度が伸びました。
このモデルで最後にホールドアウト検証を行います。


do_holdout(cl, 7)

               precision    recall  f1-score   support

dokujo-tsushin     0.8201    0.8991    0.8578       218
  it-life-hack     0.8879    0.9124    0.9000       217
 kaden-channel     0.9807    0.9398    0.9598       216
livedoor-homme     0.9362    0.6875    0.7928       128
   movie-enter     0.8369    0.8945    0.8647       218
        peachy     0.8534    0.7762    0.8130       210
          smax     0.8921    0.9862    0.9368       218
  sports-watch     0.9163    0.8756    0.8955       225
    topic-news     0.8492    0.8802    0.8645       192

   avg / total     0.8841    0.8817    0.8806      1842

まとめ

Jubatus のみでのチューニング

0.8192 
-> 繰り返し学習: 0.8345 
-> 形態素bi-gram: 0.8458 
-> アルゴリズム選択: 0.8510

COTOHA APIの追加

0.8510
-> lemma追加: 0.8516
-> 固有表現追加: 0.8518
-> アルゴリズム選択: 0.8522
-> 学習回数追加: 0.8569

ホールドアウト検証

0.8430 -> 0.8806

記事タイトルだけという限られた情報の中でもいろいろ試行錯誤することで、少し精度を上げることができました。COTOHA APIについては、文字列を取り出す以外にも形態素間の依存関係や形態素のfeature(副品詞？)なども出てるのでもう少し活用できそうな気はします。