Jubatusでテキストに含まれる特徴語の傾向を学習し、入力テキストをカテゴライズする

Last updated at 2015-04-15Posted at 2015-04-15

Jubatus 入門してみた。

前提＆目標

CentOS-6.6
カテゴリ分けされた大量のブログ記事(本文)などが MySQL に入っている
ブログ記事に含まれる特徴語とカテゴリとの対応を機械学習させ、適当なテキストを食わせたときに、どのカテゴリに対応しそうかを推定させたい

Jubatusインストール

公式サイトの説明に従って、パッケージからインストール。

$ sudo rpm -Uvh http://download.jubat.us/yum/rhel/6/stable/x86_64/jubatus-release-6-1.el6.x86_64.rpm
$ sudo yum install jubatus jubatus-client

サンプルを取得

jubatus-example なるサンプルのリポジトリがあるので、これを取得しておく。

$ git clone https://github.com/jubatus/jubatus-example.git

日本語の README など、説明が結構充実しているので、ここから入っていくのがやりやすいと思う。

サンプルを改造する

今回の目的だと、twitter_streaming_location というサンプルが使えそう。このサンプルの動きは以下のようなもの。

学習
Twitter の public stream から東京/北海道/九州の範囲のジオタグがついているものを取得
ツイートの本文に対して、それぞれどの地域のツイートなのかを学習させる
分類
文章を与えると、どの地域でツイートされたのかを推定

twitter_streaming_location をディレクトリごと適当な名前にコピーして改造する。

学習処理では、ブログのカテゴリと本文との対応を学習させ、
分類器には適当なテキストを与えて、カテゴリを推定させるようにしてみる。

学習処理の準備

教師データの準備

適当な SQL を用意して、ブログのカテゴリと本文のリストをテキストに出力しておく。CLI であれば、以下のようにすればタブ区切りでデータを取得できる。

$ mysql -uuser -p -N db < blog.sql > blog.txt

train.py を改造

元の train.py は、ツイートのジオタグを解析したりごにょごにょやってるので、そのあたりはバッサリいく。ネットワークから取得したツイートではなく、標準入力から食わせたタブ区切りデータを学習していくようにちょっと書き換え。

train.py

# !/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import json
import re

from jubatus.classifier import client
from jubatus.common import Datum

# Jubatus Configuration
host = "127.0.0.1"
port = 9199
instance_name = "" # required only when using distributed mode

def print_color(color, msg, end):
    sys.stdout.write('\033[' + str(color) + 'm' + str(msg) + '\033[0m' + str(end))

def print_red(msg, end="\n"):
    print_color(31, msg, end)

def print_green(msg, end="\n"):
    print_color(32, msg, end)

def train():
    classifier = client.Classifier(host, port, instance_name)
    for line in sys.stdin:
        category_name, body = line.split("\t")
        d = Datum({'text': body})
        classifier.train([(category_name, d)])

        # Print trained entry
        print_green(category_name, ' ')
        print body

    # 学習後に学習データをバックアップしておく場合は以下を有効に
    # classifier.save("foo")

if __name__ == '__main__':
    try:
        train()
    except KeyboardInterrupt:
        print "Stopped."

分類処理の準備

classify.py を改造

こっちはほとんど変える必要ないが、推定された上位3カテゴリのみの表示に変更した。

classify.py

# !/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

from jubatus.classifier import client
from jubatus.common import Datum

# Jubatus configuration
host = "127.0.0.1"
port = 9199
instance_name = "" # required only when using distributed mode

def estimate_blog_category_for(text):
    classifier = client.Classifier(host, port, instance_name)

    # Create datum for Jubatus
    d = Datum({'text': text})

    # Send estimation query to Jubatus
    result = classifier.classify([d])

    if len(result[0]) > 0:
        # Sort results by score
        est = sorted(result[0], key=lambda e: e.score, reverse=True)

        # Print the result
        print "Estimated Category for %s:" % text
        i = 0
        for e in est:
            print "  " + e.label + " (" + str(e.score) + ")"
            i += 1
            if i >= 3:
                break
    else:
        # No estimation results; maybe we haven't trained enough
        print "No estimation results available."
        print "Train more data or try using another text."

if __name__ == '__main__':
    if len(sys.argv) == 2:
        estimate_blog_category_for(sys.argv[1])
    else:
        print "Usage: %s data" % sys.argv[0]

jubatusサーバを起動

テキストの分かち書きは bigram ではなく mecab にしたかったので、設定をすこし書き換えた。

blog_category.json

{
  "method": "NHERD",
  "parameter": {
    "regularization_weight": 0.001
  },
  "converter": {
    "num_filter_types": {
    },
    "num_filter_rules": [
    ],
    "string_filter_types": {
    },
    "string_filter_rules": [
    ],
    "num_types": {
    },
    "num_rules": [
    ],
    "string_types": {
        "bigram":  { "method": "ngram", "char_num": "2" },
        "mecab": {
          "method": "dynamic",
          "path": "libmecab_splitter.so",
          "function": "create"
        }
    },
    "string_rules": [
        { "key": "*", "type": "mecab", "sample_weight": "bin", "global_weight": "idf" }
    ]
  }
}

この json を指定してサーバを起動する。

$ jubaclassifier -f blog_category.json -t 0

動作テスト

学習

用意しておいた教師データを train.py に食わせる。

$ cat blog.txt | ./train.py

分類

適当なテキストを食わせて、カテゴリを推定させてみる。

$ ./classify.py "はじめまして。田中といいます。"
Estimated Category for はじめまして。田中といいます。:
  自己紹介 (0.231856495142)
  日記 (0.0823381990194)
  お知らせ (0.0661180838943)

参考

Jubatus 公式
Jubatas References .. よくみるので

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up