More than 5 years have passed since last update.

日本語テキストのカテゴリをtf-idfとランダムフォレストで学習する〜livedoor ニュースを題材に

Last updated at 2016-07-25Posted at 2016-06-10

はじめに

世間的には既にやり尽くされた感のあるネタではありますが、日本語テキストの扱いに慣れるにはよい題材だなと思ったので、Qiitaに書いてみます。

今回使うものは下記の通りです。

python(anaconda3-2.5.0)
sklearn
mecab

livedoor ニュースコーパス

NHN Japan株式会社が運営する「livedoor ニュース」のうち、下記のクリエイティブ・コモンズライセンスが適用されるニュース記事を収集し、可能な限りHTMLタグを取り除いて作成したものです。

http://www.rondhuit.com/download.html を利用させていただきます。

機会学習をやる上でデータの準備は最も時間が掛かる仕事のうちの一つですが、こうして整形したデータを用意していただけるのは本当にありがたいことです。

tarの中身を展開すると、下記のようになります。

livedoor-news-data
├── CHANGES.txt
├── README.txt
├── dokujo-tsushin
│   ├── LICENSE.txt
│   ├── dokujo-tsushin-4782522.txt
│   ├── dokujo-tsushin-4788357.txt
...skipping...
├── it-life-hack
│   ├── LICENSE.txt
│   ├── it-life-hack-6292880.txt
│   ├── it-life-hack-6294340.txt
...skipping...
    ├── topic-news-6841012.txt
    └── topic-news-6918105.txt

次に、テキストデータを後工程で使いやすいよう取り込む必要があります。

import glob

def load_livedoor_news_corpus():
    category = {
        'dokujo-tsushin': 1,
        'it-life-hack':2,
        'kaden-channel': 3,
        'livedoor-homme': 4,
        'movie-enter': 5,
        'peachy': 6,
        'smax': 7,
        'sports-watch': 8,
        'topic-news':9
    }
    docs  = []
    labels = []

    for c_name, c_id in category.items():
        files = glob.glob("./text/{c_name}/{c_name}*.txt".format(c_name=c_name))

        text = ''
        for file in files:
            with open(file, 'r') as f:
                lines = f.read().splitlines() 

                url = lines[0]
                datetime = lines[1]
                subject = lines[2]
                body = "\n".join(lines[3:])
                text = subject + "\n" + body

            docs.append(text)
            labels.append(c_id)
    
    return docs, labels

docs, labels = load_livedoor_news_corpus()

テキストデータは3行目に件名、4行目以降が本文となっているので、2つを結合してdocs配列に格納しています。また、カテゴリは整数に変換し、labels配列に格納しています。

print(len(docs))   # 7367
print(len(labels)) # 7367

学習データとテストデータをつくる

7367件のデータがあるので、シンプルに7000件の学習データと367件のテストデータに分割します。
docs と labels はカテゴリでソートされてるので、シャッフルをする必要があります。

import random

## indices は 0〜7366 の整数をランダムに並べ替えた配列
random.seed()
indices = list(range(len(docs)))
random.shuffle(indices)

train_data   = [docs[i] for i in indices[0:7000]]
train_labels = [labels[i] for i in indices[0:7000]]
test_data    = [docs[i] for i in indices[7000:]]
test_labels  = [labels[i] for i in indices[7000:]]

文書をtf-idfベクトルに変換する

文書のカテゴリ分類を行うためには、文書を何らかの多次元ベクトルに変換する必要があります。今回は tf-idf で文書をベクトル表現してみます。

from natto import MeCab
from sklearn.feature_extraction.text import TfidfVectorizer

def tokenize(text):
    tokens = []
    with MeCab('-F%f[0],%f[6]') as nm:
        for n in nm.parse(text, as_nodes=True):
            # ignore any end-of-sentence nodes
            if not n.is_eos() and n.is_nor():
                klass, word = n.feature.split(',', 1)
                if klass in ['名詞', '形容詞', '形容動詞', '動詞']:
                    tokens.append(word)

    return tokens

vectorizer = TfidfVectorizer(tokenizer=tokenize)
train_matrix = vectorizer.fit_transform(train_data)

tokenizeで ['名詞', '形容詞', '形容動詞', '動詞'] のみを取り出しています。

ナイーブベイズで学習する

Multinomial Naive Bayes を用いて、学習を行います。

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice).

clf = MultinomialNB()
clf.fit(train_matrix, train_labels)

test_matrix = vectorizer.transform(test_data)

print(clf.score(train_matrix, train_labels)) # 0.881
print(clf.score(test_matrix, test_labels)) # 0.825613079019

ランダムに並べているので毎回結果は異なりますが、82%はまずまずの結果かなと思います。

ランダムフォレストで学習する

ついでにランダムフォレストでもやってみます。

from sklearn.ensemble import RandomForestClassifier

clf2 = RandomForestClassifier(n_estimators=100)
clf2.fit(train_matrix, train_labels)

print(clf2.score(train_matrix, train_labels)) # 1.0
print(clf2.score(test_matrix, test_labels)) # 0.896457765668

学習データの分類精度は100%にもかかわらず、テストデータが89%なので over-fitting 気味かなぁという感じ。

tf-idfマトリックスのfeatureを減らす等の工夫が必要かもしれません。

追記

チューニングにより精度は 94.8% まで上げることができた。
http://qiita.com/kotaroito/items/0f7898c036304d5d5ae0

まとめ

記事にしてみると「なんだそれほど難しくないじゃないか。」と思うわけですが、取り組み中は「うーむ..」と唸ったり、しょうもないことでつまづいたりを繰り返しました。

日頃から如何に要素技術（今回で言えば、形態素解析やtf-idf）に慣れ親しんでるかが、迅速にミスなく実装できる鍵になることを改めて実感した次第です。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up