More than 5 years have passed since last update.

GiNZAを用いた日本語テキストの分類タスクの学習

Posted at 2019-10-24

はじめに

本記事ではGiNZAを用いて日本語テキストの分類タスクの学習を行います。
GiNZAとは、Universal Dependenciesに基づくオープンソース日本語NLPライブラリです。（GitHub Pagesの説明より引用）

「GiNZA」は、ワンステップでの導入、高速・高精度な解析処理、単語依存構造解析レベルの国際化対応などの特長を備えた日本語自然言語処理オープンソースライブラリです。「GiNZA」は、最先端の機械学習技術を取り入れた自然言語処理ライブラリ「spaCy」（※5）をフレームワークとして利用しており、また、オープンソース形態素解析器「SudachiPy」（※6）を内部に組み込み、トークン化処理に利用しています。「GiNZA日本語UDモデル」にはMegagon Labsと国立国語研究所の共同研究成果が組み込まれています。

日本語NLPの環境は導入が複雑なことが多い印象なので、ワンステップで導入できて高速・高精度な解析が行えるGiNZAには期待が高まります。

動作環境

本記事のプログラムは以下の環境で動作を確認しています。

macOS Mojave
Python 3.6.5
ginza 2.2.0
spacy 2.2.1
scikit-learn 0.21.3

対象とするデータセット

本記事ではlivedoor ニュースコーパスを使用して分類タスクの学習を行います。
以下のような方法でデータを用意しておきます。

$ wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz
$ tar -xvf ldcc-20140209.tar.gz

GiNZAのインストール

GitHub Pagesの手順の通りにインストールすることができました。

$ pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"

Google Colaboratoryで動かす場合にはハマりどころがあって、以下の手順が必要とのことです。

import pkg_resources, imp
imp.reload(pkg_resources)

詳細は「【GiNZA】GoogleColabで日本語NLPライブラリGiNZAがloadできない」で確認することができます。
著者の方、本当にありがとうございました。助かりました。

実装

基本的にはspaCyのリファレンスとの差分が小さくなるように実装しました。
主な変更箇所はload_dataとevaluateの2箇所です。

begin_trainingにpretrained_vectorsを設定する点は「はじめての自然言語処理第4回 spaCy/GiNZA を用いた自然言語処理」を参考にしています。

train.py

from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path

import spacy
from spacy.util import minibatch, compounding
from sklearn.metrics import classification_report

categories = [
    'dokujo-tsushin',
    'it-life-hack',
    'kaden-channel',
    'livedoor-homme',
    'movie-enter',
    'peachy',
    'smax',
    'sports-watch',
    'topic-news'
]

@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_texts=("Number of texts to train from", "option", "t", int),
    n_iter=("Number of training iterations", "option", "n", int),
    init_tok2vec=("Pretrained tok2vec weights", "option", "t2v", Path),
)
def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None):
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()

    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # add the text classifier to the pipeline if it doesn't exist
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe(
            "textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"}
        )
        nlp.add_pipe(textcat, last=True)
    # otherwise, get it, so we can add labels to it
    else:
        textcat = nlp.get_pipe("textcat")

    # add label to text classifier
    for cat in categories:
        textcat.add_label(cat)

    # load the livedoor news corpus dataset
    print("Loading livedoor news corpus data...")
    (train_texts, train_cats), (dev_texts, dev_cats) = load_data()
    train_texts = train_texts[:n_texts]
    train_cats = train_cats[:n_texts]
    print(
        "Using {} examples ({} training, {} evaluation)".format(
            n_texts, len(train_texts), len(dev_texts)
        )
    )
    train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training(pretrained_vectors='ja_nopn.vectors')
        if init_tok2vec is not None:
            with init_tok2vec.open("rb") as file_:
                textcat.model.tok2vec.from_bytes(file_.read())
        print("Training the model...")
        print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
        batch_sizes = compounding(4.0, 32.0, 1.001)
        for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            random.shuffle(train_data)
            batches = minibatch(train_data, size=batch_sizes)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
            print(
                "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(  # print a simple table
                    losses["textcat"],
                    scores["textcat_p"],
                    scores["textcat_r"],
                    scores["textcat_f"],
                )
            )

    # test the trained model
    test_text = "This movie sucked"
    doc = nlp(test_text)
    print(test_text, doc.cats)

    if output_dir is not None:
        with nlp.use_params(optimizer.averages):
            nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc2 = nlp2(test_text)
        print(test_text, doc2.cats)


def load_data(limit=0, split=0.8):
    """Load data from the livedoor news corpus dataset."""
    # Partition off part of the train data for evaluation
    annotation = dict((cat, False) for cat in categories)

    texts = []
    cats = []
    for cat in Path('text').iterdir():
        if not cat.is_dir():
            continue

        label = annotation.copy()
        label[cat.name] = True
        for news in cat.iterdir():
            if 'LICENSE' in news.name:
                continue
            with open(news) as f:
                text = '\n'.join(f.read().splitlines()[2:])
            texts.append(text)
            cats.append(label)

    train_data = list(zip(texts, cats))
    random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, cats = zip(*train_data)
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])


def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)

    y_true = [max(cat.items(), key=lambda x:x[1])[0] for cat in cats]
    y_pred = [max(doc.cats.items(), key=lambda x:x[1])[0] for doc in textcat.pipe(docs)]

    score = classification_report(y_true, y_pred, output_dict=True)['macro avg']
    return {"textcat_p": score['precision'], "textcat_r": score['recall'], "textcat_f": score['f1-score']}


if __name__ == "__main__":
    plac.call(main)

実行結果

$ python train.py -m ja_ginza
Loaded model 'ja_ginza'
Loading livedoor news corpus data...
Using 2000 examples (2000 training, 1474 evaluation)
Training the model...
LOSS      P       R       F  
14.769  0.756   0.721   0.705
2.371   0.806   0.788   0.782
0.644   0.840   0.835   0.835
0.192   0.849   0.845   0.846
0.065   0.861   0.860   0.860
0.027   0.871   0.868   0.869
0.010   0.871   0.869   0.870
0.003   0.874   0.874   0.874
0.002   0.876   0.875   0.876
0.001   0.875   0.873   0.874
0.001   0.877   0.876   0.877
0.000   0.879   0.878   0.879
0.000   0.879   0.878   0.878
0.000   0.878   0.878   0.878
0.000   0.881   0.880   0.881
0.000   0.883   0.882   0.882
0.000   0.882   0.881   0.881
0.000   0.884   0.883   0.883
0.000   0.886   0.884   0.885
0.000   0.886   0.885   0.885
This movie sucked {'dokujo-tsushin': 1.4956882371628018e-14, 'it-life-hack': 1.25015605836078e-19, 'kaden-channel': 3.96688877009495e-32, 'livedoor-homme': 1.3391357476061384e-19, 'movie-enter': 3.578584823912253e-22, 'peachy': 4.913896465250873e-31, 'smax': 6.952530924856761e-23, 'sports-watch': 1.0, 'topic-news': 1.988718738299422e-11}

課題

上記の実行結果を得るのにMacBookProでは半日以上の時間を要しました。
CPUで学習をするので仕方ない面もありますが、CPUを1コアしか使用していなかったため、マルチコアで学習を行うことができるのであればもう少し高速化できそうに思います。

また、参考にした「はじめての自然言語処理第4回 spaCy/GiNZA を用いた自然言語処理」ではGPUを使用した学習を行っています。
こちらはぜひ試してみたいと考えています。

まとめ

GiNZAを用いてニュース記事の分類タスクを学習しました。
簡単な導入と短いソースコードで学習することができた点は良かったと思います。
学習にかかる時間や精度については今後の課題ですが、多くの解決策が存在するように感じています。

次は実際に運用することを想定した部分の調査ができたらと考えています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up