More than 5 years have passed since last update.

[モデル構築編] ロイター通信のデータセットを用いて、ニュースをトピックに分類するモデル(MLP)をkerasで作る（TensorFlow 2系）

Last updated at 2020-01-26Posted at 2020-01-04

前回の前処理編！

この記事は[前処理編]に続く続編です。
[前処理編] ロイター通信のデータセットを用いて、ニュースをトピックに分類するモデル(MLP)をkerasで作る（TensorFlow 2系）

動作環境についても前処理編をご参照ください。

モデルの学習

前処理をしたニュース記事のテキストx_trainとニュースのラベルy_trainを使ってモデルを作ります。

今回は単純なモデルとして、2層のMLP（マルチレイヤーパーセプトロン）とします。

Layer (type)                 Output Shape              Param #
=================================================================
dense_1 (Dense)              (None, 512)               512512
_________________________________________________________________
dropout (Dropout)            (None, 512)               0
_________________________________________________________________
dense_2 (Dense)              (None, 46)                23598
=================================================================

モデルを組んでいきます¹。

In [92]: import tensorflow as tf

In [93]: from tensorflow.keras import layers

In [96]: model = keras.Sequential(
    ...:     [
    ...:         layers.Dense(512, input_shape=(1000,), activation=tf.nn.relu),
    ...:         layers.Dropout(0.5),
    ...:         layers.Dense(number_of_classes, activation=tf.nn.softmax),
    ...:     ]
    ...: )

input_shape=(1000,) というのは、モデルに入力されるニュースの記事は前処理で長さを1000に揃えているからです
出力層はnumber_of_classes個あるうち、一番値が大きいものをラベルとして返すようにsoftmaxを取っています

学習に入る前のcompileです。

In [99]: model.compile(
    ...:     loss="categorical_crossentropy",
    ...:     optimizer=keras.optimizers.Adam(),
    ...:     metrics=["accuracy"],
    ...: )

多クラス分類なので、lossに"categorical_crossentropy"を指定、optimizerはAdamで、指標にaccuracy（正解率）を指定します。

モデルを学習させましょう。

In [100]: history = model.fit(
     ...:     x_train,
     ...:     y_train,
     ...:     batch_size=32,
     ...:     epochs=5,
     ...:     verbose=1,
     ...:     validation_split=0.1,
     ...: )
Train on 8083 samples, validate on 899 samples
Epoch 1/5
8083/8083 [==============================] - 2s 192us/sample - loss: 1.4148 - accuracy: 0.6828 - val_loss: 1.0709 - val_accuracy: 0.7653
Epoch 2/5
8083/8083 [==============================] - 1s 104us/sample - loss: 0.7804 - accuracy: 0.8169 - val_loss: 0.9457 - val_accuracy: 0.7920
Epoch 3/5
8083/8083 [==============================] - 1s 102us/sample - loss: 0.5557 - accuracy: 0.8659 - val_loss: 0.8587 - val_accuracy: 0.8076
Epoch 4/5
8083/8083 [==============================] - 1s 100us/sample - loss: 0.4175 - accuracy: 0.8976 - val_loss: 0.8491 - val_accuracy: 0.8176
Epoch 5/5
8083/8083 [==============================] - 1s 103us/sample - loss: 0.3269 - accuracy: 0.9171 - val_loss: 0.8689 - val_accuracy: 0.8065

学習用データのうち1割をバリデーションデータとし、これを学習に使わずにaccuracyの確認に使っています。
5エポック学習させたところ、学習用データについてはlossが減少し続けていますが、バリデーションデータについてはlossが増加を始めており、過学習し始めたような印象です。

学習用データについての正解率は、◯で表されています。エポック数が増え、学習が進むにつれて、正解率も上昇しています
バリデーションデータについての正解率は、実線で表されています。4エポック目で頭打ちになり、5エポック目では4エポック目より小さくなっています

モデルの性能確認

sklearn.metrics.accuracy_score(ドキュメント)を使って正解率を算出します。

まず、学習に使ったデータについて性能を確認します。
※accuracy_scoreを求めるのにone-hot表現にする前のデータを再度読み込んでいます
（load_dataメソッドのseed引数にデフォルト値が指定されているので、再現性は確保されています）

In [101]: pred_train = model.predict_classes(x_train)

In [109]: from sklearn.metrics import accuracy_score

In [114]: (_, y_train), (_, y_test) = reuters.load_data(num_words=1000)

In [115]: accuracy_score(y_train, pred_train)
Out[115]: 0.9380984190603429

学習に使ったデータについては正解率は93%と9割を超えており、学習はできていそうです。

続いて、学習に使っていないデータ（テスト用データ）について性能を確認します。

In [116]: pred = model.predict_classes(x_test)

In [117]: accuracy_score(y_test, pred)
Out[117]: 0.7916295636687445

学習に使っていないデータでは正解率は79%でした。
シンプルなMLPではまずまずという感想です。

分類結果の確認

ニュース記事に3と4が多いという偏りがあったので、ラベルごとの正解率を確認します。

filter関数(ドキュメント)を使って該当する要素を抽出します。
その返り値をリストに変換してからlenを取ることで、該当する個数を求めています。

In [126]: for label in range(46):
     ...:     train_count = len(list(filter(lambda x: x==label, y_train)))
     ...:     pred_train_count = len(list(filter(lambda x: x==label, pred_train)))
     ...:     train_correct = len(list(filter(lambda pair: pair[0]==label and
     ...:  pair[0]==pair[1], zip(y_train, pred_train))))
     ...:     test_count = len(list(filter(lambda x: x==label, y_test)))
     ...:     pred_count = len(list(filter(lambda x: x==label, pred)))
     ...:     test_correct = len(list(filter(lambda pair: pair[0]==label and
     ...: pair[0]==pair[1], zip(y_test, pred))))
     ...:     print(f'{label}, {train_count}, {pred_train_count}, {train_corr
     ...: ect}({train_correct/train_count:.4f}), {test_count}, {pred_count},
     ...: {test_correct}({test_correct/test_count:.4f})')
     ...:
# ラベル(※整数), 学習用データに含まれる数, モデルが学習用データで予測した数(※誤り含む), 学習用データでモデルの予測が正解した数(正解率), 
# テスト用データに含まれる数, モデルがテスト用データで予測した数(※誤り含む), テスト用データでモデルの予測が正解した数(正解率)
0, 55, 58, 52(0.9455), 12, 11, 9(0.7500)
1, 432, 426, 397(0.9190), 105, 104, 78(0.7429)
2, 74, 75, 70(0.9459), 20, 13, 10(0.5000)
3, 3159, 3196, 3072(0.9725), 813, 837, 765(0.9410)
4, 1949, 2018, 1873(0.9610), 474, 525, 418(0.8819)
5, 17, 14, 13(0.7647), 5, 1, 1(0.2000)
6, 48, 46, 46(0.9583), 14, 11, 11(0.7857)
7, 16, 15, 14(0.8750), 3, 2, 1(0.3333)
8, 139, 149, 125(0.8993), 38, 42, 27(0.7105)
9, 101, 109, 99(0.9802), 25, 23, 20(0.8000)
10, 124, 121, 113(0.9113), 30, 30, 27(0.9000)
11, 390, 395, 366(0.9385), 83, 104, 62(0.7470)
12, 49, 42, 42(0.8571), 13, 8, 5(0.3846)
13, 172, 184, 161(0.9360), 37, 60, 24(0.6486)
14, 26, 19, 18(0.6923), 2, 0, 0(0.0000)
15, 20, 19, 19(0.9500), 9, 2, 1(0.1111)
16, 444, 442, 404(0.9099), 99, 124, 76(0.7677)
17, 39, 36, 36(0.9231), 12, 6, 5(0.4167)
18, 66, 61, 61(0.9242), 20, 12, 10(0.5000)
19, 549, 519, 481(0.8761), 133, 108, 84(0.6316)
20, 269, 252, 223(0.8290), 70, 67, 37(0.5286)
21, 100, 96, 93(0.9300), 27, 34, 22(0.8148)
22, 15, 11, 10(0.6667), 7, 0, 0(0.0000)
23, 41, 33, 33(0.8049), 12, 7, 3(0.2500)
24, 62, 65, 57(0.9194), 19, 20, 9(0.4737)
25, 92, 93, 87(0.9457), 31, 26, 22(0.7097)
26, 24, 20, 20(0.8333), 8, 1, 1(0.1250)
27, 15, 11, 11(0.7333), 4, 1, 1(0.2500)
28, 48, 47, 45(0.9375), 10, 4, 2(0.2000)
29, 19, 16, 16(0.8421), 4, 4, 3(0.7500)
30, 45, 43, 40(0.8889), 12, 7, 7(0.5833)
31, 39, 40, 37(0.9487), 13, 9, 6(0.4615)
32, 32, 32, 31(0.9688), 10, 5, 5(0.5000)
33, 11, 10, 10(0.9091), 5, 4, 3(0.6000)
34, 50, 48, 47(0.9400), 7, 3, 3(0.4286)
35, 10, 9, 9(0.9000), 6, 2, 2(0.3333)
36, 49, 39, 37(0.7551), 11, 7, 4(0.3636)
37, 19, 18, 16(0.8421), 2, 0, 0(0.0000)
38, 19, 15, 15(0.7895), 3, 0, 0(0.0000)
39, 24, 17, 17(0.7083), 5, 3, 0(0.0000)
40, 36, 39, 30(0.8333), 10, 5, 3(0.3000)
41, 30, 25, 23(0.7667), 8, 1, 0(0.0000)
42, 13, 10, 10(0.7692), 3, 0, 0(0.0000)
43, 21, 22, 20(0.9524), 6, 8, 6(1.0000)
44, 12, 10, 10(0.8333), 5, 4, 4(0.8000)
45, 18, 17, 17(0.9444), 1, 1, 1(1.0000)

含まれるニュースの数が多い3や4というラベルは、テスト用データでも高い正解率を出しています。
一方、含まれるニュースの数が少ないラベルでは、テスト用データへの正解率が低いものが見られ（0%もあります）、まだまだモデルの改良余地があるように思われます。
該当するラベルのテキストを確認し、特徴量の作り方を検討してもいいかもしれません。

コード全容

スクリプトで実行できるように書いたコードも共有します。

Qiitaに書いた内容を関数にまとめているので、完全に同じというわけではありません
スクリプトの冒頭の定数を変えることで、ハイパーパラメタを変えたモデルを試せるようにしています（『直感 Deep Learning』で見たコードを参考にしました）

keras_mlp.py

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.datasets import reuters
from tensorflow.keras.preprocessing.text import Tokenizer


np.random.seed(42)
tf.random.set_seed(1234)

MAX_WORDS = 1000
DROPOUT = 0.5
OPTIMIZER = keras.optimizers.Adam()
BATCH_SIZE = 32
EPOCHS = 5


class IndexWordMapper:
    def __init__(self, index_word_map):
        self.index_word_map = index_word_map

    @staticmethod
    def initialize_index_word_map():
        word_index = reuters.get_word_index()
        index_word_map = {
            index + 3: word for word, index in word_index.items()
        }
        index_word_map[0] = "[padding]"
        index_word_map[1] = "[start]"
        index_word_map[2] = "[oov]"
        return index_word_map

    def print_original_sentence(self, indices_of_words):
        for index in indices_of_words:
            print(self.index_word_map[index], end=" ")


class TokenizePreprocessor:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    @staticmethod
    def initialize_tokenizer(max_words):
        return Tokenizer(num_words=max_words)

    def convert_text_to_matrix(self, texts, mode):
        return self.tokenizer.sequences_to_matrix(texts, mode=mode)


def convert_to_onehot(labels, number_of_classes):
    return keras.utils.to_categorical(labels, number_of_classes)


def build_model(number_of_classes, max_words, drop_out, optimizer):
    model = keras.Sequential(
        [
            layers.Dense(512, input_shape=(max_words,), activation=tf.nn.relu),
            layers.Dropout(drop_out),
            layers.Dense(number_of_classes, activation=tf.nn.softmax),
        ]
    )
    model.compile(
        loss="categorical_crossentropy",
        optimizer=optimizer,
        metrics=["accuracy"],
    )
    return model


def plot_accuracy(history):
    accuracy = history["accuracy"]
    val_accuracy = history["val_accuracy"]
    epochs = range(1, len(accuracy) + 1)

    plt.plot(epochs, accuracy, "bo", label="Training accuracy")
    plt.plot(epochs, val_accuracy, "b", label="Validation accuracy")
    plt.title("Training and Validation accuracy")
    plt.legend()
    plt.savefig("accuracy.png")


if __name__ == "__main__":
    index_word_map = IndexWordMapper.initialize_index_word_map()
    index_word_mapper = IndexWordMapper(index_word_map)

    (x_train, y_train), (x_test, y_test) = reuters.load_data(
        num_words=MAX_WORDS
    )
    number_of_classes = np.max(y_train) + 1

    tokenizer = TokenizePreprocessor.initialize_tokenizer(MAX_WORDS)
    preprocessor = TokenizePreprocessor(tokenizer)
    x_train = preprocessor.convert_text_to_matrix(x_train, "binary")
    x_test = preprocessor.convert_text_to_matrix(x_test, "binary")

    y_train = convert_to_onehot(y_train, number_of_classes)
    y_test = convert_to_onehot(y_test, number_of_classes)

    model = build_model(number_of_classes, MAX_WORDS, DROPOUT, OPTIMIZER)

    history = model.fit(
        x_train,
        y_train,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        verbose=1,
        validation_split=0.1,
    )
    score = model.evaluate(x_test, y_test, batch_size=BATCH_SIZE, verbose=0)
    print(score)

    plot_accuracy(history.history)

$ python keras_mlp.py
Train on 8083 samples, validate on 899 samples
Epoch 1/5
8083/8083 [==============================] - 1s 161us/sample - loss: 1.4255 - accuracy: 0.6828 - val_loss: 1.0781 - val_accuracy: 0.7631
Epoch 2/5
8083/8083 [==============================] - 1s 102us/sample - loss: 0.7915 - accuracy: 0.8122 - val_loss: 0.9229 - val_accuracy: 0.7942
Epoch 3/5
8083/8083 [==============================] - 1s 99us/sample - loss: 0.5530 - accuracy: 0.8689 - val_loss: 0.8850 - val_accuracy: 0.8042
Epoch 4/5
8083/8083 [==============================] - 1s 99us/sample - loss: 0.4072 - accuracy: 0.8983 - val_loss: 0.8857 - val_accuracy: 0.8087
Epoch 5/5
8083/8083 [==============================] - 1s 99us/sample - loss: 0.3336 - accuracy: 0.9150 - val_loss: 0.9134 - val_accuracy: 0.8053
[0.9012499411830069, 0.7907391]

Sequentialモデルのevaluateメソッド(ドキュメント)をテスト用データに適用した結果を出力しています。

Returns the loss value & metrics values for the model in test mode.

ですので、1つ目（score[0]）がlossの値で、2つ目（score[1]）がaccuracyです（コンパイルで指定したメトリクス）。

再現性の確保

スクリプトにまとめるに当たり、再現性確保のためのシードの固定にハマりました。
TensorFlow 2系でのシードの固定の情報が少ない²ように思われます。

結論としては、以下の2点を行いました。

numpyのシードの固定
tensorflow.random.set_seed(ドキュメント)のシードの固定³

np.random.seed(42)
tf.random.set_seed(1234)

今後手を動かしたい事項

モデルを変える
- 手を動かす中で「Embedding layer」を見つけたので試してみたい
- ref: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
ハイパーパラメタのグリッドサーチ
- optimizerやドロップアウト率、バッチサイズ、エポック数
- 取り出すのは上位1000語でいいか
前処理深堀り（count, tfidf, freqを試す）
記事間での数の偏りへの対処が必要かデータを確認

今回のアウトプットを下地に色々と試していこうと思います。

本記事のまとめ

シンプルなモデルとして2層のMLPを構築
性能を確認したところ、テスト用データに対して79%の正解率。ラベルに含まれるニュースの数により正解率にはムラがある
スクリプトにした際、TensorFlow2系向けの再現性の確保（2行）が必要だった

このモデルを作った後、model.summary()を実行すると、前掲のモデルの層が確認できます ↩
再現性の確保からは脱線ですが、短時間で学習が終わるならシードを固定するのではなく繰り返して、統計的な値で評価するという方法も見つかりました ref: https://machinelearningmastery.com/reproducible-results-neural-networks-keras/ ↩
ref: https://stackoverflow.com/a/58639060 。tensorflow.set_random_seedは2系ではなくなったようです ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up