More than 3 years have passed since last update.

2段階学習モデルによるperformance metrixの改善

Last updated at 2020-08-29Posted at 2020-08-29

1.この記事は

入力データを0,1の２値判定する２値判定装置において、２段階の学習モデルを導入することでF1スコアを向上させる手段の説明です。

入力データを0,1の２値判定する２値判定装置を構築したいとする。2値判定装置モデルは、トレーニングデータを与えて学習をさせる。

ここで、下記のように訓練データを与え、判定装置1をモデル化した後に、判定装置1の出力結果をトレーニングデータに追加し、判定装置2をモデル化する。判定装置2は判定装置1よりも疑陽性数が減るためF1スコアが向上する。

2.内容

2-1 データの準備,加工を行う

[1] 10個(0~9)の手書き数字の28bit x 28bit 白黒画像,60,000枚の訓練画像データ,10,000枚のテスト用画像データをインポートします。

sample.py

from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train / 255.0 #255で割り正規化を行う。
x_test = x_test / 255.0   #255で割り正規化を行う。

x_trainは28×28bitの手書き文字を0,1で表示したものです。
y_trainは手書き文字が表す数字です。

実行結果
x_trainのサイズ -> (60000, 28, 28)
x_trainは28×28bitの手書き文字を0,1で表示したものです

y_train -> 手書き文字が表す数字です。(サイズ 60000)
[5 0 4 ... 5 6 8]

[2] x_train,y_train,x_test,y_testのデータより、「3」または「5」に該当するデータを取り出す。-> x_sub_train,x_sub_test

sample.py

# Change these params if you want to change the numbers selected
num1 = 3
num2 = 5


# Subset on only two numbers: x_trainのデータの中で、y_train=3 or 5に該当するものを取り出す。
x_sub_train = x_train[(y_train == num1) | (y_train == num2)]
y_sub_train = y_train[(y_train == num1) | (y_train == num2)]

# Subset on only two numbers: x_testのデータの中で、y_test=3 or 5に該当するものを取り出す。
x_sub_test = x_test[(y_test == num1) | (y_test == num2)]
y_sub_test = y_test[(y_test == num1) | (y_test == num2)]

[3] データ形式の変換(次元変換) を行う。

sample.py

# 3次元データ(11552,28,28)を2次元データ(11552,28*28)に変換する。
x_train_flat = x_sub_train.flatten().reshape(x_sub_train.shape[0], 28*28)
# 3次元データ(1902,28,28)を2次元データ(1902,28*28)に変換する。
x_test_flat = x_sub_test.flatten().reshape(x_sub_test.shape[0], 28*28)

# One hot encode target variables
# y_sub_trainの要素が3の場合 -> 1を返す。to_categoricalにより 1->[0,1]に変換する。
# y_sub_trainの要素が5の場合 -> 0を返す。to_categoricalにより 0->[1,0]に変換する。
y_sub_train_encoded = to_categorical([1 if value == num1 else 0 for value in y_sub_train])

# データ群を訓練データとテストデータに分割する。
X_train, X_val, Y_train, Y_val = train_test_split(x_train_flat, y_sub_train_encoded, test_size = 0.1, random_state=42)

2-3 第1の学習モデル(Primary ML)を構築する

第1の学習モデルを構築する。学習モデルはKerasライブラリのニューラルネットワークを使って構築します。

sample.py

# Build primary model
model = Sequential()
model.add(Dense(units=2, activation='softmax')) 
# units・・・出力の数
# activation・・・活性化関数。(https://keras.io/ja/activations/#relu)

# 損失関数を指定します。ここでは、categorical_crossentropy
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x=X_train, y=Y_train, validation_data=(X_val, Y_val), epochs=3, batch_size=320) # batch size is so large so that the model can be poorly fit, Its easy to get 99% accuracy.
# 引数のepochsというのは、x_trainの入力データ全部を1塊として、その塊を学習し直す回数を指定します。
# batch_sizeはx_trainを小分けにする場合に与えます。その小分けにした1セットを「サブバッチ」と呼びます。これは「過学習」を防ぐ為です

(参考情報)
http://marupeke296.com/IKDADV_DL_No2_Keras.html

2-4 構築した第1の学習モデル(ニューラルネットワーク)を評価する。

ニューラルネットワークモデルを構築し、ROC曲線を描く。

sample.py

 # Plot ROC
print('X_train','\n',X_train,len(X_train)) #length:10396

prediction = model.predict(X_train) #prediction:ニューラルネットワークの出力
print('prediction','\n',prediction,len(prediction))#length:10396 [3である確率,5である確率]で並んでいる

prediction = np.array([i[1] for i in prediction]) #5である確率を取得している。
print('prediction','\n',prediction,len(prediction))#length:10396

print('Y_train','\n',Y_train) #[0,1] or [1,0]
actual = np.array([i[1] for i in Y_train]) == 1

plot_roc(actual, prediction)

def plot_roc(actual, prediction):
    # Calculate ROC / AUC
    fpr, tpr, thresholds = sk_metrics.roc_curve(actual, prediction, pos_label=1)
    roc_auc = sk_metrics.auc(fpr, tpr)

    # Plot
    plt.plot(fpr, tpr, color='darkorange',
             lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic Example')
    plt.legend(loc="lower right")
    plt.show()

ROC曲線が半分より上の領域に描かれているので性能の良い2 値分類の機械学習モデルが構成できていることが分かります。

しきい値を調整し、recall=0.99になるようにする。

sample.py

# Create a model with high recall, change the threshold until a good recall level is reached
threshold = .30

print(prediction) #5である確率を取得している。

prediction_int = np.array(prediction) > threshold #prediction_int -> [False,True,.....]
print("prediction_int",prediction_int)

# Classification report
print(sk_metrics.classification_report(actual, prediction_int))

# Confusion matrix
cm = sk_metrics.confusion_matrix(actual, prediction_int)
print('Confusion Matrix')
print(cm)

2-5 第2の学習モデル(ニューラルネットワーク)を構築する。

・第1モデルの出力 + X_Train → 第2モデル構築のためのTrainデータ入力
・第1モデルの出力 & Y_Train → 第2モデル構築のためのTrainデータ出力となる。

大部分の陽性のケースがすでに1次モデルによって識別された後で、偽陽性を除外することによってF1スコアを増加させる。つまり2次機械学習アルゴリズムの役割は、 1次モデルによる陽性判定が真か偽かを判定することである。

sample.py

# Get meta labels
meta_labels = prediction_int & actual

print("prediction_int",prediction_int) #[False True True ...]
print("meta_labels",meta_labels) #[False True True ...]

meta_labels_encoded = to_categorical(meta_labels) #[1,0] [0,1] [0,1],....
print(meta_labels_encoded)

# Reshape data
prediction_int = prediction_int.reshape((-1, 1))#[1,0]->[False], [0,1]->[True]に変換する
print("prediction_int",prediction_int)  #[False],[True],[True],....
print("X_train", X_train) #28*28 [0,0,....0]

# concatenateは配列と配列を結合させる
# MNIST data + forecasts_int
new_features = np.concatenate((prediction_int, X_train), axis=1)
print("new_features",new_features ) #[1. 0. 0. ... 0. 0. 0.],....

# Train a new model 
# Build model
meta_model = Sequential()
meta_model.add(Dense(units=2, activation='softmax'))

meta_model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# new_features=MNIST data + forecasts_int -> [1. 0. 0. ... 0. 0. 0.],[1. 0. 0. ... 0. 0. 0.],・・・
# meta_labels_encoded =[1,0] [0,1] [0,1],....
# x_train and y_train are Numpy arrays --just like in the Scikit-Learn API.
meta_model.fit(x=new_features, y=meta_labels_encoded, epochs=4, batch_size=32)

2-6 第2の学習モデルを評価する。

第1の学習モデル(ニューラルネットワーク)、第2の学習モデル(ニューラルネットワーク)にX_Trainデータを入れ、予測データを得た。それらとY_Trainを比較しClassfication reportを出力した。第2の学習モデル(ニューラルネットワーク)では第1の学習モデル(ニューラルネットワーク)に比べAccuracyが向上することが分かった。

sample.py


test_meta_label(primary_model=model, secondary_model=meta_model, x=X_train, y=Y_train, threshold=threshold)

def test_meta_label(primary_model, secondary_model, x, y, threshold):
    """
    :param primary_model: model object (First, we build a model that achieves high recall, even if the precision is not particularly high)
    :param secondary_model: model object (the role of the secondary ML algorithm is to determine whether a positive from the primary (exogenous) model
                            is true or false. It is not its purpose to come up with a betting opportunity. Its purpose is to determine whether
                            we should act or pass on the opportunity that has been presented.)
    :param x: Explanatory variables
    :param y: Target variable (One hot encoded)
    :param threshold: The confidence threshold. This is used
    :return: Print the classification report for both the base model and the meta model.
    """
    # Get the actual labels (y) from the encoded y labels
    actual = np.array([i[1] for i in y]) == 1

    # Use primary model to score the data x
    primary_prediction = primary_model.predict(x)
    primary_prediction = np.array([i[1] for i in primary_prediction]).reshape((-1, 1))
    primary_prediction_int = primary_prediction > threshold # binary labels

    # Print output for base model
    print('Base Model Metrics:')
    print(sk_metrics.classification_report(actual, primary_prediction > 0.50))
    print('Confusion Matrix')
    print(sk_metrics.confusion_matrix(actual, primary_prediction_int))
    accuracy = (actual == primary_prediction_int.flatten()).sum() / actual.shape[0]
    print('Accuracy: ', round(accuracy, 4))
    print('')

    # Secondary model
    new_features = np.concatenate((primary_prediction_int, x), axis=1)

    # Use secondary model to score the new features
    meta_prediction = secondary_model.predict(new_features)
    meta_prediction = np.array([i[1] for i in meta_prediction])
    meta_prediction_int = meta_prediction > 0.5 # binary labels

    # Now combine primary and secondary model in a final prediction
    final_prediction = (meta_prediction_int & primary_prediction_int.flatten())

    # Print output for meta model
    print('Meta Label Metrics: ')
    print(sk_metrics.classification_report(actual, final_prediction))
    print('Confusion Matrix')
    print(sk_metrics.confusion_matrix(actual, final_prediction))
    accuracy = (actual == final_prediction).sum() / actual.shape[0]
    print('Accuracy: ', round(accuracy, 4))

訓練データではなく実際のテストデータを用いた時も、第2のニューラルネットワークの方がAccuracyが向上していることが分かった。

sample.py

test_meta_label(primary_model=model, secondary_model=meta_model, x=X_val, y=Y_val, threshold=threshold)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up