More than 1 year has passed since last update.

Kaggleの臨床検査データセットを使ってみた①　～モデルの性能比較をしてみた～

Last updated at 2022-04-06Posted at 2022-04-02

概要

Kaggleの血液検査データセットを使ってデータ分析をしてみた。
いろいろ試してみた結果、量が多くなったので分割します。
今回はその①

他の回はこちらから
②～特徴選択をして、重要度を可視化してみた～
③～アンサンブル学習をしてみた～
④～外れ値の取り扱い～
⑤～不均衡データの取り扱い～

使用したデータセット：Patient Treatment Classification (Electronic Health Record Dataset)
インドネシアの病院で集められた血液検査の結果から、患者に治療が必要かどうかを判定する

モデルと評価指標

今回使用したモデルは以下の6種類

XGBoost
ニューラルネットワーク
ランダムフォレスト
ロジスティック回帰
決定木
k－近傍法

評価指標
一部モデルでaccuracy & cross validationを試していますが、主に以下5項目

ROC曲線
ROC-AUC (ROC曲線の下にある領域の面積)
特異度 (Specificity: 正しく陰性と予測できた割合)
再現率 (Recall: 正しく陽性と予測できた割合)
f1スコア (適合率と再現率の調和平均)

データ確認

データは血液検査の結果。
データの検査方法が分からなかったため、正常値は2016年度国立がん研究センターのデータをお借りしました。（※ 正常値は測定方法などにより若干のばらつきあり）

列No	検査値	和訳	正常値	高いと	低いと
1	HAEMATOCRIT	ヘマトクリット	男性：40.7～50.1 ％, 女性：35.1～44.4 ％	多血症など	貧血など
2	HAEMOGLOBINS	ヘモグロビン	男性：13.7～16.8 g/dl, 女性：11.6～14.8 g/dl	多血症など	貧血など
3	ERYTHROCYTE	赤血球	男性：4.35～5.55 x 10⁶ /μＬ, 女性：3.86～4.92 x 10⁶ /μＬ	多血症など	貧血など
4	LEUCOCYTE	白血球	3.3～8.6 x 10³/μL	感染症・白血病など	一部感染症・膠原病・貧血など
5	THROMBOCYTE	血小板	158～348 x 10³/μL	血小板血症・白血病・多血症など	貧血・紫斑病など
6	MCH	平均赤血球ヘモグロビン量	27.5～33.2 pg	巨赤芽球性貧血など	鉄欠乏性貧血など
7	MCHC	平均赤血球ヘモグロビン濃度	31.7～35.3 g/dL	脱水・多血症など	貧血など
8	MCV	平均赤血球容積	83.6～98.2 fL	巨赤芽球性貧血など	鉄欠乏性貧血など
9	AGE	年齢	―	―	―
10	SEX	性別	―	―	―
11	SOURCE	治療が必要か	out: 治療不要, in: 治療が必要	―	―

早速実装

前処理

検査値ごとに正常値かどうか判定するカラムを追加 (0: 正常値, 1: 高値, 2: 低値)し、one-hotベクトル化して使用

ライブラリのインポート

import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import missingno as msno

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

import scipy.stats
from sklearn import model_selection, ensemble
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, log_loss, roc_curve, roc_auc_score,  recall_score, f1_score, confusion_matrix
from mlxtend.plotting import plot_confusion_matrix
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import VotingClassifier, BaggingClassifier, StackingClassifier

%matplotlib inline

データ読み込み

df = pd.read_csv("../input/patient-treatment-classification/data-ori.csv")

データ確認

df.info()

>>
RangeIndex: 4412 entries, 0 to 4411
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HAEMATOCRIT   4412 non-null   float64
 1   HAEMOGLOBINS  4412 non-null   float64
 2   ERYTHROCYTE   4412 non-null   float64
 3   LEUCOCYTE     4412 non-null   float64
 4   THROMBOCYTE   4412 non-null   int64  
 5   MCH           4412 non-null   float64
 6   MCHC          4412 non-null   float64
 7   MCV           4412 non-null   float64
 8   AGE           4412 non-null   int64  
 9   SEX           4412 non-null   object 
 10  SOURCE        4412 non-null   object 
dtypes: float64(7), int64(2), object(2)

データ変換、カラム追加

# 元のデータはそのまま残して、コピーを使う
df_samp = df.copy()

# コピーに検査値が正常か判断するカラムを追加 (カラム名_AB)、初期値は0 (正常値)
for i in reversed(range(1, 9)):
    df_samp.insert(i, f'{df_samp.columns[i-1]}_AB',0)

for index, data in df_samp.iterrows():
    """
    データを数値へ変換
    性別=> Male(男性): 0, Female(女性):1
    検査値=> 正常 :0, 高値: 1, 低値: 2
    """    
    # 男性
    if df_samp.loc[index, "SEX"] == "M":
        df_samp.loc[index, "SEX"] = 0
        # HAEMATOCRIT (正常値:40.7~50.1)
        if df_samp.loc[index, "HAEMATOCRIT"] > 50.1:
            df_samp.loc[index, "HAEMATOCRIT_AB"] = 1
        elif df_samp.loc[index, "HAEMATOCRIT"] < 40.7:
            df_samp.loc[index, "HAEMATOCRIT_AB"] = 2
        # HAEMOGLOBINS (正常値:13.7～16.8)
        if df_samp.loc[index, "HAEMOGLOBINS"] > 16.8:
            df_samp.loc[index, "HAEMOGLOBINS_AB"] = 1
        elif df_samp.loc[index, "HAEMOGLOBINS"] < 13.7:
            df_samp.loc[index, "HAEMOGLOBINS_AB"] = 2
        # ERYTHROCYTE (正常値:4.35～5.55)
        if df_samp.loc[index, "ERYTHROCYTE"] > 5.55:
            df_samp.loc[index, "ERYTHROCYTE_AB"] = 1
        elif df_samp.loc[index, "ERYTHROCYTE"] < 4.35:
            df_samp.loc[index, "ERYTHROCYTE_AB"] = 2
        # LEUCOCYTE (正常値:3.3～8.6)
        if df_samp.loc[index, "LEUCOCYTE"] > 8.6:
            df_samp.loc[index, "LEUCOCYTE_AB"] = 1
        elif df_samp.loc[index, "LEUCOCYTE"] < 3.3:
            df_samp.loc[index, "LEUCOCYTE_AB"] = 2
        # THROMBOCYTE (正常値:158～348)
        if df_samp.loc[index, "THROMBOCYTE"] > 348:
            df_samp.loc[index, "THROMBOCYTE_AB"] = 1
        elif df_samp.loc[index, "THROMBOCYTE"] < 158:
            df_samp.loc[index, "THROMBOCYTE_AB"] = 2
        # MCH (正常値:27.5～33.2)
        if df_samp.loc[index, "MCH"] > 33.2:
            df_samp.loc[index, "MCH_AB"] = 1
        elif df_samp.loc[index, "MCH"] < 27.5:
            df_samp.loc[index, "MCH_AB"] = 2
        # MCHC (正常値:31.7～35.3)
        if df_samp.loc[index, "MCHC"] > 35.3:
            df_samp.loc[index, "MCHC_AB"] = 1
        elif df_samp.loc[index, "MCHC"] < 31.7:
            df_samp.loc[index, "MCHC_AB"] = 2
        # MCV (正常値:83.6～98.2)
        if df_samp.loc[index, "MCV"] > 98.2:
            df_samp.loc[index, "MCV_AB"] = 1
        elif df_samp.loc[index, "MCV"] < 83.6:
            df_samp.loc[index, "MCV_AB"] = 2    
    # 女性    
    else:
        df_samp.loc[index, "SEX"] = 1
        # HAEMATOCRIT (正常値:35.1~44.4)
        if df_samp.loc[index, "HAEMATOCRIT"] > 44.4:
            df_samp.loc[index, "HAEMATOCRIT_AB"] = 1
        elif df_samp.loc[index, "HAEMATOCRIT"] < 35.1:
            df_samp.loc[index, "HAEMATOCRIT_AB"] = 2
        # HAEMOGLOBINS (正常値:11.6～14.8)
        if df_samp.loc[index, "HAEMOGLOBINS"] > 14.8:
            df_samp.loc[index, "HAEMOGLOBINS_AB"] = 1
        elif df_samp.loc[index, "HAEMOGLOBINS"] < 11.6:
            df_samp.loc[index, "HAEMOGLOBINS_AB"] = 2
        # ERYTHROCYTE　(正常値:3.86～4.92)
        if df_samp.loc[index, "ERYTHROCYTE"] > 4.92:
            df_samp.loc[index, "ERYTHROCYTE_AB"] = 1
        elif df_samp.loc[index, "ERYTHROCYTE"] < 3.86:
            df_samp.loc[index, "ERYTHROCYTE_AB"] = 2
        # LEUCOCYTE　(正常値:3.3～8.6)
        if df_samp.loc[index, "LEUCOCYTE"] > 8.6:
            df_samp.loc[index, "LEUCOCYTE_AB"] = 1
        elif df_samp.loc[index, "LEUCOCYTE"] < 3.3:
            df_samp.loc[index, "LEUCOCYTE_AB"] = 2
        # THROMBOCYTE　(正常値:158～348)
        if df_samp.loc[index, "THROMBOCYTE"] > 348:
            df_samp.loc[index, "THROMBOCYTE_AB"] = 1
        elif df_samp.loc[index, "THROMBOCYTE"] < 158:
            df_samp.loc[index, "THROMBOCYTE_AB"] = 2
        # MCH (正常値:27.5～33.2) 
        if df_samp.loc[index, "MCH"] > 33.2:
            df_samp.loc[index, "MCH_AB"] = 1
        elif df_samp.loc[index, "MCH"] < 27.5:
            df_samp.loc[index, "MCH_AB"] = 2
        # MCHC (正常値:31.7～35.3)
        if df_samp.loc[index, "MCHC"] > 35.3:
            df_samp.loc[index, "MCHC_AB"] = 1
        elif df_samp.loc[index, "MCHC"] < 31.7:
            df_samp.loc[index, "MCHC_AB"] = 2
        # MCV (正常値:83.6～98.2)
        if df_samp.loc[index, "MCV"] > 98.2:
            df_samp.loc[index, "MCV_AB"] = 1
        elif df_samp.loc[index, "MCV"] < 83.6:
            df_samp.loc[index, "MCV_AB"] = 2
            
    # SOURCEを目的変数用に[out: 0, in: 1]へ変更
    if df_samp.loc[index, "SOURCE"] == "out":
        df_samp.loc[index, "SOURCE"] = 0
    else:
        df_samp.loc[index, "SOURCE"] = 1

データ確認

df_samp.info()

>>
RangeIndex: 4412 entries, 0 to 4411
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   HAEMATOCRIT      4412 non-null   float64
 1   HAEMATOCRIT_AB   4412 non-null   int64  
 2   HAEMOGLOBINS     4412 non-null   float64
 3   HAEMOGLOBINS_AB  4412 non-null   int64  
 4   ERYTHROCYTE      4412 non-null   float64
 5   ERYTHROCYTE_AB   4412 non-null   int64  
 6   LEUCOCYTE        4412 non-null   float64
 7   LEUCOCYTE_AB     4412 non-null   int64  
 8   THROMBOCYTE      4412 non-null   int64  
 9   THROMBOCYTE_AB   4412 non-null   int64  
 10  MCH              4412 non-null   float64
 11  MCH_AB           4412 non-null   int64  
 12  MCHC             4412 non-null   float64
 13  MCHC_AB          4412 non-null   int64  
 14  MCV              4412 non-null   float64
 15  MCV_AB           4412 non-null   int64  
 16  AGE              4412 non-null   int64  
 17  SEX              4412 non-null   object 
 18  SOURCE           4412 non-null   object 
dtypes: float64(7), int64(10), object(2)

追加したカラムをpandasでone-hotベクトル化

df_samp = pd.get_dummies(df_samp, columns=["HAEMATOCRIT_AB", "HAEMOGLOBINS_AB", "ERYTHROCYTE_AB", "LEUCOCYTE_AB",
                                          "THROMBOCYTE_AB", "MCH_AB", "MCHC_AB", "MCV_AB"])
df_samp.info()

>>
RangeIndex: 4412 entries, 0 to 4411
Data columns (total 35 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   HAEMATOCRIT        4412 non-null   float64
 1   HAEMOGLOBINS       4412 non-null   float64
 2   ERYTHROCYTE        4412 non-null   float64
 3   LEUCOCYTE          4412 non-null   float64
 4   THROMBOCYTE        4412 non-null   int64  
 5   MCH                4412 non-null   float64
 6   MCHC               4412 non-null   float64
 7   MCV                4412 non-null   float64
 8   AGE                4412 non-null   int64  
 9   SEX                4412 non-null   object 
 10  SOURCE             4412 non-null   object 
 11  HAEMATOCRIT_AB_0   4412 non-null   uint8  
 12  HAEMATOCRIT_AB_1   4412 non-null   uint8  
 13  HAEMATOCRIT_AB_2   4412 non-null   uint8  
 14  HAEMOGLOBINS_AB_0  4412 non-null   uint8  
 15  HAEMOGLOBINS_AB_1  4412 non-null   uint8  
 16  HAEMOGLOBINS_AB_2  4412 non-null   uint8  
 17  ERYTHROCYTE_AB_0   4412 non-null   uint8  
 18  ERYTHROCYTE_AB_1   4412 non-null   uint8  
 19  ERYTHROCYTE_AB_2   4412 non-null   uint8  
 20  LEUCOCYTE_AB_0     4412 non-null   uint8  
 21  LEUCOCYTE_AB_1     4412 non-null   uint8  
 22  LEUCOCYTE_AB_2     4412 non-null   uint8  
 23  THROMBOCYTE_AB_0   4412 non-null   uint8  
 24  THROMBOCYTE_AB_1   4412 non-null   uint8  
 25  THROMBOCYTE_AB_2   4412 non-null   uint8  
 26  MCH_AB_0           4412 non-null   uint8  
 27  MCH_AB_1           4412 non-null   uint8  
 28  MCH_AB_2           4412 non-null   uint8  
 29  MCHC_AB_0          4412 non-null   uint8  
 30  MCHC_AB_1          4412 non-null   uint8  
 31  MCHC_AB_2          4412 non-null   uint8  
 32  MCV_AB_0           4412 non-null   uint8  
 33  MCV_AB_1           4412 non-null   uint8  
 34  MCV_AB_2           4412 non-null   uint8  
dtypes: float64(7), int64(2), object(2), uint8(24)

欠損値確認

欠損値

# データの可視化
msno.matrix(df_samp)
pd.set_option('display.max_rows', None)
df_samp.isnull().sum()

今回は欠損値がないため、処理は不要。

共通処理

DataFrameをコピーして使用

# データコピー
def dframe_copy(dataframe, object_column):
  copy_data = dataframe.copy()
  y_copy_data = copy_data[object_column].values
  copy_data = copy_data.drop([object_column], axis=1)
  X_copy_data = copy_data.values
  return X_copy_data, y_copy_data, copy_data

グラフ表示と評価指標の算出

# ROC曲線
def create_ROCcurve(true_data, pred_data, drop_intermediate=False):
  fpr, tpr, thresholds = roc_curve(true_data,pred_data, drop_intermediate=drop_intermediate)
  plt.plot(fpr, tpr, marker='o')
  plt.xlabel('FPR: False positive rate')
  plt.ylabel('TPR: True positive rate')
  plt.grid()
  AUC_score = roc_auc_score(true_data, pred_data)
  print("AUC :{}".format(AUC_score))
  return AUC_score

# 混同行列
def create_cm(true_data, pred_data, continuous=False):
  # 連続値の時は1である確率を計算するので、0-1のベクトルに直す
  if continuous == True:
    for i in range(len(pred_data)):
      # 1の確率が0.5以上なら1
      if pred_data[i] >= 0.5:
        pred_data[i] = 1
      else:
        pred_data[i] = 0

  cm = confusion_matrix(true_data, pred_data)
  tn, fp, fn, tp = cm.flatten()
  print("tn: {}, fp: {}, fn: {}, tp: {}".format(tn, fp, fn, tp))
  specificity = tn/(fp+tn)

  plot_confusion_matrix(cm,figsize=(12,8), hide_ticks=True, cmap=plt.cm.RdPu)
  plt.xticks(range(2), ['Out', 'In'], fontsize=16)
  plt.yticks(range(2), ['Out', 'In'], fontsize=16)
  plt.show()

  recall = recall_score(true_data, pred_data)
  f1 = f1_score(true_data, pred_data)
                        
  print("特異度 :{}".format(specificity))
  print("感度: {}".format(recall))
  print("f1_socre: {}".format(f1))

  return tn, fp, fn, tp, specificity, recall, f1

グリッドサーチとランダムサーチ

# パラメーターサーチ
def param_search(train_X, test_X, train_y, test_y, model_param_set, search_type="grid"):
  # スコア比較用に変数を用意
  max_score = 0
  max_AUC = 0
  max_specificity = 0
  max_recall = 0
  best_param = None
  best_AUC = None
  best_specificity = None
  best_recall = None
 
  for model, param in model_param_set.items():
    if search_type == "grid":
      # グリッドサーチでパラメーターサーチ
      clf = GridSearchCV(model, param)
    elif search_type == "random":
      # ランダムサーチでパラメーターサーチ
      clf = RandomizedSearchCV(model, param)
    print(search_type)
    print(model.__class__.__name__)
    clf.fit(train_X, train_y)
    pred_y = clf.predict(test_X)
    score = f1_score(test_y, pred_y)
    AUC = roc_auc_score(test_y, pred_y)
    cm = confusion_matrix(test_y, pred_y)
    tn, fp, fn, tp = cm.flatten()
    specificity = tn/(fp+tn)
    recall = recall_score(test_y, pred_y)
    
    # 最高評価更新時にモデルやパラメーターも更新
    if max_score < score:
      max_score = score
      score_best_param = clf.best_params_
    if max_AUC < AUC:
      max_AUC = AUC
      AUC_best_param = clf.best_params_
    if max_specificity < specificity:
      max_specificity = specificity
      specificity_best_param = clf.best_params_
    if max_recall < recall:
      max_recall = recall
      recall_best_param = clf.best_params_
  
  
  print("ベストAUC:{}".format(AUC_best_param))
  print("ベスト特異度:{}".format(specificity_best_param))
  print("ベスト感度:{}".format(recall_best_param))
  print("ベストf1_score:{}".format(score_best_param))

  # 最も成績のいいスコアを出力してください。
  
  print("ベストAUC:", max_AUC)
  print("ベスト特異度:", max_specificity)
  print("ベスト感度:", max_recall)
  print("ベストf1_score:",max_score)

比較するためのデータ格納リスト

# 各データを格納するリストを用意
AUC_list = []
specificity_list = []
recall_list = []
f1_score_list = []

Cross-Validationの設定

scoring = {"p": "precision_macro",
           "r": "recall_macro",
           "f":"f1_macro"}
skf = StratifiedKFold(shuffle=True, random_state=0)

XGBoost (GBDT: 勾配ブースティング決定木)

ブースティングと決定木を合わせた、アンサンブル学習モデル

学習(XGBoost)

# 共通データをコピーして使用
X_GBDT, y_GBDT, df_GBDT = dframe_copy(df_samp, "SOURCE")

# 変数を訓練用とテスト用に分割
train_X, test_X, train_y, test_y = train_test_split(X_GBDT, y_GBDT, test_size=0.2, random_state=42)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, test_size=0.1, random_state=42)

# GBDTで学習
dtrain = xgb.DMatrix(train_X, label=train_y)
dval = xgb.DMatrix(val_X, label=val_y)
dtest = xgb.DMatrix(test_X)

params = {'objective': 'binary:logistic', 'silent':1, 'random_state': 71, 'eval_metric': 'auc'}
num_round = 25

watchlist = [(dtrain, 'train'), (dval, 'eval')]
GBDT_model = xgb.train(params, dtrain, num_round, evals=watchlist)

val_pred = GBDT_model.predict(dval)
val_y = val_y.tolist()
score_val = log_loss(val_y, val_pred)
print(f'logloss_val: {score_val: 4f}')

pred_y = GBDT_model.predict(dtest)

結果

# 型エラーとなったのでリスト化
test_y = test_y.tolist()
pred_y = pred_y.tolist()
AUC_GBDT = create_ROCcurve(test_y, pred_y)
tn_GBDT, fp_GBDT, fn_GBDT, tp_GBDT, specificity_GBDT, recall_GBDT, f1_GBDT = create_cm(test_y, pred_y, True)
accuracy_GBDT = accuracy_score(test_y, pred_y)
print(f'acc: {accuracy_GBDT}')
AUC_list.append(["AUC_GBDT:", np.round(AUC_GBDT, 3)])
specificity_list.append(["specificity_GBDT：", np.round(specificity_GBDT, 3)])
recall_list.append(["recall_GBDT：", np.round(recall_GBDT, 3)])
f1_score_list.append(["f1_GBDT：", np.round(f1_GBDT, 3)])

結果

ニューラルネットワーク

神経回路を模した、多層のネットワーク構造をしたモデル。

必要なライブラリのインポート

from tensorflow import keras
from tensorflow.keras import optimizers
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Activation, Dropout,Input
from keras.models import Sequential, Model
from keras.wrappers.scikit_learn import KerasRegressor
from keras.utils.vis_utils import plot_model

モデル作成

# モデル
def create_model():
  model = Sequential()
  model.add(Dense(4096, input_shape=(train_X.shape[1], )))
  model.add(Activation('relu'))
  model.add(Dropout(0.3))
  model.add(Dense(2048))
  model.add(Activation('relu'))
  model.add(Dropout(0.3))
  model.add(Dense(512))
  model.add(Activation('relu'))
  model.add(Dropout(0.3))
  model.add(Dense(64))
  model.add(Activation('relu'))
  model.add(Dropout(0.3))
  model.add(Dense(16))
  model.add(Activation('relu'))
  model.add(Dropout(0.3))

  model.add(Dense(1))
  model.add(Activation('sigmoid'))

  model.compile(optimizer=Adam, loss='binary_crossentropy',
                metrics=['accuracy'])
  return model

データの準備

# 共通データをコピーして使用
X_NN, y_NN, df_NN = dframe_copy(df_samp, "SOURCE")

# 学習データとテストデータに分ける
train_X, test_X, train_y, test_y = train_test_split(X_NN, y_NN, test_size=0.2, random_state=42)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, test_size=0.1, random_state=42)

train_X = np.asarray(train_X).astype(np.float32)
train_y = np.asarray(train_y).astype(np.float32)
val_X = np.asarray(val_X).astype(np.float32)
val_y = np.asarray(val_y).astype(np.float32)
test_X = np.asarray(test_X).astype(np.float32)
test_y = np.asarray(test_y).astype(np.float32)

学習と結果

# 最適化
Adam = optimizers.Adam(lr=0.0001)

batch_size = 4
epochs = 14

my_model = KerasRegressor(build_fn=create_model, batch_size=batch_size, epochs=epochs, validation_data=(val_X, val_y)) 
history = my_model.fit(train_X, train_y)

#acc, val_accのプロット
plt.plot(history.history["accuracy"], label="accuracy", ls="-", marker="o")
plt.plot(history.history["loss"], label = "loss", ls="-", marker="x")
plt.ylabel('acc')
plt.xlabel('epoch')
plt.legend(loc='best')
plt.show()

pred_y = my_model.predict(test_X)

AUC_NN = create_ROCcurve(test_y, pred_y)
tn_NN, fp_NN, fn_NN, tp_NN, specificity_NN, recall_NN, f1_NN = create_cm(test_y, pred_y, True)
accuracy_NN = accuracy_score(test_y, pred_y)
print(f'acc: {accuracy_NN}')

AUC_list.append(["AUC_NN", np.round(AUC_NN, 3)])
specificity_list.append(["specificity_NN", np.round(specificity_NN, 3)])
recall_list.append(["recall_NN", np.round(recall_NN, 3)])
f1_score_list.append(["f1_NN", np.round(f1_NN, 3)])

結果

ランダムフォレスト

決定木を複数結合したアンサンブル学習モデル

学習と結果

# 共通データをコピーして使用
X_RF, y_RF, df_RF  = dframe_copy(df_samp, "SOURCE")

# モデルの学習
RF_model = RandomForestClassifier()

RF_model.fit(train_X, train_y)
RF_model.score(test_X, test_y)

pred_y = RF_model.predict(test_X)

AUC_RF = create_ROCcurve(test_y, pred_y)
tn_RF, fp_RF, fn_RF, tp_RF, specificity_RF, recall_RF, f1_RF = create_cm(test_y, pred_y)
accuracy_RF = accuracy_score(test_y, pred_y)
print(f'acc: {accuracy_RF}')

AUC_list.append(["AUC_RF", np.round(AUC_RF, 3)])
specificity_list.append(["specificity_RF", np.round(specificity_RF, 3)])
recall_list.append(["recall_RF", np.round(recall_RF, 3)])
f1_score_list.append(["f1_RF", np.round(f1_RF, 3)])

scores = cross_validate(RF_model, test_X, test_y, cv=skf, scoring=scoring)
print(scores)

>>
'fit_time': array([0.2679925 , 0.24997258, 0.25059104, 0.25084162, 0.25553393]), 
'score_time': array([0.0143342 , 0.0141151 , 0.01439357, 0.01439023, 0.01639438]), 
'test_p': array([0.7047619 , 0.73393045, 0.68744912, 0.71425722, 0.76700061]),
'test_r': array([0.70235294, 0.73215686, 0.68058824, 0.71343402, 0.73221122]),
'test_f': array([0.70334221, 0.73294411, 0.68250066, 0.71382114, 0.73692078])

結果

サーチする

グリッドサーチ

model_param_set_grid = {
    RandomForestClassifier(): {
        "n_estimators": [i for i in range(40, 50)],
        "max_depth": [i for i in range(10, 20)]
    }
}

param_search(train_X, test_X, train_y, test_y, model_param_set_grid)

>>
grid
RandomForestClassifier
ベストAUC:{'max_depth': 13, 'n_estimators': 42}
ベスト特異度:{'max_depth': 13, 'n_estimators': 42}
ベスト感度:{'max_depth': 13, 'n_estimators': 42}
ベストf1_score:{'max_depth': 13, 'n_estimators': 42}
ベストAUC: 0.7337024468655118
ベスト特異度: 0.831041257367387
ベスト感度: 0.6363636363636364
ベストf1_score: 0.681948424068768

ランダムサーチ

model_param_set_random = {
    RandomForestClassifier(): {
        "n_estimators": scipy.stats.randint(10, 100),
        "max_depth": scipy.stats.randint(1, 20)
    }
}

param_search(train_X, test_X, train_y, test_y, model_param_set_random, "random")

>>
random
RandomForestClassifier
ベストAUC:{'max_depth': 15, 'n_estimators': 82}
ベスト特異度:{'max_depth': 15, 'n_estimators': 82}
ベスト感度:{'max_depth': 15, 'n_estimators': 82}
ベストf1_score:{'max_depth': 15, 'n_estimators': 82}
ベストAUC: 0.7461180042654676
ベスト特異度: 0.8585461689587426
ベスト感度: 0.6336898395721925
ベストf1_score: 0.6939970717423133

サーチ結果を反映

RF_model = RandomForestClassifier(max_depth=15, n_estimators=82)

RF_model.fit(train_X, train_y)
RF_model.score(test_X, test_y)

pred_y = RF_model.predict(test_X)

AUC_RF = create_ROCcurve(test_y, pred_y)
tn_RF, fp_RF, fn_RF, tp_RF, specificity_RF, recall_RF, f1_RF = create_cm(test_y, pred_y)

AUC_list.append(["AUC_RF_S", np.round(AUC_RF, 3)])
specificity_list.append(["specificity_RF_S", np.round(specificity_RF, 3)])
recall_list.append(["recall_RF_S", np.round(recall_RF, 3)])
f1_score_list.append(["f1_RF_S", np.round(f1_RF, 3)])

サーチ結果

ロジスティック回帰

いくつかの要因から結果が起こる確率を予測するモデル。

学習と結果

# 共通データをコピーして使用
X_LR, y_LR, df_LR = dframe_copy(df_samp, "SOURCE")
train_X, test_X, train_y, test_y = train_test_split(X_LR, y_LR, test_size=0.2, random_state=42)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, test_size=0.1, random_state=42)

# モデル学習
LR_model = LogisticRegression(max_iter=4500)

LR_model.fit(train_X, train_y)
LR_model.score(test_X, test_y)
pred_y = LR_model.predict(test_X)

AUC_LR = create_ROCcurve(test_y, pred_y)
tn_LR, fp_LR, fn_LR, tp_LR, specificity_LR, recall_LR, f1_LR = create_cm(test_y, pred_y)
accuracy_LR = accuracy_score(test_y, pred_y)
print(f'acc: {accuracy_LR}')

AUC_list.append(["AUC_LR", np.round(AUC_LR, 3)])
specificity_list.append(["specificity_LR", np.round(specificity_LR, 3)])
recall_list.append(["recall_LR", np.round(recall_LR, 3)])
f1_score_list.append(["f1_LR", np.round(f1_LR, 3)])

scores = cross_validate(LR_model, test_X, test_y, cv=skf, scoring=scoring)
print(scores)

>>
'fit_time': array([0.64652061, 0.7547183 , 0.61075044, 0.5947938 , 0.57881808]),
'score_time': array([0.00534153, 0.00532842, 0.00534463, 0.00515461, 0.00546002]),
'test_p': array([0.65761015, 0.76617527, 0.71668892, 0.71080928, 0.75096222]),
'test_r': array([0.65431373, 0.75490196, 0.71215686, 0.69740329, 0.73240924]),
'test_f': array([0.6554034 , 0.7583658 , 0.7138096 , 0.70043573, 0.73638344])

結果

サーチする

グリッドサーチ

model_param_set_grid = {
    LogisticRegression(max_iter=15000): {
        "C": [10 ** i for i in range(-5, 5)],
        "random_state": [42]
    }
}

param_search(train_X, test_X, train_y, test_y, model_param_set_grid)

>>
grid
LogisticRegression
ベストAUC:{'C': 0.1, 'random_state': 42}
ベスト特異度:{'C': 0.1, 'random_state': 42}
ベスト感度:{'C': 0.1, 'random_state': 42}
ベストf1_score:{'C': 0.1, 'random_state': 42}
ベストAUC: 0.7212343590767258
ベスト特異度: 0.8408644400785854
ベスト感度: 0.6016042780748663
ベストf1_score: 0.6617647058823529

ランダムサーチ

model_param_set_random = {
    LogisticRegression(max_iter=15000): {
        "C": scipy.stats.uniform(0.00001, 1000),
        "random_state": scipy.stats.randint(0, 100)
    },
}

param_search(train_X, test_X, train_y, test_y, model_param_set_random, "random")

>>
random
LogisticRegression
ベストAUC:{'C': 79.42047819026727, 'random_state': 40}
ベスト特異度:{'C': 79.42047819026727, 'random_state': 40}
ベスト感度:{'C': 79.42047819026727, 'random_state': 40}
ベストf1_score:{'C': 79.42047819026727, 'random_state': 40}
ベストAUC: 0.7189965645125707
ベスト特異度: 0.831041257367387
ベスト感度: 0.606951871657754
ベストf1_score: 0.6608442503639009

サーチ結果を反映

X_LR, y_LR, df_LR = dframe_copy(df_samp, "SOURCE")
train_X, test_X, train_y, test_y = train_test_split(X_LR, y_LR, test_size=0.2, random_state=42)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, test_size=0.1, random_state=42)

# モデル学習
LR_model = LogisticRegression(max_iter=4500)

train_X = np.asarray(train_X).astype(np.float32)
train_y = np.asarray(train_y).astype(np.float32)
val_X = np.asarray(val_X).astype(np.float32)
val_y = np.asarray(val_y).astype(np.float32)
test_X = np.asarray(test_X).astype(np.float32)
test_y = np.asarray(test_y).astype(np.float32)

LR_model.fit(train_X, train_y)
LR_model.score(test_X, test_y)
pred_y = LR_model.predict(test_X)

AUC_LR = create_ROCcurve(test_y, pred_y)
tn_LR, fp_LR, fn_LR, tp_LR, specificity_LR, recall_LR, f1_LR = create_cm(test_y, pred_y)

AUC_list.append(["AUC_LR", np.round(AUC_LR, 3)])
specificity_list.append(["specificity_LR", np.round(specificity_LR, 3)])
recall_list.append(["recall_LR", np.round(recall_LR, 3)])
f1_score_list.append(["f1_LR", np.round(f1_LR, 3)])

サーチ結果

決定木

木構造を用いて分類や回帰を行うモデル。

学習と結果

# 共通データをコピーして使用
X_DT, y_DT, df_DT = dframe_copy(df_samp, "SOURCE")
train_X, test_X, train_y, test_y = train_test_split(X_DT, y_DT, test_size=0.2, random_state=42)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, test_size=0.1, random_state=42)
train_X = np.asarray(train_X).astype(np.float32)
train_y = np.asarray(train_y).astype(np.float32)
val_X = np.asarray(val_X).astype(np.float32)
val_y = np.asarray(val_y).astype(np.float32)
test_X = np.asarray(test_X).astype(np.float32)
test_y = np.asarray(test_y).astype(np.float32)

# モデル学習
DT_model = DecisionTreeClassifier()
DT_model.fit(train_X, train_y)
DT_model.score(test_X, test_y)

pred_y = DT_model.predict(test_X)

AUC_DT = create_ROCcurve(test_y, pred_y)
tn_DT, fp_DT, fn_DT, tp_DT, specificity_DT, recall_DT, f1_DT = create_cm(test_y, pred_y)
accuracy_DT = accuracy_score(test_y, pred_y)
print(f'acc: {accuracy_DT}')

AUC_list.append(["AUC_DT", np.round(AUC_DT, 3)])
specificity_list.append(["specificity_DT", np.round(specificity_DT, 3)])
recall_list.append(["recall_DT", np.round(recall_DT, 3)])
f1_score_list.append(["f1_DT", np.round(f1_DT, 3)])

scores = cross_validate(DT_model, test_X, test_y, cv=skf, scoring=scoring)
print(scores)

>>
'fit_time': array([0.01100492, 0.00973248, 0.00667858, 0.00607133, 0.00647211]),
'score_time': array([0.0052526 , 0.00298262, 0.00294089, 0.00298834, 0.00300097]),
'test_p': array([0.60513479, 0.59307733, 0.56039216, 0.64848485, 0.63181818]),
'test_r': array([0.60705882, 0.59156863, 0.56039216, 0.64281929, 0.62633663]),
'test_f': array([0.60512787, 0.59200474, 0.56039216, 0.64420485, 0.62730665])

結果

サーチする

グリッドサーチ

model_param_set_grid = {
    DecisionTreeClassifier(): {
        "max_depth": [i for i in range(1, 100)],
    }
}

param_search(train_X, test_X, train_y, test_y, model_param_set_grid)

>>
grid
DecisionTreeClassifier
ベストAUC:{'max_depth': 4}
ベスト特異度:{'max_depth': 4}
ベスト感度:{'max_depth': 4}
ベストf1_score:{'max_depth': 4}
ベストAUC: 0.7042906821596294
ベスト特異度: 0.831041257367387
ベスト感度: 0.5775401069518716
ベストf1_score: 0.6390532544378698

ランダムサーチ

model_param_set_random = {
    DecisionTreeClassifier(): {
        "max_depth": scipy.stats.randint(1, 20),
    }
}

param_search(train_X, test_X, train_y, test_y, model_param_set_random, "random")

>>
random
DecisionTreeClassifier
ベストAUC:{'max_depth': 6}
ベスト特異度:{'max_depth': 6}
ベスト感度:{'max_depth': 6}
ベストf1_score:{'max_depth': 6}
ベストAUC: 0.7213157811794123
ベスト特異度: 0.8330058939096268
ベスト感度: 0.6096256684491979
ベストf1_score: 0.663755458515284

サーチ結果を反映

# モデル学習
DT_model = DecisionTreeClassifier(max_depth=6)
DT_model.fit(train_X, train_y)
DT_model.score(test_X, test_y)

pred_y = DT_model.predict(test_X)

AUC_DT = create_ROCcurve(test_y, pred_y)
tn_DT, fp_DT, fn_DT, tp_DT, specificity_DT, recall_DT, f1_DT = create_cm(test_y, pred_y)

AUC_list.append(["AUC_DT", np.round(AUC_DT, 3)])
specificity_list.append(["specificity_DT", np.round(specificity_DT, 3)])
recall_list.append(["recall_DT", np.round(recall_DT, 3)])
f1_score_list.append(["f1_DT", np.round(f1_DT, 3)])

サーチ結果

k-近傍法

近隣のデータの平均や多数決の値を予測として出力するモデル

学習と結果

# 共通データをコピーして使用
X_KN, y_KN, df_KN = dframe_copy(df_samp, "SOURCE")
train_X, test_X, train_y, test_y = train_test_split(X_KN, y_KN, test_size=0.2, random_state=42)
train_X, val_X, train_y, val_y = train_test_split(train_X, train_y, test_size=0.1, random_state=42)
train_X = np.asarray(train_X).astype(np.float32)
train_y = np.asarray(train_y).astype(np.float32)
val_X = np.asarray(val_X).astype(np.float32)
val_y = np.asarray(val_y).astype(np.float32)
test_X = np.asarray(test_X).astype(np.float32)
test_y = np.asarray(test_y).astype(np.float32)

# モデル学習
KN_model = KNeighborsClassifier()
KN_model.fit(train_X, train_y)
KN_model.score(test_X, test_y)

pred_y = KN_model.predict(test_X)

AUC_KN = create_ROCcurve(test_y, pred_y)
tn_KN, fp_KN, fn_KN, tp_KN, specificity_KN, recall_KN, f1_KN = create_cm(test_y, pred_y)
accuracy_KN = accuracy_score(test_y, pred_y)
print(f'acc: {accuracy_KN}')

AUC_list.append(["AUC_KN", np.round(AUC_KN, 3)])
specificity_list.append(["specificity_KN", np.round(specificity_KN, 3)])
recall_list.append(["recall_KN", np.round(recall_KN, 3)])
f1_score_list.append(["f1_KN", np.round(f1_KN, 3)])

scores = cross_validate(KN_model, test_X, test_y, cv=skf, scoring=scoring)
print(scores)

>>
'fit_time': array([0.00431228, 0.00247431, 0.00247526, 0.00253296, 0.00250769]), 
'score_time': array([0.02033329, 0.01186132, 0.011904  , 0.0119164 , 0.01167512]), 
'test_p': array([0.69305556, 0.74102806, 0.6523314 , 0.64574315, 0.71490656]),
'test_r': array([0.69078431, 0.73352941, 0.65117647, 0.64719131, 0.69584158]), 
'test_f': array([0.69170857, 0.73602227, 0.65166623, 0.64627195, 0.69875394])

結果

サーチする

# グリッドサーチ
model_param_set_grid = {
    KNeighborsClassifier(): {
        "n_neighbors": [i for i in range(1, 10)]
    }
}

param_search(train_X, test_X, train_y, test_y, model_param_set_grid)

>>
grid
KNeighborsClassifier
ベストAUC:{'n_neighbors': 6}
ベスト特異度:{'n_neighbors': 6}
ベスト感度:{'n_neighbors': 6}
ベストf1_score:{'n_neighbors': 6}
ベストAUC: 0.6732163306472795
ベスト特異度: 0.862475442043222
ベスト感度: 0.4839572192513369
ベストf1_score: 0.5792

ランダムサーチ

model_param_set_random = {
    KNeighborsClassifier(): {
        "n_neighbors": scipy.stats.randint(1, 20)
    }
}

param_search(train_X, test_X, train_y, test_y, model_param_set_random, "random")

>>

# ランダムサーチ
model_param_set_random = {
    KNeighborsClassifier(): {
        "n_neighbors": scipy.stats.randint(1, 20)
    }
}

param_search(train_X, test_X, train_y, test_y, model_param_set_random, "random")
random
KNeighborsClassifier
ベストAUC:{'n_neighbors': 15}
ベスト特異度:{'n_neighbors': 15}
ベスト感度:{'n_neighbors': 15}
ベストf1_score:{'n_neighbors': 15}
ベストAUC: 0.6896951136232311
ベスト特異度: 0.8526522593320236
ベスト感度: 0.5267379679144385
ベストf1_score: 0.609907120743034

サーチ結果を反映

# モデル学習
KN_model = KNeighborsClassifier(n_neighbors=18)
KN_model.fit(train_X, train_y)
KN_model.score(test_X, test_y)

pred_y = KN_model.predict(test_X)

AUC_KN = create_ROCcurve(test_y, pred_y)
tn_KN, fp_KN, fn_KN, tp_KN, specificity_KN, recall_KN, f1_KN = create_cm(test_y, pred_y)

AUC_list.append(["AUC_KN", np.round(AUC_KN, 3)])
specificity_list.append(["specificity_KN", np.round(specificity_KN, 3)])
recall_list.append(["recall_KN", np.round(recall_KN, 3)])
f1_score_list.append(["f1_KN", np.round(f1_KN, 3)])

サーチ結果

評価指標比較

AUC_list = sorted(AUC_list, reverse=True, key=lambda x: x[1])
specificity_list = sorted(specificity_list, reverse=True, key=lambda x: x[1])
recall_list = sorted(recall_list, reverse=True, key=lambda x: x[1])
f1_score_list = sorted(f1_score_list, reverse=True, key=lambda x: x[1])

print("---ROC_AUC---")
for i in range(len(AUC_list)):
  print(AUC_list[i])
print("---特異度---")
for i in range(len(specificity_list)):
  print(specificity_list[i])
print("----感度----")
for i in range(len(recall_list)):
  print(recall_list[i])
print("----f1値----")
for i in range(len(f1_score_list)):
  print(f1_score_list[i])

ソートして比較（_Sはサーチ結果を反映したもの)

ROC_AUC ▼	特異度 ▼	感度 ▼	f1値 ▼
['AUC_GBDT:', 0.816]	['specificity_NN', 0.933]	['recall_RF_S', 0.658]	['f1_RF', 0.707]
['AUC_NN', 0.78]	['specificity_RF', 0.859]	['recall_RF', 0.652]	['f1_RF_S', 0.703]
['AUC_RF', 0.755]	['specificity_GBDT：', 0.855]	['recall_GBDT：', 0.634]	['f1_GBDT：', 0.692]
['AUC_RF_S', 0.75]	['specificity_KN_S', 0.853]	['recall_DT', 0.618]	['f1_LR', 0.666]
['AUC_LR', 0.723]	['specificity_RF_S', 0.843]	['recall_LR', 0.61]	['f1_DT_S', 0.664]
['AUC_DT_S', 0.721]	['specificity_LR', 0.837]	['recall_DT_S', 0.61]	['f1_LR_S', 0.661]
['AUC_LR_S', 0.719]	['specificity_DT_S', 0.833]	['recall_LR_S', 0.607]	['f1_KN', 0.641]
['AUC_KN', 0.702]	['specificity_LR_S', 0.831]	['recall_KN', 0.591]	['f1_DT', 0.625]
['AUC_KN_S', 0.69]	['specificity_KN', 0.813]	['recall_KN_S', 0.527]	['f1_KN_S', 0.61]
['AUC_DT', 0.677]	['specificity_DT', 0.737]	['recall_NN', 0.366]	['f1_NN', 0.503]

考察

全体的にXGBoostとランダムフォレストが良い結果となった。
一方で治療が必要な「in」を正確に判定した感度は全体的に低く、改善が必要である。
ニューラルネットワークは特異度で93％を記録したが、感度が36.6%であり実用には向かないと考えられる。
グリッドサーチやランダムサーチで得られたパラメータを使用しても、大きな改善は見られなかった。

参考資料

TensorFlowのaccuracyの値が少しも変動しません。
One-Hotエンコーディング（ダミー変数）ならPandasのget_dummies()を使おう
 白黒はっきりしない判定の評価のしかた〜ROC曲線と AUC〜
【評価指標】ROC 曲線と AUC についてわかりやすく解説してみた
 【超初心者向け】F値のくどい解説とPythonでの実装例。
【初心者向け】機械学習におけるクラス分類の評価指標の解説
 やさしいランダムサーチ入門！Scikit-Learnで使ってみよう
 ハイパーパラメータ最適化
 ハイパーパラメータとは？チューニングの手法を徹底解説（XGBoost編）
機械学習のパラメータチューニングを「これでもか！」というくらい丁寧に解説
 scikit-learnのcross_validateでCross Validationしてみた話
 cross_val_scoreはもうやめようね。一発で交差検証するにはcross_validateを使う
 【初心者向け】XGBoostとはとは？特徴とインストール方法を学ぼう！
XGBoost論文を丁寧に解説する(1)
Kerasでニューラルネットワークの作成の簡単イメージ
 ランダムフォレストの概要を大雑把に解説
 ロジスティック回帰分析の基礎をわかりやすく解説
 ロジスティック回帰分析とは？用途、計算方法をわかりやすく解説！
[入門]初心者の初心者による初心者のための決定木分析
 すぐわかるk近傍法！KNN(k-nearest neighbor)をPythonで実装しよう

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Kaggleの臨床検査データセットを使ってみた① ～モデルの性能比較をしてみた～

概要

モデルと評価指標

データ確認

早速実装

前処理

欠損値確認

共通処理

XGBoost (GBDT: 勾配ブースティング決定木)

結果

ニューラルネットワーク

結果

ランダムフォレスト

結果

サーチする

サーチ結果

ロジスティック回帰

結果

サーチする

サーチ結果

決定木

結果

サーチする

サーチ結果

k-近傍法

結果

サーチする

サーチ結果

評価指標比較

考察

参考資料

Kaggleの臨床検査データセットを使ってみた①　～モデルの性能比較をしてみた～