More than 5 years have passed since last update.

GBDT系の機械学習モデルのパラメータチューニング奮闘記 ~ CatBoost vs LightGBM vs XGBoost vs Random Forests ~ その1

Last updated at 2019-02-01Posted at 2019-01-31

イントロダクション

目的

本ページの目的はGBDT(Gradient Boosting Decision Tree)の代表的な機械学習モデルのチューニングを試みる。
著者の記事はとにかく最低限の量のプログラミングで目的を達成できることを心がけているが、より簡便にかける方法がありましたら教えてください。

内容

本ページで扱う機械学習モデルはRandomforests, XGBoost, LightGBM, CatBoostとする。
各モデルの代表的なパラメータを紹介する。
各モデルの代表的なパラメータを使いチューニングを試みる。

注意

本ページで扱う機械学習モデルは、大量に計算リソースを必要とします。PCに負担をかけるため注意してください。

本ページを読み終えてできるようになること

決定木ベースの代表的なモデルのチューニングが可能になる。
各モデルの精度と計算速度を比較することができるようになる。

次は、もう少し徹底的にRandom Forests vs XGBoost vs LightGBM vs CatBoost チューニング奮闘記その2 工事中として書く予定。

前提

これまでGBDT系の機械学習モデルを利用したことがない場合は、前回のGBDT系の機械学習モデルであるXGBoost, LightGBM, CatBoostを動かしてみる。を参考にしてください。

背景

本ページで扱う機械学習モデルの学術的な背景
- XGBoostからCatBoostまでは前回の記事を参照
- Random Forests (以後RFと略記) は Breiman 2001,Machene Learningに掲載された。
  - RFはSVMなど多数のデータセットで比較される。
  - RFの予測精度はノイズがある程度少なく、非常に非常に細かいチューニングが行われたSVMに負けることがある。しかし、RFのチューニングはSVMよりも簡単に思える。
  - Random Forests は XGBoost, LightGBM, CatBoostよりも扱いが非常に簡単

実験

データの読み込み

重要変数
- tata_setは機械学習の用語である特徴量（もしくは特徴変数) を表す
- target_setは機械学習の用語であるクラス (分類対象)を表す
データ (Kaggle)
- Predict FIFA 2018 Man of the Match Match statistics with which team player has won Man of the match https://www.kaggle.com/mathan/fifa-2018-match-statistics
- 列名Man of the Matchには、いわゆる MVPの選手がいるかどうかを判定する2値が含まれている。このデータを分類対象(クラス)とする。

import numpy as np
import pandas as pd
# from sklearn.model_selection import train_test_split
# from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
# CatBoostを利用した分類
from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
# 指標を計算するため
from sklearn.metrics import accuracy_score, cohen_kappa_score, make_scorer, f1_score, recall_score
# 見た目を綺麗にするもの
import matplotlib.pyplot as plt
import seaborn as sns
import pprint, pydotplus
from pylab import rcParams
# 保存
from sklearn.externals.joblib import dump
from sklearn.externals.joblib import load

data = pd.read_csv("FIFA_data.csv")
pd.set_option('display.max_rows', 10)
print(data.columns)
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
target_set = data['Man of the Match'] 
target_set = target_set.replace('Yes', 1)
target_set = target_set.replace('No', 0)
print(target_set,type(target_set))
print(feature_names,len(feature_names))
data_set = data[feature_names]

共通準備

交差検証: Ten-fold-stratified-cross-validation
これらが不明な場合はCross-validation: KFold と StratifiledKFold の性能の違いを参照

skf = StratifiedKFold(n_splits=10,
                      shuffle=True,
                      random_state=0)

Random Forests

Random Forestsの最も重要なパラメータは次の通りである。

n_estimators: 木の数 (著者の経験的には 500 から 1000 程度
max_features: 決定木を成長させる際に利用する特徴変数の数 (著者の経験的には sqrt(扱う特徴変数の数)前後

過学習する場合は以下のパラメータを設定する

max_depth

以下はデータから直接指定する。分類対象のラベル(クラスラベル)が同一であれば設定不要。

class_weight: クラスラベルの比率に偏りがある場合は balanced または “balanced_subsample” を指定する。今回は不要

より詳細なパラメータを参照したい場合はsklearn.ensemble.RandomForestClassifierのページを参照してください。

%%time
model = RandomForestClassifier(random_state = 43,
                               n_jobs = -1,
                               oob_score=True)
# パラメーターを設定する
param_grid = {"n_estimators":[100,500,1000], # 2000まではいらない
              "max_features": [1, 2, 3, 4, 5, 7, 10],
              "max_depth": [3,5,7,10,15,None], #,20,30は過学習を引き起こす
              "min_samples_leaf":  [1, 2, 4],
              "min_samples_split": [2, 5, 10]
             } 
# パラメータチューニングをグリッドサーチ
grid_result = GridSearchCV(estimator = model,
                           param_grid = param_grid,
                           scoring = 'balanced_accuracy', #accuracy, balanced_accuracy
                           cv = skf,
                           return_train_score = True,
                           n_jobs = -1)

grid_result.fit(data_set, target_set)

CPU times: user 45.3 s, sys: 2.09 s, total: 47.3 s
Wall time: 1h 6min 9s

ベストな分類器を抽出

pprint.pprint(grid_result.best_estimator_)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=3, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=True, random_state=43, verbose=0, warm_start=False)

ベストなパラメータを抽出

pprint.pprint(grid_result.best_params_)

{'max_depth': 10,
 'max_features': 5,
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'n_estimators': 100}

ベストな正解率を抽出

テストデータに対するスコア (GridSearchCV の scoringで指定した指標) を表示。

pprint.pprint(grid_result.best_score_)

0.7734375

XGBoost

パラメータはここ参照

XGBoostの初期化とパラメータ設定

%%time
# LightXGBの初期化
model = xgb.XGBClassifier()
# pprint.pprint(model.get_params())
# パラメーターを設定する
param_grid = {"max_depth": [ 3, 6, 10,25], #10, 25,
              "learning_rate" : [0.0001,0.001,0.01], # 0.05,0.1
              "min_child_weight" : [1,3,6],
              "n_estimators": [100,200,300], # 500
              "subsample": [0.5,0.75,0.9],
              "gamma":[0,0.1,0.2],
              "eta": [0.3,0.15,0.10]
             }
# パラメータチューニングをグリッドサーチで行うために設定する
## このGridSearchCV には注意が必要 scoring は そのスコアを基準にして最適化する
grid_result = GridSearchCV(estimator = model,
                           param_grid = param_grid,
                           scoring = 'balanced_accuracy',
                           cv = skf,
                           verbose=3,
                           return_train_score = True,
                           n_jobs = -1)
grid_result.fit(data_set, target_set)

CPU times: user 55.2 s, sys: 2 s, total: 57.2 s
Wall time: 10min 11s

ベストな分類器を抽出

pprint.pprint(grid_result.best_estimator_)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, eta=0.3, gamma=0, learning_rate=0.0001,
       max_delta_step=0, max_depth=3, min_child_weight=3, missing=None,
       n_estimators=200, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.75)

ベストなパラメータを抽出

pprint.pprint(grid_result.best_params_)

{'eta': 0.3,
 'gamma': 0,
 'learning_rate': 0.0001,
 'max_depth': 3,
 'min_child_weight': 3,
 'n_estimators': 200,
 'subsample': 0.75}

ベストな正解率を抽出

テストデータに対するスコア (GridSearchCV の scoring で指定した指標) を表示。

pprint.pprint(grid_result.best_score_)

LightXGB

LightXGBの初期化とパラメータ設定

%%time
# LightXGBの初期化
model = lgb.LGBMClassifier(silent=False)
# pprint.pprint(model.get_params())
# パラメーターを設定する
param_grid = {"max_depth": [10, 25, 50, 75],
              "learning_rate" : [0.001,0.01,0.05,0.1],
              "num_leaves": [100,300,900,1200],
              "n_estimators": [100,200,500]
             }
# パラメータチューニングをグリッドサーチで行うために設定する
## このGridSearchCV には注意が必要 scoring は そのスコアを基準にして最適化する
grid_result = GridSearchCV(estimator = model,
                           param_grid = param_grid,
                           scoring = 'balanced_accuracy',
                           cv = skf,
                           verbose=3,
                           return_train_score = True,
                           n_jobs = -1)

grid_result.fit(data_set, target_set)

CPU times: user 3.32 s, sys: 150 ms, total: 3.47 s
Wall time: 40.6 s

ベストな分類器を抽出

pprint.pprint(grid_result.best_estimator_)

ベストなパラメータを抽出

pprint.pprint(grid_result.best_params_)

ベストな正解率を抽出

テストデータに対するスコア (GridSearchCV ので指定した指標) を表示。

pprint.pprint(grid_result.best_score_)

0.703125

CatBoostのチューニング

より詳細なパラメータを参照したい場合はYandexのTraining parametersのページを参照してください。またYandexのParameter tuningを参考にしてください。

CatBoostには次のようなパラメータがチューニングの対象になる。
- depth
- learning_rate
- l2_leaf_reg
- iterations
再現性を確保するためのパラメータ設定
- random_seed: 学術論文での再現性は最低条件なので必須項目

%%time
model = CatBoostClassifier()
# pprint.pprint(model.get_params())
# パラメーターを設定する
param_grid = {'depth': [4, 7, 10],
         'learning_rate' : [0.01, 0.1, 0.15],
         'l2_leaf_reg': [1,4,9],
         'iterations': [300]}
# パラメータチューニングをグリッドサーチで行うために設定する
## このGridSearchCV には注意が必要 scoring は そのスコアを基準にして最適化する
grid_result = GridSearchCV(estimator = model,
                           param_grid = param_grid,
                           scoring = 'accuracy',
                           cv = skf,
                           verbose=3,
                           return_train_score = True,
                           n_jobs = -1)
grid_result.fit(data_set, target_set)

CPU times: user 5.22 s, sys: 972 ms, total: 6.19 s
Wall time: 16min 48s

ベストなパラメータを抽出

pprint.pprint(grid_result.best_params_)

{'depth': 4, 'iterations': 300, 'l2_leaf_reg': 4, 'learning_rate': 0.01}

ベストな正解率を抽出

テストデータに対するスコア (GridSearchCV の scoring で指定した指標) を表示。

pprint.pprint(grid_result.best_score_)

0.7578125

結果

Random Forests
- 精度: 0.7734375
- 学習時間: 1時間6分
XGBoost
- 精度: 0.71875
- 学習時間: 10分
LightXGB
- 精度: 0.703125
- 学習時間: 40秒
Catboost
- 精度: 0.7578125
- 学習時間: 16分

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up