More than 5 years have passed since last update.

GBDT系の機械学習モデルであるXGBoost, LightGBM, CatBoostを動かしてみる。

Last updated at 2019-01-29Posted at 2019-01-29

目的

本ページの目的はGBDT(Gradient Boosting Decision Tree)の代表的な機械学習モデルを利用可能にする。

内容

本ページで扱う機械学習モデルはXGBoost, LightGBM, CatBoostとする。
GBDT系の機械学習モデルの大元論文を整理し、大元のサイトを整理する。
動かしてみるが目的なので、詳細な解説はせず、重要なパラメーターに対しては恣意的に選んだ数字をおいた

本ページを読み終えてできるようになること

GBDTの代表的な手法を自身の環境で動かせるようになる
GBDTの主要研究の背景に触れる

背景

本ページで扱う機械学習モデルの学術的な背景
- XGBoost (eXtreme Gradient Boosting) はChen et al., 2016, Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data miningに掲載された。
  - Kaggle でよく利用されているGBDT(Gradient Boosting Decision Tree)の一種(
  - 結構早くてスパースデータに対して良い？
  - データサイズが大き場合、問題点(efficiency と scalability)があるらしい
- LightGBM は Ke et al., 2017, NIPSに掲載された。
  - 従来のGBDTのトレーニングデータに対する学習速度を最大で20倍までスピードアップできるらしい。
- CatBoost Prokhorenkova et al., 2018, NeurIPS
  - 様々なデータセットに対して、他のブースティングよりも優れているといわれている。
    - 研究会では怪しいほど性能が高くなるといわれていた
  - 特にサンプル数が中から小のデータセットでかつカテゴリカルデータに対して優れているらしい。
  - Ordered Target Statistics という戦略を利用することで 40,000 以下のデータに良いらしい。
そもそも Gradient boosting って何？
- 様々な実用的なタスク(データ)に対して、高水準の結果を示すことが可能な強力な機械学習技術
- 勾配降下によりアンサンブルを行う

インストール

xgboost の準備: Mac OS の場合 (Windows はこちらを参照)
- brew install gcc@7
  - Error: Xcode alone is not sufficient on High Sierra. が出た場合
    - xcode-select --install で回避
- pip install xgboost
lightgbm の準備: Mac OS の場合(参考
- brew install libomp
- pip install lightgbm
Catboost の準備: Mac OS の場合(参照
- pip install catboost または conda install catboost のいずれかを実行

実験

データの読み込み

重要変数
- tata_setは機械学習の用語である特徴量（もしくは特徴変数) を表す
- target_setは機械学習の用語であるクラス (分類対象)を表す
データ (Kaggle)
- Predict FIFA 2018 Man of the Match Match statistics with which team player has won Man of the match https://www.kaggle.com/mathan/fifa-2018-match-statistics
- 列名Man of the Matchには、いわゆる MVPの選手がいるかどうかを判定する2値が含まれている。このデータを分類対象(クラス)とする。

import numpy as np
import pandas as pd
# from sklearn.model_selection import train_test_split
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate, KFold
# CatBoostを利用した分類
from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb
# from sklearn.model_selection import train_test_split,  GridSearchCV
# 指標を計算するため
from sklearn.metrics import accuracy_score, cohen_kappa_score, make_scorer, f1_score, recall_score
# 見た目を綺麗にするもの
# import matplotlib.pyplot as plt
# import seaborn as sns
import pprint, pydotplus
# 保存
from sklearn.externals.joblib import dump
from sklearn.externals.joblib import load

data = pd.read_csv("FIFA_data.csv")
pd.set_option('display.max_rows', 10)
print(data.columns)
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
target_set = data['Man of the Match'] 
target_set = target_set.replace('Yes', 1)
target_set = target_set.replace('No', 0)
print(target_set,type(target_set))
print(feature_names)
data_set = data[feature_names]

Index(['Date', 'Team', 'Opponent', 'Goal Scored', 'Ball Possession %',
       'Attempts', 'On-Target', 'Off-Target', 'Blocked', 'Corners', 'Offsides',
       'Free Kicks', 'Saves', 'Pass Accuracy %', 'Passes',
       'Distance Covered (Kms)', 'Fouls Committed', 'Yellow Card',
       'Yellow & Red', 'Red', 'Man of the Match', '1st Goal', 'Round', 'PSO',
       'Goals in PSO', 'Own goals', 'Own goal Time'],
      dtype='object')
0      1
1      0
2      0
3      1
4      0
      ..
123    0
124    1
125    0
126    1
127    0
Name: Man of the Match, Length: 128, dtype: int64 <class 'pandas.core.series.Series'>
['Goal Scored', 'Ball Possession %', 'Attempts', 'On-Target', 'Off-Target', 'Blocked', 'Corners', 'Offsides', 'Free Kicks', 'Saves', 'Pass Accuracy %', 'Passes', 'Distance Covered (Kms)', 'Fouls Committed', 'Yellow Card', 'Yellow & Red', 'Red', 'Goals in PSO']

共通準備

評価指標: Accuracy と Kappa 係数を利用する。
交差検証: Ten-fold-stratified-cross-validation
これらが不明な場合はCross-validation: KFold と StratifiledKFold の性能の違いを参照

scoring = {'accuracy': make_scorer(accuracy_score),
           'kappa': make_scorer(cohen_kappa_score)
          }
skf = StratifiedKFold(n_splits=10,
                      shuffle=True,
                      random_state=0)

XGBoost

XGBoost を動かしてみる。

%%time
model = xgb.XGBClassifier(max_depth = 50,
                           learning_rate = 0.16,
                           min_child_weight = 1,
                           n_estimators = 200)
scores_xgb = cross_validate(model, 
                        data_set,   # 
                        target_set,
                        cv=skf, 
                        n_jobs = -1,
                        scoring=scoring)
pprint.pprint("XGBoost:{}".format(scores_xgb))
pprint.pprint("XGBoost_accuracy:{}".format(scores_xgb["test_accuracy"].mean()))
pprint.pprint("XGBoost_kappa:{}".format(scores_xgb["test_kappa"].mean()))

("XGBoost:{'fit_time': array([0.07296681, 0.07270503, 0.06822896, 0.08056283, "
 '0.06243491,\n'
 '       0.06437612, 0.080024  , 0.07008386, 0.08049893, 0.09190106]), '
 "'score_time': array([0.00357699, 0.0035851 , 0.00481105, 0.00362611, "
 '0.0033021 ,\n'
 '       0.00332212, 0.00316381, 0.00327396, 0.00260711, 0.00225186]), '
 "'test_accuracy': array([0.57142857, 0.71428571, 0.64285714, 0.85714286, "
 '0.83333333,\n'
 '       0.5       , 0.5       , 0.91666667, 0.33333333, 0.58333333]), '
 "'train_accuracy': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]), "
 "'test_kappa': array([ 0.14285714,  0.42857143,  0.28571429,  0.71428571,  "
 '0.66666667,\n'
 '        0.        ,  0.        ,  0.83333333, -0.33333333,  0.16666667]), '
 "'train_kappa': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])}")
'XGBoost_accuracy:0.6452380952380952'
'XGBoost_kappa:0.29047619047619044'
CPU times: user 64.7 ms, sys: 44.8 ms, total: 110 ms
Wall time: 2.03 s

LightGBM

LightGBMを動かしてみる。

%%time
model = lgb.LGBMClassifier(num_leaves = 31,
                           num_trees = 100,
                           objective = 'binary',
                           metric = 'binary_logloss',
                           silent=False)

scores_lgb = cross_validate(model, 
                        data_set,   # 
                        target_set,
                        cv=skf, 
                        n_jobs = -1,
                        scoring=scoring)

CPU times: user 34.1 ms, sys: 3.33 ms, total: 37.4 ms
Wall time: 283 ms

pprint.pprint("LightGBM:{}".format(scores_lgb))
pprint.pprint("LightGBM_accuracy:{}".format(scores_lgb["test_accuracy"].mean()))
pprint.pprint("LightGBM_kappa:{}".format(scores_lgb["test_kappa"].mean()))

("LightGBM:{'fit_time': array([0.05086493, 0.04860687, 0.04421997, 0.04607511, "
 '0.0268271 ,\n'
 '       0.0277791 , 0.02678585, 0.02445316, 0.01633024, 0.02036715]), '
 "'score_time': array([0.02009106, 0.01956511, 0.02327609, 0.02541184, "
 '0.01577282,\n'
 '       0.01657176, 0.01604891, 0.01570582, 0.00908589, 0.00763202]), '
 "'test_accuracy': array([0.57142857, 0.57142857, 0.57142857, 0.78571429, "
 '0.75      ,\n'
 '       0.66666667, 0.41666667, 0.91666667, 0.33333333, 0.66666667]), '
 "'train_accuracy': array([0.94736842, 0.96491228, 0.93859649, 0.94736842, "
 '0.94827586,\n'
 '       0.96551724, 0.96551724, 0.94827586, 0.97413793, 0.95689655]), '
 "'test_kappa': array([ 0.14285714,  0.14285714,  0.14285714,  0.57142857,  "
 '0.5       ,\n'
 '        0.33333333, -0.16666667,  0.83333333, -0.33333333,  0.33333333]), '
 "'train_kappa': array([0.89473684, 0.92982456, 0.87719298, 0.89473684, "
 '0.89655172,\n'
 '       0.93103448, 0.93103448, 0.89655172, 0.94827586, 0.9137931 ])}')
'LightGBM_accuracy:0.625'
'LightGBM_kappa:0.25000000000000006'

CatBoost

CatBoost を利用するためには CatBoostClassifier 関数を利用する。

%%time
# Catboost のモデルの準備
model = CatBoostClassifier(iterations=500, 
                           eval_metric = 'Kappa',
                           learning_rate=0.01, 
                           l2_leaf_reg = 9,
                           depth=10,
                           one_hot_max_size = 50,
                           loss_function='Logloss')

scores_cat = cross_validate(model, 
                        data_set,   # 
                        target_set,
                        cv=skf, 
                        n_jobs = -1,
                        scoring=scoring)
pprint.pprint("CatBoost:{}".format(scores_cat))
pprint.pprint("CatBoost_accuracy:{}".format(scores_cat["test_accuracy"].mean()))
pprint.pprint("CatBoost_kappa:{}".format(scores_cat["test_kappa"].mean()))

("CatBoost:{'fit_time': array([33.30261898, 33.46484399, 33.24761176, "
 '33.5051899 , 37.29434705,\n'
 '       38.02754211, 37.78509092, 38.18597794, 21.43462086, 19.71722698]), '
 "'score_time': array([0.01334691, 0.01270199, 0.01476884, 0.00987601, "
 '0.01335907,\n'
 '       0.01004386, 0.01510406, 0.01020694, 0.00449705, 0.00260401]), '
 "'test_accuracy': array([0.5       , 0.78571429, 0.78571429, 0.92857143, "
 '0.83333333,\n'
 '       0.66666667, 0.66666667, 0.75      , 0.5       , 0.66666667]), '
 "'train_accuracy': array([0.95614035, 0.94736842, 0.93859649, 0.93859649, "
 '0.95689655,\n'
 '       0.93965517, 0.93103448, 0.93965517, 0.94827586, 0.92241379]), '
 "'test_kappa': array([0.        , 0.57142857, 0.57142857, 0.85714286, "
 '0.66666667,\n'
 '       0.33333333, 0.33333333, 0.5       , 0.        , 0.33333333]), '
 "'train_kappa': array([0.9122807 , 0.89473684, 0.87719298, 0.87719298, "
 '0.9137931 ,\n'
 '       0.87931034, 0.86206897, 0.87931034, 0.89655172, 0.84482759])}')
'CatBoost_accuracy:0.7083333333333333'
'CatBoost_kappa:0.4166666666666667'
CPU times: user 56.7 ms, sys: 34.1 ms, total: 90.7 ms
Wall time: 1min 33s

結果

正しい設定で調査をしていないが、
- 予測精度は CatBoostが最も良い。
- しかし、CatBoostの計算時間は1分30秒でありLightGBMの283msと比較すると果てしなく長い。

pprint.pprint("XGBoost_accuracy:{}".format(scores_xgb["test_accuracy"].mean()))
pprint.pprint("XGBoost_kappa:{}".format(scores_xgb["test_kappa"].mean()))
pprint.pprint("LightGBM_accuracy:{}".format(scores_lgb["test_accuracy"].mean()))
pprint.pprint("LightGBM_kappa:{}".format(scores_lgb["test_kappa"].mean()))
pprint.pprint("CatBoost_accuracy:{}".format(scores_cat["test_accuracy"].mean()))
pprint.pprint("CatBoost_kappa:{}".format(scores_cat["test_kappa"].mean()))

'XGBoost_accuracy:0.6452380952380952'
'XGBoost_kappa:0.29047619047619044'
'LightGBM_accuracy:0.625'
'LightGBM_kappa:0.25000000000000006'
'CatBoost_accuracy:0.7083333333333333'
'CatBoost_kappa:0.4166666666666667'

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up