More than 5 years have passed since last update.

決定木ベースの機械学習モデル(Random Forests, XGBboost)からdefragTreesを利用して一目でわかるルールの抽出を試みる。

Last updated at Posted at 2019-02-15






  • Random Forests
  • XGBboost



本ページは、Making Tree Ensembles Interpretable: A Bayesian Model Selection Approachを参考にしています。この論文のソースコードはdefragTreesを参考にしています。


defragTreesは pip コマンドや git+URLに対応していません。そのため、ここからdefragTrees.pyをダウンロードしてください。ダウンロードしたdefragTrees.pyはソースファイルと同一なフォルダに入れて「from defragTrees import DefragModel」とすれば実行することができます。

import numpy as np
from sklearn.datasets import make_classification
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import accuracy_score, cohen_kappa_score, balanced_accuracy_score, make_scorer, f1_score, recall_score
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import xgboost as xgb
from defragTrees import DefragModel
from sklearn.externals.joblib import dump
from sklearn.externals.joblib import load




  • n_samples: サンプルの数
    • アンケート調査であれば回答者数に該当
  • n_features: 特徴量の数
    • 分類対象(クラス)を識別するために利用する特徴量の数
    • 例えば、アンケート調査であれば質問と回答のペア
  • n_informative:分類対象と関係のある特徴量の数
    • この数が多ければ多いほど予測が簡単になる
  • n_classes: 分類対象が何種類あるか
    • 2値分類問題にするならば 2 をセット
  • random_state: ランダムシードの設定
data = make_classification(n_samples=1000, #生成するサンプル数 
data_set = data[0] # 特徴量
target_set = data[1] # クラスラベル
#from sklearn.datasets import load_iris
#iris = load_iris()
#data_set = iris.data
#target_set =iris.target

Random Forests


model = BalancedRandomForestClassifier(random_state = 43,
                               n_jobs = 1,
                               n_estimators = 500,
                               max_features = "log2",
                               class_weight = 'balanced',
                               sampling_strategy = 'all',
                               max_depth = None, 
scoring = {'accuracy': make_scorer(accuracy_score),
           'kappa': make_scorer(cohen_kappa_score),
           'blanced_accuracy': make_scorer(balanced_accuracy_score) }
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
scores = cross_validate(model, data_set, target_set,n_jobs = 1,
                        cv=skf, return_train_score=True,scoring=scoring)
model.fit(data_set, target_set)
[0.735 0.765 0.795 0.765 0.815]
[0.46356275 0.49570815 0.56008584 0.5        0.59606987]
[0.77261905 0.775      0.81071429 0.7797619  0.825     ]

BalancedRandomForestClassifier(bootstrap=True, class_weight='balanced',
                criterion='gini', max_depth=None, max_features='log2',
                max_leaf_nodes=None, min_impurity_decrease=0.0,
                min_samples_leaf=2, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
                oob_score=False, random_state=43, replacement=False,
                sampling_strategy='all', verbose=0, warm_start=False)

Random Forests から単純なルールを抽出する

splitter = DefragModel.parseSLtrees(model) # parse sklearn tree ensembles into the array of (feature index, threshold)
mdl = DefragModel(modeltype='classification', maxitr=100, qitr=0, tol=1e-6, restart=20, verbose=0)
mdl.fit(data_set, target_set, splitter, 10, fittype='FAB')

# results
score, cover, coll = mdl.evaluate(data_set, target_set)
print('<< defragTrees >>')
print('----- Evaluated Results -----')
print('Test Error = %f' % (score,))
print('Test Coverage = %f' % (cover,))
print('Overlap = %f' % (coll,))
print('----- Found Rules -----')
[Seed   0] TrainingError = 0.30, K = 2
[Seed   1] TrainingError = 0.30, K = 3
[Seed   2] TrainingError = 0.30, K = 2
[Seed   3] TrainingError = 0.30, K = 3
[Seed   4] TrainingError = 0.30, K = 2
[Seed   5] TrainingError = 0.30, K = 2
[Seed   6] TrainingError = 0.30, K = 2
[Seed   7] TrainingError = 0.30, K = 2
[Seed   8] TrainingError = 0.30, K = 3
[Seed   9] TrainingError = 0.30, K = 2
[Seed  10] TrainingError = 0.30, K = 3
[Seed  11] TrainingError = 0.30, K = 2
[Seed  12] TrainingError = 0.30, K = 1
[Seed  13] TrainingError = 0.30, K = 2
[Seed  14] TrainingError = 0.30, K = 1
[Seed  15] TrainingError = 0.30, K = 2
[Seed  16] TrainingError = 0.30, K = 2
[Seed  17] TrainingError = 0.30, K = 2
[Seed  18] TrainingError = 0.30, K = 2
[Seed  19] TrainingError = 0.30, K = 3
Optimal Model >> Seed   0, TrainingError = 0.30, K = 2

<< defragTrees >>
----- Evaluated Results -----
Test Error = 0.300000
Test Coverage = 1.000000
Overlap = 0.942000

----- Found Rules -----
[Rule  1]
y = 0 when
	 x_31 < 14.781519

[Rule  2]
y = 0 when
	 x_13 >= -13.205372
	 x_28 < 13.061854
	 x_96 < 12.163099

y = 0



  • Rule 1
    • y = 0 when
      • x_31 < 14.781519
    • 解釈:31番目の特徴量(x_31)の値が14.781519未満のとき、クラスラベル(y)は0と分類される
  • Rule 2
    • y = 0 when
      • x_13 >= -13.205372
      • x_28 < 13.061854
      • x_96 < 12.163099
    • 13番目の特徴量(x_13)の値が-13.205372以上、28番目の特徴量(x_28)の値が13.061854未満、96番目の特徴量(x_96)の値が12.163099未満のとき、クラスラベル(y)は0と分類される
  • それ以外は クラスラベルは0となる



model = xgb.XGBClassifier(max_depth = 50,
                           learning_rate = 0.16,
                           min_child_weight = 1,
                           n_estimators = 200)
scoring = {'accuracy': make_scorer(accuracy_score),
           'kappa': make_scorer(cohen_kappa_score),
           'blanced_accuracy': make_scorer(balanced_accuracy_score) }
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
scores = cross_validate(model, data_set, target_set,n_jobs = 1,
                        cv=skf, return_train_score=True,scoring=scoring)
num_round = 50
dtrain = xgb.DMatrix(data_set, label=target_set)
param = {'max_depth':50, 'learning_rate':0.16, 'min_child_weight':1, 'n_estimators':200}
bst = xgb.train(param, dtrain, num_round)

# output xgb model as text

[0.8   0.825 0.88  0.8   0.86 ]
[0.46236559 0.54188482 0.67741935 0.44444444 0.62365591]
[0.7047619  0.74642857 0.8        0.69047619 0.77619048]
XGBoost から単純なルールを抽出する

splitter = DefragModel.parseXGBtrees('xgbmodel.txt') # parse sklearn tree ensembles into the array of (feature index, threshold)
mdl = DefragModel(modeltype='classification', maxitr=100, qitr=0, tol=1e-6, restart=20, verbose=0)
mdl.fit(data_set, target_set, splitter, 10, fittype='FAB')

# results
score, cover, coll = mdl.evaluate(data_set, target_set)
print('<< defragTrees >>')
print('----- Evaluated Results -----')
print('Test Error = %f' % (score,))
print('Test Coverage = %f' % (cover,))
print('Overlap = %f' % (coll,))
print('----- Found Rules -----')
[Seed   0] TrainingError = 0.30, K = 3
[Seed   1] TrainingError = 0.30, K = 3
[Seed   2] TrainingError = 0.30, K = 2
[Seed   3] TrainingError = 0.30, K = 4
[Seed   4] TrainingError = 0.30, K = 2
[Seed   5] TrainingError = 0.30, K = 3
[Seed   6] TrainingError = 0.30, K = 3
[Seed   7] TrainingError = 0.30, K = 1
[Seed   8] TrainingError = 0.30, K = 3
[Seed   9] TrainingError = 0.30, K = 3
[Seed  10] TrainingError = 0.30, K = 2
[Seed  11] TrainingError = 0.30, K = 3
[Seed  12] TrainingError = 0.30, K = 3
[Seed  13] TrainingError = 0.30, K = 3
[Seed  14] TrainingError = 0.30, K = 4
[Seed  15] TrainingError = 0.30, K = 2
[Seed  16] TrainingError = 0.30, K = 2
[Seed  17] TrainingError = 0.30, K = 3
[Seed  18] TrainingError = 0.30, K = 3
[Seed  19] TrainingError = 0.30, K = 4
Optimal Model >> Seed   0, TrainingError = 0.30, K = 3

<< defragTrees >>
----- Evaluated Results -----
Test Error = 0.300000
Test Coverage = 1.000000
Overlap = 0.122000

----- Found Rules -----
[Rule  1]
y = 0 when
	 x_1 < 0.389858
	 x_3 >= -1.891694
	 x_4 < 13.444655
	 x_10 >= -15.387629
	 x_12 >= -11.851440
	 x_14 >= -16.041128
	 x_18 >= -14.103243
	 x_45 < 13.571693
	 x_68 >= -11.976764
	 x_89 < 16.081133

[Rule  2]
y = 0 when
	 x_1 < 0.389858
	 x_3 < 0.876493
	 x_9 < 2.240515
	 x_20 < 12.560732
	 x_26 < 15.807751
	 x_41 < 14.043766
	 x_78 >= -15.805157
	 x_81 < 14.394110

[Rule  3]
y = 0 when
	 x_1 >= -0.301440

y = 0



  • ルール3
    • y = 0 when x_1 >= -0.301440
      • 特徴量 1 が -0.301440 以下のとき、クラスラベルは 0 と分類される


決定木ベースの複雑な機械学習モデルを構築し、もし 特徴量が *** だったら、クラスラベルは *** に分類されるというルールを抽出できた。どれほどの効果があるかはわからないが、defragTreesを動かすことができた。


