More than 5 years have passed since last update.

Borutaで特徴量を選択する

Posted at 2019-10-28

特徴量選択ツールであるBorutaの使い方メモです。

Borutaとは

ランダムにシャッフルした値と置き換えたときの精度低下から特徴量選択を行うツールです。

手法の内容についてはこちらが詳しいです
https://aotamasaki.hatenablog.com/entry/2019/01/05/195813

Boruta公式はこちら
https://github.com/scikit-learn-contrib/boruta_py

試行環境

Windows10
python 3.6
jupyter notebook
boruta 0.3

インストール

condaでインストールできます。pip でも出来るようですが僕の環境ではダメでした。

terminal

conda install -c conda-forge boruta_py

Borutaを使う

Boston house-pricesサンプルデータを使ってやってみます。
Boruta_pyはpandas.DataFrameを扱えない為、必ずnumpy.arrayに変換してから投入します。

python

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from boruta import BorutaPy

# データを読んでくる
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target, name='target')
print(X.shape, y.shape)
display(X.head())
display(y.head())

# 全部の特徴量で学習
rf1 = RandomForestRegressor(n_jobs=-1, max_depth=5)
rf1.fit(X, y)
print('SCORE with ALL Features: %1.2f\n' % rf1.score(X, y))

# RandomForestRegressorでBorutaを実行
rf = RandomForestRegressor(n_jobs=-1, max_depth=5)
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
feat_selector.fit(X.values, y.values)

# 選択された特徴量を確認
selected = feat_selector.support_
print('選択された特徴量の数: %d' % np.sum(selected))
print(selected)
print(X.columns[selected])

# 選択した特徴量で学習
X_selected = X[X.columns[selected]]
rf2 = RandomForestRegressor(n_jobs=-1, max_depth=5)
rf2.fit(X_selected, y)
print('SCORE with selected Features: %1.2f' % rf2.score(X_selected, y))

スコアは以下のようになりました。あんまり変わりませんね。

output

SCORE with ALL Features: 0.93
SCORE with selected Features: 0.94

別のデータで再挑戦

差が出なくて悔しいので別のデータで再度やってみます。上でリンクしてる記事で扱ったデータではちゃんと差が出るようですので、こちらのコードをベースに再挑戦しました。

参考にしたコード
https://github.com/masakiaota/blog/blob/master/boruta/Madalon_Data_Set.ipynb

python

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from boruta import BorutaPy
from multiprocessing import cpu_count

# データを読んでくる
data_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data'
label_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels'
X_data = pd.read_csv(data_url, sep=" ", header=None)
y_data = pd.read_csv(label_url, sep=" ", header=None)
data = X_data.iloc[:,0:500]
data['target'] = y_data[0] 

y=data['target']
X=data.drop(columns='target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


# RandomForestで学習
rf1=RandomForestClassifier(n_estimators=500,
                           random_state=42,
                           n_jobs=int(cpu_count()/2))
rf1.fit(X_train.values, y_train.values)

y_test_pred = rf1.predict(X_test.values)
print(confusion_matrix(y_test.values, y_test_pred,labels=rf.classes_), '\n')
print('ACCURACY with ALL Features: %1.2f\n' % accuracy_score(y_test, y_test_pred))

# Borutaを実行
rf = RandomForestClassifier(n_jobs=int(cpu_count()/2), max_depth=7)
feat_selector = BorutaPy(rf, n_estimators='auto', two_step=False,verbose=2, random_state=42)
feat_selector.fit(X_train.values,y_train.values)
print(X_train.columns[feat_selector.support_])

# 選択したFeatureを取り出し
X_train_selected = X_train.iloc[:,feat_selector.support_]
X_test_selected = X_test.iloc[:,feat_selector.support_]
display(X_test_selected.head())


# 選択したFeatureで学習
rf2=RandomForestClassifier(n_estimators=500,
                           random_state=42,
                           n_jobs=int(cpu_count()/2))
rf2.fit(X_train_selected.values, y_train.values)

y_test_pred2 = rf2.predict(X_test_selected.values)
print(confusion_matrix(y_test.values, y_test_pred2, labels=rf2.classes_), '\n')
print('ACCURACY with selected Features: %1.2f\n' % accuracy_score(y_test, y_test_pred2))

こんどはハッキリと特徴量選択後の方が良くなりました。

output

[[179  70]
 [ 96 155]] 

ACCURACY with ALL Features: 0.67


[[219  30]
 [ 29 222]] 

ACCURACY with selected Features: 0.88

レッツトライ！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up