More than 5 years have passed since last update.

【機械学習】RFEを用いた特徴量の選別

Posted at 2018-03-19

今回はRFEを用いた特徴量の選別についてまとめます。

RFEとは

Recursive Feature Elimination　の略。再帰的特徴消去。
特徴に重みを割り当てる外部推定機（ランダムフォレストや勾配ブースティング、線形モデルなど）が与えられた時、特徴集合を走査、どの特徴が重要であるかを発見し、指定した特徴数になるまで特徴の消去を行う。

例えば２０個の特徴をもつデータがあったとする。そのままの形でモデルにデータを渡して学習を行わせると、２０個に含まれる重要度の低い特徴までも考慮して学習することになってしまう。
（天気を予測させるためのデータの中に「ここ一週間の株価」という特徴があったとして、これは明らかに結果に対して重要でない）

その時RFEを用いると、指定した数、例えば１０とすると、データ中の特徴量を重要度でランキング化し、その上位１０の特徴量を抽出して学習を行わせることができる。

RFEの実践

sklaenrnに用意されている breast_cancer データでRFEを実践する。
breast_cancerデータの特徴量は３０。これを２０に選別する。

RFEの実践

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor

# データサンプル
dataset = load_breast_cancer()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['target'])

# 初期データの特徴(30個)
print(X.shape)
### 出力結果###
(569, 30)

# 初期データの特徴量内容
display(X.columns)
### 出力結果###
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')


# RFEモデルの生成
# 外部推定機には勾配ブースティングを指定。目標特徴量は２０に設定
rfe = RFE(estimator=GradientBoostingRegressor(random_state=0), n_features_to_select=20, step=0.5)

# 特徴量削減の実行
rfe.fit(X, y.as_matrix().ravel())

# 削減実行後のデータを再構成
rfeData = pd.DataFrame(rfe.transform(X), columns=X.columns.values[rfe.support_])

# 削減後、特徴量は２０になっていることが確認できる
display(rfeData.shape)
### 出力結果###
(569, 20)

# １０個の特徴が消滅している
display(rfeData.columns)
### 出力結果###
Index(['mean texture', 'mean perimeter', 'mean concave points',
       'mean symmetry', 'mean fractal dimension', 'texture error',
       'perimeter error', 'area error', 'smoothness error',
       'compactness error', 'concavity error', 'symmetry error',
       'fractal dimension error', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry'],
      dtype='object')

RFEの中身

RFEのパラメータ、および属性について。

Parameters


estimator	使用する外部推定機。教師あり学習のモデル。このモデルに基づいて特徴の重要度を判断する。必須。
n_features_to_select	選択する特徴の数を指定する。何も指定しなかった場合、特徴量は半分になる。
step	特徴量削除の速度。一度の再帰処理により指定ステップ分の特徴量が消滅する。
verbose	出力の冗長性を制御する

parameter

# RFEモデルのparameter
rfe = RFE(estimator=GradientBoostingRegressor)
display(rfe)

### 出力結果###
RFE(estimator=<class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>,
  n_features_to_select=None, step=1, verbose=0)

Attributes


n_features_	抽出した特徴量の数。
support_	選択した特徴(true)と選択しなかった特徴(false)の表示。
ranking_	特徴ランキング。選択された特徴はランク１となる。
estimator_	使用した外部推定機の詳細。

attribute

print("n__features_---------------------")
display(rfe.n_features_)
print("support_-------------------------")
display(rfe.support_)
print("ranking_-------------------------")
display(rfe.ranking_)
print("estimator_-----------------------")
display(rfe.estimator_)

### 出力結果###
n__features_---------------------
20
support_-------------------------
array([False,  True,  True, False, False, False, False,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True, False,
        True,  True, False, False,  True,  True,  True,  True,  True,
        True,  True, False], dtype=bool)
ranking_-------------------------
array([2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1,
       1, 1, 1, 1, 1, 1, 2])
estimator_-----------------------
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=0,
             subsample=1.0, verbose=0, warm_start=False)

以上。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up