More than 5 years have passed since last update.

いろいろな特徴量抽出で多数決　投票の多かった順に並べて出力

Last updated at 2019-09-27Posted at 2019-09-26

はじめに

特徴量抽出の方法は各種存在するが、いちいち試すのは面倒。
いろいろ試した結果、どの選定方法でも選ばれる特徴量をバランス良く抽出したい。
やってみて感じたが、リスト形式で出すことによる説得力は大きい。
x_train, y_trainを用意した後、以下トライすると良いかもしれません。

サンプルコード

Classifierを用いた特徴量抽出
（Regressorを用いた場合も追記します）

feature_selection.py

from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFE

# RFE with Logisitic Regression
from sklearn.linear_model import LogisticRegression
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=100, step=200, verbose=5)
rfe_selector.fit(x_train, y_train)
rfe_support = rfe_selector.get_support()
rfe_feature = (x_train.loc[:,rfe_support].columns.tolist())

# SelectfromModel with Logistics Regression by l1 penalty
embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1"), threshold=0.1)
embeded_lr_selector.fit(x_train, y_train)

n_features = embeded_lr_selector.transform(x_train).shape[1]

# Reset the threshold till the number of features equals to 100.
while n_features > 100:
    embeded_lr_selector.threshold += 0.1
    x_transform = embeded_lr_selector.transform(x_train)
    n_features = x_transform.shape[1]
    
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = x_train.loc[:,embeded_lr_support].columns.tolist()

# SelectfromModel with Lasso Regression by l1 penalty
from sklearn.linear_model import Lasso

embeded_la_selector = SelectFromModel(Lasso(alpha=0.005))
embeded_la_selector.fit(x_train, y_train)

n_features = embeded_la_selector.transform(x_train).shape[1]

# Reset the threshold till the number of features equals to 100.
while n_features > 100:
    embeded_la_selector.threshold += 0.1
    x_transform = embeded_la_selector.transform(x_train_norm)
    n_features = x_transform.shape[1]    

embeded_la_support = embeded_la_selector.get_support()
embeded_la_feature = x_train.loc[:,embeded_la_support].columns.tolist()

# Feature selection with RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier

embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=10), threshold=0.001)
embeded_rf_selector.fit(x_train, y_train)

n_features = embeded_rf_selector.transform(x_train).shape[1]

while n_features > 100:
    embeded_rf_selector.threshold += 0.001
    x_transform = embeded_rf_selector.transform(x_train)
    n_features = x_transform.shape[1]
  
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature =x_train.loc[:,embeded_rf_support].columns.tolist()

# Feature selection with LGBM Classifier
from lightgbm import LGBMClassifier
from lightgbm import LGBMRegressor

lgbc=LGBMClassifier(n_estimators=30, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

embeded_lgb_selector = SelectFromModel(lgbc, threshold=0.001)
embeded_lgb_selector.fit(x_train, y_train)

n_features = embeded_lgb_selector.transform(x_train).shape[1]

while n_features > 100:
    embeded_lgb_selector.threshold += 0.005
    x_transform = embeded_lgb_selector.transform(x_train)
    n_features = x_transform.shape[1]

embeded_lgb_support = embeded_lgb_selector.get_support()
embeded_lgb_feature = x_train.loc[:,embeded_lgb_support].columns.tolist()

# ANOVA
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

skb = SelectKBest(f_classif, k=50).fit(x_train, y_train)

skb_f_classif_support = skb.get_support()
skb_f_classif_feature = x_train.loc[:,skb_f_classif_support].columns.tolist()

feature_name = x_train.columns.tolist()

# put all selection together 
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'ANOVA':skb_f_classif_support,'RFE':rfe_support,'Logistic':embeded_lr_support,'Lasso':embeded_la_support,'Random Forest':embeded_rf_support, 'LightGBM':embeded_lgb_support})

# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)

feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(200)

出力結果

以下、Dataframeで出力される。ランキング形式で分かりやすい。

ID	Feature	ANOVA	RFE	Logistics	Lasso	Random Forest	LightGBM	Total
1	AAA	True	True	True	True	True	True	6
2	BBB	True	True	True	True	True	True	6
3	CCC	True	True	False	True	True	True	5
4	DDD	True	True	False	True	True	True	5
5	EEE	True	True	False	False	True	True	4

また出た結果をリスト形式にして、再度解析用のデータフレームを作ることも可能。

extract.py

selected_features=feature_selection_df.Feature.tolist()

# selection by all methods
feature_selected_df = train_x[selected_features]

# selection only by Lasso
Lasso_selected_df = train_x[embeded_la_feature]

終わりに

特徴量につき、どのモデル式に対してもバランスの良いものを選定したということが言えれば、
最終的なモデル式選定の良い根拠につながると感じている。
「どれか一つにとって、恣意的に都合の良いものを選定しました」とならないことが重要。

無限ループにハマることが少しでもなくなればと思います。

参考

特徴量選択のまとめ
 特徴量選択について

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

いろいろな特徴量抽出で多数決 投票の多かった順に並べて出力

はじめに

サンプルコード

出力結果

終わりに

参考

いろいろな特徴量抽出で多数決　投票の多かった順に並べて出力