はじめに
特徴量抽出の方法は各種存在するが、いちいち試すのは面倒。
いろいろ試した結果、どの選定方法でも選ばれる特徴量をバランス良く抽出したい。
やってみて感じたが、リスト形式で出すことによる説得力は大きい。
x_train, y_trainを用意した後、以下トライすると良いかもしれません。
サンプルコード
Classifierを用いた特徴量抽出
(Regressorを用いた場合も追記します)
feature_selection.py
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFE
# RFE with Logisitic Regression
from sklearn.linear_model import LogisticRegression
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=100, step=200, verbose=5)
rfe_selector.fit(x_train, y_train)
rfe_support = rfe_selector.get_support()
rfe_feature = (x_train.loc[:,rfe_support].columns.tolist())
# SelectfromModel with Logistics Regression by l1 penalty
embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1"), threshold=0.1)
embeded_lr_selector.fit(x_train, y_train)
n_features = embeded_lr_selector.transform(x_train).shape[1]
# Reset the threshold till the number of features equals to 100.
while n_features > 100:
embeded_lr_selector.threshold += 0.1
x_transform = embeded_lr_selector.transform(x_train)
n_features = x_transform.shape[1]
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = x_train.loc[:,embeded_lr_support].columns.tolist()
# SelectfromModel with Lasso Regression by l1 penalty
from sklearn.linear_model import Lasso
embeded_la_selector = SelectFromModel(Lasso(alpha=0.005))
embeded_la_selector.fit(x_train, y_train)
n_features = embeded_la_selector.transform(x_train).shape[1]
# Reset the threshold till the number of features equals to 100.
while n_features > 100:
embeded_la_selector.threshold += 0.1
x_transform = embeded_la_selector.transform(x_train_norm)
n_features = x_transform.shape[1]
embeded_la_support = embeded_la_selector.get_support()
embeded_la_feature = x_train.loc[:,embeded_la_support].columns.tolist()
# Feature selection with RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier
embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=10), threshold=0.001)
embeded_rf_selector.fit(x_train, y_train)
n_features = embeded_rf_selector.transform(x_train).shape[1]
while n_features > 100:
embeded_rf_selector.threshold += 0.001
x_transform = embeded_rf_selector.transform(x_train)
n_features = x_transform.shape[1]
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature =x_train.loc[:,embeded_rf_support].columns.tolist()
# Feature selection with LGBM Classifier
from lightgbm import LGBMClassifier
from lightgbm import LGBMRegressor
lgbc=LGBMClassifier(n_estimators=30, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)
embeded_lgb_selector = SelectFromModel(lgbc, threshold=0.001)
embeded_lgb_selector.fit(x_train, y_train)
n_features = embeded_lgb_selector.transform(x_train).shape[1]
while n_features > 100:
embeded_lgb_selector.threshold += 0.005
x_transform = embeded_lgb_selector.transform(x_train)
n_features = x_transform.shape[1]
embeded_lgb_support = embeded_lgb_selector.get_support()
embeded_lgb_feature = x_train.loc[:,embeded_lgb_support].columns.tolist()
# ANOVA
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
skb = SelectKBest(f_classif, k=50).fit(x_train, y_train)
skb_f_classif_support = skb.get_support()
skb_f_classif_feature = x_train.loc[:,skb_f_classif_support].columns.tolist()
feature_name = x_train.columns.tolist()
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'ANOVA':skb_f_classif_support,'RFE':rfe_support,'Logistic':embeded_lr_support,'Lasso':embeded_la_support,'Random Forest':embeded_rf_support, 'LightGBM':embeded_lgb_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(200)
出力結果
以下、Dataframeで出力される。ランキング形式で分かりやすい。
ID | Feature | ANOVA | RFE | Logistics | Lasso | Random Forest | LightGBM | Total |
---|---|---|---|---|---|---|---|---|
1 | AAA | True | True | True | True | True | True | 6 |
2 | BBB | True | True | True | True | True | True | 6 |
3 | CCC | True | True | False | True | True | True | 5 |
4 | DDD | True | True | False | True | True | True | 5 |
5 | EEE | True | True | False | False | True | True | 4 |
また出た結果をリスト形式にして、再度解析用のデータフレームを作ることも可能。
extract.py
selected_features=feature_selection_df.Feature.tolist()
# selection by all methods
feature_selected_df = train_x[selected_features]
# selection only by Lasso
Lasso_selected_df = train_x[embeded_la_feature]
終わりに
特徴量につき、どのモデル式に対してもバランスの良いものを選定したということが言えれば、
最終的なモデル式選定の良い根拠につながると感じている。
「どれか一つにとって、恣意的に都合の良いものを選定しました」とならないことが重要。
無限ループにハマることが少しでもなくなればと思います。