More than 5 years have passed since last update.

Kaggle:Feature Selection

Kaggle

Posted at 2020-02-21

Manual Feature Engineering Part1 Part2の続き。Manual Feature Engineeringで作った

train_bureau_raw.csvとかtest_bureau_raw.csvとかをそのまま使うようです。まあnotebookのデータに入ってますが。

# Bureau only features
bureau_features = list(set(bureau_columns) - set(previous_columns))

setはpythonのユニークというか集合を出す関数ですね。前やった。

Admit and Correct Mistakes!

When doing manual feature engineering, I accidentally created some columns derived from the client id, SK_ID_CURR. As this is a unique identifier for each client, it should not have any predictive power, and we would not want to build a model trained on this "feature". Let's remove any columns built on the SK_ID_CURR.

あれ、そんなのいつ作ったかな・・

There are 1 columns that contain SK_ID_CURR
There are 0 columns that contain SK_ID_BUREAU
There are 0 columns that contain SK_ID_PREV
Training shape: (307511, 1463)
Testing shape: (48744, 1463)

実際１個しかないし。。

Remove Collinear Variables

Identify Correlated Variables

# Threshold for removing correlated variables
threshold = 0.9

# Absolute value correlation matrix
corr_matrix = train.corr().abs()
corr_matrix.head()

前は相関係数を調べるならtargetとの相関係数を調べればいいんじゃない？と思っていたが、collinearな変数を見つけるためには、がさっと全部計算するのが良いというのが理解できた。

np.triuのk=1オプション付きは「上三角行列 (対角要素なし)」を取得する。
https://blog.amedama.jp/entry/2017/11/30/231443

確かに相関係数matrixは同じものが入っているので上半分か下半分でよい。
というかif any(upper[column] > threshold)下記のように条件が楽に設定できるわけよね。

今回の例でいうと、列 AMT_GOODS_PRICEは AMT_CREDITと0.986046と高い（しきい値以上の）数値になっている。だが、AMT_CREDIT側では下半分の数値を削除しているので、

 if any(upper[column] > threshold)

という分岐でひっかかるのはAMT_GOODS_PRICEだけで、AMT_CREDITが消えることはない、見事片側だけ削除しているということですね。

Remove Missing Values

# Need to save the labels because aligning will remove this column
train_labels = train["TARGET"]
train_ids = train['SK_ID_CURR']
test_ids = test['SK_ID_CURR']

train = pd.get_dummies(train.drop(columns = all_missing))
test = pd.get_dummies(test.drop(columns = all_missing))

train, test = train.align(test, join = 'inner', axis = 1)

print('Training set full shape: ', train.shape)
print('Testing set full shape: ' , test.shape)

またget_dummiesしてる・・なぜ２回やる？まあ２回しても中身は変わらない（はず）だが・・

train = train.drop(columns = ['SK_ID_CURR'])
test = test.drop(columns = ['SK_ID_CURR'])

またIDをdropしてるし・・謎。上記にあったフルデータを再ロードした場合は必要だったのかな？

Feature Selection through Feature Importances

Since the LightGBM model does not need missing values to be imputed, we can directly fit on the training data.

LightGBMには欠損値ありでデータを突っ込めるから、これでfeature importanceを測って楽をしようぜ、と読めた。

# Initialize an empty array to hold feature importances
feature_importances = np.zeros(train.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced')

この上の説明でも、

We will use Early Stopping to determine the optimal number of iterations and run the model twice, averaging the feature importances to try and avoid overfitting to a certain set of features.

とあるように、基本的には数回fittingをしていくので、zero初期化して、結果を加算していく、ということは以前学んだ

288 features required for 0.90 of cumulative importance

cumulative（累積的）なものを指標に使うというのは良い考えですよね。

Other Options for Dimensionality Reduction

PCAはイマイチ理解がおっつかない・・

conclusion

とにかく、

NA データが多い列は削除（今回は75%）
feature importanceが0のものは削除
collinear variablesが高いものは１つ残して削除（今回は90%以上）

とするのは前提としてやったほうがよさそう。ということで。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up