More than 3 years have passed since last update.

SelectKBestって何？ってなったときに、サクッとみれるようまとめた

Posted at 2021-10-03

分析の学習をしているとSelectKBestという関数を使っているケースに巡り、
「？？？？」ってなったので、簡単に自分でも使ってみました。
ログとしてここに書いておきます。

簡単な説明にとどめ、そして$χ^2$検定とか出てきますが、ガツガツした説明はここでは一旦省きます。

使うデータ

公式ドキュメントに従って、定番のload_irisを使います。
（コードは一気に書くので、後述）

ドキュメントにまずは従う

一旦ドキュメント通りにコピペして使ってみます。

import numpy as np

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


X, y = load_iris(return_X_y=True)
print(X.shape)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape

>>>(150, 4)
>>>(150, 2)

当然ながら、同じ結果になります。

なんですが、、、そもそも自分の中で$χ^2$って$χ^2$検定（二つの質的変数がどのくらい連関があるか？的な）の利用シーンしか知らず、
クロス表でCategorical×Categoricalのイメージのみでした。
そーなると「なんですか？これは？」ってなるわけですが、
とりあえずX, X_newの要素をランダムにみてみました。

print(X[0])
print(X_new[0])
print(X[10])
print(X_new[10])


[5.1 3.5 1.4 0.2]
[1.4 0.2]
[5.4 3.7 1.5 0.2]
[1.5 0.2]

scipyでも検証

結果的に今回SelectKBestは特徴量選択のメソッドで、今回は後ろ2つが$χ^2$の値を最も最適と判断されたわけですが、本当にそうか？？？って思ったので、scipy.statsでも検証してみました。

print('そのままのX:', chi2_contingency(np.hstack((X, y.reshape(-1, 1))), correction=False)[:3])
X_new2 = np.array([arr[:2] for arr in X]) # 0, 1
print('0, 1番目:', chi2_contingency(np.hstack((X_new2, y.reshape(-1, 1))), correction=False)[:3])
X_new2 = np.array([arr[::2] for arr in X]) # 0, 2
print('0, 2番目:', chi2_contingency(np.hstack((X_new2, y.reshape(-1, 1))), correction=False)[:3])
X_new2 = np.array([arr[::3] for arr in X]) # 0, 3
print('0, 3番目:', chi2_contingency(np.hstack((X_new2, y.reshape(-1, 1))), correction=False)[:3])
X_new2 = np.array([arr[1:-1] for arr in X]) # 1, 2
print('1, 2番目:', chi2_contingency(np.hstack((X_new2, y.reshape(-1, 1))), correction=False)[:3])
X_new2 = np.array([arr[1::2] for arr in X]) # 1, 3
print('1, 3番目:', chi2_contingency(np.hstack((X_new2, y.reshape(-1, 1))), correction=False)[:3])
# 3, 4
print('本命:', chi2_contingency(np.hstack((X_new, y.reshape(-1, 1))), correction=False)[:3])


そのままのX: (197.72548568528768, 1.0, 596)
0, 1番目: (95.8009422151861, 1.0, 298)
0, 2番目: (108.3388708044952, 1.0, 298)
0, 3番目: (106.60836667325174, 1.0, 298)
1, 2番目: (150.73817774749693, 0.9999999999999512, 298)
1, 3番目: (137.88456015851705, 1.0, 298)
本命: (34.65598151635649, 1.0, 298)

なるほど、確かに！って感じで、後ろ2つで$χ^2$が一番小さいことがわかりました。

じゃあ、これで次元削減していいの？

このSelectKBestで最適な特徴量を選択できるのであれば、PCAとかいらないじゃ〜〜ん！って思いがちですが、そうでもないです。
そもそも$χ^2$において最適な特徴量であるので、
例えばこのX_newとかをLogisticRegressionとかに当てはめると、正解率はそんなによくならないことを確かめてみました。

そのまま

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9736842105263158

X_new

X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=123)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9210526315789473

0, 1番目

X_new2 = np.array([arr[:2] for arr in X])
X_train, X_test, y_train, y_test = train_test_split(X_new2, y, random_state=123)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8421052631578947

1, 3番目

X_new2 = np.array([arr[1::2] for arr in X])
X_train, X_test, y_train, y_test = train_test_split(X_new2, y, random_state=123)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9473684210526315

試合終了しました。(chi2 -> f_regressionでも同じ結果になる)

機械学習にはRFEとかが良さげ

このSelectKBestはどういう時に有利なのかというと、統計量の側面から見た特徴量選択に適している（そう）です。

なので、上記で見たlinear modelで力を発揮するかと言われれば、そうではなく、
機械学習において特徴量をうまく選択したい時は、
RFE(Recursive Feature Elimination)、EFS（Exhaustive Feature Selector）などを使う方が良いです。

実際に使って検証しました。

from sklearn.feature_selection import RFE


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
print('そのまま', model.score(X_test, y_test))

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=2)
rfe.fit(X_train, y_train)
print('RFE features 2', rfe.score(X_test, y_test))


そのまま 0.9736842105263158
RFE features 2 0.9473684210526315

結果をみると、確かに機械学習において最適な特徴量選択をしてくれていますね。

まとめ

本来はSelectKBestってなんだ？ってところから、最終的にどのように使い分けるのか？というところまで派生しました。
おさらいすると
統計的な側面で特徴量選択→ SeletKBest（など。他にもあります）
機械学習の側面で特徴量選択→RFE（など。ほかにもあります）
って感じで使い分けるのが適切かと。

何かしらの参考になれば嬉しい限りです！

参考文献

【入門者向け】特徴量選択の基本まとめ(scikit-learnときどきmlxtend) ⇦めちゃ参考になりました。
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up