sklearn.feature_selection.chi2 #scikit-learn

sklearn.feature_selectionは名前からして素性選択のためのモジュールと思われるのですが、

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

というサンプルコードが何をしているのかパッと理解できなかったので、いろいろ調べてみました。備忘のため、メモしておきます。

X_newはどうなってる？

print(X[0])
print(X_new[0])

# [ 5.1  3.5  1.4  0.2]
# [ 1.4  0.2]

print(X[1])
print(X_new[1])

# [ 4.9  3.   1.4  0.2]
# [ 1.4  0.2]

どうやら3番目と4番目のfeatureが選択されているようです。

SelectKBest

Select features according to the k highest scores.

最もスコアのよいK個のfeatureを選択する、とのこと。
score_func は f_classif や chi2 などが指定できるようです。

chi2

Compute chi-squared stats between each non-negative feature and class.

snip...

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

超適当な訳ですが、「それぞれのfeatureとclass間のカイ二乗統計量を計算する。（中略）カイ二乗検定は確率変数間の依存を測るので、この関数はクラスと独立である可能性が高いfeature（言い換えると、分類には役立たないfeature）を捨てることとなる」とのこと。

実際にSource Codeを読むと、期待度数(expected)と観測度数(observed)からカイ二乗統計量を計算していることがわかります。

まとめると...

サンプルコードがやっているのは、4つのfeatureのうちlabelと依存関係の強い上位2つのfeatureを選択することだ、ということがわかりました。