More than 5 years have passed since last update.

A comment on Boruta algorithm

Posted at 2020-05-15

前略

Borutaは特徴量重要度を使用した特徴量選択の手法の一つです。

下記サイトの解説を参考に勉強させていただきました。とてもわかり易い解説で、実験もあってとても参考になります。

ランダムフォレストと検定を用いた特徴量選択手法 Boruta

Borutaのアルゴリズムでは、ある特徴量が予測に有効かどうか判断するために帰無仮説「
この特徴量の重要度は、判別(回帰)に寄与しない特徴量の重要度と同じである。」を立て、検定を行います。帰無仮説が正しい場合、ある特徴量の重要度と判別(回帰)に寄与しない特徴量の収容度の大小はランダムになるはずで、ある特徴量の重要度が大きくなる確率は$p=0.5$の二項分布に従うはずです。

ここで気になったのは、Shadow Featureは一つではなく、元の特徴量次元と同じ数だけあります。そして、大小比較は、「ある特徴量の重要度 VS Shadow Featureの重要度の中で最大のもの」で行っています。

上記画像は[ランダムフォレストと検定を用いた特徴量選択手法 Boruta](https://aotamasaki.hatenablog.com/entry/2019/01/05/195813)からの引用

boruta_pyの実装もそうなっているように見えます。

    def _do_tests(self, dec_reg, hit_reg, _iter):
        active_features = np.where(dec_reg >= 0)[0]
        hits = hit_reg[active_features]
        # get uncorrected p values based on hit_reg
        to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
        to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()
        ...

ところが、Shadow Featureの特徴量重要度のうち最大のものを表す確率変数を$S_{\mathrm{max}}$とすると、

S_{\mathrm{max}} \leq F_1 \Leftrightarrow S_1\leq F_1 \wedge S_2\leq F_1 \wedge S_3\leq F_1 \wedge S_4\leq F_1

したがって、

\mathrm{Pr}(S_{\mathrm{max}} \leq F_1) = \mathrm{Pr}(S_1 \leq F_1)\mathrm{Pr}(S_2 \leq F_1)\mathrm{Pr}(S_3 \leq F_1)\mathrm{Pr}(S_4 \leq F_1) = \left(\frac{1}{2}\right)^4

となり、帰無仮説が正しい場合、ある特徴量がShadow Featureの特徴量重要度のうち最大のものよりも大きくなる確率は$p=(1/2)^4$の二項分布に従うはずです。

%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import binom

plt.rcParams['font.family'] = 'IPAPGothic' 
plt.rcParams["figure.figsize"] = [12, 6]
plt.rcParams['font.size'] = 20 
plt.rcParams['xtick.labelsize'] = 15
plt.rcParams['ytick.labelsize'] = 15

# 試行回数
N = 100
# 試行回数中の発生回数 (配列)   
k = np.arange(N+1)

# グラフにプロット
fig, ax = plt.subplots(1,1)
ax.plot(k, binom.pmf(k, N, p=0.5), 'bo', ms=8, label='p=0.5')
ax.vlines(k, 0, binom.pmf(k, N, p), colors='b', lw=1, alpha=0.2)
ax.plot(k, binom.pmf(k, N, p=0.5**4), 'x', ms=8, color='r', label='p=0.5^4')
ax.vlines(k, 0, binom.pmf(k, N, p=0.5**4), colors='r', lw=1, alpha=0.2)
ax.set_xlabel('iterarion')
ax.set_ylabel('probability')
ax.legend()
plt.show()

グラフにするとわかりますが、棄却域とか結構違います。

草々

と、ここまで書いたんですが、Borutaを実際使ってみるとちゃんと動いてる？ので、何か考え方間違ってるんですかね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

A comment on Boruta algorithm

前略

コメント

草々