More than 1 year has passed since last update.

カーネル密度推定グラフを使用して、特徴量の分布を可視化する。

Posted at 2022-11-17

カーネル密度推定グラフを使用して、特徴量の分布を可視化し、
トレーニングデータセットとテストデータセットで利用可能な各特徴量の分布の確認をしているコードを見つけたためメモ。（TPS September 2021 EDA(Kaggle)）

トレーニングデータセットとテストデータの分布が一致しない状況は、ドメインシフトと言われていて、
機械学習の性能低下に繋がることがあるとのこと。
参考：ドメインシフトと機械学習の性能低下

カーネル密度推定とは

有限の標本点から全体の分布を推定する手法の一つ。
分布をパラメトリックモデルで記述できない場合は、ノンパラメトリック推定という手法が使われる。
カーネル密度推定はノンパラメトリック推定の代表例。

参考：パラメトリック手法とノンパラメトリック手法の違い
参考：カーネル密度推定とは

使用コード

データセットはTabular Playground Series - Sep 2021のものを使用している。
（トレーニングデータセットをtrain_df, テストデータセットをtest_dfに格納。）

features = [feature for feature in train_df.columns if feature not in ['id', claim']]

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = list(train_df.columns[1:26])
# features = ['f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26']

background_color = "#f6f5f5"

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

参考：TPS September 2021 EDA(Kaggle)

出力結果

トレーニングデータセットとテストデータセットの特徴量の分布がほぼ同じであると分かった。

使用コード内の知らなかった関数など

add_gridspec

Figure.add_gridspec(nrows, ncols)を実行すると、nrows（縦）、ncols(横）で分割するGridSpecが返される。
上記の使用コードでは、

gs = fig.add_gridspec(5,5)

としているので、出力結果のように5×5の図表が出力されている。
また、下記の部分は、図表同士の間隔を設定している。

gs.update(wspace=0.3, hspace=0.3)

spines["top", "right"].set_visible(False)

軸の目盛りを結ぶ線で、データ領域の境界を示すもの。任意の位置に配置することができる。
使用コードでは、図表の枠をspines["Top"], spines["right"]で指定し、set_visible(False)で指定した枠線を消しているイメージ。
参考：matplotlibで枠線を消したグラフを作る

for s in ["top","right"]:
    locals()["ax"+str(run_no)].spines[s].set_visible(False)
    run_no += 1

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up