0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

TPS22Jul 上位入賞者記事 解説

Last updated at Posted at 2022-08-04

TPS22Jul 上位入賞者記事 解説 publicscore 0.81912
https://www.kaggle.com/code/mehrankazeminia/3-3-tps22jul-clustering-ensembling

GMM概要
https://datachemeng.com/gaussianmixturemodel/
GaussianMixture引数
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

ライブラリ読み込み

一通り読み込む

.py
import os
import gc
import random

import numpy as np 
import pandas as pd
import seaborn as sns

from tqdm import tqdm
from scipy import stats
from pathlib import Path

import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.express as px
%matplotlib inline
!ls ../input/*

!pip install sklego
from sklego.mixture import BayesianGMMClassifier
from sklearn.mixture import BayesianGaussianMixture
from sklearn.preprocessing import MinMaxScaler, PowerTransformer, StandardScaler, RobustScaler, LabelEncoder

データ整理

コンペデータ読み込み

.py
DATA = pd.read_csv('../input/tabular-playground-series-jul-2022/data.csv')
SAMPLE = pd.read_csv('../input/tabular-playground-series-jul-2022/sample_submission.csv')

dfにデータコピー
dfの列id削除(特徴量にならない)
inplace = True で元のDATAのidも削除
colsに列の名前入れる

.py
df = DATA.copy()
df.drop("id", axis=1, inplace=True)
cols = list(df.columns)

シャピロウィルク検定
帰無仮説H:正規母集団からのサンプリングである
有意水準0.05としp<0.05で帰無仮説を棄却
p_value <= alpha で正規母集団から抽出されたものではない軸のみ集める

.py
cols_select = []
alpha = 0.05

for col in cols:
    _, p_value = stats.shapiro(df[col])
    
    if (p_value <= alpha): 
        cols_select.append(col)       
print(cols_select)  
#['f_07', 'f_08', 'f_09', 'f_10', 'f_11', 'f_12', 'f_13', 'f_22', 'f_23', 'f_24', #'f_25', 'f_26', 'f_27', 'f_28']

スケーリング

PowerTransformer()でデータをガウシアン的にする

.py
dff = DATA[cols_select]

dffs = dff.copy()
dffs = PowerTransformer().fit_transform(dffs)
dffs = pd.DataFrame(dffs, columns=cols_select)
dffs

分類 過去データ読み込み

過去に提出したデータを読み込む×3
教師データとして使用する

.py
sub_prime = pd.read_csv('../input/tps22jul81580/submission.csv', index_col=[0])
sub_prime['Predicted'].value_counts()

3 16358
5 16343
6 14699
1 13775
4 13018
7 12524
2 11283
Name: Predicted, dtype: int64

過去のデータの統計表示

sub_prime['Predicted'] += -1
sub_prime['Predicted'].value_counts().plot(kind='bar')
sub_prime['Predicted'].value_counts()
```.py
2    16358
4    16343
5    14699
0    13775
3    13018
6    12524
1    11283
Name: Predicted, dtype: int64

![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/1982114/3c868d04-49e4-3891-c086-8de388623c09.png)

過去の提出表示
```.py
support1 = pd.read_csv('../input/tps22jul81661/submission.csv', index_col=[0])
support1['Predicted'].value_counts()

3 16778
1 15776
5 15063
2 14217
6 12855
4 11935
7 11376

Name: Predicted, dtype: int64

.py
support1['Predicted'] += -1
support1['Predicted'].value_counts().plot(kind='bar')
support1['Predicted'].value_counts()

2 16778
0 15776
4 15063
1 14217
5 12855
3 11935
6 11376
Name: Predicted, dtype: int64
image.png

.py
support2 = pd.read_csv('../input/tps22jun81232/submission.csv', index_col=[0])
support2['Predicted'].value_counts()

4 16355
5 16331
2 14769
1 13701
6 12988
3 12532
7 11324
Name: Predicted, dtype: int64
support2['Predicted'] += -1

.py
support2['Predicted'].value_counts().plot(kind='bar')
support2['Predicted'].value_counts()

3 16355
4 16331
1 14769
0 13701
5 12988
2 12532
6 11324
Name: Predicted, dtype: int64
image.png

分類 BayesianGMMClassifierの学習

dffs 相関のある軸のみでpowertransfrom済みのデータ を学習用(train)データ
sub_prime 前回の出力結果 を教師(train)データ1
support1,2 前回の出力結果 を教師(train)データ2,3

としてGBMMを学習

.py
X = np.array(dffs)
y = np.array(sub_prime)

s1 = np.array(support1)
s2 = np.array(support2)

n_components=7 クラスタの数 他の記事で7つが良いとされていた
random_state=123,
tol=0.001 収束の閾値。下界の平均ゲインがこの閾値を下回るとEMの反復が停止
max_iter=200 実行するEM反復回数
n_init=3 実行する初期化の数. 最良の結果が保持される
verbose=0

各教師データ(y,s1,s2)でBGMM学習させ、結果(確率)をproba,probs1,probs2に格納

.py
bgm = BayesianGMMClassifier(n_components=7, random_state=123, tol=0.001, max_iter=200, n_init=3, verbose=0)
#98000個のデータにおいて1~7のどこに属するか確率で表示 (98000,7)

bgm.fit(X,y)
proba = bgm.predict_proba(X)

bgm.fit(X,s1)
probs1 = bgm.predict_proba(X)

bgm.fit(X,s2)
probs2 = bgm.predict_proba(X)

proba.shape, probs1.shape, probs2.shape
#((98000, 7), (98000, 7), (98000, 7))

concatenate()で列方向に結合(axix = 1)
確率をproba,probs1,probs2の順に小さくする 精度的な重み付け

.py
prob = np.concatenate((proba, probs1*0.90), axis=1)
prob = np.concatenate((prob,  probs2*0.80), axis=1)
prob.shape
#(98000, 21)

アンサンブル行う 最も確率の高いクラスタに属するように分類する 

.py
pred = np.argmax(prob, axis=1)
pred, min(pred), max(pred)
#(array([0, 6, 5, ..., 5, 1, 4]), 0, 20)

教師データs1の結果に対してyの結果を見る(横軸s1縦軸y)
s1 y
0  4
1 0
2 2
3 6
4 5
5 3
6 1
が一番多い結果になった

.py
clusters = np.zeros(shape=(7, 7), dtype=int)
for n1, n2 in zip(y, s1):
    clusters[n1, n2] += 1
    
clusters

array([[ 33, 13518, 70, 24, 81, 10, 39],
[ 31, 23, 48, 24, 12, 15, 11130],
[ 30, 20, 16149, 33, 65, 34, 27],
[ 10, 10, 93, 66, 167, 12651, 21],
[15613, 521, 60, 29, 10, 15, 95],
[ 7, 27, 78, 27, 14448, 86, 26],
[ 52, 98, 280, 11732, 280, 44, 38]])

上記の行列からs1に対してyの予測値をまとめる

.py
max_clusters = np.argmax(clusters, axis=0)
max_clusters
#array([4, 0, 2, 6, 5, 3, 1])

分類結果が0-20なので0-6に変換する

.py
for i in range(len(pred)):
    
    if (pred[i] == 7): 
        pred[i] = max_clusters[0]
    if (pred[i] == 8): 
        pred[i] = max_clusters[1]
    if (pred[i] == 9): 
        pred[i] = max_clusters[2]
    if (pred[i] == 10): 
        pred[i] = max_clusters[3]
    if (pred[i] == 11): 
        pred[i] = max_clusters[4]
    if (pred[i] == 12): 
        pred[i] = max_clusters[5]
    if (pred[i] == 13): 
        pred[i] = max_clusters[6]        

pred, min(pred), max(pred)
#(array([0, 6, 5, ..., 5, 1, 4]), 0, 20)
.py
clusters = np.zeros(shape=(7, 7), dtype=int)
for n1, n2 in zip(y, s2):
    clusters[n1, n2] += 1
    
clusters

array([[13570, 44, 47, 38, 35, 5, 36],
[ 18, 11, 30, 56, 24, 6, 11138],
[ 28, 73, 66, 41, 16089, 25, 36],
[ 2, 51, 27, 5, 24, 12908, 1],
[ 37, 6, 30, 16174, 30, 2, 64],
[ 26, 14526, 54, 8, 41, 27, 17],
[ 20, 58, 12278, 33, 88, 15, 32]])

.py
max_clusters = np.argmax(clusters, axis=0)
max_clusters

array([0, 5, 6, 4, 2, 3, 1])

.py
for i in range(len(pred)):
    
    if (pred[i] == 14): 
        pred[i] = max_clusters[0]
    if (pred[i] == 15): 
        pred[i] = max_clusters[1]
    if (pred[i] == 16): 
        pred[i] = max_clusters[2]
    if (pred[i] == 17): 
        pred[i] = max_clusters[3]
    if (pred[i] == 18): 
        pred[i] = max_clusters[4]
    if (pred[i] == 19): 
        pred[i] = max_clusters[5]
    if (pred[i] == 20): 
        pred[i] = max_clusters[6]        

pred, min(pred), max(pred)

(array([0, 6, 5, ..., 5, 1, 4]), 0, 6)

#提出

.py
sub = SAMPLE.copy()
sub['Predicted'] = pred
hist_data = [sub_prime.iloc[:, 0], pred]  
group_labels = ['Sub_Prime', 'Submission']
  
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False) 
fig.show()

.py
sub.to_csv("submission.csv", index=False)
!ls

感想

教師なし学習であるため分類器に限界がある
→ 過去に提出したデータを教師データとして教師あり学習を行う

次元が多い
→ 相関図を見て相関の大きいデータのみで学習を行う

0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?