More than 5 years have passed since last update.

機械学習　k-NN で不均衡データ sklearn

Last updated at 2020-02-25Posted at 2020-02-23

(2020/02/25　追記)　TODO: k-NNのweightsの計算を距離の逆数を使わず、距離の和で計算してしまっている。距離の逆数に修正する予定。(KNeighborsClassifierの計算方法が間違っているわけではなく、私の自作関数の計算方法が間違っている状態)

結論

sklearnのKNeighborsClassifierで不均衡データの少ないサイプル側に重いweightを設定をすることができた。
その結果：少ないサンプル側のrecallを上げることができた。
- before confusion matrix
  
  [[2641, 67]
  
  [ 167, 125]]
- after: confusion matrix
  
  [[2252 456]
  
  [ 80 212]]

イメージ図　before

イメージ図　after

背景・課題

sklearn.neighbors.KNeighborsClassifierのweightsという引数の挙動が意味不明だったので調べてみた。

方法

不均衡データへの設定なし

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

%matplotlib inline

from sklearn.datasets import make_classification
data_base = make_classification(
    n_samples = 10000, n_features = 2, n_informative = 2, n_redundant = 0, 
    n_repeated = 0, n_classes = 2, n_clusters_per_class = 2, weights = [0.9, 0.1], 
    flip_y = 0, class_sep = 0.5, hypercube = True, shift = 0.0, 
    scale = 1.0, shuffle = True, random_state =5)

df = pd.DataFrame(data_base[0], columns = ['f1', 'f2'])
df['class'] = data_base[1]

fig = plt.figure()
ax = fig.add_subplot()
for i in df.groupby('class'):
    cls = i[1]
    ax.plot(cls['f1'],
              cls['f2'],
               'o',
            ms=2)

plt.show()

X = df[["f1","f2"]]
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

print("train", X_train.shape, y_train.shape)
print("test", X_test.shape, y_test.shape)

train (7000, 2) (7000,)
test (3000, 2) (3000,)


from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
    
result   = confusion_matrix(y_test, pred)
result2 = accuracy_score(y_test, pred)

print("confusion matrix \n",result) 
print("accuracy \n", result2 )

confusion matrix
[[2641 67]
[ 167 125]]
accuracy
0.922

不均衡データへの設定あり

まずweightをサンプルサイズの割合の逆数を算出する。

size_and_weight = pd.DataFrame({
                'class0': [sum(clf._y == 0),1/ (sum(clf._y == 0)/ len(clf._y))],
                'class1': [sum(clf._y == 1),1/ (sum(clf._y == 1)/ len(clf._y))]}).T
size_and_weight.columns = ['sample_size', 'weight']
size_and_weight

	sample_size	weight
class0	6292.0	1.112524
class1	708.0	9.887006

trainデータに対して学習を済ませておき、その上で、testデータに対しても距離計算を実施しておく。



weights_array = pd.Categorical(clf._y)
weights_array.categories = [size_and_weight.loc[('class0'),'weight'],
                            size_and_weight.loc[('class1'),'weight']]

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
neigh_dist, neigh_ind = clf.kneighbors(X_test) # この部分のデータフレームについては後述

weights_array = np.array(weights_array).reshape((-1, 1))[neigh_ind,0]
pd.DataFrame(weights_array).head()

	0	1	2	3	4
0	1.112524	1.112524	1.112524	1.112524	1.112524
1	1.112524	1.112524	1.112524	1.112524	1.112524
2	1.112524	9.887006	1.112524	1.112524	1.112524
3	1.112524	1.112524	1.112524	1.112524	1.112524
4	1.112524	1.112524	1.112524	1.112524	1.112524

↑　不均衡データに対応するためのweightが完成
weightを考慮するために、引数weightsを設定し予測まで実行

def tmp(array_):
    global weights_array
    array_ = array_ * weights_array
    return array_

clf = KNeighborsClassifier(n_neighbors=5,weights=tmp)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

pred = clf.predict(X_test)
result   = confusion_matrix(y_test, pred)
result2 = accuracy_score(y_test, pred)

print("confusion matrix \n",result) 
print("accuracy \n", result2 )

confusion matrix
[[2252 456]
[ 80 212]]
accuracy
0.8213333333333334

結論

KNeighborsClassifierでは、testデータを距離の総和のもっとも大きいclassへ分類する。 (推定結果が遠いクラスになるというやや直感と反するアルゴリズムだが、もっとも近しいn個のデータに対してのみ距離の計算は適用されるのでweightをかけない場合は、距離の総和というより多数決に近い推定となる。)
前記総和を "総和とweights_arrayの要素積"と置き換えることで、不均衡データに対応する方法へカスタマイズした。

詳細：カスタマイズ引数について

前記総和を "総和とweights_arrayの要素積"と置き換える

にあたり、以下のデータフレームの対応関係を理解する必要がある。

X_train
X_test
y_train
neigh_dist <- X_trainとX_testの距離計算の距離
neigh_ind <- X_trainとX_testの距離計算のindex
neigh_indのclass

neigh_dist, neigh_ind, y_train(clf._y) の対応について表から理解する。

neigh_dist, neigh_ind = clf.kneighbors(X_test)
pd.DataFrame(neigh_dist).tail(5)
pd.DataFrame(neigh_ind).tail(5)
pd.DataFrame(clf._y.reshape((-1, 1))[neigh_ind,0]).tail(5)

↑の左から順に、neigh_dist, neigh_ind, "neigh_indのclass"となる

考察1：上記3つのテーブルのshapeは同じである。
考察2: [上記3つの行数] = [テストデータの行数]
考察3: カラム数 = [n_neighbors=5]
考察4: neigh_distに関して、右に進むと大きくなる。　つまりテストデータに対して、もっとも距離の近い５つの点を抽出したと考えられる。

neigh_dist の算出から、neigh_dist、X_train、X_testの対応を理解する。

DataFrame: neigh_dist の以下の数値を算出する
- index = 2998 # 2998番目のテストデータ
- values = 0.015318 # [もっとも近い距離と判定されたX_trainの1374番目のデータ] と [前記indexのテストデータ] との距離

test_index = 2998
tmp1 = pd.DataFrame(X_test.iloc[test_index])
display(tmp1.T)

train_index = 1374
tmp2 = pd.DataFrame(X_train.iloc[train_index])
display(tmp2.T)

# ユークリッド距離を算出
(
sum(   (tmp1.values - tmp2.values)  **2    )
**(1/2)
)

array([0.01531811])

neigh_dist.iloc[2998,0]について:を学習データとテストデータから算出できた。

y_train と clf_knn._yの対応

sum(clf_knn._y == y_train) == len(y_train)

True

y_trainとclf_knn._yが一致することがわかった。

neigh_ind と "neigh_indのclass"の対応

index_ = neigh_ind[2998,:]
pd.DataFrame(clf._y[index_]).T

"neigh_indのclass" 2998行目について：neight_indとy_trainから"neigh_indのclass"を作成することができた。

　最後に

章[詳細：カスタマイズ引数について]では、KNeighborsClassifier.predict部分のソースコードを解説しているだけなので、gitのソースコードを見た方が早いかもしれません。　参考git sklearn

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

機械学習 k-NN で 不均衡データ sklearn

結論