More than 5 years have passed since last update.

【ラビットチャレンジ】機械学習第5章アルゴリズム

機械学習

Last updated at 2019-06-26Posted at 2019-06-23

k近傍法-k Nearest Neighber（kNN）

分類問題のための機械学習手法
近傍 $k$ 個のクラスラベルの中で最も多いラベルを割り当てる．
$k$ を変化させると結果も変わる．
- $k$ を大きくすると決定境界は滑らかになる．

ハンズオン

np_knn.ipynb

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

訓練データ生成

np_knn.ipynb

def gen_data():
    x0 = np.random.normal(size=50).reshape(-1, 2) - 1
    x1 = np.random.normal(size=50).reshape(-1, 2) + 1.
    x_train = np.concatenate([x0, x1])
    y_train = np.concatenate([np.zeros(25), np.ones(25)]).astype(np.int)
    return x_train, y_train

np_knn.ipynb

X_train, ys_train = gen_data()
plt.scatter(X_train[:, 0], X_train[:, 1], c=ys_train)

<matplotlib.collections.PathCollection at 0x10fa3e470>

学習

陽に訓練ステップはない

予測

予測するデータ点との、距離が最も近い$k$個の、訓練データのラベルの最頻値を割り当てる

np_knn.ipynb

def distance(x1, x2):
    return np.sum((x1 - x2)**2, axis=1)

def knc_predict(n_neighbors, x_train, y_train, X_test):
    y_pred = np.empty(len(X_test), dtype=y_train.dtype)
    for i, x in enumerate(X_test):
        distances = distance(x, X_train)
        nearest_index = distances.argsort()[:n_neighbors]
        mode, _ = stats.mode(y_train[nearest_index])
        y_pred[i] = mode
    return y_pred

def plt_resut(x_train, y_train, y_pred):
    xx0, xx1 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
    xx = np.array([xx0, xx1]).reshape(2, -1).T
    plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train)
    plt.contourf(xx0, xx1, y_pred.reshape(100, 100).astype(dtype=np.float), alpha=0.2, levels=np.linspace(0, 1, 3))

np_knn.ipynb

n_neighbors = 3

xx0, xx1 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
X_test = np.array([xx0, xx1]).reshape(2, -1).T

y_pred = knc_predict(n_neighbors, X_train, ys_train, X_test)
plt_resut(X_train, ys_train, y_pred)

numpy実装

np_knn.ipynb

from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, ys_train)
plt_resut(X_train, ys_train, knc.predict(xx))

考察

直感的である．
$k$の大きさによって結果が変わるが，$k$をどのように決定するか．
- $k$をパラメータとして複数回アルゴリズムを回し，目的関数を最小化するものを選ぶ？

k-means

教師なし学習
クラスタリング手法
与えられたデータを $k$ 個のクラスタに分類する．
$k$ は予め決めておく．

k-meansのアルゴリズム

各クラスタ中心の初期値を設定する．
各データ点に対して，各クラスタ中心との距離を計算し，最も距離が近いクラスタを割り当てる．
各クラスタの平均ベクトル（中心）を計算する．
収束するまで2,3の処理を繰り返す．

ハンズオン

課題のページがロジスティック回帰のものだったので，課題設定が不明．
np_kmeans.ipynb を用いて行う．

np_kmeans.ipynb

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

データ生成

np_kmeans.ipynb

def gen_data():
    x1 = np.random.normal(size=(100, 2)) + np.array([-5, -5])
    x2 = np.random.normal(size=(100, 2)) + np.array([5, -5])
    x3 = np.random.normal(size=(100, 2)) + np.array([0, 5])
    return np.vstack((x1, x2, x3))

np_kmeans.ipynb

# データ作成
X_train = gen_data()
# データ描画
plt.scatter(X_train[:, 0], X_train[:, 1])

<matplotlib.collections.PathCollection at 0x113713b70>

np_kmeans.ipynb

def distance(x1, x2):
    return np.sum((x1 - x2)**2, axis=1)

n_clusters = 3
iter_max = 100

# 各クラスタ中心をランダムに初期化
centers = X_train[np.random.choice(len(X_train), n_clusters, replace=False)]

for _ in range(iter_max):
    prev_centers = np.copy(centers)
    D = np.zeros((len(X_train), n_clusters))
    # 各データ点に対して、各クラスタ中心との距離を計算
    for i, x in enumerate(X_train):
        D[i] = distance(x, centers)
    # 各データ点に、最も距離が近いクラスタを割り当
    cluster_index = np.argmin(D, axis=1)
    # 各クラスタの中心を計算
    for k in range(n_clusters):
        index_k = cluster_index == k
        centers[k] = np.mean(X_train[index_k], axis=0)
    # 収束判定
    if np.allclose(prev_centers, centers):
        break

クラスタリング結果

np_kmeans.ipynb

def plt_result(X_train, centers, xx):
    #　データを可視化
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_pred, cmap='spring')
    # 中心を可視化
    plt.scatter(centers[:, 0], centers[:, 1], s=200, marker='X', lw=2, c='black', edgecolor="white")
    # 領域の可視化
    pred = np.empty(len(xx), dtype=int)
    for i, x in enumerate(xx):
        d = distance(x, centers)
        pred[i] = np.argmin(d)
    plt.contourf(xx0, xx1, pred.reshape(100, 100), alpha=0.2, cmap='spring')

np_kmeans.ipynb

y_pred = np.empty(len(X_train), dtype=int)
for i, x in enumerate(X_train):
    d = distance(x, centers)
    y_pred[i] = np.argmin(d)

np_kmeans.ipynb

xx0, xx1 = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100))
xx = np.array([xx0, xx1]).reshape(2, -1).T

plt_result(X_train, centers, xx)

numpy実装

np_kmeans.ipynb

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)

np_kmeans.ipynb

print("labels: {}".format(kmeans.labels_))
print("cluster_centers: {}".format(kmeans.cluster_centers_))
kmeans.cluster_centers_

labels: [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1]
cluster_centers: [[ 4.9692623  -4.84907152]
 [-0.0671198   4.9758858 ]
 [-5.13504067 -4.95842931]]





array([[ 4.9692623 , -4.84907152],
       [-0.0671198 ,  4.9758858 ],
       [-5.13504067, -4.95842931]])

np_kmeans.ipynb

plt_result(X_train, kmeans.cluster_centers_, xx)

考察

直感的である．
kNNと同様，$k$の大きさ，さらに中心の初期値によって結果が変わる．
- 局所最適解は必ず見つかるので，$k$と中心の初期値をパラメータに複数回アルゴリズムを回して最適解を探せば良さそう．

DeepLearning ラビットチャレンジ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【ラビットチャレンジ】 機械学習 第5章 アルゴリズム

k近傍法-k Nearest Neighber（kNN）

ハンズオン

訓練データ生成

学習

予測

numpy実装

考察

k-means

k-meansのアルゴリズム

ハンズオン

データ生成

クラスタリング結果

numpy実装

考察

【ラビットチャレンジ】機械学習第5章アルゴリズム