More than 3 years have passed since last update.

【ラビットチャレンジ】【機械学習】アルゴリズム

機械学習

Last updated at 2020-01-06Posted at 2020-01-06

「ラビットチャレンジ」提出レポート

１．k近傍法(k-NN)

■　分類問題のための機械学習手法である。
　●　最近傍のデータを $k$ 個取ってきて、それらがもっとも多く所属するクラスに識別する。
■　$k$ を変化させると結果も変わる。$k=1 $ の場合は最近傍法という。
■　$k$ を大きくすると決定境界は滑らかになる。

２．k平均法(k-means)

【概要】

■　教師なし学習
■　クラスタリング手法（特徴の似ているもの同士をグループ化）
■　与えられたデータを $k$ 個のクラスタに分類する

【アルゴリズム】

■　各クラスタ中心の初期値を設定する
■　各データ点に対して、各クラスタ中心との距離を計算し、最も距離が近いクラスタを割り当てる
■　各クラスタの平均ベクトル（中心）を計算する
■　収束するまで2, 3の処理を繰り返す
■　中心の初期値を変えるとクラスタリング結果も変わりうる
■　$k$ の値を変えるとクラスタリング結果も変わる

３．ハンズオン

k近傍法(k-NN)

【実装演習結果】

　●　設定：人口データを分類
　●　課題：人口データと分類結果をプロットしてください

■　必要モジュールとデータのインポート

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

■　訓練データ生成

def gen_data():
    x0 = np.random.normal(size=50).reshape(-1, 2) - 1
    x1 = np.random.normal(size=50).reshape(-1, 2) + 1.
    x_train = np.concatenate([x0, x1])
    y_train = np.concatenate([np.zeros(25), np.ones(25)]).astype(np.int)
    return x_train, y_train

X_train, ys_train = gen_data()
plt.scatter(X_train[:, 0], X_train[:, 1], c=ys_train)

■　学習
陽に訓練ステップはない
■　予測
予測するデータ点との、距離が最も近い $k$ 個の、訓練データのラベルの最頻値を割り当てる

def distance(x1, x2):
    return np.sum((x1 - x2)**2, axis=1)

def knc_predict(n_neighbors, x_train, y_train, X_test):
    y_pred = np.empty(len(X_test), dtype=y_train.dtype)
    for i, x in enumerate(X_test):
        distances = distance(x, X_train)
        nearest_index = distances.argsort()[:n_neighbors]
        mode, _ = stats.mode(y_train[nearest_index])
        y_pred[i] = mode
    return y_pred

def plt_resut(x_train, y_train, y_pred):
    xx0, xx1 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
    xx = np.array([xx0, xx1]).reshape(2, -1).T
    plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train)
    plt.contourf(xx0, xx1, y_pred.reshape(100, 100).astype(dtype=np.float), alpha=0.2, levels=np.linspace(0, 1, 3))

n_neighbors = 3
xx0, xx1 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
X_test = np.array([xx0, xx1]).reshape(2, -1).T

y_pred = knc_predict(n_neighbors, X_train, ys_train, X_test)
plt_resut(X_train, ys_train, y_pred)

■　numpy実装

xx0, xx1 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
xx = np.array([xx0, xx1]).reshape(2, -1).T

from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, ys_train)
plt_resut(X_train, ys_train, knc.predict(xx))

【考察】

■　k近傍法を使用ことで、データの分類ができた。
■　上のパラメータ（n_neighbors）の値を大きく設定して、決定境界は滑らかになるかどうかを確認してみた。
●　n_neighbors＝20にすると、以下の結果のように、決定境界は滑らかになった。

n_neighbors = 20

xx0, xx1 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
X_test = np.array([xx0, xx1]).reshape(2, -1).T

y_pred = knc_predict(n_neighbors, X_train, ys_train, X_test)
plt_resut(X_train, ys_train, y_pred)

k平均法(k-means)

【実装演習結果】

■　必要モジュールのインポート

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

■　データ生成

def gen_data():
    x1 = np.random.normal(size=(100, 2)) + np.array([-5, -5])
    x2 = np.random.normal(size=(100, 2)) + np.array([5, -5])
    x3 = np.random.normal(size=(100, 2)) + np.array([0, 5])
    return np.vstack((x1, x2, x3))

#データ作成
X_train = gen_data()
#データ描画
plt.scatter(X_train[:, 0], X_train[:, 1])

■　学習
k-meansアルゴリズムは以下のとおりである
1) 各クラスタ中心の初期値を設定する
2) 各データ点に対して、各クラスタ中心との距離を計算し、最も距離が近いクラスタを割り当てる
3) 各クラスタの平均ベクトル（中心）を計算する
4) 収束するまで2, 3の処理を繰り返す

def distance(x1, x2):
    return np.sum((x1 - x2)**2, axis=1)

n_clusters = 3
iter_max = 100

# 各クラスタ中心をランダムに初期化
centers = X_train[np.random.choice(len(X_train), n_clusters, replace=False)]

for _ in range(iter_max):
    prev_centers = np.copy(centers)
    D = np.zeros((len(X_train), n_clusters))
    # 各データ点に対して、各クラスタ中心との距離を計算
    for i, x in enumerate(X_train):
        D[i] = distance(x, centers)
    # 各データ点に、最も距離が近いクラスタを割り当
    cluster_index = np.argmin(D, axis=1)
    # 各クラスタの中心を計算
    for k in range(n_clusters):
        index_k = cluster_index == k
        centers[k] = np.mean(X_train[index_k], axis=0)
    # 収束判定
    if np.allclose(prev_centers, centers):
        break

■　クラスタリング結果

def plt_result(X_train, centers, xx):
    #　データを可視化
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_pred, cmap='spring')
    # 中心を可視化
    plt.scatter(centers[:, 0], centers[:, 1], s=200, marker='X', lw=2, c='black', edgecolor="white")
    # 領域の可視化
    pred = np.empty(len(xx), dtype=int)
    for i, x in enumerate(xx):
        d = distance(x, centers)
        pred[i] = np.argmin(d)
    plt.contourf(xx0, xx1, pred.reshape(100, 100), alpha=0.2, cmap='spring')

y_pred = np.empty(len(X_train), dtype=int)
for i, x in enumerate(X_train):
    d = distance(x, centers)
    y_pred[i] = np.argmin(d)

xx0, xx1 = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100))
xx = np.array([xx0, xx1]).reshape(2, -1).T

plt_result(X_train, centers, xx)

■　numpy実装

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)

print("labels: {}".format(kmeans.labels_))
print("cluster_centers: {}".format(kmeans.cluster_centers_))
kmeans.cluster_centers_

【考察】

■　k近傍法(k-NN)、k平均法(k-means)のアルゴリズムを理解しやすい。
■　k-means手法では、結果が初期値に大きく依存するという問題点がある。
■　最初にクラスタの数を決めなければならない。

【機械学習】レポート一覧

【ラビットチャレンジ】【機械学習】線形回帰モデル
 【ラビットチャレンジ】【機械学習】非線形回帰モデル
 【ラビットチャレンジ】【機械学習】ロジスティク回帰モデル
 【ラビットチャレンジ】【機械学習】主成分分析
 【ラビットチャレンジ】【機械学習】サポートベクターマシン

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up