More than 1 year has passed since last update.

機械学習はどんなふうに分類しているか

Last updated at 2024-05-29Posted at 2024-05-28

機械学習の分類と一言で言っても色々な分類へのアプローチがあります。
そこで、縦横0から1までを細かくデータにして分類をどのようにしているか可視化してみます。

関数

from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
def showline_clf(x, y, model, modelname, x0="x0", x1="x1"):
    fig, ax = plt.subplots(figsize=(8, 6))
    X, Y = np.meshgrid(np.linspace(*ax.get_xlim(), 1000), np.linspace(*ax.get_ylim(), 1000))
    XY = np.column_stack([X.ravel(), Y.ravel()])
    x = preprocessing.minmax_scale(x)
    model.fit(x, y)
    Z = model.predict(XY).reshape(X.shape)
    plt.contourf(X, Y, Z, alpha=0.1, cmap="brg")
    plt.scatter(x[:, 0], x[:, 1], c=y, cmap="brg")
    plt.xlim(min(x[:, 0]), max(x[:, 0]))
    plt.ylim(min(x[:, 1]), max(x[:, 1]))
    plt.title(modelname)
    plt.colorbar()
    plt.xlabel(x0)
    plt.ylabel(x1)
    plt.show()

プログラム

ライブラリのインポート

from sklearn.datasets import make_blobs
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.neighbors import KNeighborsClassifier as KNN
import matplotlib.pyplot as plt

データセットの可視化

x, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=1.0)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap="brg")
plt.show()

各機械学習モデル

model1 = SVC()
model2 = SVC(kernel="linear")
model3 = DTC()
model4 = RFC()
model5 = GBC()
model6 = LR()
model7 = KNN()

SVM(RBF)

分類データの次元を上げて分類できるようにしてから分類し、元に戻す手法でガウス関数を使います。

model1.fit(x, y)
showline_clf(x, y, model1, "SVM RBF kernel", x0="x0", x1="x1")

SVM

分類曲線と平行に支持曲線を作り、その支持曲線と支持曲線の間が最大になるように分類します。

model2.fit(x, y)
showline_clf(x, y, model2, "SVM", x0="x0", x1="x1")

決定木

各変数ごとにif文でその値より高いか低いかで分類していくアルゴリズムです。そのため縦線と横線のみになります。

model3.fit(x, y)
showline_clf(x, y, model3, "DecistionTree", x0="x0", x1="x1")

ランダムフォレスト

決定木を複数作ってバギングという謂わば多数決を使って分類精度を上げるという手法です。少し曲線っぽく見えますが、よく見ると決定木同様縦横線だけです。

model4.fit(x, y)
showline_clf(x, y, model4, "RandomForest", x0="x0", x1="x1")

勾配ブースティング決定木

決定木の深さを浅くした状態でより良いクラスごとの決定木を作っていき、Softmax関数で分類していきます。そのためアンサンブル学習ですがランダムフォレストと異なり決定木と似たようになります。

model5.fit(x, y)
showline_clf(x, y, model5, "GBDT", x0="x0", x1="x1")

ロジスティック回帰

シグモイド関数を多変量にして分類しています。

model6.fit(x, y)
showline_clf(x, y, model6, "LogisticRegression", x0="x0", x1="x1")

K-近傍法

未知のデータがどの正解データに距離的に近いかを多数決で分類していくアルゴリズムです。k-NNのkはその多数決の数になります。

model7.fit(x, y)
showline_clf(x, y, model7, "K-NN", x0="x0", x1="x1")

ナイーブベイズ

model11 = GaussianNB()
model11.fit(x, y)
showline_clf(x, y, model11, "Naive Bayse", x0="x0", x1="x1")

余談

クラスタリング

K-means法

from sklearn.cluster import KMeans
model10 = KMeans(n_clusters=4)
model10.fit_predict(x)
showline_clf(x, y, model10, "MLP", x0="x0", x1="x1")

直線になるんですね・・・

異常検知アルゴリズム

マップの色が青が異常値で緑が正常値です。

IsolationForest

from sklearn.ensemble import IsolationForest
model8 = IsolationForest()
model8.fit(x)
showline_clf(x, y, model8, "IsolationForest", x0="x0", x1="x1")

OneClassSVM

from sklearn.svm import OneClassSVM
model9 = OneClassSVM()
model9.fit(x)
showline_clf(x, y, model9, "OneClassSVM", x0="x0", x1="x1")

LightGBM

Scikit-LearnにはGradientBoostingClassifierがありますが、それより速く動く勾配ブースティング決定木です。

from lightgbm import LGBMClassifier
model11 = LGBMClassifier()
model11.fit(x, y)
showline_clf(x, y, model11, "LightGBM", x0="x0", x1="x1")

まとめ

性質ごとに使い分けをしましょう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up