38
28

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

UMAPの教師ありハイパーパラメーターチューニング

Last updated at Posted at 2022-03-30

「t-SNEの教師ありハイパーパラメーターチューニング」「Isomapの教師ありハイパーパラメーターチューニング」の続編です。

UMAP のインストール

UMAP は scikit-learn には実装されていないのでインストールします。

!pip install umap-learn
Collecting umap-learn
  Downloading umap-learn-0.5.2.tar.gz (86 kB)
[K     |████████████████████████████████| 86 kB 4.3 MB/s 
[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.21.5)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.0.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.4.1)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.51.2)
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.6.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 18.1 MB/s 
[?25hRequirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from umap-learn) (4.63.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (57.4.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (0.34.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from pynndescent>=0.5->umap-learn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22->umap-learn) (3.1.0)
Building wheels for collected packages: umap-learn, pynndescent
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82708 sha256=2ad8270ad7ff916a8cafad0b4ca1352d32e05d95597ed159180383308e885b42
  Stored in directory: /root/.cache/pip/wheels/84/1b/c6/aaf68a748122632967cef4dffef68224eb16798b6793257d82
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
  Created wheel for pynndescent: filename=pynndescent-0.5.6-py3-none-any.whl size=53943 sha256=90c3c97fea10d5a6f03000dd193fd4c848b1fa9825d2660b2ecac3ab98214097
  Stored in directory: /root/.cache/pip/wheels/03/f1/56/f80d72741e400345b5a5b50ec3d929aca581bf45e0225d5c50
Successfully built umap-learn pynndescent
Installing collected packages: pynndescent, umap-learn
Successfully installed pynndescent-0.5.6 umap-learn-0.5.2

UMAP でパラメータを変化させる

お試し用のデータとして、scikit-learn で取得できる糖尿病データを用いてみます。

from umap import UMAP
import sklearn.datasets
import matplotlib.pyplot as plt

dataset = sklearn.datasets.load_diabetes()

mapper = UMAP()
mapper.fit(dataset.data)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()

UMAPの教師ありハイパラチューニング_3_1.png

ここで、点の色は dataset.target にある変数で、この値は UMAP の学習に用いられていないことに注意してください。 UMAP は「教師なし学習」の一種であり、学習に用いた高次元データの構造を可視化する働きをするのですが、その結果として、学習に用いていない変数との関係が見えてくることもあります。

さて、パラメータを変えると、このマッピングがどのくらい変わってしまうのかを見てみましょう。

n_neighbors

from umap import UMAP
import sklearn.datasets
import matplotlib.pyplot as plt

dataset = sklearn.datasets.load_diabetes()

for n_neighbors in (2, 5, 10, 30, 50, 100):
    mapper = UMAP(n_neighbors=n_neighbors)
    mapper.fit(dataset.data)
    embedding = mapper.transform(dataset.data)
    title='n_neighbors = {0}'.format(n_neighbors)
    plt.title(title)
    plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
    plt.colorbar()
    plt.show()

UMAPの教師ありハイパラチューニング_5_0.png

UMAPの教師ありハイパラチューニング_5_1.png

UMAPの教師ありハイパラチューニング_5_2.png

UMAPの教師ありハイパラチューニング_5_3.png

UMAPの教師ありハイパラチューニング_5_4.png

UMAPの教師ありハイパラチューニング_5_5.png

min_dist

from umap import UMAP
import sklearn.datasets
import matplotlib.pyplot as plt

dataset = sklearn.datasets.load_diabetes()

for min_dist in (0.0, 0.1, 0.25, 0.5, 0.8, 0.99):
    mapper = UMAP(min_dist=min_dist)
    mapper.fit(dataset.data)
    embedding = mapper.transform(dataset.data)
    title='min_dist = {0}'.format(min_dist)
    plt.title(title)
    plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
    plt.colorbar()
    plt.show()

UMAPの教師ありハイパラチューニング_7_0.png

UMAPの教師ありハイパラチューニング_7_1.png

UMAPの教師ありハイパラチューニング_7_2.png

UMAPの教師ありハイパラチューニング_7_3.png

UMAPの教師ありハイパラチューニング_7_4.png

UMAPの教師ありハイパラチューニング_7_5.png

metric

from umap import UMAP
import sklearn.datasets
import matplotlib.pyplot as plt

dataset = sklearn.datasets.load_diabetes()

for metric in ["euclidean", "manhattan", "chebyshev", "minkowski", "canberra", 
               "braycurtis", "mahalanobis", "cosine", "correlation"]:

    mapper = UMAP(metric=metric)
    mapper.fit(dataset.data)
    embedding = mapper.transform(dataset.data)
    title='metric = {0}'.format(metric)
    plt.title(title)
    plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
    plt.colorbar()
    plt.show()

UMAPの教師ありハイパラチューニング_9_0.png

UMAPの教師ありハイパラチューニング_9_1.png

UMAPの教師ありハイパラチューニング_9_2.png

UMAPの教師ありハイパラチューニング_9_3.png

UMAPの教師ありハイパラチューニング_9_4.png

UMAPの教師ありハイパラチューニング_9_5.png

UMAPの教師ありハイパラチューニング_9_6.png

UMAPの教師ありハイパラチューニング_9_7.png

UMAPの教師ありハイパラチューニング_9_8.png

結局、どのパラメータが良いの?

パラメータを変化させると、可視化結果が変わってくることがよく分かりましたね。それでは結局、どのパラメータが良いのでしょうか?

目的変数がある場合、説明変数の中には、目的変数との関係性の薄い変数が含まれることも多いでしょう。教師なし学習だと、目的変数との関係性を考えずにマッピングしてしまうのですが、要らない変数の影響はできるだけ小さくしてマッピングして欲しいですよね(ね?)。

回帰問題用スコア関数

そこで、教師なし学習である UMAP を「教師あり UMAP」にするためのスコア関数を設計しましょう。回帰問題の場合は、たとえば次のように作れると思います。マッピングされた2点に対して、目的変数の差を計算し、それを2点間のユークリッド距離で割った数値の総和です。

def regression_scorer(X, Y):
    sum = 0
    n1 = 0
    for x1, y1 in zip(X, Y):
        n1 += 1
        n2 = 0
        for x2, y2 in zip(X, Y):
            n2 += 1
            if n1 > n2:
                dist = ((x1[0] - x2[0])**2 + (x1[1] - x2[1])**2) + 1e-53
                sum += (y1 - y2)**2 / dist

    return sum / (len(Y) * (len(Y) - 1) / 2)

Optuna で最適化

パラメータの最適化といえば Optuna。Optuna と言えばパラメータの最適化。ということでインストールします。

!pip install optuna
Collecting optuna
  Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 9.5 MB/s 
[?25hCollecting alembic
  Downloading alembic-1.7.7-py3-none-any.whl (210 kB)
[K     |████████████████████████████████| 210 kB 56.1 MB/s 
[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.21.5)
Requirement already satisfied: sqlalchemy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.32)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (3.13)
Requirement already satisfied: scipy!=1.4.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.63.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.3)
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 10.0 MB/s 
[?25hCollecting colorlog
  Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (3.0.7)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (4.11.3)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (1.1.2)
Collecting Mako
  Downloading Mako-1.2.0-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 4.3 MB/s 
[?25hRequirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (5.4.0)
Collecting cmd2>=1.0.0
  Downloading cmd2-2.4.0-py3-none-any.whl (150 kB)
[K     |████████████████████████████████| 150 kB 78.3 MB/s 
[?25hCollecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.8.1-py2.py3-none-any.whl (113 kB)
[K     |████████████████████████████████| 113 kB 74.5 MB/s 
[?25hCollecting stevedore>=2.0.1
  Downloading stevedore-3.5.0-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 6.8 MB/s 
[?25hCollecting autopage>=0.4.0
  Downloading autopage-0.5.0-py3-none-any.whl (29 kB)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.2.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (3.10.0.2)
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (21.4.0)
Collecting pyperclip>=1.6
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy>=1.1.0->optuna) (3.7.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.7/dist-packages (from Mako->alembic->optuna) (2.0.1)
Building wheels for collected packages: pyperclip
  Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
  Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11137 sha256=031b37d441d73814df94ef903581ef03bb76ca151064df06b639a6e908740c40
  Stored in directory: /root/.cache/pip/wheels/9f/18/84/8f69f8b08169c7bae2dde6bd7daf0c19fca8c8e500ee620a28
Successfully built pyperclip
Installing collected packages: pyperclip, pbr, stevedore, Mako, cmd2, autopage, colorlog, cmaes, cliff, alembic, optuna
Successfully installed Mako-1.2.0 alembic-1.7.7 autopage-0.5.0 cliff-3.10.1 cmaes-0.8.2 cmd2-2.4.0 colorlog-6.6.0 optuna-2.10.0 pbr-5.8.1 pyperclip-1.8.2 stevedore-3.5.0

教師あり UMAP

Optuna を使って、UMAP を最適化するクラスを設計します。

import scipy
class SupervisedUMAP:
    def __init__(self, X, Y, scorer):
        self.X = X
        self.Y = Y
        self.scorer = scorer
        self.best_score = 1e53
        self.best_model = None

    def __call__(self, trial):
        n_neighbors = trial.suggest_int("n_neighbors", 2, len(self.Y))
        min_dist = trial.suggest_uniform("min_dist", 0.0, 0.99)
        metric = trial.suggest_categorical("metric", 
                ["euclidean", "manhattan", "chebyshev", "minkowski", "canberra", 
               "braycurtis", "mahalanobis", "cosine", "correlation"])

        mapper = UMAP(
            n_neighbors=n_neighbors, 
            min_dist=min_dist,
            metric=metric
            )
        mapper.fit(self.X)
        embedding = mapper.transform(self.X)
        score = self.scorer(scipy.stats.zscore(embedding), self.Y)

        if self.best_score > score:
            self.best_score = score
            self.best_model = mapper

            print(self.best_model)
            title='trial={0}, score={1:.3e}'.format(trial.number, score)
            plt.title(title)
            plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
            plt.colorbar()
            plt.show()      

        return score

最適化計算の実行例です。いくらか省略しながら結果を示します。

import optuna

objective = SupervisedUMAP(dataset.data, dataset.target, regression_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
UMAP(metric='braycurtis', min_dist=0.17087855399688387, n_neighbors=165, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_2.png

UMAP(metric='manhattan', min_dist=0.2525304699853455, n_neighbors=178, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_5.png

UMAP(metric='mahalanobis', min_dist=0.10402066419499499, n_neighbors=140, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_8.png

UMAP(metric='canberra', min_dist=0.965734984822116, n_neighbors=426, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_11.png

UMAP(metric='minkowski', min_dist=0.552229974539016, n_neighbors=246, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_14.png

UMAP(angular_rp_forest=True, metric='cosine', min_dist=0.6024771436878388, n_neighbors=116, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_17.png

UMAP(metric='minkowski', min_dist=0.5463771791221853, n_neighbors=147, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_20.png

UMAP(metric='chebyshev', min_dist=0.465599002041312, n_neighbors=269, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_23.png

UMAP(metric='chebyshev', min_dist=0.8192578183604611, n_neighbors=264, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_26.png

UMAP(metric='chebyshev', min_dist=0.8053374897725938, n_neighbors=272, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_17_29.png

最適化が進むに従って、目的変数の分離が最もしやすいパラメータが選択できたのではないかなと思います。

予測値のマッピング

私が UMAP を好きな理由のひとつが、inverse_transform が使えることです。inverse_transform は、2次元マップ上の座標から、もとのベクトルを復元する方法です。

objective.best_model.inverse_transform([[0, 0]])
array([[ 0.02235418, -0.06212789, -0.06006454, -0.03912916,  0.00229801,
         0.02006099, -0.03019603,  0.03950487,  0.00173235, -0.03239265]],
      dtype=float32)

これを用いて、任意の予測モデルによる予測値をマッピングできます。

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0))
model = GaussianProcessRegressor(kernel=kernel, random_state=0)
model.fit(dataset.data, dataset.target)
figsize=(6, 4)
h = 1
alpha = 0.8

embedding = objective.best_model.transform(dataset.data)
x_min = embedding[:, 0].min() - 1
x_max = embedding[:, 0].max() + 1
y_min = embedding[:, 1].min() - 1
y_max = embedding[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
plt.figure(figsize=figsize)
Z = model.predict(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=alpha)
plt.colorbar()
plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.5)
plt.grid()
plt.show()

UMAPの教師ありハイパラチューニング_23_0.png

ガウス過程回帰など、予測値の分散を考慮できる方法を用いれば、予測値の分散をマッピングすることでさらなる考察が可能になりそうですね。Bagging や Blending で予測の分散を考慮するのも良いかもしれません。

plt.figure(figsize=figsize)
Z, Z_std = model.predict(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]), return_std=True)
Z = Z_std.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=alpha)
plt.colorbar()
plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.5)
plt.grid()
plt.show()

UMAPの教師ありハイパラチューニング_25_0.png

分類問題用スコア関数

回帰問題用スコア関数を少しだけ改変して、分類問題用スコア関数にします。分類問題と異なり、回帰問題では異なるクラス間の「差」は常に1であるとします。

def classification_scorer(X, Y, alpha=1e-3):
    sum = 0
    n1 = 0
    for x1, y1 in zip(X, Y):
        n1 += 1
        n2 = 0
        for x2, y2 in zip(X, Y):
            n2 += 1
            if n1 > n2:
                dist = ((x1[0] - x2[0])**2 + (x1[1] - x2[1])**2) + 1e-53
                if y1 != y2:
                    sum += 1 / dist
                else:
                    pass
                    
    return sum / (len(Y) * (len(Y) - 1) / 2)

それ以外は、回帰問題と同じですね。

乳がんデータセット

乳がんデータセットを用いた2値分類の例です。

from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()
import optuna

objective = SupervisedUMAP(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
UMAP(min_dist=0.2172843223845458, n_neighbors=218, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_30_2.png

UMAP(metric='mahalanobis', min_dist=0.47752441562991854, n_neighbors=293, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_30_5.png

UMAP(metric='mahalanobis', min_dist=0.6024885895334441, n_neighbors=52, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_30_8.png

UMAP(metric='canberra', min_dist=0.9858623684125, n_neighbors=108, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_30_11.png

UMAP(metric='canberra', min_dist=0.929187495285436, n_neighbors=364, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_30_14.png

UMAP(metric='canberra', min_dist=0.9718475179229352, n_neighbors=326, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_30_17.png

UMAP(metric='canberra', min_dist=0.7476737707654739, n_neighbors=402, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_30_20.png

predict_proba をマッピング

分類モデルの場合は、predict_proba をマッピングすることでその信頼性を見積もるのも良いかもしれません。

import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0))
model = GaussianProcessClassifier(kernel=kernel, random_state=0)
model.fit(dataset.data, dataset.target)
figsize=(6, 4)
h = 1
alpha = 0.8

embedding = objective.best_model.transform(dataset.data)
x_min = embedding[:, 0].min() - 1
x_max = embedding[:, 0].max() + 1
y_min = embedding[:, 1].min() - 1
y_max = embedding[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
for n in range(2):
    plt.figure(figsize=figsize)
    plt.title("predict {} probability".format(n))
    Z = model.predict_proba(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))[:, n]
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=alpha)
    plt.colorbar()
    plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.5)
    plt.grid()
    plt.show()

UMAPの教師ありハイパラチューニング_34_0.png

UMAPの教師ありハイパラチューニング_34_1.png

ワインデータセット

ワインデータセットを用いた例です。このように、ラベルが3種類以上の場合でも実行可能です。

from sklearn.datasets import load_wine

dataset = load_wine()
import optuna

objective = SupervisedUMAP(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
UMAP(metric='canberra', min_dist=0.7820271577093002, n_neighbors=130, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_37_2.png

UMAP(metric='canberra', min_dist=0.2664782777245317, n_neighbors=116, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_37_5.png

UMAP(metric='canberra', min_dist=0.5972978429297398, n_neighbors=132, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_37_8.png

UMAP(metric='canberra', min_dist=0.9498778003225311, n_neighbors=34, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_37_11.png

import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0))
model = GaussianProcessClassifier(kernel=kernel, random_state=0)
model.fit(dataset.data, dataset.target)
figsize=(6, 4)
h = 1
alpha = 0.8

embedding = objective.best_model.transform(dataset.data)
x_min = embedding[:, 0].min() - 1
x_max = embedding[:, 0].max() + 1
y_min = embedding[:, 1].min() - 1
y_max = embedding[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
for n in range(3):
    plt.figure(figsize=figsize)
    plt.title("predict {} probability".format(n))
    Z = model.predict_proba(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))[:, n]
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=alpha)
    plt.colorbar()
    plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.5)
    plt.grid()
    plt.show()

UMAPの教師ありハイパラチューニング_40_0.png

UMAPの教師ありハイパラチューニング_40_1.png

UMAPの教師ありハイパラチューニング_40_2.png

手書き数字データセット

手書き数字データセットを用いた例です。この例ではラベルが10種類あります。

from sklearn.datasets import load_digits

dataset = load_digits()
import optuna

objective = SupervisedUMAP(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
UMAP(metric='minkowski', min_dist=0.08295894663258277, n_neighbors=845, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_2.png

UMAP(angular_rp_forest=True, metric='correlation', min_dist=0.8610569729117126, n_neighbors=1512, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_5.png

UMAP(metric='braycurtis', min_dist=0.22248833651912708, n_neighbors=594, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_8.png

UMAP(min_dist=0.6227914042692446, n_neighbors=530, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_11.png

UMAP(min_dist=0.6510861250675604, n_neighbors=392, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_14.png

UMAP(metric='manhattan', min_dist=0.520042472164499, n_neighbors=299, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_17.png

UMAP(min_dist=0.5368016672875904, n_neighbors=468, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_20.png

UMAP(min_dist=0.48661142862105233, n_neighbors=336, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_23.png

UMAP(min_dist=0.5262945217120664, n_neighbors=464, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_26.png

UMAP(min_dist=0.33899588494964167, n_neighbors=374, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

UMAPの教師ありハイパラチューニング_43_29.png

import numpy as np
from sklearn.neural_network import MLPClassifier

model = MLPClassifier()
model.fit(dataset.data, dataset.target)
figsize=(6, 4)
h = 1
alpha = 0.8

embedding = objective.best_model.transform(dataset.data)
x_min = embedding[:, 0].min() - 1
x_max = embedding[:, 0].max() + 1
y_min = embedding[:, 1].min() - 1
y_max = embedding[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
for n in range(10):
    plt.figure(figsize=figsize)
    plt.title("predict {} probability".format(n))
    Z = model.predict_proba(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))[:, n]
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=alpha)
    plt.colorbar()
    plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.1)
    plt.grid()
    plt.show()

UMAPの教師ありハイパラチューニング_46_0.png

UMAPの教師ありハイパラチューニング_46_1.png

UMAPの教師ありハイパラチューニング_46_2.png

UMAPの教師ありハイパラチューニング_46_3.png

UMAPの教師ありハイパラチューニング_46_4.png

UMAPの教師ありハイパラチューニング_46_5.png

UMAPの教師ありハイパラチューニング_46_6.png

UMAPの教師ありハイパラチューニング_46_7.png

UMAPの教師ありハイパラチューニング_46_8.png

UMAPの教師ありハイパラチューニング_46_9.png

おわりに

t-SNE や Isomap が好きです。でも UMAP はもっと好きです。

38
28
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
38
28

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?