「t-SNEの教師ありハイパーパラメーターチューニング」と「Isomapの教師ありハイパーパラメーターチューニング」の続編です。
UMAP のインストール
UMAP は scikit-learn には実装されていないのでインストールします。
!pip install umap-learn
Collecting umap-learn
Downloading umap-learn-0.5.2.tar.gz (86 kB)
[K |████████████████████████████████| 86 kB 4.3 MB/s
[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.21.5)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.0.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.4.1)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.51.2)
Collecting pynndescent>=0.5
Downloading pynndescent-0.5.6.tar.gz (1.1 MB)
[K |████████████████████████████████| 1.1 MB 18.1 MB/s
[?25hRequirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from umap-learn) (4.63.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (57.4.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (0.34.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from pynndescent>=0.5->umap-learn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22->umap-learn) (3.1.0)
Building wheels for collected packages: umap-learn, pynndescent
Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82708 sha256=2ad8270ad7ff916a8cafad0b4ca1352d32e05d95597ed159180383308e885b42
Stored in directory: /root/.cache/pip/wheels/84/1b/c6/aaf68a748122632967cef4dffef68224eb16798b6793257d82
Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
Created wheel for pynndescent: filename=pynndescent-0.5.6-py3-none-any.whl size=53943 sha256=90c3c97fea10d5a6f03000dd193fd4c848b1fa9825d2660b2ecac3ab98214097
Stored in directory: /root/.cache/pip/wheels/03/f1/56/f80d72741e400345b5a5b50ec3d929aca581bf45e0225d5c50
Successfully built umap-learn pynndescent
Installing collected packages: pynndescent, umap-learn
Successfully installed pynndescent-0.5.6 umap-learn-0.5.2
UMAP でパラメータを変化させる
お試し用のデータとして、scikit-learn で取得できる糖尿病データを用いてみます。
from umap import UMAP
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
mapper = UMAP()
mapper.fit(dataset.data)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
ここで、点の色は dataset.target にある変数で、この値は UMAP の学習に用いられていないことに注意してください。 UMAP は「教師なし学習」の一種であり、学習に用いた高次元データの構造を可視化する働きをするのですが、その結果として、学習に用いていない変数との関係が見えてくることもあります。
さて、パラメータを変えると、このマッピングがどのくらい変わってしまうのかを見てみましょう。
n_neighbors
from umap import UMAP
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
for n_neighbors in (2, 5, 10, 30, 50, 100):
mapper = UMAP(n_neighbors=n_neighbors)
mapper.fit(dataset.data)
embedding = mapper.transform(dataset.data)
title='n_neighbors = {0}'.format(n_neighbors)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
min_dist
from umap import UMAP
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
for min_dist in (0.0, 0.1, 0.25, 0.5, 0.8, 0.99):
mapper = UMAP(min_dist=min_dist)
mapper.fit(dataset.data)
embedding = mapper.transform(dataset.data)
title='min_dist = {0}'.format(min_dist)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
metric
from umap import UMAP
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
for metric in ["euclidean", "manhattan", "chebyshev", "minkowski", "canberra",
"braycurtis", "mahalanobis", "cosine", "correlation"]:
mapper = UMAP(metric=metric)
mapper.fit(dataset.data)
embedding = mapper.transform(dataset.data)
title='metric = {0}'.format(metric)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
結局、どのパラメータが良いの?
パラメータを変化させると、可視化結果が変わってくることがよく分かりましたね。それでは結局、どのパラメータが良いのでしょうか?
目的変数がある場合、説明変数の中には、目的変数との関係性の薄い変数が含まれることも多いでしょう。教師なし学習だと、目的変数との関係性を考えずにマッピングしてしまうのですが、要らない変数の影響はできるだけ小さくしてマッピングして欲しいですよね(ね?)。
回帰問題用スコア関数
そこで、教師なし学習である UMAP を「教師あり UMAP」にするためのスコア関数を設計しましょう。回帰問題の場合は、たとえば次のように作れると思います。マッピングされた2点に対して、目的変数の差を計算し、それを2点間のユークリッド距離で割った数値の総和です。
def regression_scorer(X, Y):
sum = 0
n1 = 0
for x1, y1 in zip(X, Y):
n1 += 1
n2 = 0
for x2, y2 in zip(X, Y):
n2 += 1
if n1 > n2:
dist = ((x1[0] - x2[0])**2 + (x1[1] - x2[1])**2) + 1e-53
sum += (y1 - y2)**2 / dist
return sum / (len(Y) * (len(Y) - 1) / 2)
Optuna で最適化
パラメータの最適化といえば Optuna。Optuna と言えばパラメータの最適化。ということでインストールします。
!pip install optuna
Collecting optuna
Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K |████████████████████████████████| 308 kB 9.5 MB/s
[?25hCollecting alembic
Downloading alembic-1.7.7-py3-none-any.whl (210 kB)
[K |████████████████████████████████| 210 kB 56.1 MB/s
[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.21.5)
Requirement already satisfied: sqlalchemy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.32)
Collecting cmaes>=0.8.2
Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (3.13)
Requirement already satisfied: scipy!=1.4.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.63.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.3)
Collecting cliff
Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K |████████████████████████████████| 81 kB 10.0 MB/s
[?25hCollecting colorlog
Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (3.0.7)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (4.11.3)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (1.1.2)
Collecting Mako
Downloading Mako-1.2.0-py3-none-any.whl (78 kB)
[K |████████████████████████████████| 78 kB 4.3 MB/s
[?25hRequirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (5.4.0)
Collecting cmd2>=1.0.0
Downloading cmd2-2.4.0-py3-none-any.whl (150 kB)
[K |████████████████████████████████| 150 kB 78.3 MB/s
[?25hCollecting pbr!=2.1.0,>=2.0.0
Downloading pbr-5.8.1-py2.py3-none-any.whl (113 kB)
[K |████████████████████████████████| 113 kB 74.5 MB/s
[?25hCollecting stevedore>=2.0.1
Downloading stevedore-3.5.0-py3-none-any.whl (49 kB)
[K |████████████████████████████████| 49 kB 6.8 MB/s
[?25hCollecting autopage>=0.4.0
Downloading autopage-0.5.0-py3-none-any.whl (29 kB)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.2.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (3.10.0.2)
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (21.4.0)
Collecting pyperclip>=1.6
Downloading pyperclip-1.8.2.tar.gz (20 kB)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy>=1.1.0->optuna) (3.7.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.7/dist-packages (from Mako->alembic->optuna) (2.0.1)
Building wheels for collected packages: pyperclip
Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11137 sha256=031b37d441d73814df94ef903581ef03bb76ca151064df06b639a6e908740c40
Stored in directory: /root/.cache/pip/wheels/9f/18/84/8f69f8b08169c7bae2dde6bd7daf0c19fca8c8e500ee620a28
Successfully built pyperclip
Installing collected packages: pyperclip, pbr, stevedore, Mako, cmd2, autopage, colorlog, cmaes, cliff, alembic, optuna
Successfully installed Mako-1.2.0 alembic-1.7.7 autopage-0.5.0 cliff-3.10.1 cmaes-0.8.2 cmd2-2.4.0 colorlog-6.6.0 optuna-2.10.0 pbr-5.8.1 pyperclip-1.8.2 stevedore-3.5.0
教師あり UMAP
Optuna を使って、UMAP を最適化するクラスを設計します。
import scipy
class SupervisedUMAP:
def __init__(self, X, Y, scorer):
self.X = X
self.Y = Y
self.scorer = scorer
self.best_score = 1e53
self.best_model = None
def __call__(self, trial):
n_neighbors = trial.suggest_int("n_neighbors", 2, len(self.Y))
min_dist = trial.suggest_uniform("min_dist", 0.0, 0.99)
metric = trial.suggest_categorical("metric",
["euclidean", "manhattan", "chebyshev", "minkowski", "canberra",
"braycurtis", "mahalanobis", "cosine", "correlation"])
mapper = UMAP(
n_neighbors=n_neighbors,
min_dist=min_dist,
metric=metric
)
mapper.fit(self.X)
embedding = mapper.transform(self.X)
score = self.scorer(scipy.stats.zscore(embedding), self.Y)
if self.best_score > score:
self.best_score = score
self.best_model = mapper
print(self.best_model)
title='trial={0}, score={1:.3e}'.format(trial.number, score)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
return score
最適化計算の実行例です。いくらか省略しながら結果を示します。
import optuna
objective = SupervisedUMAP(dataset.data, dataset.target, regression_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
UMAP(metric='braycurtis', min_dist=0.17087855399688387, n_neighbors=165, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='manhattan', min_dist=0.2525304699853455, n_neighbors=178, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='mahalanobis', min_dist=0.10402066419499499, n_neighbors=140, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='canberra', min_dist=0.965734984822116, n_neighbors=426, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='minkowski', min_dist=0.552229974539016, n_neighbors=246, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(angular_rp_forest=True, metric='cosine', min_dist=0.6024771436878388, n_neighbors=116, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='minkowski', min_dist=0.5463771791221853, n_neighbors=147, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='chebyshev', min_dist=0.465599002041312, n_neighbors=269, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='chebyshev', min_dist=0.8192578183604611, n_neighbors=264, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='chebyshev', min_dist=0.8053374897725938, n_neighbors=272, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
最適化が進むに従って、目的変数の分離が最もしやすいパラメータが選択できたのではないかなと思います。
予測値のマッピング
私が UMAP を好きな理由のひとつが、inverse_transform が使えることです。inverse_transform は、2次元マップ上の座標から、もとのベクトルを復元する方法です。
objective.best_model.inverse_transform([[0, 0]])
array([[ 0.02235418, -0.06212789, -0.06006454, -0.03912916, 0.00229801,
0.02006099, -0.03019603, 0.03950487, 0.00173235, -0.03239265]],
dtype=float32)
これを用いて、任意の予測モデルによる予測値をマッピングできます。
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0))
model = GaussianProcessRegressor(kernel=kernel, random_state=0)
model.fit(dataset.data, dataset.target)
figsize=(6, 4)
h = 1
alpha = 0.8
embedding = objective.best_model.transform(dataset.data)
x_min = embedding[:, 0].min() - 1
x_max = embedding[:, 0].max() + 1
y_min = embedding[:, 1].min() - 1
y_max = embedding[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
plt.figure(figsize=figsize)
Z = model.predict(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=alpha)
plt.colorbar()
plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.5)
plt.grid()
plt.show()
ガウス過程回帰など、予測値の分散を考慮できる方法を用いれば、予測値の分散をマッピングすることでさらなる考察が可能になりそうですね。Bagging や Blending で予測の分散を考慮するのも良いかもしれません。
plt.figure(figsize=figsize)
Z, Z_std = model.predict(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]), return_std=True)
Z = Z_std.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=alpha)
plt.colorbar()
plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.5)
plt.grid()
plt.show()
分類問題用スコア関数
回帰問題用スコア関数を少しだけ改変して、分類問題用スコア関数にします。分類問題と異なり、回帰問題では異なるクラス間の「差」は常に1であるとします。
def classification_scorer(X, Y, alpha=1e-3):
sum = 0
n1 = 0
for x1, y1 in zip(X, Y):
n1 += 1
n2 = 0
for x2, y2 in zip(X, Y):
n2 += 1
if n1 > n2:
dist = ((x1[0] - x2[0])**2 + (x1[1] - x2[1])**2) + 1e-53
if y1 != y2:
sum += 1 / dist
else:
pass
return sum / (len(Y) * (len(Y) - 1) / 2)
それ以外は、回帰問題と同じですね。
乳がんデータセット
乳がんデータセットを用いた2値分類の例です。
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
import optuna
objective = SupervisedUMAP(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
UMAP(min_dist=0.2172843223845458, n_neighbors=218, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='mahalanobis', min_dist=0.47752441562991854, n_neighbors=293, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='mahalanobis', min_dist=0.6024885895334441, n_neighbors=52, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='canberra', min_dist=0.9858623684125, n_neighbors=108, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='canberra', min_dist=0.929187495285436, n_neighbors=364, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='canberra', min_dist=0.9718475179229352, n_neighbors=326, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='canberra', min_dist=0.7476737707654739, n_neighbors=402, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
predict_proba をマッピング
分類モデルの場合は、predict_proba をマッピングすることでその信頼性を見積もるのも良いかもしれません。
import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0))
model = GaussianProcessClassifier(kernel=kernel, random_state=0)
model.fit(dataset.data, dataset.target)
figsize=(6, 4)
h = 1
alpha = 0.8
embedding = objective.best_model.transform(dataset.data)
x_min = embedding[:, 0].min() - 1
x_max = embedding[:, 0].max() + 1
y_min = embedding[:, 1].min() - 1
y_max = embedding[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
for n in range(2):
plt.figure(figsize=figsize)
plt.title("predict {} probability".format(n))
Z = model.predict_proba(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))[:, n]
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=alpha)
plt.colorbar()
plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.5)
plt.grid()
plt.show()
ワインデータセット
ワインデータセットを用いた例です。このように、ラベルが3種類以上の場合でも実行可能です。
from sklearn.datasets import load_wine
dataset = load_wine()
import optuna
objective = SupervisedUMAP(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
UMAP(metric='canberra', min_dist=0.7820271577093002, n_neighbors=130, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='canberra', min_dist=0.2664782777245317, n_neighbors=116, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='canberra', min_dist=0.5972978429297398, n_neighbors=132, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='canberra', min_dist=0.9498778003225311, n_neighbors=34, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0))
model = GaussianProcessClassifier(kernel=kernel, random_state=0)
model.fit(dataset.data, dataset.target)
figsize=(6, 4)
h = 1
alpha = 0.8
embedding = objective.best_model.transform(dataset.data)
x_min = embedding[:, 0].min() - 1
x_max = embedding[:, 0].max() + 1
y_min = embedding[:, 1].min() - 1
y_max = embedding[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
for n in range(3):
plt.figure(figsize=figsize)
plt.title("predict {} probability".format(n))
Z = model.predict_proba(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))[:, n]
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=alpha)
plt.colorbar()
plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.5)
plt.grid()
plt.show()
手書き数字データセット
手書き数字データセットを用いた例です。この例ではラベルが10種類あります。
from sklearn.datasets import load_digits
dataset = load_digits()
import optuna
objective = SupervisedUMAP(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
UMAP(metric='minkowski', min_dist=0.08295894663258277, n_neighbors=845, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(angular_rp_forest=True, metric='correlation', min_dist=0.8610569729117126, n_neighbors=1512, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='braycurtis', min_dist=0.22248833651912708, n_neighbors=594, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(min_dist=0.6227914042692446, n_neighbors=530, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(min_dist=0.6510861250675604, n_neighbors=392, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(metric='manhattan', min_dist=0.520042472164499, n_neighbors=299, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(min_dist=0.5368016672875904, n_neighbors=468, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(min_dist=0.48661142862105233, n_neighbors=336, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(min_dist=0.5262945217120664, n_neighbors=464, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
UMAP(min_dist=0.33899588494964167, n_neighbors=374, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
import numpy as np
from sklearn.neural_network import MLPClassifier
model = MLPClassifier()
model.fit(dataset.data, dataset.target)
figsize=(6, 4)
h = 1
alpha = 0.8
embedding = objective.best_model.transform(dataset.data)
x_min = embedding[:, 0].min() - 1
x_max = embedding[:, 0].max() + 1
y_min = embedding[:, 1].min() - 1
y_max = embedding[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
for n in range(10):
plt.figure(figsize=figsize)
plt.title("predict {} probability".format(n))
Z = model.predict_proba(objective.best_model.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))[:, n]
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=alpha)
plt.colorbar()
plt.scatter(embedding[:, 0], embedding[:, 1], c="black", alpha=0.1)
plt.grid()
plt.show()
おわりに
t-SNE や Isomap が好きです。でも UMAP はもっと好きです。