「t-SNEの教師ありハイパーパラメーターチューニング」の続編です。同じ方法論が、 Isomap にも適用可能ということで、やってみました。
Isomap でパラメータを変化させる
Isomap は scikit-learn に実装されているので、それを使ってみましょう。お試し用のデータとして、scikit-learn で取得できる糖尿病データを用いてみます。
from sklearn.manifold import Isomap
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
mapper = Isomap()
embedding = mapper.fit_transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
ここで、点の色は dataset.target にある変数で、この値は Isomap の学習に用いられていないことに注意してください。 Isomap は「教師なし学習」の一種であり、学習に用いた高次元データの構造を可視化する働きをするのですが、その結果として、学習に用いていない変数との関係が見えてくることもあります。
さて、パラメータを変えると、このマッピングがどのくらい変わってしまうのかを見てみましょう。
n_neighbors
from sklearn.manifold import Isomap
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
for n_neighbors in (5, 10, 30, 50):
mapper = Isomap(n_neighbors=n_neighbors)
embedding = mapper.fit_transform(dataset.data)
title='n_neighbors = {0}'.format(n_neighbors)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
eigen_solver
from sklearn.manifold import Isomap
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
for eigen_solver in ("arpack", "dense"):
mapper = Isomap(eigen_solver=eigen_solver)
embedding = mapper.fit_transform(dataset.data)
title='eigen_solver = {0}'.format(eigen_solver)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
path_method
from sklearn.manifold import Isomap
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
for path_method in ["FW", "D"]:
mapper = Isomap(path_method=path_method)
embedding = mapper.fit_transform(dataset.data)
title='path_method = {0}'.format(path_method)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
neighbors_algorithm
from sklearn.manifold import Isomap
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
for neighbors_algorithm in ["brute", "kd_tree", "ball_tree"]:
mapper = Isomap(neighbors_algorithm=neighbors_algorithm)
embedding = mapper.fit_transform(dataset.data)
title='neighbors_algorithm = {0}'.format(neighbors_algorithm)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
t-SNE よりは、パラメータ依存が小さいように見えますね。なので、ぶっちゃけ、これ以降の解析の意味はあまりないのかもしれません。
回帰問題用スコア関数
教師なし学習である Isomap を「教師あり Isomap」にするためのスコア関数を設計しましょう。回帰問題の場合は、たとえば次のように作れると思います。マッピングされた2点に対して、目的変数の差を計算し、それを2点間のユークリッド距離で割った数値の総和です。
def regression_scorer(X, Y):
sum = 0
n1 = 0
for x1, y1 in zip(X, Y):
n1 += 1
n2 = 0
for x2, y2 in zip(X, Y):
n2 += 1
if n1 > n2:
dist = ((x1[0] - x2[0])**2 + (x1[1] - x2[1])**2) + 1e-53
sum += (y1 - y2)**2 / dist
return sum / (len(Y) * (len(Y) - 1) / 2)
Optuna で最適化
パラメータの最適化といえば Optuna。Optuna と言えばパラメータの最適化。ということでインストールします。
!pip install optuna
Collecting optuna
Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K |████████████████████████████████| 308 kB 6.8 MB/s
[?25hCollecting cmaes>=0.8.2
Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Requirement already satisfied: scipy!=1.4.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.1)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (3.13)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.63.0)
Collecting colorlog
Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.3)
Collecting alembic
Downloading alembic-1.7.7-py3-none-any.whl (210 kB)
[K |████████████████████████████████| 210 kB 45.1 MB/s
[?25hCollecting cliff
Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K |████████████████████████████████| 81 kB 7.4 MB/s
[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.21.5)
Requirement already satisfied: sqlalchemy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.32)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (3.0.7)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (4.11.3)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (1.1.2)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (5.4.0)
Collecting Mako
Downloading Mako-1.2.0-py3-none-any.whl (78 kB)
[K |████████████████████████████████| 78 kB 7.5 MB/s
[?25hCollecting stevedore>=2.0.1
Downloading stevedore-3.5.0-py3-none-any.whl (49 kB)
[K |████████████████████████████████| 49 kB 6.1 MB/s
[?25hCollecting pbr!=2.1.0,>=2.0.0
Downloading pbr-5.8.1-py2.py3-none-any.whl (113 kB)
[K |████████████████████████████████| 113 kB 46.7 MB/s
[?25hCollecting autopage>=0.4.0
Downloading autopage-0.5.0-py3-none-any.whl (29 kB)
Collecting cmd2>=1.0.0
Downloading cmd2-2.4.0-py3-none-any.whl (150 kB)
[K |████████████████████████████████| 150 kB 49.1 MB/s
[?25hRequirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.2.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (3.10.0.2)
Collecting pyperclip>=1.6
Downloading pyperclip-1.8.2.tar.gz (20 kB)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (21.4.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy>=1.1.0->optuna) (3.7.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.7/dist-packages (from Mako->alembic->optuna) (2.0.1)
Building wheels for collected packages: pyperclip
Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11137 sha256=0959dc9560c6c3701ab101a86a0767d09a14c52af3fbe2c9212397cecf5f2e79
Stored in directory: /root/.cache/pip/wheels/9f/18/84/8f69f8b08169c7bae2dde6bd7daf0c19fca8c8e500ee620a28
Successfully built pyperclip
Installing collected packages: pyperclip, pbr, stevedore, Mako, cmd2, autopage, colorlog, cmaes, cliff, alembic, optuna
Successfully installed Mako-1.2.0 alembic-1.7.7 autopage-0.5.0 cliff-3.10.1 cmaes-0.8.2 cmd2-2.4.0 colorlog-6.6.0 optuna-2.10.0 pbr-5.8.1 pyperclip-1.8.2 stevedore-3.5.0
教師あり Isomap
Optuna を使って、Isomap を最適化するクラスを設計します。
import scipy
class SupervisedIsomap:
def __init__(self, X, Y, scorer):
self.X = X
self.Y = Y
self.scorer = scorer
self.best_score = 1e53
self.best_model = None
def __call__(self, trial):
n_neighbors = trial.suggest_int("n_neighbors", 5, 50)
eigen_solver = trial.suggest_categorical("eigen_solver", ["arpack", "dense"])
path_method = trial.suggest_categorical("path_method", ["FW", "D"])
neighbors_algorithm = trial.suggest_categorical("neighbors_algorithm", ["brute", "kd_tree", "ball_tree"])
mapper = Isomap(
n_neighbors=n_neighbors,
eigen_solver=eigen_solver,
path_method=path_method,
neighbors_algorithm=neighbors_algorithm,
)
embedding = mapper.fit_transform(self.X)
score = self.scorer(scipy.stats.zscore(embedding), self.Y)
if self.best_score > score:
self.best_score = score
self.best_model = mapper
print(self.best_model)
title='trial={0}, score={1:.3e}'.format(trial.number, score)
plt.title(title)
plt.scatter(embedding[:, 0], embedding[:, 1], c=self.Y, alpha=0.5)
plt.colorbar()
plt.show()
return score
最適化計算の実行例です。いくらか省略しながら結果を示します。
import optuna
objective = SupervisedIsomap(dataset.data, dataset.target, regression_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
Isomap(eigen_solver='dense', n_neighbors=50, neighbors_algorithm='brute',
path_method='FW')
Isomap(eigen_solver='dense', n_neighbors=41, neighbors_algorithm='brute',
path_method='FW')
Isomap(eigen_solver='arpack', n_neighbors=41, neighbors_algorithm='brute',
path_method='D')
Isomap(eigen_solver='arpack', n_neighbors=44, neighbors_algorithm='brute',
path_method='D')
Isomap(eigen_solver='arpack', n_neighbors=44, neighbors_algorithm='ball_tree',
path_method='D')
Isomap(eigen_solver='dense', n_neighbors=47, neighbors_algorithm='brute',
path_method='FW')
目的変数の分離が最もしやすいパラメータが選択できたのではないかなと思います。
分類問題用スコア関数
回帰問題用スコア関数を少しだけ改変して、分類問題用スコア関数にします。分類問題と異なり、回帰問題では異なるクラス間の「差」は常に1であるとします。
def classification_scorer(X, Y, alpha=1e-3):
sum = 0
n1 = 0
for x1, y1 in zip(X, Y):
n1 += 1
n2 = 0
for x2, y2 in zip(X, Y):
n2 += 1
if n1 > n2:
dist = ((x1[0] - x2[0])**2 + (x1[1] - x2[1])**2) + 1e-53
if y1 != y2:
sum += 1 / dist
else:
pass
return sum / (len(Y) * (len(Y) - 1) / 2)
それ以外は、回帰問題と同じですね。
乳がんデータセット
乳がんデータセットを用いた2値分類の例です。
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
import optuna
objective = SupervisedIsomap(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
Isomap(eigen_solver='dense', n_neighbors=47, neighbors_algorithm='ball_tree',
path_method='D')
Isomap(eigen_solver='arpack', n_neighbors=47, neighbors_algorithm='brute',
path_method='FW')
Isomap(eigen_solver='arpack', n_neighbors=47, neighbors_algorithm='brute',
path_method='FW')
Isomap(eigen_solver='arpack', n_neighbors=47, neighbors_algorithm='brute',
path_method='FW')
Isomap(eigen_solver='arpack', n_neighbors=47, neighbors_algorithm='brute',
path_method='FW')
Isomap(eigen_solver='arpack', n_neighbors=47, neighbors_algorithm='brute',
path_method='FW')
ワインデータセット
ワインデータセットを用いた例です。このように、ラベルが3種類以上の場合でも実行可能です。
from sklearn.datasets import load_wine
dataset = load_wine()
import optuna
objective = SupervisedIsomap(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
Isomap(eigen_solver='arpack', n_neighbors=8, neighbors_algorithm='ball_tree',
path_method='D')
Isomap(eigen_solver='arpack', n_neighbors=42, neighbors_algorithm='ball_tree',
path_method='FW')
Isomap(eigen_solver='dense', n_neighbors=42, neighbors_algorithm='brute',
path_method='FW')
Isomap(eigen_solver='arpack', n_neighbors=45, neighbors_algorithm='brute',
path_method='D')
Isomap(eigen_solver='arpack', n_neighbors=45, neighbors_algorithm='ball_tree',
path_method='D')
手書き数字データセット
手書き数字データセットを用いた例です。この例ではラベルが10種類あります。
from sklearn.datasets import load_digits
dataset = load_digits()
import optuna
objective = SupervisedIsomap(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
Isomap(eigen_solver='arpack', n_neighbors=32, neighbors_algorithm='kd_tree',
path_method='D')
Isomap(eigen_solver='arpack', n_neighbors=21, neighbors_algorithm='kd_tree',
path_method='FW')
Isomap(eigen_solver='dense', n_neighbors=45, neighbors_algorithm='brute',
path_method='FW')
Isomap(eigen_solver='dense', n_neighbors=50, neighbors_algorithm='kd_tree',
path_method='FW')
Isomap(eigen_solver='dense', n_neighbors=44, neighbors_algorithm='kd_tree',
path_method='FW')
Isomap(eigen_solver='arpack', n_neighbors=29, neighbors_algorithm='kd_tree',
path_method='D')
Isomap(eigen_solver='arpack', n_neighbors=29, neighbors_algorithm='kd_tree',
path_method='D')
おわりに
Isomapは、データにもよりますがパラメータ依存が小さいようですね。t-SNEと比べてみると面白いです。