More than 1 year has passed since last update.

t-SNEの教師ありハイパーパラメーターチューニング

Last updated at 2023-03-16Posted at 2022-03-29

高次元データを可視化する手法のひとつとして、t-SNE という手法が人気です。ですがこの手法、パラメータがいろいろあって、それによって結果が大きく異なり、しかも、結果の良し悪しを判断する「定量的な」手法が存在しないという問題があります（「主観的な」手法はあるにしても）。

（「いや、存在するよ」という方がいらっしゃれば、コメントをいただけると大変ありがたいです。）

t-SNE については、この記事などが分かりやすいです。

高次元のデータを可視化するt-SNEの効果的な使い方

t-SNE でパラメータを変化させる

t-SNE は scikit-learn に実装されているので、それを使ってみましょう。

糖尿病データセット

お試し用のデータとして、scikit-learn で取得できる糖尿病データセットを用いてみます。

from sklearn.manifold import TSNE

import sklearn.datasets
import matplotlib.pyplot as plt

dataset = sklearn.datasets.load_diabetes()

mapper = TSNE()
embedding = mapper.fit_transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()

ここで、点の色は dataset.target にある変数で、この値は t-SNE の学習に用いられていないことに注意してください。 t-SNE は「教師なし学習」の一種であり、学習に用いた高次元データの構造を可視化する働きをするのですが、その結果として、学習に用いていない変数との関係が見えてくることもあります。

さて、パラメータを変えると、このマッピングがどのくらい変わってしまうのかを見てみましょう。

perplexity

from sklearn.manifold import TSNE
import sklearn.datasets
import matplotlib.pyplot as plt

dataset = sklearn.datasets.load_diabetes()

for perplexity in (5, 10, 30, 50):
    mapper = TSNE(perplexity=perplexity)
    embedding = mapper.fit_transform(dataset.data)
    title='perplexity = {0}'.format(perplexity)
    plt.title(title)
    plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
    plt.colorbar()
    plt.show()

early_exaggeration

from sklearn.manifold import TSNE
import sklearn.datasets
import matplotlib.pyplot as plt

dataset = sklearn.datasets.load_diabetes()

for early_exaggeration in (6, 12, 24, 48):
    mapper = TSNE(early_exaggeration=early_exaggeration)
    embedding = mapper.fit_transform(dataset.data)
    title='early_exaggeration = {0}'.format(early_exaggeration)
    plt.title(title)
    plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
    plt.colorbar()
    plt.show()

init

from sklearn.manifold import TSNE
import sklearn.datasets
import matplotlib.pyplot as plt

dataset = sklearn.datasets.load_diabetes()

for init in ["random", "pca"]:
    mapper = TSNE(init=init)
    embedding = mapper.fit_transform(dataset.data)
    title='init = {0}'.format(init)
    plt.title(title)
    plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
    plt.colorbar()
    plt.show()

結局、どのパラメータが良いの？

パラメータを変化させると、可視化結果が変わってくることがよく分かりましたね。それでは結局、どのパラメータが良いのでしょうか？

先に紹介した記事では、複数のパラメータを用いて複数の可視化結果を得ることで、データ構造を把握することを勧めています。

目的変数がない場合は、それで十分かもしれません。しかし、目的変数がある場合は、どうでしょうか。説明変数の中には、目的変数との関係性の薄い変数が含まれることも多いでしょう。教師なし学習だと、目的変数との関係性を考えずにマッピングしてしまうのですが、要らない変数の影響はできるだけ小さくしてマッピングして欲しいですよね（ね？）。

回帰問題用スコア関数

そこで、教師なし学習である t-SNE を「教師あり t-SNE」にするためのスコア関数を設計しましょう。回帰問題の場合は、たとえば次のように作れると思います。マッピングされた２点に対して、目的変数の差を計算し、それを２点間のユークリッド距離で割った数値の総和です。

def regression_scorer(X, Y):
    sum = 0
    n1 = 0
    for x1, y1 in zip(X, Y):
        n1 += 1
        n2 = 0
        for x2, y2 in zip(X, Y):
            n2 += 1
            if n1 > n2:
                dist = ((x1[0] - x2[0])**2 + (x1[1] - x2[1])**2) + 1e-53
                sum += (y1 - y2)**2 / dist

    return sum / (len(Y) * (len(Y) - 1) / 2)

Optuna で最適化

パラメータの最適化といえば Optuna。Optuna と言えばパラメータの最適化。ということでインストールします。

!pip install optuna

Collecting optuna
  Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 5.1 MB/s 
[?25hRequirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.63.0)
Collecting colorlog
  Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: scipy!=1.4.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.3)
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 8.1 MB/s 
[?25hRequirement already satisfied: sqlalchemy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.32)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.21.5)
Collecting alembic
  Downloading alembic-1.7.7-py3-none-any.whl (210 kB)
[K     |████████████████████████████████| 210 kB 4.6 MB/s 
[?25hRequirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (3.13)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (3.0.7)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (4.11.3)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (1.1.2)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (5.4.0)
Collecting Mako
  Downloading Mako-1.2.0-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 5.6 MB/s 
[?25hRequirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.2.0)
Collecting autopage>=0.4.0
  Downloading autopage-0.5.0-py3-none-any.whl (29 kB)
Collecting stevedore>=2.0.1
  Downloading stevedore-3.5.0-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 5.0 MB/s 
[?25hCollecting cmd2>=1.0.0
  Downloading cmd2-2.4.0-py3-none-any.whl (150 kB)
[K     |████████████████████████████████| 150 kB 31.0 MB/s 
[?25hCollecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.8.1-py2.py3-none-any.whl (113 kB)
[K     |████████████████████████████████| 113 kB 33.3 MB/s 
[?25hRequirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (21.4.0)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (3.10.0.2)
Collecting pyperclip>=1.6
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy>=1.1.0->optuna) (3.7.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.7/dist-packages (from Mako->alembic->optuna) (2.0.1)
Building wheels for collected packages: pyperclip
  Building wheel for pyperclip (setup.py) ... [?25l[?25hdone
  Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11137 sha256=f6e348257d49e3eb4819d8d6878fb85834425bba91046933d8932256a5cf8b75
  Stored in directory: /root/.cache/pip/wheels/9f/18/84/8f69f8b08169c7bae2dde6bd7daf0c19fca8c8e500ee620a28
Successfully built pyperclip
Installing collected packages: pyperclip, pbr, stevedore, Mako, cmd2, autopage, colorlog, cmaes, cliff, alembic, optuna
Successfully installed Mako-1.2.0 alembic-1.7.7 autopage-0.5.0 cliff-3.10.1 cmaes-0.8.2 cmd2-2.4.0 colorlog-6.6.0 optuna-2.10.0 pbr-5.8.1 pyperclip-1.8.2 stevedore-3.5.0

教師あり t-SNE

Optuna を使って、t-SNE を最適化するクラスを設計します。

import scipy

class SupervisedTSNE:
    def __init__(self, X, Y, scorer):
        self.X = X
        self.Y = Y
        self.scorer = scorer
        self.best_score = 1e53
        self.best_model = None

    def __call__(self, trial):
        perplexity = trial.suggest_uniform("perplexity", 5, 50)
        early_exaggeration = trial.suggest_uniform("early_exaggeration", 6, 48)
        init = trial.suggest_categorical("init", ["random", "pca"])

        mapper = TSNE(
            perplexity=perplexity, 
            early_exaggeration=early_exaggeration,
            init=init
            )
        embedding = mapper.fit_transform(self.X)
        score = self.scorer(scipy.stats.zscore(embedding), self.Y)

        if self.best_score > score:
            self.best_score = score
            self.best_model = mapper
            
            print(self.best_model)
            title='trial={0}, score={1:.3e}'.format(trial.number, score)
            plt.title(title)
            plt.scatter(embedding[:, 0], embedding[:, 1], c=self.Y, alpha=0.5)
            plt.colorbar()
            plt.show()      

        return score

最適化計算の実行例です。いくらか省略しながら結果を示します。

import optuna

objective = SupervisedTSNE(dataset.data, dataset.target, regression_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

TSNE(early_exaggeration=29.57478761904391, init='pca',
     perplexity=7.288180627051028)

TSNE(early_exaggeration=32.743719902491954, init='random',
     perplexity=24.065691436818355)

TSNE(early_exaggeration=32.016253092352976, init='random',
     perplexity=36.81327058527401)

TSNE(early_exaggeration=18.6793601909901, init='random',
     perplexity=27.80237026497791)

TSNE(early_exaggeration=22.976468842875498, init='random',
     perplexity=29.058905297492934)

最適化が進むに従って、目的変数の分離が最もしやすいパラメータが選択できたのではないかなと思います。

分類問題用スコア関数

回帰問題用スコア関数を少しだけ改変して、分類問題用スコア関数にします。分類問題と異なり、回帰問題では異なるクラス間の「差」は常に１であるとします。

def classification_scorer(X, Y):
    sum = 0
    n1 = 0
    for x1, y1 in zip(X, Y):
        n1 += 1
        n2 = 0
        for x2, y2 in zip(X, Y):
            n2 += 1
            if n1 > n2 and y1 != y2:
                dist = ((x1[0] - x2[0])**2 + (x1[1] - x2[1])**2) + 1e-53
                sum += 1 / dist

    return sum / (len(Y) * (len(Y) - 1) / 2)

それ以外は、回帰問題と同じですね。

乳がんデータセット

乳がんデータセットを用いた２値分類の例です。

from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()

import optuna

objective = SupervisedTSNE(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

TSNE(early_exaggeration=12.611510237145549, init='random',
     perplexity=35.56361120894745)

TSNE(early_exaggeration=11.628359825589639, init='random',
     perplexity=34.11686139958637)

TSNE(early_exaggeration=24.20393636473322, init='pca',
     perplexity=24.985853567716287)

TSNE(early_exaggeration=6.663314666588436, init='pca',
     perplexity=49.552144264042596)

TSNE(early_exaggeration=13.917835653986023, init='pca',
     perplexity=46.58919209262085)

ワインデータセット

ワインデータセットを用いた例です。このように、ラベルが３種類以上の場合でも実行可能です。

from sklearn.datasets import load_wine

dataset = load_wine()

import optuna

objective = SupervisedTSNE(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

TSNE(early_exaggeration=44.81403485997158, init='random',
     perplexity=25.09137902984912)

TSNE(early_exaggeration=27.620977309632256, init='pca',
     perplexity=46.49121817564552)

TSNE(early_exaggeration=21.437834212793774, init='pca',
     perplexity=35.0983127345436)

TSNE(early_exaggeration=16.6199311486226, init='pca',
     perplexity=35.26687480395333)

TSNE(early_exaggeration=37.38551556793539, init='pca',
     perplexity=37.02128781194007)

TSNE(early_exaggeration=43.617308902887665, init='pca',
     perplexity=25.688919658222332)

TSNE(early_exaggeration=41.44898877661107, init='pca',
     perplexity=49.518301278226495)

手書き数字データセット

手書き数字データセットを用いた例です。この例ではラベルが１０種類あります。

from sklearn.datasets import load_digits

dataset = load_digits()

import optuna

objective = SupervisedTSNE(dataset.data, dataset.target, classification_scorer)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

TSNE(early_exaggeration=18.55172689770817, init='random',
     perplexity=24.74617700539417)

TSNE(early_exaggeration=24.70355395287924, init='pca',
     perplexity=9.82525783382701)

TSNE(early_exaggeration=40.247290546695076, init='random',
     perplexity=10.034670577895955)

TSNE(early_exaggeration=47.52143321379234, init='pca',
     perplexity=19.072079610868933)

TSNE(early_exaggeration=47.905280869230495, init='pca',
     perplexity=19.914995719242043)

TSNE(early_exaggeration=47.7352451400118, init='pca',
     perplexity=13.574817469405804)

おわりに

目的変数がある場合は、このような最適化を行うと、「目的変数との関係を考慮した」全体像の把握がやりやすくなるのではないでしょうか。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up