以前に UMAP, t-SNE, Isomap の教師ありハイパーパラメーターチューニングという記事を書きましたが、教師あり・教師なし・半教師ありの次元削減ができる Ivis というものがあるので、使ってみました。
Ivis ダウンロード
!pip install git+https://github.com/beringresearch/ivis.git
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/beringresearch/ivis.git
Cloning https://github.com/beringresearch/ivis.git to /tmp/pip-req-build-fmiazz46
Running command git clone -q https://github.com/beringresearch/ivis.git /tmp/pip-req-build-fmiazz46
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from ivis==2.0.8) (1.21.6)
Requirement already satisfied: scikit-learn>0.20.0 in /usr/local/lib/python3.7/dist-packages (from ivis==2.0.8) (1.0.2)
Collecting annoy>=1.15.2
Downloading annoy-1.17.1.tar.gz (647 kB)
[K |████████████████████████████████| 647 kB 5.1 MB/s
[?25hRequirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from ivis==2.0.8) (4.64.1)
Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from ivis==2.0.8) (0.3.6)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>0.20.0->ivis==2.0.8) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>0.20.0->ivis==2.0.8) (3.1.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>0.20.0->ivis==2.0.8) (1.7.3)
Building wheels for collected packages: ivis, annoy
Building wheel for ivis (setup.py) ... [?25l[?25hdone
Created wheel for ivis: filename=ivis-2.0.8-py3-none-any.whl size=35462 sha256=2254db91d6b5d1e2b898313c211ff5fc15bc0b165301dbcd03e2cbda32254878
Stored in directory: /tmp/pip-ephem-wheel-cache-xjhcw8qi/wheels/d1/d7/92/46dbb25fa631e6ed1b333d872b9898f8d2df3f5437452fb834
Building wheel for annoy (setup.py) ... [?25l[?25hdone
Created wheel for annoy: filename=annoy-1.17.1-cp37-cp37m-linux_x86_64.whl size=395185 sha256=a0bf153dd01cd1bcb4a5062d921af7467759c698f8d5e15857512ec38fc2d6e8
Stored in directory: /root/.cache/pip/wheels/81/94/bf/92cb0e4fef8770fe9c6df0ba588fca30ab7c306b6048ae8a54
Successfully built ivis annoy
Installing collected packages: annoy, ivis
Successfully installed annoy-1.17.1 ivis-2.0.8
糖尿病データを教師なしで
お試し用のデータとして、scikit-learn で取得できる糖尿病データを教師なしで用いてみます。まずは学習から。
from ivis import Ivis
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
mapper = Ivis()
mapper.fit(dataset.data)
Building KNN index
100%|██████████| 442/442 [00:00<00:00, 135181.74it/s]
Extracting KNN neighbours
100%|██████████| 442/442 [00:00<00:00, 869.80it/s]
Training neural network
Epoch 1/1000
4/4 [==============================] - 2s 21ms/step - loss: 1.2240
Epoch 2/1000
4/4 [==============================] - 0s 14ms/step - loss: 1.1184
Epoch 3/1000
4/4 [==============================] - 0s 12ms/step - loss: 1.0714
Epoch 4/1000
4/4 [==============================] - 0s 10ms/step - loss: 1.1124
Epoch 5/1000
4/4 [==============================] - 0s 9ms/step - loss: 1.0280
(中略)
Epoch 168/1000
4/4 [==============================] - 0s 11ms/step - loss: 0.5203
Epoch 169/1000
4/4 [==============================] - 0s 10ms/step - loss: 0.5536
学習が終わったら、次のようにして transform します。
embedding = mapper.transform(dataset.data)
4/4 [==============================] - 0s 7ms/step
いちおう shape を確認しましょう。次元数はデフォルトで 2 ですが、好みで他の次元数も当然設定できます。
embedding.shape
(442, 2)
次元削減の結果を図示します。
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
Ivis は Tensorflow に基づいた深層学習を行なっており、乱数を固定しないと動作ごとに結果が異なります。同じ計算を再度行った結果がこちらです。
糖尿病データを教師あり(回帰)で
今度は、同じ糖尿病データですが、目的変数を設定し、回帰モデルを用いて教師あり次元削減します。
from ivis import Ivis
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_diabetes()
mapper = Ivis(supervision_metric="mae")
mapper.fit(dataset.data, dataset.target)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
Building KNN index
100%|██████████| 442/442 [00:00<00:00, 91131.22it/s]
Extracting KNN neighbours
100%|██████████| 442/442 [00:00<00:00, 1086.21it/s]
Training neural network
Epoch 1/1000
4/4 [==============================] - 2s 9ms/step - loss: 76.6442 - stacked_triplets_loss: 1.1391 - supervised_loss: 152.1492
Epoch 2/1000
4/4 [==============================] - 0s 10ms/step - loss: 76.5158 - stacked_triplets_loss: 1.0523 - supervised_loss: 151.9793
Epoch 3/1000
4/4 [==============================] - 0s 14ms/step - loss: 76.4499 - stacked_triplets_loss: 1.1045 - supervised_loss: 151.7953
Epoch 4/1000
4/4 [==============================] - 0s 10ms/step - loss: 76.3735 - stacked_triplets_loss: 1.0908 - supervised_loss: 151.6561
Epoch 5/1000
4/4 [==============================] - 0s 12ms/step - loss: 76.2924 - stacked_triplets_loss: 1.0760 - supervised_loss: 151.5088
(中略)
Epoch 149/1000
4/4 [==============================] - 0s 9ms/step - loss: 28.2583 - stacked_triplets_loss: 6.5393 - supervised_loss: 49.9772
Epoch 150/1000
4/4 [==============================] - 0s 12ms/step - loss: 28.7334 - stacked_triplets_loss: 6.4428 - supervised_loss: 51.0239
4/4 [==============================] - 0s 6ms/step
同じ計算を再度行った結果です。
教師ありを用いると、目的変数の大小をある程度反映した次元削減ができている気がしますね。
乳がんデータを教師なしで
次は、分類用データセットである乳がんデータを用いてみましょう。まずは目的変数を与えず、教師なしで次元削減してみます。
from ivis import Ivis
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_breast_cancer()
mapper = Ivis()
mapper.fit(dataset.data)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
Building KNN index
100%|██████████| 569/569 [00:00<00:00, 125173.55it/s]
Extracting KNN neighbours
100%|██████████| 569/569 [00:00<00:00, 1404.38it/s]
Training neural network
Epoch 1/1000
5/5 [==============================] - 1s 10ms/step - loss: 18.8974
Epoch 2/1000
5/5 [==============================] - 0s 10ms/step - loss: 11.1786
Epoch 3/1000
5/5 [==============================] - 0s 12ms/step - loss: 9.5566
Epoch 4/1000
5/5 [==============================] - 0s 12ms/step - loss: 8.4553
Epoch 5/1000
5/5 [==============================] - 0s 12ms/step - loss: 8.2069
(中略)
Epoch 82/1000
5/5 [==============================] - 0s 10ms/step - loss: 1.0038
Epoch 83/1000
5/5 [==============================] - 0s 10ms/step - loss: 0.8592
5/5 [==============================] - 0s 4ms/step
同じ計算を再度行った結果です。
結果にだいぶ違いがあるように(結果が安定しないように)見えますね...
乳がんデータを教師あり(分類)で
同じデータに対して、今度は分類用の目的変数を与えて、教師あり次元削減してみましょう。
from ivis import Ivis
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_breast_cancer()
mapper = Ivis()
mapper.fit(dataset.data, dataset.target)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
Building KNN index
100%|██████████| 569/569 [00:00<00:00, 53078.29it/s]
Extracting KNN neighbours
100%|██████████| 569/569 [00:00<00:00, 681.58it/s]
Training neural network
Epoch 1/1000
5/5 [==============================] - 2s 10ms/step - loss: 16.7840 - stacked_triplets_loss: 15.3962 - supervised_loss: 18.1718
Epoch 2/1000
5/5 [==============================] - 0s 10ms/step - loss: 11.9572 - stacked_triplets_loss: 13.6644 - supervised_loss: 10.2501
Epoch 3/1000
5/5 [==============================] - 0s 16ms/step - loss: 9.0844 - stacked_triplets_loss: 10.7357 - supervised_loss: 7.4332
Epoch 4/1000
5/5 [==============================] - 0s 11ms/step - loss: 8.2655 - stacked_triplets_loss: 8.6759 - supervised_loss: 7.8552
Epoch 5/1000
5/5 [==============================] - 0s 9ms/step - loss: 8.3221 - stacked_triplets_loss: 8.7451 - supervised_loss: 7.8990
(中略)
Epoch 210/1000
5/5 [==============================] - 0s 12ms/step - loss: 0.3639 - stacked_triplets_loss: 0.4976 - supervised_loss: 0.2302
Epoch 211/1000
5/5 [==============================] - 0s 11ms/step - loss: 0.3241 - stacked_triplets_loss: 0.4379 - supervised_loss: 0.2104
5/5 [==============================] - 0s 4ms/step
同じ計算を再度行った結果です。
目的変数の違いが多少(少しだけ?)反映された次元削減になっていそうです。
ワインのデータ
ワインのデータについても同様に、教師なし・教師あり次元削減を行います。まずは教師なしから。
from ivis import Ivis
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_wine()
mapper = Ivis()
mapper.fit(dataset.data)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
Building KNN index
100%|██████████| 178/178 [00:00<00:00, 180160.74it/s]
Extracting KNN neighbours
100%|██████████| 178/178 [00:00<00:00, 573.19it/s]
Training neural network
Epoch 1/1000
2/2 [==============================] - 1s 14ms/step - loss: 19.4780
Epoch 2/1000
2/2 [==============================] - 0s 12ms/step - loss: 14.3163
Epoch 3/1000
2/2 [==============================] - 0s 13ms/step - loss: 11.7559
Epoch 4/1000
2/2 [==============================] - 0s 9ms/step - loss: 11.7646
Epoch 5/1000
2/2 [==============================] - 0s 10ms/step - loss: 16.4651
(中略)
Epoch 70/1000
2/2 [==============================] - 0s 17ms/step - loss: 1.4658
Epoch 71/1000
2/2 [==============================] - 0s 16ms/step - loss: 1.6900
2/2 [==============================] - 0s 6ms/step
同じ計算を再度行った結果です。
続いて、同様に説明変数(分類用)を与えた教師あり次元削減です。
from ivis import Ivis
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_wine()
mapper = Ivis()
mapper.fit(dataset.data, dataset.target)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
Building KNN index
100%|██████████| 178/178 [00:00<00:00, 167546.25it/s]
Extracting KNN neighbours
100%|██████████| 178/178 [00:00<00:00, 577.28it/s]
Training neural network
Epoch 1/1000
2/2 [==============================] - 2s 16ms/step - loss: 56.4703 - stacked_triplets_loss: 19.4923 - supervised_loss: 93.4482
Epoch 2/1000
2/2 [==============================] - 0s 14ms/step - loss: 27.5077 - stacked_triplets_loss: 19.3239 - supervised_loss: 35.6914
Epoch 3/1000
2/2 [==============================] - 0s 17ms/step - loss: 30.7517 - stacked_triplets_loss: 11.3563 - supervised_loss: 50.1470
Epoch 4/1000
2/2 [==============================] - 0s 16ms/step - loss: 30.9872 - stacked_triplets_loss: 18.4747 - supervised_loss: 43.4996
Epoch 5/1000
2/2 [==============================] - 0s 15ms/step - loss: 19.9857 - stacked_triplets_loss: 13.8399 - supervised_loss: 26.1315
(中略)
Epoch 111/1000
2/2 [==============================] - 0s 20ms/step - loss: 1.0617 - stacked_triplets_loss: 0.9649 - supervised_loss: 1.1585
Epoch 112/1000
2/2 [==============================] - 0s 26ms/step - loss: 2.1483 - stacked_triplets_loss: 2.9829 - supervised_loss: 1.3138
WARNING:tensorflow:5 out of the last 15 calls to <function Model.make_predict_function.<locals>.predict_function at 0x7fc88118b320> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
2/2 [==============================] - 0s 9ms/step
おや、何か Warning が出てる...(でも放置)
改善してないように見える。Warningに対処しなければいけないのかな...(でも放置)
手書き数字データ
最後に、手書き数字データセットを取り扱ってみましょう。まずは説明変数を与えない教師なし次元削減から。
from ivis import Ivis
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_digits()
mapper = Ivis()
mapper.fit(dataset.data)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
Building KNN index
100%|██████████| 1797/1797 [00:00<00:00, 116449.04it/s]
Extracting KNN neighbours
100%|██████████| 1797/1797 [00:01<00:00, 1148.54it/s]
Training neural network
Epoch 1/1000
15/15 [==============================] - 1s 11ms/step - loss: 1.0162
Epoch 2/1000
15/15 [==============================] - 0s 12ms/step - loss: 0.8303
Epoch 3/1000
15/15 [==============================] - 0s 16ms/step - loss: 0.5445
Epoch 4/1000
15/15 [==============================] - 0s 12ms/step - loss: 0.5609
Epoch 5/1000
15/15 [==============================] - 0s 13ms/step - loss: 0.4943
(中略)
Epoch 132/1000
15/15 [==============================] - 0s 11ms/step - loss: 0.1893
Epoch 133/1000
15/15 [==============================] - 0s 11ms/step - loss: 0.1837
15/15 [==============================] - 0s 3ms/step
続いて、説明変数を与える教師あり次元削減。
from ivis import Ivis
import sklearn.datasets
import matplotlib.pyplot as plt
dataset = sklearn.datasets.load_digits()
mapper = Ivis()
mapper.fit(dataset.data, dataset.target)
embedding = mapper.transform(dataset.data)
plt.scatter(embedding[:, 0], embedding[:, 1], c=dataset.target, alpha=0.5)
plt.colorbar()
plt.show()
Building KNN index
100%|██████████| 1797/1797 [00:00<00:00, 40648.71it/s]
Extracting KNN neighbours
100%|██████████| 1797/1797 [00:01<00:00, 1011.28it/s]
Training neural network
Epoch 1/1000
15/15 [==============================] - 2s 11ms/step - loss: 1.7909 - stacked_triplets_loss: 1.0847 - supervised_loss: 2.4971
Epoch 2/1000
15/15 [==============================] - 0s 11ms/step - loss: 1.3948 - stacked_triplets_loss: 0.8389 - supervised_loss: 1.9507
Epoch 3/1000
15/15 [==============================] - 0s 11ms/step - loss: 1.1755 - stacked_triplets_loss: 0.5271 - supervised_loss: 1.8240
Epoch 4/1000
15/15 [==============================] - 0s 11ms/step - loss: 1.1343 - stacked_triplets_loss: 0.5658 - supervised_loss: 1.7028
Epoch 5/1000
15/15 [==============================] - 0s 12ms/step - loss: 1.0922 - stacked_triplets_loss: 0.5646 - supervised_loss: 1.6198
(中略)
Epoch 87/1000
15/15 [==============================] - 0s 13ms/step - loss: 0.5124 - stacked_triplets_loss: 0.4945 - supervised_loss: 0.5302
Epoch 88/1000
15/15 [==============================] - 0s 12ms/step - loss: 0.4649 - stacked_triplets_loss: 0.4065 - supervised_loss: 0.5233
15/15 [==============================] - 0s 3ms/step
目的変数がある場合はしっかり特徴を捉えられている気がしますが、目的変数を与えない場合はUMAP等より劣る気がしますね...
より詳しく
今回は Ivis を簡単に取り扱いましたが、詳しくは本家サイト https://bering-ivis.readthedocs.io/en/latest/ をご覧ください。半教師学習もできますし、その他いろんなパラメータがあり、それらの効果について詳細な解説があります。
所感
Ivis は次元数の多いスパースなデータに強みがあるようなことが書いてありましたので、そういうデータを使わなければならない時は良いのかもしれません。でもまあ普通のデータ(?)のときは UMAP とかのほうが良いような気がします(自分が使いこなしていないだけかもしれません)。