More than 1 year has passed since last update.

[Python / UMAP] TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k. の対策

Posted at 2024-01-05

現象

umap.UMAP を使った処理の単体テストを書いていたら、以下のようなエラーに遭遇した。

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

環境

	version
Docker image	python:3.10.12-slim
Python	3.10.12
umap-learn	0.5.5

再現手順

実行コード

import numpy as np
import umap


SEED: int = 42

# 次元削減対象データ
vectors: np.ndarray = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9]])

target_dim: int = 2
reducer = umap.UMAP(random_state=SEED, n_components=target_dim)
reducer.fit(vectors)
embedding = reducer.transform(vectors)

発生したエラー

/usr/local/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1600: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead.
  warnings.warn("k >= N for N * N square matrix. "
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/src/ml_tools/dim_reduction.py", line 26, in reduce_dim_with_umap
    reducer.fit(vectors)
  File "/usr/local/lib/python3.10/site-packages/umap/umap_.py", line 2780, in fit
    self.embedding_, aux_data = self._fit_embed_data(
  File "/usr/local/lib/python3.10/site-packages/umap/umap_.py", line 2826, in _fit_embed_data
    return simplicial_set_embedding(
  File "/usr/local/lib/python3.10/site-packages/umap/umap_.py", line 1106, in simplicial_set_embedding
    embedding = spectral_layout(
  File "/usr/local/lib/python3.10/site-packages/umap/spectral.py", line 304, in spectral_layout
    return _spectral_layout(
  File "/usr/local/lib/python3.10/site-packages/umap/spectral.py", line 521, in _spectral_layout
    eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
  File "/usr/local/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py", line 1605, in eigsh
    raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

原因

以下の GitHub issue (buhrmann さんのコメント) を参照した結果、「データサンプル数が少なすぎる」時に発生するエラーだと分かった。

The number of dimensions you're embedding into with UMAP must be at least 2 less than the number of samples (documents). So you cannot e.g. embed into 10 dimensions if you haven't got at least 12 samples.
(buhrmann さんのコメント)

日本語に訳して一部補うと以下のようになる。

UMAP でエンべディングする際に指定するデータの次元数(引数 n_components)は、少なくともサンプルデータ数よりも2以上少なくなければならない。
例えば、10次元にエンべディングしようと思っているのであれば、最低でも 12 サンプルは持っている必要がある。

対策

n_componentsよりも2以上多いデータ数を用意すれば解決する。
自分は単体テスト用のコードだったので、以下のような処理を書いて、単体テストの関数に組み込んだ。

実行コード

import numpy as np
import umap


SEED: int = 42

# 次元削減対象データ (サンプル数を一個増やした)
vectors: np.ndarray = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9], [0.4, 0.5, 0.6]])

target_dim: int = 2
reducer = umap.UMAP(random_state=SEED, n_components=target_dim)
reducer.fit(vectors)
embedding = reducer.transform(vectors)

# 出力
array([[11.799392 ,  6.375716 ],
       [12.055416 ,  5.505439 ],
       [10.8652525,  5.305329 ],
       [11.5016775,  4.7107167]], dtype=float32)

これで解決なり〜

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up