More than 1 year has passed since last update.

[Python / pytorch] CUDA error: device-side assert triggered CUDA kernel errors の原因調査

Last updated at 2024-04-07Posted at 2024-04-07

エラー内容

目についたエラーはこちら。

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

他にもこんなメッセージが出ていた。

 from the models weights: update the call to weights.transforms(antialias=True).
  warnings.warn(
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.

原因

正解ラベルの ID に -1 を設定していたのが原因だった。

エラー文に書かれている制限事項	意味
t >= 0	負の値は正解ラベルにできないらしい。
t < n_classes	正解ラベル `t` は、正解クラス数 `n_classes` よりも小さい値でなければいけないらしい。つまり、正解ラベルの番号が、飛び飛びになってもいけない、ということだろう。

解決策

ラベル ID の値を 0 と正の整数のみにしたら解決した。

この記事では、クラス(ラベル)のIDは [0, nb_classes-1] の間しか許容されていないと書かれている。
エラー文の記載とも辻褄があっている。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

[Python / pytorch] CUDA error: device-side assert triggered CUDA kernel errors の原因調査

エラー内容

原因

解決策

関連記事