8
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Pytorch @WSL で、UserWarning: CUDA initialization ~~ Error 500 が出たとき

Last updated at Posted at 2024-05-31

出会いは突然に

久しぶりに 3D Deep 系の環境構築をしていた時に、怪しいエラーが出た⇩

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found

あとこれも多分同じエラー⇩

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found

(work) ➜  ~ pip install git+https://github.com/KAIR-BAIR/nerfacc.git@v0.5.2
Collecting git+https://github.com/KAIR-BAIR/nerfacc.git@v0.5.2
~~~
~~~
        File "/root/venv/work/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1760, in _get_cuda_arch_flags
          capability = torch.cuda.get_device_capability(i)
        File "/root/venv/work/lib/python3.9/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
          prop = get_device_properties(device)
        File "/root/venv/work/lib/python3.9/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
          _lazy_init()  # will define _get_device_properties
        File "/root/venv/work/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
          torch._C._cuda_init()
      RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for nerfacc
  Running setup.py clean for nerfacc
Failed to build nerfacc
ERROR: Could not build wheels for nerfacc, which is required to install pyproject.toml-based projects
(work) ➜  ~

エラーが起きた環境

  • WSL2 上の Docker 上の ubuntu
  • WSL2 の nvidia-driver: 555.85
  • docker image (2パターン)
    • ubuntu 20.04
    • cuda: 11.8.0 / 12.1.1
    • python: 3.9.19 / 3.11.9
    • pytorch: 2.0.0 / 2.3.0

エラーを調べる

どうやら、そもそも python で pytorch を使おうとしただけでエラーが出ていた

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found

(work)   ~ python
Python 3.9.19 (main, Apr  6 2024, 17:59:24) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/root/venv/work/lib/python3.11/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>>

原因

WSL側で NVIDIA Driver 555.xx を使ってるときの問題らしい。うんち。
WSL側の nvidia-container-toolkit1.14.4 or newer にアプデするといいらしい。

現在の nvidia-container-toolkit のバージョンを確認すると、1.14.0 で確かに少し古かった

(work) ➜  ~ dpkg -l | grep nvidia-container-toolkit
ii  nvidia-container-toolkit        1.14.0~rc.2-1    amd64    NVIDIA Container toolkit
ii  nvidia-container-toolkit-base   1.14.0~rc.2-1    amd64    NVIDIA Container Toolkit Base

因みに、CUDA initialization: Unexpected error from cudaGetDeviceCount() で調べてもこのエラーには直接は関係ない記事が出てくるので注意です。キモは Error 500: named symbol not found だったらしい

あと GPT に聞いても残念、CUDA Toolkit は疑ってくれたが、nvidia-container-toolkit にはたどり着けなかった (チャットログ)

解決

nvidia-container-toolkit のページを見ても、アップデート方法がパッと見なかった

sudo apt update && sudo apt upgrade だけではだめそうだったので(それはそう?)、WSL側でとりあえずインストール方法の部分をもう一度やってみました
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-container-toolkit

無事、アップデートされた

(work) ➜  ~ dpkg -l | grep nvidia-container-toolkit
ii  nvidia-container-toolkit        1.15.0-1    amd64    NVIDIA Container toolkit
ii  nvidia-container-toolkit-base   1.15.0-1    amd64    NVIDIA Container Toolkit Base

無事、UserWarning: CUDA initialization が消えました

(work)   ~ python
Python 3.9.19 (main, Apr  6 2024, 17:57:55) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

以上!

8
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
8
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?