出会いは突然に
久しぶりに 3D Deep 系の環境構築をしていた時に、怪しいエラーが出た⇩
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found
あとこれも多分同じエラー⇩
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found
(work) ➜ ~ pip install git+https://github.com/KAIR-BAIR/nerfacc.git@v0.5.2
Collecting git+https://github.com/KAIR-BAIR/nerfacc.git@v0.5.2
~~~
~~~
File "/root/venv/work/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1760, in _get_cuda_arch_flags
capability = torch.cuda.get_device_capability(i)
File "/root/venv/work/lib/python3.9/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
prop = get_device_properties(device)
File "/root/venv/work/lib/python3.9/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/root/venv/work/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for nerfacc
Running setup.py clean for nerfacc
Failed to build nerfacc
ERROR: Could not build wheels for nerfacc, which is required to install pyproject.toml-based projects
(work) ➜ ~
エラーが起きた環境
- WSL2 上の Docker 上の ubuntu
- WSL2 の nvidia-driver: 555.85
- docker image (2パターン)
- ubuntu 20.04
- cuda: 11.8.0 / 12.1.1
- python: 3.9.19 / 3.11.9
- pytorch: 2.0.0 / 2.3.0
エラーを調べる
どうやら、そもそも python で pytorch を使おうとしただけでエラーが出ていた
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found
(work) ➜ ~ python
Python 3.9.19 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/root/venv/work/lib/python3.11/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
>>>
原因
WSL側で NVIDIA Driver 555.xx を使ってるときの問題らしい。うんち。
WSL側の nvidia-container-toolkit
を 1.14.4 or newer
にアプデするといいらしい。
現在の nvidia-container-toolkit のバージョンを確認すると、1.14.0 で確かに少し古かった
(work) ➜ ~ dpkg -l | grep nvidia-container-toolkit
ii nvidia-container-toolkit 1.14.0~rc.2-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.14.0~rc.2-1 amd64 NVIDIA Container Toolkit Base
因みに、CUDA initialization: Unexpected error from cudaGetDeviceCount()
で調べてもこのエラーには直接は関係ない記事が出てくるので注意です。キモは Error 500: named symbol not found
だったらしい
あと GPT に聞いても残念、CUDA Toolkit は疑ってくれたが、nvidia-container-toolkit にはたどり着けなかった (チャットログ)
解決
nvidia-container-toolkit のページを見ても、アップデート方法がパッと見なかった
sudo apt update && sudo apt upgrade
だけではだめそうだったので(それはそう?)、WSL側でとりあえずインストール方法の部分をもう一度やってみました
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-container-toolkit
無事、アップデートされた
(work) ➜ ~ dpkg -l | grep nvidia-container-toolkit
ii nvidia-container-toolkit 1.15.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.15.0-1 amd64 NVIDIA Container Toolkit Base
無事、UserWarning: CUDA initialization
が消えました
(work) ➜ ~ python
Python 3.9.19 (main, Apr 6 2024, 17:57:55) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>
以上!