何もしてないのにいつの間にかnvidia-smi
で表示が出なかったり、docker
でGPUがつかめなくなってしまう場合の対応。driverをreloadしましょう。
再起動してもだいたい直りますが、再起動したくない場合も多いので以下のdriver のrloadで再起動しでうまくいきます。
$ sudo rmmod nvidia_drm
$ sudo rmmod nvidia_uvm
$ sudo rmmod nvidia_modeset
$ sudo rmmod nvidia
$ lsmod | grep nvidia
$ nvidia-smi
$ nvidia-smi
Fri Mar 15 11:40:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 |
| 0% 27C P0 56W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 |
| 0% 28C P0 60W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 |
| 0% 26C P0 56W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 26C P0 53W / 300W | 0MiB / 23028MiB | 4% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
参考: nvidia-smi で Failed to initialize NVML: Driver/library version mismatch と言われたとき【GPU】