サーバを再起動したら以下のエラーが出てしまった
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the lates
以下のサイトを見ながら直していく
Ubuntuの更新ついでにNVIDIAのドライバを更新しようとしたらハマった話
ubuntuのバージョン確認
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
カーネルモジュールの状態
$ dkms status
nvidia/510.108.03, 5.15.0-91-generic, x86_64: installed
$ uname -r カーネルのバージョン
5.15.0-113-generic
使用しているGPUの確認
$ lspci | grep -i nvidia
af:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
af:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
d8:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
d8:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
$ dpkg -l | grep nvidia
ii libnvidia-cfg1-510:amd64 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-510 525.147.05-0ubuntu2.22.04.1 all Transitional package for libnvidia-common-535
ii libnvidia-common-535 535.183.01-0ubuntu0.22.04.1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-495:amd64 510.108.03-0ubuntu0.22.04.1 amd64 Transitional package for libnvidia-compute-510
ii libnvidia-compute-510:amd64 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-510:amd64 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-egl-wayland1:amd64 1:1.1.9-1.1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-encode-510:amd64 510.108.03-0ubuntu0.22.04.1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-510:amd64 510.108.03-0ubuntu0.22.04.1 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-510:amd64 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-510:amd64 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ml-dev:amd64 11.5.50~11.5.1-1ubuntu1 amd64 NVIDIA Management Library (NVML) development files
ii nvidia-compute-utils-510 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA compute utilities
ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
ii nvidia-cuda-dev:amd64 11.5.1-1ubuntu1 amd64 NVIDIA CUDA development files
ii nvidia-cuda-gdb 11.5.114~11.5.1-1ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 11.5.1-1ubuntu1 amd64 NVIDIA CUDA development toolkit
ii nvidia-cuda-toolkit-doc 11.5.1-1ubuntu1 all NVIDIA CUDA and OpenCL documentation
ii nvidia-dkms-510 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA DKMS package
ii nvidia-driver-510 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-510 510.108.03-0ubuntu0.22.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-510 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA kernel source package
ii nvidia-modprobe 510.47.03-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-dev:amd64 11.5.1-1ubuntu1 amd64 NVIDIA OpenCL development files
ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA's Prime
ii nvidia-profiler 11.5.114~11.5.1-1ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 510.47.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-510 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA driver support binaries
ii nvidia-visual-profiler 11.5.114~11.5.1-1ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL
ii screen-resolution-extra 0.18.2 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-510 510.108.03-0ubuntu0.22.04.1 amd64 NVIDIA binary Xorg driver
$ dpkg -l | grep cuda
ii cuda 11.6.1-1 amd64 CUDA meta-package
ii cuda-11-6 11.6.1-1 amd64 CUDA 11.6 meta-package
ii cuda-cccl-11-6 11.6.55-1 amd64 CUDA CCCL
ii cuda-command-line-tools-11-6 11.6.1-1 amd64 CUDA command-line tools
ii cuda-compiler-11-6 11.6.1-1 amd64 CUDA compiler
ii cuda-cudart-11-6 11.6.55-1 amd64 CUDA Runtime native Libraries
ii cuda-cudart-dev-11-6 11.6.55-1 amd64 CUDA Runtime native dev links, headers
ii cuda-cuobjdump-11-6 11.6.112-1 amd64 CUDA cuobjdump
ii cuda-cupti-11-6 11.6.112-1 amd64 CUDA profiling tools runtime libs.
ii cuda-cupti-dev-11-6 11.6.112-1 amd64 CUDA profiling tools interface.
ii cuda-cuxxfilt-11-6 11.6.112-1 amd64 CUDA cuxxfilt
ii cuda-demo-suite-11-6 11.6.55-1 amd64 Demo suite for CUDA
ii cuda-documentation-11-6 11.6.112-1 amd64 CUDA documentation
ii cuda-driver-dev-11-6 11.6.55-1 amd64 CUDA Driver native dev stub library
ii cuda-drivers 510.47.03-1 amd64 CUDA Driver meta-package, branch-agnostic
ii cuda-drivers-510 510.47.03-1 amd64 CUDA Driver meta-package, branch-specific
ii cuda-gdb-11-6 11.6.112-1 amd64 CUDA-GDB
ii cuda-libraries-11-6 11.6.1-1 amd64 CUDA Libraries 11.6 meta-package
ii cuda-libraries-dev-11-6 11.6.1-1 amd64 CUDA Libraries 11.6 development meta-package
ii cuda-memcheck-11-6 11.6.112-1 amd64 CUDA-MEMCHECK
ii cuda-nsight-11-6 11.6.112-1 amd64 CUDA nsight
ii cuda-nsight-compute-11-6 11.6.1-1 amd64 NVIDIA Nsight Compute
ii cuda-nsight-systems-11-6 11.6.1-1 amd64 NVIDIA Nsight Systems
ii cuda-nvcc-11-6 11.6.112-1 amd64 CUDA nvcc
ii cuda-nvdisasm-11-6 11.6.104-1 amd64 CUDA disassembler
ii cuda-nvml-dev-11-6 11.6.55-1 amd64 NVML native dev links, headers
ii cuda-nvprof-11-6 11.6.112-1 amd64 CUDA Profiler tools
ii cuda-nvprune-11-6 11.6.112-1 amd64 CUDA nvprune
ii cuda-nvrtc-11-6 11.6.112-1 amd64 NVRTC native runtime libraries
ii cuda-nvrtc-dev-11-6 11.6.112-1 amd64 NVRTC native dev links, headers
ii cuda-nvtx-11-6 11.6.112-1 amd64 NVIDIA Tools Extension
ii cuda-nvvp-11-6 11.6.112-1 amd64 CUDA Profiler tools
ii cuda-repo-ubuntu1804-11-6-local 11.6.1-510.47.03-1 amd64 cuda repository configuration files
ii cuda-runtime-11-6 11.6.1-1 amd64 CUDA Runtime 11.6 meta-package
ii cuda-samples-11-6 11.6.101-1 amd64 CUDA example applications
ii cuda-sanitizer-11-6 11.6.112-1 amd64 CUDA Sanitizer
ii cuda-toolkit-11-6 11.6.1-1 amd64 CUDA Toolkit 11.6 meta-package
ii cuda-toolkit-11-6-config-common 11.6.55-1 all Common config package for CUDA Toolkit 11.6.
ii cuda-toolkit-11-config-common 11.6.55-1 all Common config package for CUDA Toolkit 11.
ii cuda-toolkit-config-common 11.6.55-1 all Common config package for CUDA Toolkit.
ii cuda-tools-11-6 11.6.1-1 amd64 CUDA Tools meta-package
ii cuda-visual-tools-11-6 11.6.1-1 amd64 CUDA visual tools
ii libcudart11.0:amd64 11.5.117~11.5.1-1ubuntu1 amd64 NVIDIA CUDA Runtime Library
ii nvidia-cuda-dev:amd64 11.5.1-1ubuntu1 amd64 NVIDIA CUDA development files
ii nvidia-cuda-gdb 11.5.114~11.5.1-1ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 11.5.1-1ubuntu1 amd64 NVIDIA CUDA development toolkit
ii nvidia-cuda-toolkit-doc 11.5.1-1ubuntu1 all NVIDIA CUDA and OpenCL documentation
driverの削除
sudo apt-get --purge remove nvidia-*
sudo apt-get --purge remove cuda-*
推奨ドライドライバの確認
$ ubuntu-drivers devices
ERROR:root:aplay command not found
== /sys/devices/pci0000:ae/0000:ae:00.0/0000:af:00.0 ==
modalias : pci:v000010DEd00002230sv000010DEsd00001459bc03sc00i00
vendor : NVIDIA Corporation
model : GA102GL [RTX A6000]
driver : nvidia-driver-535 - distro non-free recommended
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-535-server-open - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-535-server - distro non-free
driver : nvidia-driver-545 - distro non-free
driver : nvidia-driver-535-open - distro non-free
driver : nvidia-driver-545-open - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
以下のコマンドでinstallしてみる
sudo apt install nvidia-driver-535
このサイトからダウンロードしていく
NVIDIA CUDA Toolkit 12.1 Downloads
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.5.1/local_installers/cuda-repo-ubuntu2204-12-5-local_12.5.1-555.42.06-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-5-local_12.5.1-555.42.06-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-5
今の最新は12.5なんだね、進んだなぁ
$ nvidia-smi
Wed Jul 10 15:13:05 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:AF:00.0 Off | Off |
| 30% 56C P8 27W / 300W | 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:D8:00.0 Off | Off |
| 30% 51C P8 8W / 300W | 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
ちゃんと出力されました
なんでたまに再起動するとドライバ使えなくなっちゃうんですかねぇ
何台のupdateをやったことか
これは定期的にやっていかないとですね