CUDAの再インストールが必要なときの手順
CUDAが認識されない(nvidia, nvccが使えない)
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
$ nvcc -V
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
まずやること
- PCの再起動
sudo apt update
sudo apt upgrade
以下おまけ
sudo apt --fix-broken install
sudo dpkg --audit
sudo dpkg --configure --pending
sudo dpkg --configure <ファイル名>
-
ls /var/lib/dpkg/info/<ファイル名>
(存在の確認) sudo dpkg-reconfigure
だめなとき CUDA再インストールが必要
PCで使用しているGPUの型番とCompatible Driverのバージョンを確認する(NVIDIAのページを確認する)
2.1. Verify You Have a CUDA-Capable GPU
$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2704 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 22bb (rev a1)
↓切り抜き(私のPCはGPU NVIDIA GeForce RTX 3070)
2.2. Verify You Have a Supported Version of Linux
$ uname -m && cat /etc/*release
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
2.3. Verify the System Has gcc Installed
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2.4. Verify the System has the Correct Kernel Headers and Development Packages Installed
$ uname -r
6.5.0-21-generic
2.7. Download the NVIDIA CUDA Toolkit
The NVIDIA CUDA Toolkit is available at https://developer.nvidia.com/cuda-downloads.
壊れたCUDAとお別れ
今のCUDA関連のライブラリを全てuninstallする
- Removing CUDA Toolkit and Driver
Ubuntu and Debian
apt-getよりaptの方が新しくておすすめ!(置き換えるとよい)
To remove CUDA Toolkit:
sudo apt --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
"*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
↓今回の実行結果
Package 'cuda-drivers-fabricmanager-550' is not installed, so not removed
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
libnvjitlink-12-2 : Depends: cuda-toolkit-config-common but it is not going to be installed
Depends: cuda-toolkit-12-config-common but it is not going to be installed
Depends: cuda-toolkit-12-2-config-common but it is not going to be installed
nvidia-dkms-535 : Depends: nvidia-kernel-common-535 (= 535.161.07-0ubuntu1) but 535.154.05-0ubuntu1 is to be installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
↓指示に従ってみた
$ sudo apt --fix-broken install
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Correcting dependencies... Done
The following packages were automatically installed and are no longer required:
nvidia-firmware-535-535.113.01 nvidia-firmware-535-535.146.02 nvidia-firmware-535-535.161.07
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
nvidia-kernel-common-535
The following packages will be upgraded:
nvidia-kernel-common-535
1 upgraded, 0 newly installed, 0 to remove and 112 not upgraded.
2 not fully installed or removed.
Need to get 0 B/38.3 MB of archives.
After this operation, 45.1 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
(Reading database ... 316795 files and directories currently installed.)
Preparing to unpack .../nvidia-kernel-common-535_535.161.07-0ubuntu1_amd64.deb ...
Unpacking nvidia-kernel-common-535 (535.161.07-0ubuntu1) over (535.154.05-0ubuntu1) ...
dpkg: error processing archive /var/cache/apt/archives/nvidia-kernel-common-535_535.161.07-0ubuntu1_a
md64.deb (--unpack):
trying to overwrite '/lib/firmware/nvidia/535.161.07/gsp_ga10x.bin', which is also in package nvidia
-firmware-535-535.161.07 535.161.07-0ubuntu0.22.04.1
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Errors were encountered while processing:
/var/cache/apt/archives/nvidia-kernel-common-535_535.161.07-0ubuntu1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
To remove NVIDIA Drivers:
sudo apt --purge remove "*nvidia*" "libxnvctrl*"
↓実行結果
kage 'nvidia-docker2' is not installed, so not removed
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
cuda-drivers-535 : Depends: libnvidia-common-535 (>= 535.104.12) but it is not going to be installed
Depends: libnvidia-compute-535 (>= 535.104.12) but it is not going to be installed
Depends: libnvidia-decode-535 (>= 535.104.12) but it is not going to be installed
Depends: libnvidia-encode-535 (>= 535.104.12) but it is not going to be installed
Depends: libnvidia-fbc1-535 (>= 535.104.12) but it is not going to be installed
Depends: libnvidia-gl-535 (>= 535.104.12) but it is not going to be installed
Depends: nvidia-compute-utils-535 (>= 535.104.12) but it is not going to be installed
Depends: nvidia-dkms-535 (>= 535.104.12)
Depends: nvidia-driver-535 (>= 535.104.12) but it is not going to be installed
Depends: nvidia-kernel-common-535 (>= 535.104.12) but it is not going to be installed
Depends: nvidia-kernel-source-535 (>= 535.104.12) but it is not going to be installed or
nvidia-kernel-open-535 (>= 535.104.12) but it is not going to be installed
Depends: nvidia-utils-535 (>= 535.104.12) but it is not going to be installed
Depends: xserver-xorg-video-nvidia-535 (>= 535.104.12) but it is not going to be installed
Depends: nvidia-modprobe (>= 535.104.12) but it is not going to be installed
Depends: nvidia-settings (>= 535.104.12) but it is not going to be installed
libnvidia-gl-535:i386 : Depends: libnvidia-common-535:i386 (= 535.161.07-0ubuntu1)
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
To clean up the uninstall:
sudo apt autoremove
↓実行結果
$ sudo apt autoremove
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
nvidia-dkms-535 : Depends: nvidia-kernel-common-535 (= 535.161.07-0ubuntu1) but 535.154.05-0ubuntu1 is installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
だめなので仕方なく一個強制的にずつ消すことにしました(T_T)
本当に疲れた
さらにダメな時
やったことを書いておきます
今回の原因(sudo dpkg --audit
より)
nvidia-dkms-535 NVIDIA DKMS package
nvidia-driver-535 NVIDIA driver metapackage
警告
依存関係を壊すので自己責任で
↓消したいファイル名(今回はnvidia-dkms-535なので適宜置き換える)を全部から探してきて全部消すコマンド
sudo dpkg -L nvidia-dkms-535 | sudo grep '*nvidia-dkms-535*' | sudo xargs rm -rf ;\
sudo find /etc -name "*nvidia-dkms-535*" | while read f; do sudo rm -rf "$f"; done ;\
sudo find /var -name "*nvidia-dkms-535*" | while read f; do sudo rm -rf "$f"; done ;\
sudo find /usr -name "*nvidia-dkms-535*" | while read f; do sudo rm -rf "$f"; done
↓無理やりファイルを設定し直す
sudo dpkg-reconfigure --force nvidia-dkms-535
↓きれいになるまで繰り返しエラー文に従って無理消し去る
sudo dpkg --purge --force-all nvidia-dkms-535
今回は10回ほどこのコマンドを使用しました
お別れのあとの作業
★に戻ってCUDA Toolkitをインストール
先程のページの下に書いてあります
Base Installerをやります(aptが良いでしょう)
ここからが本番!Driverのインストール
NVIDAのドキュメントに従ってCUDAをインストールする
※コマンドと実行結果がのっています
3.10. Ubuntu
3.10.1. Prepare Ubuntu
-
Perform the pre-installation actions.
えらいから終わってます -
The kernel headers and development packages for the currently running kernel can be installed with:
$ sudo apt install linux-headers-$(uname -r)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
linux-headers-6.5.0-21-generic is already the newest version (6.5.0-21.21~22.04.1).
linux-headers-6.5.0-21-generic set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
- Remove Outdated Signing Key:
$ sudo apt-key del 7fa2af80
Warning: apt-key is deprecated. Manage keyring files in trusted.gpg.d instead (see apt-key(8)).
OK
- Choose an installation method: local repo or network repo.
今回はnetwork repoを選択
3.10.3. Network Repo Installation for Ubuntu
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
--2024-02-29 22:34:59-- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
Resolving www-proxy.waseda.jp (www-proxy.waseda.jp)... 133.9.4.20
Connecting to www-proxy.waseda.jp (www-proxy.waseda.jp)|133.9.4.20|:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 4332 (4.2K) [application/x-deb]
Saving to: ‘cuda-keyring_1.1-1_all.deb.2’
cuda-keyring_1.1-1_all.de 100%[====================================>] 4.23K --.-KB/s in 0s
2024-02-29 22:34:59 (12.5 MB/s) - ‘cuda-keyring_1.1-1_all.deb.2’ saved [4332/4332]
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
(Reading database ... 257830 files and directories currently installed.)
Preparing to unpack cuda-keyring_1.1-1_all.deb ...
Unpacking cuda-keyring (1.1-1) over (1.1-1) ...
Setting up cuda-keyring (1.1-1) ...
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
(Reading database ... 257830 files and directories currently installed.)
Preparing to unpack cuda-keyring_1.1-1_all.deb ...
Unpacking cuda-keyring (1.1-1) over (1.1-1) ...
Setting up cuda-keyring (1.1-1) ...
少し下にいって
3.10.4. Common Installation Instructions for Ubuntu
These instructions apply to both local and network installation for Ubuntu.
- Update the Apt repository cache:
$ sudo apt update
Hit:1 https://download.docker.com/linux/ubuntu jammy InRelease
Hit:2 https://nvidia.github.io/libnvidia-container/stable/deb/amd64 InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease
Hit:4 https://dl.google.com/linux/chrome/deb stable InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 https://linux.teamviewer.com/deb stable InRelease
Hit:7 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:8 http://jp.archive.ubuntu.com/ubuntu jammy InRelease
Hit:9 http://jp.archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:10 http://jp.archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
- Install CUDA SDK:
$ sudo apt install cuda-toolkit
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
cuda-toolkit
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 2,722 B of archives.
After this operation, 9,216 B of additional disk space will be used.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 cuda-toolkit 12.3.2-1 [2,722 B]
Fetched 2,722 B in 0s (30.6 kB/s)
Selecting previously unselected package cuda-toolkit.
(Reading database ... 257830 files and directories currently installed.)
Preparing to unpack .../cuda-toolkit_12.3.2-1_amd64.deb ...
Unpacking cuda-toolkit (12.3.2-1) ...
Setting up cuda-toolkit (12.3.2-1) ...
Setting alternatives
To include all GDS packages:
sudo apt install nvidia-gds
CUDAの動作を確認する
このあと念の為
sudo apt update
-
sudo apt upgrade
を実行した
エラーが出ないことを確認したら再起動しよう
復活か死か
nvidia-smi
やnvcc -V
などを実行してPCの復活を確認する!!!!
幸運を祈ります