reboot後「 nvidia-device-plugin-daemonset」で以下が発生
Error: failed to start container "nvidia-device-plugin-ctr":
Error response from daemon: OCI runtime create failed: container_linux.go:349:
starting container process caused "process_linux.go:449:
container init caused \"process_linux.go:432:
running prestart hook 0 caused \\\"error running hook:
exit status 1, stdout: , stderr:
nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown
ndivia-smiも効かない
# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Driverを再インストール
# bash NVIDIA-Linux-x86_64-470.57.02.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 470.57.02
nvidia-smi再チェック→OK
# nvidia-smi
Wed Sep 29 01:08:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 35% 32C P8 N/A / 75W | 0MiB / 4040MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Device-Plugin再Deploy
# kubectl apply -f nvidia-device-plugin.yml
daemonset.apps/nvidia-device-plugin-daemonset unchanged
# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system nvidia-device-plugin-daemonset-tmn88 1/1 Running 10 32m