NVIDIA GPUの同時セッション数上限の勘違い
1050Ti 2枚積んでいるから「同時セッション数 3 * 2 = 6」かと思ったが間違いで、
システム全体として3セッション上限の制限らしい、、、
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 25C P8 N/A / 75W | 0MiB / 4040MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| 35% 42C P0 N/A / 75W | 227MiB / 4038MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 1196663 C some-process1 99MiB |
| 1 N/A N/A 1220659 C some-process2 62MiB |
| 1 N/A N/A 1257213 C some-process3 62MiB |
+-----------------------------------------------------------------------------+
セッション上限数にパッチを当てる
ここのパッチを当てて、上限数を解除する
ホストOS
# git clone https://github.com/keylase/nvidia-patch.git
Cloning into 'nvidia-patch'...
remote: Enumerating objects: 3084, done.
remote: Counting objects: 100% (1161/1161), done.
remote: Compressing objects: 100% (423/423), done.
remote: Total 3084 (delta 840), reused 817 (delta 738), pack-reused 1923
Receiving objects: 100% (3084/3084), 1.69 MiB | 3.64 MiB/s, done.
Resolving deltas: 100% (1859/1859), done.
# cd nvidia-patch/
# bash ./patch.sh
Detected nvidia driver version: 470.57.02
Attention! Backup not found. Copying current libnvidia-encode.so to backup.
18b4249e6513b77a69b26c3212c5779736d749a3 /opt/nvidia/libnvidia-encode-backup/libnvidia-encode.so.470.57.02
29da0846e54cc1de195ed2efa4aa01388d469a48 /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.57.02
Patched!
コンテナ
チェックアウトしたnvidia-patchソース内のDockerfileの記述を真似て追加
RUN mkdir -p /usr/local/bin /patched-lib
COPY patch.sh docker-entrypoint.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/patch.sh /usr/local/bin/docker-entrypoint.sh
ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh"]
実行結果
同時セッション数が3以上になっていることを確認
# nvidia-smi
Thu Oct 21 03:38:49 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 25C P8 N/A / 75W | 0MiB / 4040MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| 35% 42C P0 N/A / 75W | 318MiB / 4038MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 11966xx C some-process1 99MiB |
| 1 N/A N/A 12206xx C some-process2 62MiB |
| 1 N/A N/A 12572xx C some-process3 62MiB |
| 1 N/A N/A 12578xx C some-process4 62MiB |
+-----------------------------------------------------------------------------+