watsonx.ai では、NVIDIA Multi-Instance GPU(MIG)に対応している基盤モデルを、MIG設定済みのGPUにデプロイすることができます。(1つのGPUに複数の小型モデルをインストールし、GPUリソースをより効率的に使用することができます。)
MIG設定とその解除例をご紹介します。
最初の状態 (MIG設定なし)
$ oc exec -n nvidia-gpu-operator -it nvidia-driver-daemonset-416.94.202412100237-0-fst8j -- nvidia-smi
Thu Jun 5 05:35:22 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000000:08:00.0 Off | 0 |
| N/A 28C P0 57W / 400W | 1MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
MIG設定
-
MIGアドバタイズ戦略を
single
に設定します。
ホスト名、ストラテジー、コンフィギュレーションラベルを環境変数で指定します。- STRATEGY: single
IBM® Software Hub バージョン 5.1.0では、 NVIDIA MIGシングルストラテジーを検証し、サポートを追加しました。 シングル・ストラテジーでは、単一のGPUで固定パーティション・サイズを使用できます。
- MIG_CONFIGURATIONは、こちらのリンク先から適切なもの
all-3g.47gb
を選択しました。
NODE_NAME=mynode STRATEGY=single MIG_CONFIGURATION=all-3g.47gb
- STRATEGY: single
-
希望のMIG分割プロファイルを適用します。
oc label node/${NODE_NAME} nvidia.com/mig.config=${MIG_CONFIGURATION} --overwrite
MIG設定 (all-3g.47gb) の確認
$ oc get node/${NODE_NAME} -o json | jq '.metadata.labels'| grep mig
"nvidia.com/gpu.deploy.mig-manager": "true",
"nvidia.com/mig.config": "all-3g.47gb",
"nvidia.com/mig.config.state": "success"
$ oc exec -n nvidia-gpu-operator -it nvidia-driver-daemonset-416.94.202412100237-0-fst8j -- nvidia-smi -L
GPU 0: NVIDIA H100 NVL (UUID: GPU-f279a52b-e802-c262-e776-a4197f09a5f7)
MIG 3g.47gb Device 0: (UUID: MIG-16600d9a-ed18-5d1b-915c-472019110d02)
MIG 3g.47gb Device 1: (UUID: MIG-4fedd3e9-d1bc-582e-b01f-bc24dd98e15c)
$ oc exec -n nvidia-gpu-operator -it nvidia-driver-daemonset-416.94.202412100237-0-fst8j -- nvidia-smi
Fri Jun 6 07:08:40 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000000:08:00.0 Off | On |
| N/A 28C P0 57W / 400W | 76MiB / 95830MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 1 0 0 | 38MiB / 47488MiB | 60 0 | 3 0 3 0 3 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 2 0 1 | 38MiB / 47488MiB | 60 0 | 3 0 3 0 3 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
MIG設定の解除
MIG_CONFIGURATION=all-disabled && \
oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite
MIG設定解除後
$ oc exec -n nvidia-gpu-operator -it nvidia-driver-daemonset-416.94.202412100237-0-fst8j -- nvidia-smi
Fri Jun 6 08:39:20 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000000:08:00.0 Off | 0 |
| N/A 28C P0 60W / 400W | 1MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
環境
- GPU: NVIDIA H100 NVL 94GB
- OCP 4.16
- NVIDIA GPU Operator 24.6.2
- IBM Software Hub 5.1.3