watsonx.ai でどのGPUをどの基盤モデルが使用しているかの調べ方例

Last updated at 2025-04-24Posted at 2025-04-23

複数の基盤モデルをデプロイした場合、どの基盤モデルがどのGPUを使用しているのかを知りたい時があります。

基盤モデルが実行されているノード情報の取得

watsonx.ai で基盤モデルをデプロイすると、predictor pod が起動します。
２つのカスタム基盤モデルがデプロイされている場合の例（一部、文字を置き換えています）です。この出力から、各モデルのpodがどのノードで実行されているのかがわかります。

# oc get pod -o wide | grep predictor
cfm-74945870-d634-4f67-9763-126664e0ffcb-predictor-5ff45b54t7cb   1/1     Running     0             26h     10.128.6.44    compute-1-ru25.ocp.xxx.yyy.com   <none>           <none>
cfm-d5d3dd30-ebe4-4705-afe7-87166c12b628-predictor-7954f8cgshvx   1/1     Running     0             26h     10.128.6.43    compute-1-ru25.ocp.xxx.yyy.com   <none>           <none>

この例ではカスタム基盤モデルを使用しているため、podの名前からどの基盤モデルの predictor pod なのかを判別することが困難です。そのような場合は pod の .metadata.annotations.model-id情報を参照します。watsonx.ai で IBMから提供されている基盤モデルを使用する場合は、pod名に基盤モデルの名前が含まれます。

GPUを使用しているプロセス情報の取得

次に、predictor pod が実行されているノードの nvidia-driver-daemonset pod で nvidia-smi を実行します。この出力からGPUの使用状況と各GPUを使用しているPIDとプロセス名がわかります。
下の出力例の場合、GPU 0 と 1 が使用中で、GPU 0 は PID 2647282 が使用しているということがわかります。
ただし、Process name がいずれも /opt/vllm/bin/python3 となっていて、どのモデルがどのGPUを使用しているのかが、この出力からはわかりません。

# oc exec -n nvidia-gpu-operator -it nvidia-driver-daemonset-416.94.202412170927-0-9rgpj -- nvidia-smi
Fri Apr 18 05:55:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:41:00.0 Off |                    0 |
| N/A   29C    P0             45W /  250W |   36235MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:61:00.0 Off |                    0 |
| N/A   34C    P0             47W /  250W |   36655MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          On  |   00000000:A1:00.0 Off |                    0 |
| N/A   27C    P0             37W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2647282      C   /opt/vllm/bin/python3                       36216MiB |
|    1   N/A  N/A   2615491      C   /opt/vllm/bin/python3                       36636MiB |
+-----------------------------------------------------------------------------------------+

プロセスと基盤モデルとの紐づけ

次に、PID=2647282がどちらのモデルなのかを調べます。
まず、対象のGPUを搭載しているノードのdebugセッションに入ります。

コマンド例

oc debug node/compute-1-ru25.ocp.xxx.yyy.com

そこで、PID=2647282 の /opt/vllm/bin/python3 の親プロセスを辿っていくと、cfm-74945870-d634-4f67-9763-126664e0ffcb-predictor-5ff45b54t7cb から起動されたものであることがわかります。

実行例

sh-5.1# ps -ef | grep 2647282
1001040+ 2647282 2645293  0 Apr17 ?        00:05:40 /opt/vllm/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=17, pipe_handle=19) --multiprocessing-fork
root     3314196 3005053  0 06:26 ?        00:00:00 grep 2647282
sh-5.1# ps -ef | grep 2645293
1001040+ 2645293 2645276  0 Apr17 ?        00:01:32 python3 -m vllm_tgis_adapter --uvicorn-log-level=warning
1001040+ 2647281 2645293  0 Apr17 ?        00:00:00 /opt/vllm/bin/python3 -c from multiprocessing.resource_tracker import main;main(16)
1001040+ 2647282 2645293  0 Apr17 ?        00:05:40 /opt/vllm/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=17, pipe_handle=19) --multiprocessing-fork
root     3324494 3005053  0 06:27 ?        00:00:00 grep 2645293
sh-5.1# ps -ef | grep 2645276
root     2645276       1  0 Apr17 ?        00:00:00 /usr/bin/conmon -b /run/containers/storage/overlay-containers/3fce3d76554207b939a73ad530a05f6507d07fa3da273aad177855a82e69d845/userdata -c 3fce3d76554207b939a73ad530a05f6507d07fa3da273aad177855a82e69d845 --exit-dir /var/run/crio/exits -l /var/log/pods/wx_cfm-74945870-d634-4f67-9763-126664e0ffcb-predictor-5ff45b54t7cb_278b0c9f-90f3-4303-8d5c-aaff2ae5f237/kserve-container/0.log --log-level info -n k8s_kserve-container_cfm-74945870-d634-4f67-9763-126664e0ffcb-predictor-5ff45b54t7cb_wx_278b0c9f-90f3-4303-8d5c-aaff2ae5f237_0 -P /run/containers/storage/overlay-containers/3fce3d76554207b939a73ad530a05f6507d07fa3da273aad177855a82e69d845/userdata/conmon-pidfile -p /run/containers/storage/overlay-containers/3fce3d76554207b939a73ad530a05f6507d07fa3da273aad177855a82e69d845/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/3fce3d76554207b939a73ad530a05f6507d07fa3da273aad177855a82e69d845/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio --syslog -u 3fce3d76554207b939a73ad530a05f6507d07fa3da273aad177855a82e69d845 -s
1001040+ 2645293 2645276  0 Apr17 ?        00:01:32 python3 -m vllm_tgis_adapter --uvicorn-log-level=warning
root     3326596 3005053  0 06:27 ?        00:00:00 grep 2645276

したがって、この場合、GPU 0 を使用しているのは、cfm-74945870-d634-4f67-9763-126664e0ffcb-predictor-5ff45b54t7cb ということがわかります。

検証環境

OCP 4.16
IBM Software Hub 5.1.2
watsonx.ai 2.1.2

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up