More than 1 year has passed since last update.

【NVIDIA】Data Center GPU Manager(DCGM)使用方法

Last updated at 2023-01-10Posted at 2023-01-10

はじめに

GitHubに書かれている手順がさっぱり・・・
見る人が見れば分かるのか???
そんな私、みなさんのためのメモです。

ですので新しい情報はないです。
GitHub見てください。

参考

GitHub NVIDA/DCGM (https://github.com/NVIDIA/DCGM)

DCGMとは

Data Center GPU Manager (DCGM)の略、簡単にはGPUを管理/運用するツール。
似たようなツールでNVIDIA Management Library(NVML)がありますが、こちらより高度なツールという感じ。

NVIDIA System Management Interface (nvidia-smi) モニタリングCLI
- NVIDIA Management Library (NVML) APIのようなもの
Data Center GPU Manager (DCGM) 高度なGPU管理ツール

環境

OS　 CentOS8系
GPU　V100
DCGM 3.1.3

手順

(1) DCGMダウンロード
(2) DCGMインストール
(3) DCGMサービス起動
(4) 動作確認
(5) dcgmiコマンド
- Diag
- Health
- NVlink

(1) DCGMダウンロード

$ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

(2) DCGMインストール

sudo dnf clean expire-cache 
sudo dnf install -y datacenter-gpu-manager

(3) DCGMサービス起動

sudo systemctl --now enable nvidia-dcgm

(4) 動作確認

dcgmi
---
Usage: 
   dcgmi subsystem 
   dcgmi -v 

For complete USAGE and HELP type: 
   dcgmi --help
---

dcgmi group -l
---
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 2 groups found.                                                              |
+===================+==========================================================+
| Groups            |                                                          |
| -> 0              |                                                          |
|    -> Group ID    | 0                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_GPUS                                  |
|    -> Entities    | GPU 0, GPU 1, GPU 2, GPU 3                               |
| -> 1              |                                                          |
|    -> Group ID    | 1                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_NVSWITCHES                            |
|    -> Entities    | None                                                     |
+-------------------+----------------------------------------------------------+
---

DefaultでGroup0にGPUが割り当てられています。このGroupに対してコマンド実行するイメージ。

作成されていなければ-cで作成

dcgmi group -c <groupName> -a <entityId> 

# -c(create) test(Groupname) -a(add) 0,1(GPU#0/1)をアサイン
dcgmi group -c test -a 0,1
---
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 3 groups found.                                                              |
+===================+==========================================================+
| Groups            |                                                          |
| -> 0              |                                                          |
|    -> Group ID    | 0                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_GPUS                                  |
|    -> Entities    | GPU 0, GPU 1, GPU 2, GPU 3                               |
| -> 1              |                                                          |
|    -> Group ID    | 1                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_NVSWITCHES                            |
|    -> Entities    | None                                                     |
| -> 4              |                                                          |
|    -> Group ID    | 2                                                        |
|    -> Group Name  | test                                                     |
|    -> Entities    | GPU 0, GPU 1                                             |
+-------------------+----------------------------------------------------------+
---

(5) dcgmiコマンド

Diag

dcgmi diag -h

---
   dcgmi diag --host <IP/FQDN> -g <groupId> -r <diag> -p
        <test_name.variable_name=variable_value> -c
        </full/path/to/config/file> -f <fakeGpuList> -i <gpuList> -v
        --statsonfail --debugLogFile <debug file> --statspath <plugin
        statistics path> -j --throttle-mask <> --fail-early
        --check-interval <failure check interval> --iterations <iterations>

  -r  --run        diag       Run a diagnostic. (Note: higher numbered tests
                               include all beneath.)   
                               1 - Quick (System Validation ~ seconds)  
                               2 - Medium (Extended System Validation ~ 2
                               minutes)  
                               3 - Long (System HW Diagnostics ~ 15 minutes)  
                               4 - Extended (Longer-running System HW
                               Diagnostics)  
                               Specific tests to run may be specified by name,
                               and multiple tests may be specified as a comma
                               separated list. For example, the command: 
                                
                               dcgmi diag -r "sm stress,diagnostic"  
                                
                               would run the SM Stress and Diagnostic tests
                               together.
---

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -r(Diagnostic実行) 1(レベル1で実行)
dcgmi diag -g 0 -r 1

dcgmi diag -g 0 -r 1～4
1 - クイック (システム検証 ~ 秒)
2 - 中 (拡張システム検証 ~ 2 分)
3 - 長い (システム HW 診断 ~ 15 分)
4 - 拡張 (長時間稼働するシステムハードウェア)

Health

dcgmi health -h

---
Usage: dcgmi health
   dcgmi health --host <IP/FQDN> -g <groupId> -c -j 
   dcgmi health --host <IP/FQDN> -g <groupId> -f -j 
   dcgmi health --host <IP/FQDN> -g <groupId> -s <flags> -j -m <seconds> -u <seconds>

  -s  --set        flags      Set the watches to be monitored. [default = pm] 
                               a - all watches 
                               p - PCIe watches (*) 
                               m - memory watches (*) 
                               i - infoROM watches 
                               t - thermal and power watches (*) 
                               n - NVLink watches (*) 
                               (*) watch requires 60 sec before first query

  -c  --check                 Check to see if any errors or warnings have
                               occurred in the currently monitored watches.
  -h  --help                  Displays usage information and exits.
  -j  --json                  Print the output in a json format
---

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -s(監視SET) a(監視項目ALL)
dcgmi health -g 0 -s a

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -c(Check) -j(json形式で表示)
dcgmi health -g 0 -c -j

GroupIDに対し監視有効⇒Check実行

NVlink


dcgmi nvlink -h
---
Usage: dcgmi nvlink
   dcgmi nvlink --host <IP/FQDN> -g <gpuId> -e -j 
   dcgmi nvlink --host <IP/FQDN> -s 

  -e  --errors                Print NvLink errors for a given gpuId (-g).
  -s  --link-status           Print NvLink link status for all GPUs and NvSwitches in the system.
---

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -e(error)
dcgmi nvlink -g 0 -e

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -s(status)
dcgmi nvlink -g 0 -s

ひとまず実行出来ました。
他にも様々なコマンドがありますが、保守で使いそうなものを実行しております。
参考になれば

ではまた

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up