1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

【NVIDIA】Data Center GPU Manager(DCGM)使用方法

Last updated at Posted at 2023-01-10

はじめに

GitHubに書かれている手順がさっぱり・・・
見る人が見れば分かるのか???
そんな私、みなさんのためのメモです。

ですので新しい情報はないです。
GitHub見てください。

参考

DCGMとは

Data Center GPU Manager (DCGM)の略、簡単にはGPUを管理/運用するツール。
似たようなツールでNVIDIA Management Library(NVML)がありますが、こちらより高度なツールという感じ。

  • NVIDIA System Management Interface (nvidia-smi) モニタリングCLI
    • NVIDIA Management Library (NVML) APIのようなもの
  • Data Center GPU Manager (DCGM) 高度なGPU管理ツール

image.png

環境

  • OS  CentOS8系
  • GPU V100
  • DCGM 3.1.3

手順

  • (1) DCGMダウンロード
  • (2) DCGMインストール
  • (3) DCGMサービス起動
  • (4) 動作確認
  • (5) dcgmiコマンド
    • Diag
    • Health
    • NVlink

(1) DCGMダウンロード

$ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

(2) DCGMインストール

sudo dnf clean expire-cache 
sudo dnf install -y datacenter-gpu-manager

(3) DCGMサービス起動

sudo systemctl --now enable nvidia-dcgm

(4) 動作確認

dcgmi
---
Usage: 
   dcgmi subsystem 
   dcgmi -v 

For complete USAGE and HELP type: 
   dcgmi --help
---

dcgmi group -l
---
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 2 groups found.                                                              |
+===================+==========================================================+
| Groups            |                                                          |
| -> 0              |                                                          |
|    -> Group ID    | 0                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_GPUS                                  |
|    -> Entities    | GPU 0, GPU 1, GPU 2, GPU 3                               |
| -> 1              |                                                          |
|    -> Group ID    | 1                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_NVSWITCHES                            |
|    -> Entities    | None                                                     |
+-------------------+----------------------------------------------------------+
---

DefaultでGroup0にGPUが割り当てられています。このGroupに対してコマンド実行するイメージ。

作成されていなければ-cで作成

dcgmi group -c <groupName> -a <entityId> 

# -c(create) test(Groupname) -a(add) 0,1(GPU#0/1)をアサイン
dcgmi group -c test -a 0,1
---
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 3 groups found.                                                              |
+===================+==========================================================+
| Groups            |                                                          |
| -> 0              |                                                          |
|    -> Group ID    | 0                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_GPUS                                  |
|    -> Entities    | GPU 0, GPU 1, GPU 2, GPU 3                               |
| -> 1              |                                                          |
|    -> Group ID    | 1                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_NVSWITCHES                            |
|    -> Entities    | None                                                     |
| -> 4              |                                                          |
|    -> Group ID    | 2                                                        |
|    -> Group Name  | test                                                     |
|    -> Entities    | GPU 0, GPU 1                                             |
+-------------------+----------------------------------------------------------+
---

(5) dcgmiコマンド

Diag

dcgmi diag -h

---
   dcgmi diag --host <IP/FQDN> -g <groupId> -r <diag> -p
        <test_name.variable_name=variable_value> -c
        </full/path/to/config/file> -f <fakeGpuList> -i <gpuList> -v
        --statsonfail --debugLogFile <debug file> --statspath <plugin
        statistics path> -j --throttle-mask <> --fail-early
        --check-interval <failure check interval> --iterations <iterations>

  -r  --run        diag       Run a diagnostic. (Note: higher numbered tests
                               include all beneath.)   
                               1 - Quick (System Validation ~ seconds)  
                               2 - Medium (Extended System Validation ~ 2
                               minutes)  
                               3 - Long (System HW Diagnostics ~ 15 minutes)  
                               4 - Extended (Longer-running System HW
                               Diagnostics)  
                               Specific tests to run may be specified by name,
                               and multiple tests may be specified as a comma
                               separated list. For example, the command: 
                                
                               dcgmi diag -r "sm stress,diagnostic"  
                                
                               would run the SM Stress and Diagnostic tests
                               together.
---

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -r(Diagnostic実行) 1(レベル1で実行)
dcgmi diag -g 0 -r 1

dcgmi diag -g 0 -r 1~4
1 - クイック (システム検証 ~ 秒)
2 - 中 (拡張システム検証 ~ 2 分)
3 - 長い (システム HW 診断 ~ 15 分)
4 - 拡張 (長時間稼働するシステム ハードウェア)

Health

dcgmi health -h

---
Usage: dcgmi health
   dcgmi health --host <IP/FQDN> -g <groupId> -c -j 
   dcgmi health --host <IP/FQDN> -g <groupId> -f -j 
   dcgmi health --host <IP/FQDN> -g <groupId> -s <flags> -j -m <seconds> -u <seconds>

  -s  --set        flags      Set the watches to be monitored. [default = pm] 
                               a - all watches 
                               p - PCIe watches (*) 
                               m - memory watches (*) 
                               i - infoROM watches 
                               t - thermal and power watches (*) 
                               n - NVLink watches (*) 
                               (*) watch requires 60 sec before first query

  -c  --check                 Check to see if any errors or warnings have
                               occurred in the currently monitored watches.
  -h  --help                  Displays usage information and exits.
  -j  --json                  Print the output in a json format
---

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -s(監視SET) a(監視項目ALL)
dcgmi health -g 0 -s a

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -c(Check) -j(json形式で表示)
dcgmi health -g 0 -c -j 

GroupIDに対し監視有効⇒Check実行

NVlink


dcgmi nvlink -h
---
Usage: dcgmi nvlink
   dcgmi nvlink --host <IP/FQDN> -g <gpuId> -e -j 
   dcgmi nvlink --host <IP/FQDN> -s 

  -e  --errors                Print NvLink errors for a given gpuId (-g).
  -s  --link-status           Print NvLink link status for all GPUs and NvSwitches in the system.
---

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -e(error)
dcgmi nvlink -g 0 -e

# -g(Group) 0(dcgmi group -lで表示されるGroupID) -s(status)
dcgmi nvlink -g 0 -s
 

ひとまず実行出来ました。
他にも様々なコマンドがありますが、保守で使いそうなものを実行しております。
参考になれば

ではまた

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?