More than 1 year has passed since last update.

【NVIDIA】nccl-tests使用方法

Last updated at 2022-12-28Posted at 2022-12-02

はじめに

GitHubに書かれている手順がさっぱり・・・
見る人が見れば分かるのか???
そんな私、みなさんのためのメモです。

ですので新しい情報はないです。
GitHub見てください。

参考

GitHub nccl-tests (https://github.com/NVIDIA/nccl-tests)

環境

OS　CentOS7系
GPU V100
CUDA 11.2
NCCL 2.8.4

手順

(1) nccl-testsダウンロード
(2) nccl-testsビルド
(3) 動作確認
(4) 応用編

(1) nccl-testsダウンロード

GitHubからcuda-samplesをダウンロード

緑色のコード⇒ダウンロードZIP

(2) nccl-testsビルド

/tmp

# /tmpなどに保存して解凍
unzip nccl-tests-master.zip

nccl-tests-master/

# 解凍されたディレクトリに移動
cd nccl-tests-master

単一ノードで実行/MPIにて複数ノードでの実行と二つあるようです。
今回は単一ノードでの検証を進めます。

CUDAインストール方法はrpmやlocalなど何種類かあるので、
保存されているディレクトリに読み替えてください。
CUDAはlocal(.run)ファイルでインストールしてますので下記ディレクトリになっています。

nccl-tests-master/

# CUDA_HOME/NCCL_HOMEの指定は各TOPディレクトリ指定
# /opt/cuda/cuda11.2/←このディレクトリ
# ├── bin
# ~~~
# └── tools
make CUDA_HOME=/opt/cuda/cuda11.2 NCCL_HOME=/opt/nccl/nccl2.8.4

---
# 完了
~~~
../verifiable/verifiable.cu(1117): warning: variable "floating" was declared but never referenced
~~~
make[1]: ディレクトリ `/tmp/nccl-tests-master/src' から出ます
---

nccl-tests-master/build/

# nccl-tests-master内にbuildディレクトリ作成される
build/
├── all_gather_perf
├── all_reduce_perf
├── alltoall_perf
├── broadcast_perf
├── gather_perf
├── hypercube_perf
├── reduce_perf
├── reduce_scatter_perf
├── scatter_perf
├── sendrecv_perf
├── timer.o
└── verifiable

(3) 動作確認

nccl-tests-master/

./build/all_reduce_perf

---
#  Rank  0 Group  0 Pid  16856 on　・・・
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608     float     sum      -1    89.63  374.39    0.00      0     0.42  79109.82    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
---

(4) 応用編

nccl-tests-master/

# (-b 8)byteから(-e 128M)byteまで(-f 2)倍率スキャンを4GPU(-g 4)で実行
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
---
#  Rank  0 Group  0 Pid  17066 on　・・・
#  Rank  1 Group  0 Pid  17066 on　・・・
#  Rank  2 Group  0 Pid  17066 on　・・・
#  Rank  3 Group  0 Pid  17066 on　・・・
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    12.51    0.00    0.00      0    11.78    0.00    0.00      0
          16             4     float     sum      -1    13.12    0.00    0.00      0    11.78    0.00    0.00      0
~~~ 
   33554432       8388608     float     sum      -1    631.8   53.11   79.66      0    634.4   52.89   79.33      0
    67108864      16777216     float     sum      -1   1221.3   54.95   82.42      0   1220.9   54.97   82.45      0
   134217728      33554432     float     sum      -1   2456.8   54.63   81.95      0   2421.5   55.43   83.14      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 21.308 
#
---

ひとまず実行出来ました。
参考になれば

ではまた

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up