LoginSignup
9
7

More than 5 years have passed since last update.

nvidia-smiの複数マシン版cluster-smiをビルドする

Last updated at Posted at 2018-11-23

複数マシンにGPUを入れて計算すると,それぞれのマシンでnvidia-smiを実行して...は大変.一度にみられると便利.それがcluster-smi.

これが実行画面らしい.

image.png

Go初めてだしライブラリ周り複雑だし,ビルドが簡単ではなかったので,忘備録.環境は以下の通り.

  • ubuntu 18.04LTS
  • cuda-10.0

準備

goをインストール.依存ライブラリのzmqはaptで.gopathの概念が最初はよくわからなかった...

sudo apt install -y golang-go libzmq3-dev
mkdir -p ~/go/src/github.com/patwie
cd ~/go/src/github.com/patwie
git clone https://github.com/tttamaki/cluster-smi.git
cd cluster-smi

確認

nvidiaライブラリ用のheaderとshared libの場所を確認.

$ locate nvml.h
/usr/local/cuda-10.0/targets/x86_64-linux/include/nvml.h
$ ldconfig -p | grep nvidia-ml
        libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
        libnvidia-ml.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
$ 

修正

nvml/以下はcuda-8用のファイルだったので,修正.

  • cuda-8のnvml.hは削除
  • CFLAGSに上記のcuda-10.0を-Iで指定
  • LDFLAGSに-no-pieを追加(これを入れるとエラーがなくなる)

以下のdiff中のパスはシステムに応じて適切に修正しましょう.

cd ~/go/src/github.com/patwie/cluster-smi/nvml
mv nvml.h nvml-cuda80.h
patch -p0 <<EOF
--- nvml.go.org 2018-11-23 15:49:00.330334478 +0900
+++ nvml.go     2018-11-23 15:49:28.834330994 +0900
@@ -1,8 +1,8 @@
 package nvml

 /*
-#cgo CFLAGS: -I/usr/local/cuda/include
-#cgo LDFLAGS: -lnvidia-ml -L/usr/local/cuda-8.0/targets/x86_64-linux/lib/stubs/
+#cgo CFLAGS: -I/usr/local/cuda-10.0/targets/x86_64-linux/include
+#cgo LDFLAGS: -lnvidia-ml -no-pie

 // #cgo CFLAGS: -I/graphics/opt/opt_Ubuntu16.04/cuda/toolkit_8.0/cuda/include
 // #cgo LDFLAGS: -lnvidia-ml -L/graphics/opt/opt_Ubuntu16.04/cuda/toolkit_8.0/cuda/lib64/stubs
EOF

ビルド

上記でLDFLAGSに追加したので,CGO_LDFLAGS_ALLOWで指定.

cd ~/go/src/github.com/patwie/cluster-smi
export CGO_LDFLAGS_ALLOW=-no-pie
cp config.example.go config.go
cd proc
go install
cd ..
make all

実行

cluster-smi-local

ローカルマシン用.nvidia-smiと同じ機能.

$ ~/go/src/github.com/patwie/cluster-smi/cluster-smi-local

Fri Nov 23 15:58:48 2018 (http://github.com/patwie/cluster-smi)
+----------------------+-----------------------+---------------------------+----------+
| Node                 | Gpu                   | Memory-Usage              | GPU-Util |
+----------------------+-----------------------+---------------------------+----------+
| your-system-name-her | 0:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |
|                      | 1:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |
|                      | 2:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |
|                      | 3:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |
|                      | 4:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |
|                      | 5:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |
|                      | 6:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |
|                      | 7:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |
+----------------------+-----------------------+---------------------------+----------+

温度表示も欲しいけど,ない... ファンと温度表示と電力消費量を追加!
- 追記:Fan, Temp, Powerを追加! https://github.com/tttamaki/cluster-smi

Sat Nov 24 15:14:24 2018 (http://github.com/patwie/cluster-smi)
+----------------------+-----------------------+---------------------------+----------+-------+-------+-------+
| Node                 | Gpu                   | Memory-Usage              | GPU-Util | Fan   | Temp  | Power |
+----------------------+-----------------------+---------------------------+----------+-------+-------+-------+
| your-system-name-her | 0:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |  23 % |  23 C |   7W  |
|                      | 1:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |  23 % |  25 C |   8W  |
|                      | 2:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |  23 % |  23 C |   7W  |
|                      | 3:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |  23 % |  25 C |   8W  |
|                      | 4:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |  23 % |  24 C |   8W  |
|                      | 5:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |  27 % |  25 C |   7W  |
|                      | 6:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |  23 % |  23 C |   9W  |
|                      | 7:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB (  0 %) |   0 %    |  23 % |  24 C |   8W  |
+----------------------+-----------------------+---------------------------+----------+-------+-------+-------+

cluster-smi

routerにするマシンのipをymlに指定

IPは自分のものに修正しましょう.

cd ~/go/src/github.com/patwie/cluster-smi
cp cluster-smi.example.yml cluster-smi.yml
patch -p0 <<EOF
--- cluster-smi.example.yml     2018-11-23 16:17:35.611878651 +0900
+++ cluster-smi.yml     2018-11-23 16:18:30.204312021 +0900
@@ -1,6 +1,6 @@
 tick: 3               # tick for receiving data in seconds
 timeout: 180             # duration considered as timeout (machine is considered as offline after x sec)
-router_ip: 127.0.0.1     # ip of cluster-smi-server
+router_ip: 10.0.0.10     # ip of cluster-smi-server
 ports:
   nodes: 9090            # port of cluster-smi-server, which nodes send to
   clients: 9091          # port of cluster-smi-server, where clients subscribe to
EOF

これであとはnode,router,smiを実行.環境変数のexportを忘れずに.

各マシンでnodeを実行

$ export CLUSTER_SMI_CONFIG_PATH=/home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
$ ~/go/src/github.com/patwie/cluster-smi/cluster-smi-node

2018/11/23 16:34:09 Read configuration from /home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
2018/11/23 16:34:09   Tick: 3
2018/11/23 16:34:09   Timeout: 180
2018/11/23 16:34:09   RouterIp: 10.0.0.10
2018/11/23 16:34:09   Ports:
2018/11/23 16:34:09     Nodes: 9090
2018/11/23 16:34:09     Clients: 9091
2018/11/23 16:34:09 Now pushing to tcp://10.30.81.234:9090
2018/11/23 16:34:09 Cluster-SMI-Node is active. Press CTRL+C to shut down.

routerにするマシンでrouterを実行

$ export CLUSTER_SMI_CONFIG_PATH=/home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
$ ~/go/src/github.com/patwie/cluster-smi/cluster-smi-router

2018/11/23 16:31:49 Read configuration from /home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
2018/11/23 16:31:49   Tick: 3
2018/11/23 16:31:49   Timeout: 180
2018/11/23 16:31:49   RouterIp: 10.0.0.10
2018/11/23 16:31:49   Ports:
2018/11/23 16:31:49     Nodes: 9090
2018/11/23 16:31:49     Clients: 9091
2018/11/23 16:31:49 Cluster-SMI-Router is active. Press CTRL+C to shut down.
2018/11/23 16:31:49 Waiting for clients connecting to tcp://*:9091
2018/11/23 16:31:49 waiting for nodes connecting to  tcp://*:9090
2018/11/23 16:31:49 A new node "226" connected
2018/11/23 16:31:49 A new node "224" connected
2018/11/23 16:31:49 A new node "220" connected
2018/11/23 16:31:49 A new node "228" connected
2018/11/23 16:31:49 A new node "242" connected
2018/11/23 16:34:12 A new node "230" connected
2018/11/23 16:36:05 A new node "232" connected

どこでもcluster-smiを実行

$ export CLUSTER_SMI_CONFIG_PATH=/home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
$ ~/go/src/github.com/patwie/cluster-smi/cluster-smi -t

Fri Nov 23 16:36:25 2018 (http://github.com/patwie/cluster-smi)
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| Node                 | Gpu                   | Memory-Usage                 | GPU-Util | Last Seen                |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 230                  | 0:GeForce GTX 1080 Ti |  231 MiB / 11175 MiB (  2 %) |   3 %    | Fri Nov 23 16:36:24 2018 |
|                      | 1:GeForce GTX 1080 Ti |    2 MiB / 11178 MiB (  0 %) |   0 %    |                          |
|                      | 2:GeForce GTX 1080 Ti |    2 MiB / 11178 MiB (  0 %) |   0 %    |                          |
|                      | 3:GeForce GTX 1080 Ti |    2 MiB / 11178 MiB (  0 %) |   0 %    |                          |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 234                  | 0:GeForce GTX 1080 Ti |    0 MiB / 11178 MiB (  0 %) |   0 %    | Fri Nov 23 16:36:24 2018 |
|                      | 1:GeForce GTX 1080 Ti |    0 MiB / 11178 MiB (  0 %) |   0 %    |                          |
|                      | 2:GeForce GTX 1080 Ti |    0 MiB / 11178 MiB (  0 %) |   0 %    |                          |
|                      | 3:GeForce GTX 1080 Ti |    0 MiB / 11178 MiB (  0 %) |   0 %    |                          |
|                      | 4:GeForce GTX 1080 Ti |    0 MiB / 11178 MiB (  0 %) |   0 %    |                          |
|                      | 5:GeForce GTX 1080 Ti |    0 MiB / 11178 MiB (  0 %) |   0 %    |                          |
|                      | 6:GeForce GTX 1080 Ti |    0 MiB / 11178 MiB (  0 %) |   0 %    |                          |
|                      | 7:GeForce GTX 1080 Ti |    0 MiB / 11178 MiB (  0 %) |   0 %    |                          |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 226                  | 0:GeForce GTX 1080 Ti | 6193 MiB / 11178 MiB ( 55 %) |   0 %    | Fri Nov 23 16:36:23 2018 |
|                      | 1:GeForce GTX 1080 Ti | 6207 MiB / 11178 MiB ( 55 %) |   0 %    |                          |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 228                  | 0:GeForce GTX 1080 Ti |  322 MiB / 11178 MiB (  2 %) |   0 %    | Fri Nov 23 16:36:25 2018 |
|                      | 1:GeForce GTX 1080 Ti |    1 MiB / 11178 MiB (  0 %) |   0 %    |                          |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 232                  | 0:GeForce GTX 1080    |  3677 MiB / 8119 MiB ( 45 %) |  65 %    | Fri Nov 23 16:36:23 2018 |
|                      | 1:GeForce GTX 1080    |  3319 MiB / 8118 MiB ( 40 %) |  12 %    |                          |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 220                  | 0:GeForce GTX 1080 Ti |  233 MiB / 11177 MiB (  2 %) |   0 %    | Fri Nov 23 16:36:24 2018 |
|                      | 1:GeForce GTX 1080 Ti |    2 MiB / 11178 MiB (  0 %) |   0 %    |                          |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 224                  | 0:GeForce GTX 1080 Ti |  199 MiB / 11177 MiB (  1 %) |   0 %    | Fri Nov 23 16:36:23 2018 |
|                      | 1:GeForce GTX 1080 Ti | 3843 MiB / 11178 MiB ( 34 %) |  49 %    |                          |
+----------------------+-----------------------+------------------------------+----------+--------------------------+

詳細を表示するならcluster-smi -t -pとする.

温度表示も欲しいけど,ない... ファンと温度表示と電力消費量を追加!

問題点

数分程度でcluster-smi-routerがpanicで落ちる...

panic: interrupted system call

goroutine 1 [running]:
main.main()
        /home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi-router.go:83 +0x61d

どうやらメッセージ送受信が失敗している?

80:     // read request of client
81:                msg, err := messaging.ReceiveMultipartMessage(router_socket)
82:     if err != nil {
83:                        panic(err)
84:                }

追記:マシンを増やすと落ちなくなった.単にメッセージが来ないと落ちている?ならマシンが増えてメッセージがたくさん来れば大丈夫なのかも.

9
7
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
9
7