複数マシンにGPUを入れて計算すると,それぞれのマシンでnvidia-smiを実行して...は大変.一度にみられると便利.それがcluster-smi.
- https://github.com/PatWie/cluster-smi
- 追記:Fan, Temp, Powerを追加! https://github.com/tttamaki/cluster-smi
これが実行画面らしい.
Go初めてだしライブラリ周り複雑だし,ビルドが簡単ではなかったので,忘備録.環境は以下の通り.
- ubuntu 18.04LTS
- cuda-10.0
準備
goをインストール.依存ライブラリのzmqはaptで.gopathの概念が最初はよくわからなかった...
sudo apt install -y golang-go libzmq3-dev
mkdir -p ~/go/src/github.com/patwie
cd ~/go/src/github.com/patwie
git clone https://github.com/tttamaki/cluster-smi.git
cd cluster-smi
確認
nvidiaライブラリ用のheaderとshared libの場所を確認.
$ locate nvml.h
/usr/local/cuda-10.0/targets/x86_64-linux/include/nvml.h
$ ldconfig -p | grep nvidia-ml
libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
$
修正
nvml/
以下はcuda-8用のファイルだったので,修正.
- cuda-8のnvml.hは削除
- CFLAGSに上記のcuda-10.0を-Iで指定
- LDFLAGSに-no-pieを追加(これを入れるとエラーがなくなる)
以下のdiff中のパスはシステムに応じて適切に修正しましょう.
cd ~/go/src/github.com/patwie/cluster-smi/nvml
mv nvml.h nvml-cuda80.h
patch -p0 <<EOF
--- nvml.go.org 2018-11-23 15:49:00.330334478 +0900
+++ nvml.go 2018-11-23 15:49:28.834330994 +0900
@@ -1,8 +1,8 @@
package nvml
/*
-#cgo CFLAGS: -I/usr/local/cuda/include
-#cgo LDFLAGS: -lnvidia-ml -L/usr/local/cuda-8.0/targets/x86_64-linux/lib/stubs/
+#cgo CFLAGS: -I/usr/local/cuda-10.0/targets/x86_64-linux/include
+#cgo LDFLAGS: -lnvidia-ml -no-pie
// #cgo CFLAGS: -I/graphics/opt/opt_Ubuntu16.04/cuda/toolkit_8.0/cuda/include
// #cgo LDFLAGS: -lnvidia-ml -L/graphics/opt/opt_Ubuntu16.04/cuda/toolkit_8.0/cuda/lib64/stubs
EOF
ビルド
上記でLDFLAGSに追加したので,CGO_LDFLAGS_ALLOW
で指定.
cd ~/go/src/github.com/patwie/cluster-smi
export CGO_LDFLAGS_ALLOW=-no-pie
cp config.example.go config.go
cd proc
go install
cd ..
make all
実行
cluster-smi-local
ローカルマシン用.nvidia-smiと同じ機能.
$ ~/go/src/github.com/patwie/cluster-smi/cluster-smi-local
Fri Nov 23 15:58:48 2018 (http://github.com/patwie/cluster-smi)
+----------------------+-----------------------+---------------------------+----------+
| Node | Gpu | Memory-Usage | GPU-Util |
+----------------------+-----------------------+---------------------------+----------+
| your-system-name-her | 0:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % |
| | 1:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % |
| | 2:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % |
| | 3:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % |
| | 4:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % |
| | 5:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % |
| | 6:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % |
| | 7:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % |
+----------------------+-----------------------+---------------------------+----------+
温度表示も欲しいけど,ない... ファンと温度表示と電力消費量を追加!
- 追記:Fan, Temp, Powerを追加! https://github.com/tttamaki/cluster-smi
Sat Nov 24 15:14:24 2018 (http://github.com/patwie/cluster-smi)
+----------------------+-----------------------+---------------------------+----------+-------+-------+-------+
| Node | Gpu | Memory-Usage | GPU-Util | Fan | Temp | Power |
+----------------------+-----------------------+---------------------------+----------+-------+-------+-------+
| your-system-name-her | 0:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | 23 % | 23 C | 7W |
| | 1:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | 23 % | 25 C | 8W |
| | 2:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | 23 % | 23 C | 7W |
| | 3:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | 23 % | 25 C | 8W |
| | 4:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | 23 % | 24 C | 8W |
| | 5:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | 27 % | 25 C | 7W |
| | 6:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | 23 % | 23 C | 9W |
| | 7:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | 23 % | 24 C | 8W |
+----------------------+-----------------------+---------------------------+----------+-------+-------+-------+
cluster-smi
routerにするマシンのipをymlに指定
IPは自分のものに修正しましょう.
cd ~/go/src/github.com/patwie/cluster-smi
cp cluster-smi.example.yml cluster-smi.yml
patch -p0 <<EOF
--- cluster-smi.example.yml 2018-11-23 16:17:35.611878651 +0900
+++ cluster-smi.yml 2018-11-23 16:18:30.204312021 +0900
@@ -1,6 +1,6 @@
tick: 3 # tick for receiving data in seconds
timeout: 180 # duration considered as timeout (machine is considered as offline after x sec)
-router_ip: 127.0.0.1 # ip of cluster-smi-server
+router_ip: 10.0.0.10 # ip of cluster-smi-server
ports:
nodes: 9090 # port of cluster-smi-server, which nodes send to
clients: 9091 # port of cluster-smi-server, where clients subscribe to
EOF
これであとはnode,router,smiを実行.環境変数のexportを忘れずに.
各マシンでnodeを実行
$ export CLUSTER_SMI_CONFIG_PATH=/home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
$ ~/go/src/github.com/patwie/cluster-smi/cluster-smi-node
2018/11/23 16:34:09 Read configuration from /home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
2018/11/23 16:34:09 Tick: 3
2018/11/23 16:34:09 Timeout: 180
2018/11/23 16:34:09 RouterIp: 10.0.0.10
2018/11/23 16:34:09 Ports:
2018/11/23 16:34:09 Nodes: 9090
2018/11/23 16:34:09 Clients: 9091
2018/11/23 16:34:09 Now pushing to tcp://10.30.81.234:9090
2018/11/23 16:34:09 Cluster-SMI-Node is active. Press CTRL+C to shut down.
routerにするマシンでrouterを実行
$ export CLUSTER_SMI_CONFIG_PATH=/home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
$ ~/go/src/github.com/patwie/cluster-smi/cluster-smi-router
2018/11/23 16:31:49 Read configuration from /home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
2018/11/23 16:31:49 Tick: 3
2018/11/23 16:31:49 Timeout: 180
2018/11/23 16:31:49 RouterIp: 10.0.0.10
2018/11/23 16:31:49 Ports:
2018/11/23 16:31:49 Nodes: 9090
2018/11/23 16:31:49 Clients: 9091
2018/11/23 16:31:49 Cluster-SMI-Router is active. Press CTRL+C to shut down.
2018/11/23 16:31:49 Waiting for clients connecting to tcp://*:9091
2018/11/23 16:31:49 waiting for nodes connecting to tcp://*:9090
2018/11/23 16:31:49 A new node "226" connected
2018/11/23 16:31:49 A new node "224" connected
2018/11/23 16:31:49 A new node "220" connected
2018/11/23 16:31:49 A new node "228" connected
2018/11/23 16:31:49 A new node "242" connected
2018/11/23 16:34:12 A new node "230" connected
2018/11/23 16:36:05 A new node "232" connected
どこでもcluster-smiを実行
$ export CLUSTER_SMI_CONFIG_PATH=/home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi.yml
$ ~/go/src/github.com/patwie/cluster-smi/cluster-smi -t
Fri Nov 23 16:36:25 2018 (http://github.com/patwie/cluster-smi)
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| Node | Gpu | Memory-Usage | GPU-Util | Last Seen |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 230 | 0:GeForce GTX 1080 Ti | 231 MiB / 11175 MiB ( 2 %) | 3 % | Fri Nov 23 16:36:24 2018 |
| | 1:GeForce GTX 1080 Ti | 2 MiB / 11178 MiB ( 0 %) | 0 % | |
| | 2:GeForce GTX 1080 Ti | 2 MiB / 11178 MiB ( 0 %) | 0 % | |
| | 3:GeForce GTX 1080 Ti | 2 MiB / 11178 MiB ( 0 %) | 0 % | |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 234 | 0:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | Fri Nov 23 16:36:24 2018 |
| | 1:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | |
| | 2:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | |
| | 3:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | |
| | 4:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | |
| | 5:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | |
| | 6:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | |
| | 7:GeForce GTX 1080 Ti | 0 MiB / 11178 MiB ( 0 %) | 0 % | |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 226 | 0:GeForce GTX 1080 Ti | 6193 MiB / 11178 MiB ( 55 %) | 0 % | Fri Nov 23 16:36:23 2018 |
| | 1:GeForce GTX 1080 Ti | 6207 MiB / 11178 MiB ( 55 %) | 0 % | |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 228 | 0:GeForce GTX 1080 Ti | 322 MiB / 11178 MiB ( 2 %) | 0 % | Fri Nov 23 16:36:25 2018 |
| | 1:GeForce GTX 1080 Ti | 1 MiB / 11178 MiB ( 0 %) | 0 % | |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 232 | 0:GeForce GTX 1080 | 3677 MiB / 8119 MiB ( 45 %) | 65 % | Fri Nov 23 16:36:23 2018 |
| | 1:GeForce GTX 1080 | 3319 MiB / 8118 MiB ( 40 %) | 12 % | |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 220 | 0:GeForce GTX 1080 Ti | 233 MiB / 11177 MiB ( 2 %) | 0 % | Fri Nov 23 16:36:24 2018 |
| | 1:GeForce GTX 1080 Ti | 2 MiB / 11178 MiB ( 0 %) | 0 % | |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
| 224 | 0:GeForce GTX 1080 Ti | 199 MiB / 11177 MiB ( 1 %) | 0 % | Fri Nov 23 16:36:23 2018 |
| | 1:GeForce GTX 1080 Ti | 3843 MiB / 11178 MiB ( 34 %) | 49 % | |
+----------------------+-----------------------+------------------------------+----------+--------------------------+
詳細を表示するならcluster-smi -t -p
とする.
温度表示も欲しいけど,ない... ファンと温度表示と電力消費量を追加!
問題点
数分程度でcluster-smi-routerがpanicで落ちる...
panic: interrupted system call
goroutine 1 [running]:
main.main()
/home/tamaki/go/src/github.com/patwie/cluster-smi/cluster-smi-router.go:83 +0x61d
どうやらメッセージ送受信が失敗している?
80: // read request of client
81: msg, err := messaging.ReceiveMultipartMessage(router_socket)
82: if err != nil {
83: panic(err)
84: }
追記:マシンを増やすと落ちなくなった.単にメッセージが来ないと落ちている?ならマシンが増えてメッセージがたくさん来れば大丈夫なのかも.