本当はGTX1080TiでもやりたかったのですがCUDA9を入れないといけないのとNVIDIA Dockerを立ち上げるのは時間がなかったです・・
ここではRadeon Vega FE、Radeon ⅦのVega10とVega20の二枚を用いて検証します
03/08に公開するのは仮バージョンなので一部抜けがありますがご了承ください。
近日中に修正いれますので記事の内容が変わるかもです。
(3/20と3/21にGTX1080TiとRX470 16GBのInceptionV3 ベンチマークの一部を追加しました)
#AMD GPUの深層学習性能の検証
ROCmでTensor-flowにおける性能がどれぐらいか見てみましょう。
環境構築は以下の通りです
https://qiita.com/_JG1WWK/items/bfb59e2589b82bf5a8b3
CPU Xeon E5-2603 v4
MB msi-x99 Gaming7
RAM DDR4-2400 32GB
GPU0 NVIDIA GTX1080Ti
GPU1 AMD Vega Frontier Edition (ベンチマーク時にRadeonⅦに換装)
OS Ubuntu16.04.5 LST kernel version 4.13
以上の項目でみていきます
#目次
・ベンチマーク準備のコマンド
・InceptionV3
・ResNet-50
・ResNet-152
・Alexnet
・VGG16
・結果のまとめ
#準備
以下のようにして実行する必要があります、anaconda、またはminicondaが必須です
$ git clone https://github.com/tensorflow/benchmarks.git -b cnn_tf_v1.12_compatible
$ cd ~/benchmarks/scripts/tf_cnn_benchmarks
$ conda create -n rocm_tf_test python=3.6
$ conda activate rocm_tf_test
$ python --version
Python 3.5.6 :: Anaconda, Inc.
$ pip install tensorflow-rocm==1.12
##パワーリミットについて
パワーリミットを実装する時は以下のコマンドで実装します
sudo /opt/rocm/bin/rocm-smi --setpoweroverdrive 120
上記では120wをリミットにしています
今回はVegaFEにおいてはパワーリミットをつける際は150wを上限値として設定します
======================== ROCm System Management Interface ========================
******WARNING******
Operating your AMD GPU outside of official AMD specifications or outside of
factory settings, including but not limited to the conducting of overclocking,
over-volting or under-volting (including use of this interface software,
even if such software has been directly or indirectly provided by AMD or otherwise
affiliated in any way with AMD), may cause damage to your AMD GPU, system components
and/or result in system failure, as well as cause other problems.
DAMAGES CAUSED BY USE OF YOUR AMD GPU OUTSIDE OF OFFICIAL AMD SPECIFICATIONS OR
OUTSIDE OF FACTORY SETTINGS ARE NOT COVERED UNDER ANY AMD PRODUCT WARRANTY AND
MAY NOT BE COVERED BY YOUR BOARD OR SYSTEM MANUFACTURER'S WARRANTY.
Please use this utility with caution.
Do you accept these terms? [y/N] y
と聞かれるのであんまりおかしな値が入ってないか注意して使う必要があります
InceptionV3
tensorflow1.12用のベンチマークリポジトリを落として動かします。
以下のサイトに準拠してベンチマークを取ります。
https://www.leadergpu.com/articles/431-tensorflow-inception-v3-benchmark
実行コマンドは以下の通り
$ python ./tf_cnn_benchmarks.py --num_gpus=1 --model inception3 --batch_size 32
##VegaFE 測定
(大量のログが吐かれるので省略)
Done warm up
Step Img/sec total_loss
1 images/sec: 88.7 +/- 0.0 (jitter = 0.0) 7.383
10 images/sec: 87.2 +/- 0.3 (jitter = 1.3) 7.431
20 images/sec: 87.3 +/- 0.3 (jitter = 1.3) 7.336
30 images/sec: 87.2 +/- 0.3 (jitter = 1.0) 7.449
40 images/sec: 87.1 +/- 0.2 (jitter = 1.0) 7.387
50 images/sec: 87.1 +/- 0.2 (jitter = 0.9) 7.370
60 images/sec: 87.1 +/- 0.2 (jitter = 0.8) 7.435
70 images/sec: 87.1 +/- 0.2 (jitter = 0.8) 7.339
80 images/sec: 87.1 +/- 0.2 (jitter = 0.8) 7.343
90 images/sec: 87.1 +/- 0.1 (jitter = 0.8) 7.489
100 images/sec: 87.0 +/- 0.1 (jitter = 0.8) 7.435
----------------------------------------------------------------
total images/sec: 86.99
----------------------------------------------------------------
大体86images/sですね、次にワットパワーを絞って動かします
結果です
1 images/sec: 81.2 +/- 0.0 (jitter = 0.0) 7.354
10 images/sec: 82.0 +/- 0.3 (jitter = 0.8) 7.432
20 images/sec: 81.8 +/- 0.2 (jitter = 0.8) 7.320
30 images/sec: 81.9 +/- 0.2 (jitter = 0.7) 7.478
40 images/sec: 81.8 +/- 0.2 (jitter = 0.8) 7.381
50 images/sec: 81.7 +/- 0.2 (jitter = 0.8) 7.364
60 images/sec: 81.7 +/- 0.2 (jitter = 0.7) 7.444
70 images/sec: 81.6 +/- 0.2 (jitter = 0.8) 7.345
80 images/sec: 81.6 +/- 0.1 (jitter = 0.8) 7.397
90 images/sec: 81.5 +/- 0.1 (jitter = 0.9) 7.503
100 images/sec: 81.5 +/- 0.1 (jitter = 0.9) 7.409
----------------------------------------------------------------
total images/sec: 81.48
----------------------------------------------------------------
##RadeonⅦ測定
定格で測定
1 images/sec: 131.5 +/- 0.0 (jitter = 0.0) 7.373
10 images/sec: 130.6 +/- 0.8 (jitter = 0.5) 7.430
20 images/sec: 130.6 +/- 0.5 (jitter = 0.5) 7.356
30 images/sec: 130.5 +/- 0.4 (jitter = 0.5) 7.476
40 images/sec: 130.5 +/- 0.4 (jitter = 0.4) 7.370
50 images/sec: 130.4 +/- 0.4 (jitter = 0.4) 7.378
60 images/sec: 130.5 +/- 0.3 (jitter = 0.5) 7.434
70 images/sec: 130.5 +/- 0.3 (jitter = 0.4) 7.318
80 images/sec: 130.5 +/- 0.3 (jitter = 0.5) 7.361
90 images/sec: 130.6 +/- 0.2 (jitter = 0.5) 7.475
100 images/sec: 130.7 +/- 0.2 (jitter = 0.6) 7.427
----------------------------------------------------------------
total images/sec: 130.62
----------------------------------------------------------------
さすがRadeonⅦですね・・
パワーリミットを150wにして測定します
Step Img/sec total_loss
1 images/sec: 118.2 +/- 0.0 (jitter = 0.0) 7.383
10 images/sec: 117.6 +/- 0.4 (jitter = 0.2) 7.442
20 images/sec: 117.5 +/- 0.3 (jitter = 0.2) 7.339
30 images/sec: 117.4 +/- 0.3 (jitter = 0.3) 7.501
40 images/sec: 117.5 +/- 0.2 (jitter = 0.3) 7.408
50 images/sec: 117.4 +/- 0.2 (jitter = 0.3) 7.361
60 images/sec: 117.4 +/- 0.2 (jitter = 0.3) 7.398
70 images/sec: 117.3 +/- 0.2 (jitter = 0.4) 7.279
80 images/sec: 117.2 +/- 0.2 (jitter = 0.4) 7.361
90 images/sec: 117.2 +/- 0.2 (jitter = 0.4) 7.464
100 images/sec: 117.2 +/- 0.2 (jitter = 0.5) 7.418
----------------------------------------------------------------
total images/sec: 117.12
----------------------------------------------------------------
さすが7nmって感じですね15 images/sぐらいのダウンでしょうか
##RX570 16GB
Step Img/sec total_loss
1 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.363
10 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.425
20 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.346
30 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.468
40 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.396
50 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.350
60 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.419
70 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.323
80 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.345
90 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.485
100 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.424
----------------------------------------------------------------
total images/sec: 45.94
----------------------------------------------------------------
まあまあですかね・・?ベンチマーク時の挙動がすごい怪しかったのでメモを参考にしないと再現性のない結果になるかもしれません
#ResNet50
https://www.leadergpu.com/articles/429-tensorflow-resnet-50-benchmark
次はResNet50を測定します。
$python ./tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32
##VegaFE
ひとまずフルパワーsetsclk 7
で動かしてます
Step Img/sec total_loss
1 images/sec: 149.8 +/- 0.0 (jitter = 0.0) 8.458
10 images/sec: 155.7 +/- 1.3 (jitter = 0.9) 7.997
20 images/sec: 155.5 +/- 0.8 (jitter = 1.0) 8.260
30 images/sec: 155.6 +/- 0.6 (jitter = 0.9) 8.339
40 images/sec: 155.6 +/- 0.5 (jitter = 1.0) 8.190
50 images/sec: 155.6 +/- 0.5 (jitter = 1.1) 7.757
60 images/sec: 155.6 +/- 0.4 (jitter = 1.3) 8.069
70 images/sec: 155.7 +/- 0.4 (jitter = 1.2) 8.481
80 images/sec: 155.7 +/- 0.4 (jitter = 1.1) 8.290
90 images/sec: 155.6 +/- 0.3 (jitter = 1.1) 8.032
100 images/sec: 155.5 +/- 0.3 (jitter = 1.1) 8.021
----------------------------------------------------------------
total images/sec: 155.38
----------------------------------------------------------------
```
```
======================== ROCm System Management Interface ========================
================================================================================================
GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%
0 65.0c 233.0W 1528Mhz 945Mhz 8.0GT/s, x16 20.78% manual 220.0W 0% 0% 99%
1 N/A N/A N/A N/A N/A 0% N/A N/A N/A N/A N/A
================================================================================================
======================== End of ROCm SMI Log ========================
消費電力もフルパワーで230wも喰っています
150wにリミットを設定して測定します
1 images/sec: 133.4 +/- 0.0 (jitter = 0.0) 8.458
10 images/sec: 138.0 +/- 1.3 (jitter = 0.4) 7.997
20 images/sec: 138.1 +/- 0.7 (jitter = 0.5) 8.260
30 images/sec: 138.1 +/- 0.6 (jitter = 0.5) 8.337
40 images/sec: 138.0 +/- 0.5 (jitter = 0.5) 8.181
50 images/sec: 138.1 +/- 0.4 (jitter = 0.5) 7.753
60 images/sec: 138.2 +/- 0.4 (jitter = 0.6) 8.052
70 images/sec: 138.3 +/- 0.4 (jitter = 0.5) 8.463
80 images/sec: 138.3 +/- 0.3 (jitter = 0.5) 8.274
90 images/sec: 138.3 +/- 0.3 (jitter = 0.5) 8.034
100 images/sec: 138.3 +/- 0.3 (jitter = 0.5) 8.006
----------------------------------------------------------------
total images/sec: 138.19
----------------------------------------------------------------
89%ぐらいの性能低下と言ったところでしょうか
それに対して80wぐらい消費電力が減ってるのでVegaと言うアーキテクチャはピーキーなクロックセッティングなんだなと思います。
試しにFP16の有効化も試してみます、パワーリミットはそのままです
$ python ./tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --use_fp16
以上のコマンドで走らせました
Step Img/sec total_loss
1 images/sec: 145.9 +/- 0.0 (jitter = 0.0) 7.971
10 images/sec: 148.2 +/- 1.3 (jitter = 3.4) 8.046
20 images/sec: 148.9 +/- 0.9 (jitter = 1.9) 8.330
30 images/sec: 149.2 +/- 0.7 (jitter = 1.9) 8.094
40 images/sec: 149.0 +/- 0.7 (jitter = 1.7) 8.683
50 images/sec: 149.0 +/- 0.6 (jitter = 1.9) 8.299
60 images/sec: 148.9 +/- 0.5 (jitter = 1.7) 8.334
70 images/sec: 148.9 +/- 0.5 (jitter = 1.9) 8.153
80 images/sec: 148.9 +/- 0.4 (jitter = 1.8) 8.132
90 images/sec: 148.9 +/- 0.4 (jitter = 1.8) 8.380
100 images/sec: 148.9 +/- 0.4 (jitter = 1.7) 8.333
----------------------------------------------------------------
total images/sec: 148.86
----------------------------------------------------------------
若干性能が・・あがったような・・
##RadeonⅦ
定格で測りました
Done warm up
Step Img/sec total_loss
1 images/sec: 207.0 +/- 0.0 (jitter = 0.0) 8.458
10 images/sec: 210.5 +/- 2.6 (jitter = 1.4) 7.998
20 images/sec: 210.3 +/- 1.8 (jitter = 1.4) 8.261
30 images/sec: 210.0 +/- 1.4 (jitter = 1.4) 8.335
40 images/sec: 210.1 +/- 1.2 (jitter = 1.3) 8.184
50 images/sec: 209.9 +/- 1.0 (jitter = 1.3) 7.751
60 images/sec: 210.0 +/- 1.0 (jitter = 1.4) 8.066
70 images/sec: 210.2 +/- 0.9 (jitter = 1.4) 8.463
80 images/sec: 210.5 +/- 0.8 (jitter = 1.4) 8.294
90 images/sec: 210.4 +/- 0.8 (jitter = 1.4) 8.012
100 images/sec: 210.3 +/- 0.7 (jitter = 1.4) 8.014
----------------------------------------------------------------
total images/sec: 210.18
----------------------------------------------------------------
パワーリミットをかけて測ります
Step Img/sec total_loss
1 images/sec: 209.4 +/- 0.0 (jitter = 0.0) 8.458
10 images/sec: 207.6 +/- 1.4 (jitter = 0.5) 7.997
20 images/sec: 207.9 +/- 0.8 (jitter = 0.7) 8.262
30 images/sec: 207.5 +/- 0.7 (jitter = 0.8) 8.337
40 images/sec: 207.6 +/- 0.6 (jitter = 0.8) 8.191
50 images/sec: 207.6 +/- 0.5 (jitter = 0.8) 7.760
60 images/sec: 207.6 +/- 0.5 (jitter = 0.8) 8.059
70 images/sec: 207.5 +/- 0.5 (jitter = 0.9) 8.469
80 images/sec: 207.6 +/- 0.4 (jitter = 1.0) 8.292
90 images/sec: 207.3 +/- 0.5 (jitter = 1.0) 8.021
100 images/sec: 207.3 +/- 0.5 (jitter = 0.9) 8.001
----------------------------------------------------------------
total images/sec: 207.12
----------------------------------------------------------------
流石に最新のRadeonⅦではFP16を使うとそれなりに早くなります
Step Img/sec total_loss
1 images/sec: 289.6 +/- 0.0 (jitter = 0.0) 7.976
10 images/sec: 288.6 +/- 2.8 (jitter = 2.9) 8.040
20 images/sec: 288.1 +/- 2.1 (jitter = 1.5) 8.314
30 images/sec: 288.7 +/- 1.6 (jitter = 1.1) 8.075
40 images/sec: 288.9 +/- 1.3 (jitter = 1.2) 8.678
50 images/sec: 288.3 +/- 1.3 (jitter = 1.2) 8.303
60 images/sec: 286.7 +/- 1.5 (jitter = 1.7) 8.346
70 images/sec: 287.0 +/- 1.4 (jitter = 1.6) 8.166
80 images/sec: 286.9 +/- 1.3 (jitter = 1.8) 8.147
90 images/sec: 287.1 +/- 1.2 (jitter = 2.0) 8.411
100 images/sec: 287.4 +/- 1.1 (jitter = 1.8) 8.329
----------------------------------------------------------------
total images/sec: 287.15
----------------------------------------------------------------
##RX570 16GB
Resnet50ではまあまあの実行性能でした
Step Img/sec total_loss
1 images/sec: 83.1 +/- 0.0 (jitter = 0.0) 8.458
10 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 7.997
20 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.261
30 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.338
40 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.188
50 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 7.743
60 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.072
70 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.463
80 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.295
90 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.032
100 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.015
----------------------------------------------------------------
total images/sec: 83.05
----------------------------------------------------------------
#ResNet-152
実行コマンドは以下の通り
python ./tf_cnn_benchmarks.py --num_gpus=1 --model resnet152 --batch_size 32
##VegaFE
フルパワー--setsclk 7
で測定
1 images/sec: 65.2 +/- 0.0 (jitter = 0.0) 9.909
10 images/sec: 64.7 +/- 0.2 (jitter = 0.6) 9.650
20 images/sec: 64.6 +/- 0.2 (jitter = 0.6) 9.755
30 images/sec: 64.6 +/- 0.1 (jitter = 0.7) 9.905
40 images/sec: 64.4 +/- 0.1 (jitter = 1.0) 9.960
50 images/sec: 64.2 +/- 0.1 (jitter = 1.2) 10.084
60 images/sec: 63.8 +/- 0.1 (jitter = 1.6) 10.244
70 images/sec: 63.6 +/- 0.1 (jitter = 1.8) 9.957
80 images/sec: 63.4 +/- 0.1 (jitter = 1.5) 9.885
90 images/sec: 63.2 +/- 0.1 (jitter = 1.5) 10.220
100 images/sec: 63.0 +/- 0.1 (jitter = 1.4) 10.038
----------------------------------------------------------------
total images/sec: 63.02
----------------------------------------------------------------
======================== ROCm System Management Interface ========================
================================================================================================
GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%
0 81.0c 251.0W 1269Mhz 945Mhz 8.0GT/s, x16 40.78% manual 220.0W 0% 0% 100%
1 N/A N/A N/A N/A N/A 0% N/A N/A N/A N/A N/A
================================================================================================
======================== End of ROCm SMI Log ========================
ピークで250wぐらいですね
パワーリミット150wで動かします
(これ結果貼り間違えっぽいのであとで再測定します)
結果貼り間違えました
これもFP16を投入して実験的に試しましたがボロボロでした
$ TF_ROCM_FUSION_ENABLE=1 python ./tf_cnn_benchmarks.py --num_gpus=1 --model resnet152 --batch_size 32 --use_fp16
Step Img/sec total_loss
1 images/sec: 57.1 +/- 0.0 (jitter = 0.0) 10.044
10 images/sec: 56.6 +/- 0.2 (jitter = 0.6) 9.827
20 images/sec: 56.6 +/- 0.2 (jitter = 0.7) 9.752
30 images/sec: 56.4 +/- 0.1 (jitter = 0.8) 10.009
40 images/sec: 56.4 +/- 0.1 (jitter = 0.8) 9.830
50 images/sec: 56.4 +/- 0.1 (jitter = 0.8) 10.032
60 images/sec: 56.4 +/- 0.1 (jitter = 0.8) 9.625
70 images/sec: 56.3 +/- 0.1 (jitter = 0.9) 9.874
80 images/sec: 56.3 +/- 0.1 (jitter = 0.9) 9.608
90 images/sec: 56.3 +/- 0.1 (jitter = 0.9) 10.094
100 images/sec: 56.3 +/- 0.1 (jitter = 0.9) 10.082
----------------------------------------------------------------
total images/sec: 56.32
----------------------------------------------------------------
半分未満ってどういうこと?
##RadeonⅦ
Step Img/sec total_loss
1 images/sec: 89.5 +/- 0.0 (jitter = 0.0) 9.906
10 images/sec: 88.6 +/- 0.3 (jitter = 0.6) 9.670
20 images/sec: 88.6 +/- 0.3 (jitter = 0.7) 9.745
30 images/sec: 88.6 +/- 0.2 (jitter = 0.5) 9.937
40 images/sec: 88.6 +/- 0.2 (jitter = 0.5) 9.977
50 images/sec: 88.5 +/- 0.2 (jitter = 0.4) 10.093
60 images/sec: 88.5 +/- 0.2 (jitter = 0.5) 10.304
70 images/sec: 88.5 +/- 0.2 (jitter = 0.7) 10.016
80 images/sec: 88.5 +/- 0.1 (jitter = 0.6) 9.965
90 images/sec: 88.5 +/- 0.1 (jitter = 0.6) 10.213
100 images/sec: 88.4 +/- 0.1 (jitter = 0.6) 10.020
----------------------------------------------------------------
total images/sec: 88.41
----------------------------------------------------------------
リミットをつけた状態
1 images/sec: 82.8 +/- 0.0 (jitter = 0.0) 9.873
10 images/sec: 82.3 +/- 0.3 (jitter = 0.2) 9.694
20 images/sec: 82.0 +/- 0.2 (jitter = 0.4) 9.757
30 images/sec: 82.0 +/- 0.2 (jitter = 0.3) 9.884
40 images/sec: 82.0 +/- 0.2 (jitter = 0.4) 9.931
50 images/sec: 81.9 +/- 0.2 (jitter = 0.5) 10.092
60 images/sec: 81.9 +/- 0.1 (jitter = 0.5) 10.253
70 images/sec: 81.9 +/- 0.1 (jitter = 0.5) 9.958
80 images/sec: 81.9 +/- 0.1 (jitter = 0.5) 9.902
90 images/sec: 82.0 +/- 0.1 (jitter = 0.4) 10.204
100 images/sec: 81.9 +/- 0.1 (jitter = 0.5) 10.048
----------------------------------------------------------------
total images/sec: 81.90
----------------------------------------------------------------
FP16適用
Step Img/sec total_loss
1 images/sec: 113.9 +/- 0.0 (jitter = 0.0) 10.044
10 images/sec: 116.1 +/- 0.6 (jitter = 1.1) 9.814
20 images/sec: 116.3 +/- 0.4 (jitter = 0.8) 9.755
30 images/sec: 116.3 +/- 0.3 (jitter = 0.8) 10.030
40 images/sec: 116.3 +/- 0.3 (jitter = 0.7) 9.825
50 images/sec: 116.2 +/- 0.2 (jitter = 0.9) 10.052
60 images/sec: 116.3 +/- 0.2 (jitter = 0.8) 9.654
70 images/sec: 116.3 +/- 0.2 (jitter = 0.8) 9.919
80 images/sec: 116.2 +/- 0.2 (jitter = 0.8) 9.650
90 images/sec: 116.2 +/- 0.2 (jitter = 0.8) 10.083
100 images/sec: 116.2 +/- 0.2 (jitter = 0.8) 10.041
----------------------------------------------------------------
total images/sec: 116.19
----------------------------------------------------------------
##RX570 16GB
Done warm up
Step Img/sec total_loss
1 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.936
10 images/sec: 35.5 +/- 0.0 (jitter = 0.1) 9.670
20 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.764
30 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.933
40 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.962
50 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 10.067
60 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 10.225
70 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.986
80 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.944
90 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 10.225
100 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 10.069
----------------------------------------------------------------
total images/sec: 35.49
----------------------------------------------------------------
#ALexnet
https://www.leadergpu.com/articles/428-tensorflow-alexnet-benchmark
より引用
python ./tf_cnn_benchmarks.py --num_gpus=1 --model alexnet --batch_size 32
batchサイズは32で実行
##VegaFE
setsclk 7
フルパワーで実行
Step Img/sec total_loss
1 images/sec: 862.0 +/- 0.0 (jitter = 0.0) nan
10 images/sec: 858.1 +/- 2.5 (jitter = 6.1) nan
20 images/sec: 858.6 +/- 1.9 (jitter = 6.1) nan
30 images/sec: 859.4 +/- 1.3 (jitter = 5.3) nan
40 images/sec: 859.6 +/- 1.1 (jitter = 5.4) nan
50 images/sec: 860.1 +/- 0.9 (jitter = 5.2) nan
60 images/sec: 858.4 +/- 1.6 (jitter = 5.3) nan
70 images/sec: 858.4 +/- 1.4 (jitter = 5.3) nan
80 images/sec: 856.4 +/- 2.5 (jitter = 5.1) nan
90 images/sec: 857.2 +/- 2.3 (jitter = 4.9) nan
100 images/sec: 858.0 +/- 2.1 (jitter = 4.7) nan
----------------------------------------------------------------
total images/sec: 855.59
----------------------------------------------------------------
パワーリミットを150wにして実行
Step Img/sec total_loss
1 images/sec: 821.2 +/- 0.0 (jitter = 0.0) nan
10 images/sec: 800.8 +/- 2.8 (jitter = 4.7) nan
20 images/sec: 801.2 +/- 1.7 (jitter = 3.2) nan
30 images/sec: 801.4 +/- 1.3 (jitter = 3.5) nan
40 images/sec: 802.0 +/- 1.0 (jitter = 3.6) nan
50 images/sec: 801.9 +/- 0.9 (jitter = 3.9) nan
60 images/sec: 801.6 +/- 0.8 (jitter = 3.8) nan
70 images/sec: 799.3 +/- 1.8 (jitter = 3.8) nan
80 images/sec: 799.6 +/- 1.6 (jitter = 3.4) nan
90 images/sec: 799.8 +/- 1.4 (jitter = 3.4) nan
100 images/sec: 799.7 +/- 1.3 (jitter = 3.1) nan
----------------------------------------------------------------
total images/sec: 797.83
----------------------------------------------------------------
FP16を有効化して測定
$ TF_ROCM_FUSION_ENABLE=1 python ./tf_cnn_benchmarks.py --num_gpus=1 --model alexnet --batch_size 32 --use_fp16
Step Img/sec total_loss
1 images/sec: 570.9 +/- 0.0 (jitter = 0.0) 7.199
10 images/sec: 562.0 +/- 1.8 (jitter = 5.6) 7.199
20 images/sec: 561.1 +/- 1.5 (jitter = 7.0) 7.200
30 images/sec: 562.3 +/- 1.3 (jitter = 7.6) 7.199
40 images/sec: 562.6 +/- 1.0 (jitter = 6.8) 7.199
50 images/sec: 562.8 +/- 0.9 (jitter = 6.2) 7.199
60 images/sec: 560.7 +/- 1.7 (jitter = 7.5) 7.199
70 images/sec: 560.7 +/- 1.4 (jitter = 7.4) 7.199
80 images/sec: 561.2 +/- 1.3 (jitter = 7.1) 7.200
90 images/sec: 561.6 +/- 1.2 (jitter = 6.5) 7.199
100 images/sec: 561.7 +/- 1.1 (jitter = 6.5) 7.199
----------------------------------------------------------------
total images/sec: 560.68
----------------------------------------------------------------
逆に遅くなってますね
##RadeonⅦ
コレに関しても一応定格測定です
Done warm up
Step Img/sec total_loss
1 images/sec: 983.5 +/- 0.0 (jitter = 0.0) nan
10 images/sec: 984.9 +/- 2.6 (jitter = 6.6) nan
20 images/sec: 980.3 +/- 2.9 (jitter = 8.2) nan
30 images/sec: 982.1 +/- 2.3 (jitter = 8.4) nan
40 images/sec: 983.5 +/- 1.8 (jitter = 8.4) nan
50 images/sec: 983.8 +/- 1.5 (jitter = 7.6) nan
60 images/sec: 983.5 +/- 1.4 (jitter = 7.7) nan
70 images/sec: 984.0 +/- 1.2 (jitter = 6.9) nan
80 images/sec: 980.7 +/- 3.1 (jitter = 6.9) nan
90 images/sec: 981.6 +/- 2.8 (jitter = 6.1) nan
100 images/sec: 982.2 +/- 2.5 (jitter = 6.2) nan
----------------------------------------------------------------
total images/sec: 979.09
----------------------------------------------------------------
パワーリミット150wをかけました
Step Img/sec total_loss
1 images/sec: 965.4 +/- 0.0 (jitter = 0.0) nan
10 images/sec: 940.8 +/- 13.7 (jitter = 28.5) nan
20 images/sec: 950.9 +/- 7.6 (jitter = 18.3) nan
30 images/sec: 956.7 +/- 5.4 (jitter = 14.0) nan
40 images/sec: 957.2 +/- 4.5 (jitter = 13.5) nan
50 images/sec: 958.2 +/- 3.7 (jitter = 13.1) nan
60 images/sec: 959.0 +/- 3.1 (jitter = 12.9) nan
70 images/sec: 956.1 +/- 4.0 (jitter = 13.1) nan
80 images/sec: 957.2 +/- 3.6 (jitter = 12.9) nan
90 images/sec: 957.8 +/- 3.2 (jitter = 11.8) nan
100 images/sec: 958.3 +/- 2.9 (jitter = 12.8) nan
----------------------------------------------------------------
total images/sec: 955.23
----------------------------------------------------------------
FP16を適用
1 images/sec: 979.4 +/- 0.0 (jitter = 0.0) 7.178
10 images/sec: 973.1 +/- 3.9 (jitter = 5.2) 7.197
20 images/sec: 972.6 +/- 2.3 (jitter = 8.9) 7.198
30 images/sec: 973.1 +/- 1.8 (jitter = 7.7) 7.199
40 images/sec: 972.9 +/- 1.6 (jitter = 7.7) 7.199
50 images/sec: 974.2 +/- 1.6 (jitter = 8.9) 7.199
60 images/sec: 970.5 +/- 4.0 (jitter = 9.5) 7.200
70 images/sec: 970.5 +/- 3.5 (jitter = 9.0) 7.199
80 images/sec: 972.0 +/- 3.1 (jitter = 8.5) 7.200
90 images/sec: 972.9 +/- 2.8 (jitter = 8.6) 7.199
100 images/sec: 974.0 +/- 2.5 (jitter = 9.1) 7.199
----------------------------------------------------------------
total images/sec: 970.86
----------------------------------------------------------------
流石にTeslaV100には勝てませんがP100に並ぶぐらいの性能は出てる感じです
##RX570 16GB
Step Img/sec total_loss
1 images/sec: 344.3 +/- 0.0 (jitter = 0.0) nan
10 images/sec: 364.6 +/- 2.2 (jitter = 1.7) nan
20 images/sec: 366.2 +/- 1.2 (jitter = 2.0) nan
30 images/sec: 366.4 +/- 0.9 (jitter = 2.3) nan
40 images/sec: 366.6 +/- 0.7 (jitter = 2.2) nan
50 images/sec: 366.9 +/- 0.5 (jitter = 2.2) nan
60 images/sec: 367.0 +/- 0.5 (jitter = 2.3) nan
70 images/sec: 366.9 +/- 0.4 (jitter = 2.4) nan
80 images/sec: 367.1 +/- 0.4 (jitter = 2.3) nan
90 images/sec: 367.1 +/- 0.4 (jitter = 2.4) nan
100 images/sec: 367.1 +/- 0.3 (jitter = 2.4) nan
----------------------------------------------------------------
total images/sec: 366.67
----------------------------------------------------------------
#VGG16
python ./tf_cnn_benchmarks.py --num_gpus=1 --model vgg16 --batch_size 32
以上のコマンドで実行しました。
#VegaFEで測定
--setsclk 7
で測定
Step Img/sec total_loss
1 images/sec: 93.7 +/- 0.0 (jitter = 0.0) 7.262
10 images/sec: 94.3 +/- 0.1 (jitter = 0.4) 7.242
20 images/sec: 94.2 +/- 0.1 (jitter = 0.3) 7.273
30 images/sec: 94.2 +/- 0.1 (jitter = 0.3) 7.212
40 images/sec: 94.1 +/- 0.1 (jitter = 0.3) 7.314
50 images/sec: 94.1 +/- 0.1 (jitter = 0.4) 7.276
60 images/sec: 94.1 +/- 0.1 (jitter = 0.3) 7.247
70 images/sec: 94.1 +/- 0.1 (jitter = 0.3) 7.240
80 images/sec: 93.9 +/- 0.1 (jitter = 0.4) 7.265
90 images/sec: 93.6 +/- 0.1 (jitter = 0.4) 7.269
100 images/sec: 93.4 +/- 0.1 (jitter = 0.5) 7.275
----------------------------------------------------------------
total images/sec: 93.35
----------------------------------------------------------------
消費電力は220w前後で推移
パワーリミットを150wに絞って測定
Step Img/sec total_loss
1 images/sec: 79.3 +/- 0.0 (jitter = 0.0) 7.276
10 images/sec: 79.0 +/- 0.2 (jitter = 0.7) 7.235
20 images/sec: 79.1 +/- 0.2 (jitter = 0.5) 7.289
30 images/sec: 79.0 +/- 0.1 (jitter = 0.7) 7.227
40 images/sec: 78.9 +/- 0.1 (jitter = 0.7) 7.273
50 images/sec: 78.9 +/- 0.1 (jitter = 0.7) 7.260
60 images/sec: 78.8 +/- 0.1 (jitter = 0.7) 7.271
70 images/sec: 78.7 +/- 0.1 (jitter = 0.6) 7.264
80 images/sec: 78.7 +/- 0.1 (jitter = 0.6) 7.252
90 images/sec: 78.7 +/- 0.1 (jitter = 0.6) 7.267
100 images/sec: 78.7 +/- 0.1 (jitter = 0.6) 7.267
----------------------------------------------------------------
total images/sec: 78.64
----------------------------------------------------------------
消費電力は150w前後で推移しました
###FP16について
vgg16にてFP16を用いた時の性能テストをしました、実行コマンドは
$ python ./tf_cnn_benchmarks.py --num_gpus=1 --model vgg16 --batch_size 32 --use_fp16
です
Step Img/sec total_loss
1 images/sec: 52.9 +/- 0.0 (jitter = 0.0) 7.275
10 images/sec: 52.9 +/- 0.1 (jitter = 0.4) 7.298
20 images/sec: 52.8 +/- 0.1 (jitter = 0.3) 7.294
30 images/sec: 52.8 +/- 0.1 (jitter = 0.2) 7.251
40 images/sec: 52.8 +/- 0.1 (jitter = 0.2) 7.285
50 images/sec: 52.8 +/- 0.1 (jitter = 0.2) 7.251
60 images/sec: 52.8 +/- 0.0 (jitter = 0.2) 7.252
70 images/sec: 52.7 +/- 0.0 (jitter = 0.2) 7.263
80 images/sec: 52.7 +/- 0.0 (jitter = 0.2) 7.266
90 images/sec: 52.8 +/- 0.0 (jitter = 0.2) 7.255
100 images/sec: 52.7 +/- 0.0 (jitter = 0.3) 7.252
----------------------------------------------------------------
total images/sec: 52.72
----------------------------------------------------------------
https://www.nttpc.co.jp/gpu/article/benchmark03.html
のTeslaP100、V100と比較してもちょっとFP16の性能が良くないです。。。むしろ遅くなってるのでカタログスペック上ネイティブ対応してるはずなのでROCmの実装になんらかの問題があると考えられます
##RadeonⅦ
これは定格値になります
Step Img/sec total_loss
1 images/sec: 124.6 +/- 0.0 (jitter = 0.0) 7.256
10 images/sec: 125.0 +/- 0.1 (jitter = 0.3) 7.249
20 images/sec: 124.9 +/- 0.1 (jitter = 0.4) 7.275
30 images/sec: 124.9 +/- 0.1 (jitter = 0.5) 7.213
40 images/sec: 124.9 +/- 0.1 (jitter = 0.5) 7.279
50 images/sec: 124.9 +/- 0.1 (jitter = 0.4) 7.278
60 images/sec: 124.8 +/- 0.1 (jitter = 0.5) 7.252
70 images/sec: 124.8 +/- 0.1 (jitter = 0.5) 7.242
80 images/sec: 124.7 +/- 0.1 (jitter = 0.5) 7.253
90 images/sec: 124.7 +/- 0.1 (jitter = 0.5) 7.264
100 images/sec: 124.7 +/- 0.1 (jitter = 0.5) 7.275
----------------------------------------------------------------
total images/sec: 124.61
----------------------------------------------------------------
パワーリミット150wです
Step Img/sec total_loss
1 images/sec: 108.9 +/- 0.0 (jitter = 0.0) 7.246
10 images/sec: 109.3 +/- 0.1 (jitter = 0.3) 7.241
20 images/sec: 109.4 +/- 0.1 (jitter = 0.2) 7.255
30 images/sec: 109.4 +/- 0.0 (jitter = 0.2) 7.240
40 images/sec: 109.3 +/- 0.1 (jitter = 0.2) 7.305
50 images/sec: 109.2 +/- 0.1 (jitter = 0.2) 7.276
60 images/sec: 109.2 +/- 0.1 (jitter = 0.3) 7.260
70 images/sec: 109.1 +/- 0.0 (jitter = 0.4) 7.261
80 images/sec: 109.0 +/- 0.1 (jitter = 0.3) 7.249
90 images/sec: 109.0 +/- 0.1 (jitter = 0.3) 7.271
100 images/sec: 109.0 +/- 0.1 (jitter = 0.3) 7.272
----------------------------------------------------------------
total images/sec: 108.99
----------------------------------------------------------------
FP16を適用
Step Img/sec total_loss
1 images/sec: 161.6 +/- 0.0 (jitter = 0.0) 7.244
10 images/sec: 162.0 +/- 0.2 (jitter = 0.5) 7.260
20 images/sec: 161.8 +/- 0.3 (jitter = 0.4) 7.307
30 images/sec: 161.8 +/- 0.2 (jitter = 0.5) 7.257
40 images/sec: 161.8 +/- 0.1 (jitter = 0.6) 7.259
50 images/sec: 161.8 +/- 0.1 (jitter = 0.5) 7.240
60 images/sec: 161.7 +/- 0.1 (jitter = 0.5) 7.265
70 images/sec: 161.7 +/- 0.1 (jitter = 0.4) 7.257
80 images/sec: 161.7 +/- 0.1 (jitter = 0.5) 7.269
90 images/sec: 161.6 +/- 0.1 (jitter = 0.5) 7.243
100 images/sec: 161.6 +/- 0.1 (jitter = 0.5) 7.250
----------------------------------------------------------------
total images/sec: 161.54
----------------------------------------------------------------
##RX570 16GB
Step Img/sec total_loss
1 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.924
10 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.608
20 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.733
30 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.907
40 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.926
50 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 10.055
60 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 10.276
70 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.989
80 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.863
90 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 10.227
100 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 10.029
----------------------------------------------------------------
total images/sec: 34.19
----------------------------------------------------------------
#まとめ
##結果
即興でLibere Officeで作ったのでちょっと微妙です、ベンチマーク別はちょっと見づらいのでやめます。
あとでWPS Officeで作り直します
(ゼロになってる値は測定ミスをRadeonⅦに換装してから気づいてしまったため後で測り直しすることにしました)
##VegaFEについての総評
FP16を使わないほうが良いです、パワーリミットをつけても性能は極端に下がらないので常時稼働とかするときはしちゃってもいいかもです
##RadeonⅦについての総評
間違いなくRadeon史上最速の深層学習性能であります、これ10万円らしいので多分大してAMDは儲かってなさそうですね
FP16でもそれなりに速くなります、使えるなら使いましょう、パワーリミットをつけけないとファンがすごいです。熱対策が必須になります
#CUDA編(編集中)
CUDAでもテストしてみました
##環境
GPU GTX1080Ti
Tensorflow-gpu 1.12.0
python 3.6.8
##InceptionV3
Step Img/sec total_loss
1 images/sec: 133.7 +/- 0.0 (jitter = 0.0) 7.345
10 images/sec: 133.9 +/- 0.8 (jitter = 1.8) 7.438
20 images/sec: 134.5 +/- 0.5 (jitter = 1.0) 7.323
30 images/sec: 134.7 +/- 0.3 (jitter = 0.9) 7.496
40 images/sec: 134.8 +/- 0.3 (jitter = 0.9) 7.332
50 images/sec: 134.8 +/- 0.2 (jitter = 0.8) 7.318
60 images/sec: 134.8 +/- 0.2 (jitter = 0.8) 7.392
70 images/sec: 134.8 +/- 0.2 (jitter = 0.8) 7.320
80 images/sec: 134.8 +/- 0.2 (jitter = 0.8) 7.361
90 images/sec: 134.8 +/- 0.2 (jitter = 0.8) 7.497
100 images/sec: 134.8 +/- 0.2 (jitter = 0.7) 7.400
----------------------------------------------------------------
total images/sec: 134.78
----------------------------------------------------------------
##Resnet50
Step Img/sec total_loss
1 images/sec: 191.2 +/- 0.0 (jitter = 0.0) 8.458
10 images/sec: 190.3 +/- 2.5 (jitter = 10.6) 7.997
20 images/sec: 191.8 +/- 1.5 (jitter = 7.1) 8.260
30 images/sec: 193.4 +/- 1.1 (jitter = 6.0) 8.336
40 images/sec: 193.9 +/- 0.9 (jitter = 4.6) 8.195
50 images/sec: 194.1 +/- 0.7 (jitter = 4.5) 7.749
60 images/sec: 195.0 +/- 0.7 (jitter = 4.7) 8.065
70 images/sec: 195.9 +/- 0.6 (jitter = 5.1) 8.474
80 images/sec: 196.6 +/- 0.6 (jitter = 5.0) 8.287
90 images/sec: 196.7 +/- 0.6 (jitter = 4.7) 8.003
100 images/sec: 196.8 +/- 0.5 (jitter = 4.6) 8.007
----------------------------------------------------------------
total images/sec: 196.62
----------------------------------------------------------------
#RX570 16GB版を稼働させるときのメモ
gfx803_32.cd.pdb.txtが無いぞとうるさいので勝手にgfx803_36.cd.pdb.txtをリネームしたものを複製して無理やり動かしました。ファイルパスは以下の通り
/opt/rocm/miopen/share/miopen/db$ ls
gfx803_32.cd.pdb.txt gfx900_56.cd.pdb.txt gfx906_60.cd.pdb.txt
gfx803_36.cd.pdb.txt gfx900_64.cd.pdb.txt gfx906_64.cd.pdb.txt
gfx803_64.cd.pdb.txt gfx906_56.cd.pdb.txt
#参考
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
https://qiita.com/syoyo/items/58bc1ed7558660defe29
https://github.com/RadeonOpenCompute/ROC-smi
#todo
RX470 8GBでも測定する
一部抜けているVegaFEの測定を埋める
グラフ主体で見やすい結果に改変する
新規で記事を作りなおす
いちいち手動で測定するのも面倒くさいので.shでも書いてまとめてもいいのでは