LoginSignup
2
5

More than 3 years have passed since last update.

AMD_GPUでTensorFlow benchmarksを行い深層学習性能のおおよその性能を検証する(仮)

Last updated at Posted at 2019-03-08

本当はGTX1080TiでもやりたかったのですがCUDA9を入れないといけないのとNVIDIA Dockerを立ち上げるのは時間がなかったです・・
ここではRadeon Vega FE、Radeon ⅦのVega10とVega20の二枚を用いて検証します

03/08に公開するのは仮バージョンなので一部抜けがありますがご了承ください。
近日中に修正いれますので記事の内容が変わるかもです。

(3/20と3/21にGTX1080TiとRX470 16GBのInceptionV3 ベンチマークの一部を追加しました)

AMD GPUの深層学習性能の検証

ROCmでTensor-flowにおける性能がどれぐらいか見てみましょう。

環境構築は以下の通りです
https://qiita.com/_JG1WWK/items/bfb59e2589b82bf5a8b3

CPU Xeon E5-2603 v4
MB msi-x99 Gaming7
RAM DDR4-2400 32GB
GPU0 NVIDIA GTX1080Ti
GPU1 AMD Vega Frontier Edition (ベンチマーク時にRadeonⅦに換装)
OS Ubuntu16.04.5 LST kernel version 4.13

以上の項目でみていきます

目次

・ベンチマーク準備のコマンド
・InceptionV3
・ResNet-50
・ResNet-152
・Alexnet
・VGG16
・結果のまとめ

準備

以下のようにして実行する必要があります、anaconda、またはminicondaが必須です

$ git clone https://github.com/tensorflow/benchmarks.git -b cnn_tf_v1.12_compatible
$ cd ~/benchmarks/scripts/tf_cnn_benchmarks
$ conda create -n rocm_tf_test python=3.6
$ conda activate rocm_tf_test
$ python --version
Python 3.5.6 :: Anaconda, Inc.
$  pip install tensorflow-rocm==1.12

パワーリミットについて

パワーリミットを実装する時は以下のコマンドで実装します

sudo /opt/rocm/bin/rocm-smi --setpoweroverdrive 120

上記では120wをリミットにしています

今回はVegaFEにおいてはパワーリミットをつける際は150wを上限値として設定します

========================        ROCm System Management Interface        ========================

          ******WARNING******

          Operating your AMD GPU outside of official AMD specifications or outside of
          factory settings, including but not limited to the conducting of overclocking,
          over-volting or under-volting (including use of this interface software,
          even if such software has been directly or indirectly provided by AMD or otherwise
          affiliated in any way with AMD), may cause damage to your AMD GPU, system components
          and/or result in system failure, as well as cause other problems.
          DAMAGES CAUSED BY USE OF YOUR AMD GPU OUTSIDE OF OFFICIAL AMD SPECIFICATIONS OR
          OUTSIDE OF FACTORY SETTINGS ARE NOT COVERED UNDER ANY AMD PRODUCT WARRANTY AND
          MAY NOT BE COVERED BY YOUR BOARD OR SYSTEM MANUFACTURER'S WARRANTY.
          Please use this utility with caution.

Do you accept these terms? [y/N] y

と聞かれるのであんまりおかしな値が入ってないか注意して使う必要があります

InceptionV3

tensorflow1.12用のベンチマークリポジトリを落として動かします。

以下のサイトに準拠してベンチマークを取ります。
https://www.leadergpu.com/articles/431-tensorflow-inception-v3-benchmark

実行コマンドは以下の通り

$ python ./tf_cnn_benchmarks.py --num_gpus=1  --model inception3 --batch_size 32

VegaFE 測定



(大量のログが吐かれるので省略)
Done warm up
Step    Img/sec total_loss
1   images/sec: 88.7 +/- 0.0 (jitter = 0.0) 7.383
10  images/sec: 87.2 +/- 0.3 (jitter = 1.3) 7.431
20  images/sec: 87.3 +/- 0.3 (jitter = 1.3) 7.336
30  images/sec: 87.2 +/- 0.3 (jitter = 1.0) 7.449
40  images/sec: 87.1 +/- 0.2 (jitter = 1.0) 7.387
50  images/sec: 87.1 +/- 0.2 (jitter = 0.9) 7.370
60  images/sec: 87.1 +/- 0.2 (jitter = 0.8) 7.435
70  images/sec: 87.1 +/- 0.2 (jitter = 0.8) 7.339
80  images/sec: 87.1 +/- 0.2 (jitter = 0.8) 7.343
90  images/sec: 87.1 +/- 0.1 (jitter = 0.8) 7.489
100 images/sec: 87.0 +/- 0.1 (jitter = 0.8) 7.435
----------------------------------------------------------------
total images/sec: 86.99
----------------------------------------------------------------


大体86images/sですね、次にワットパワーを絞って動かします

結果です

1   images/sec: 81.2 +/- 0.0 (jitter = 0.0) 7.354
10  images/sec: 82.0 +/- 0.3 (jitter = 0.8) 7.432
20  images/sec: 81.8 +/- 0.2 (jitter = 0.8) 7.320
30  images/sec: 81.9 +/- 0.2 (jitter = 0.7) 7.478
40  images/sec: 81.8 +/- 0.2 (jitter = 0.8) 7.381
50  images/sec: 81.7 +/- 0.2 (jitter = 0.8) 7.364
60  images/sec: 81.7 +/- 0.2 (jitter = 0.7) 7.444
70  images/sec: 81.6 +/- 0.2 (jitter = 0.8) 7.345
80  images/sec: 81.6 +/- 0.1 (jitter = 0.8) 7.397
90  images/sec: 81.5 +/- 0.1 (jitter = 0.9) 7.503
100 images/sec: 81.5 +/- 0.1 (jitter = 0.9) 7.409
----------------------------------------------------------------
total images/sec: 81.48
----------------------------------------------------------------

RadeonⅦ測定

定格で測定

1   images/sec: 131.5 +/- 0.0 (jitter = 0.0)    7.373
10  images/sec: 130.6 +/- 0.8 (jitter = 0.5)    7.430
20  images/sec: 130.6 +/- 0.5 (jitter = 0.5)    7.356
30  images/sec: 130.5 +/- 0.4 (jitter = 0.5)    7.476
40  images/sec: 130.5 +/- 0.4 (jitter = 0.4)    7.370
50  images/sec: 130.4 +/- 0.4 (jitter = 0.4)    7.378
60  images/sec: 130.5 +/- 0.3 (jitter = 0.5)    7.434
70  images/sec: 130.5 +/- 0.3 (jitter = 0.4)    7.318
80  images/sec: 130.5 +/- 0.3 (jitter = 0.5)    7.361
90  images/sec: 130.6 +/- 0.2 (jitter = 0.5)    7.475
100 images/sec: 130.7 +/- 0.2 (jitter = 0.6)    7.427
----------------------------------------------------------------
total images/sec: 130.62
----------------------------------------------------------------

さすがRadeonⅦですね・・
パワーリミットを150wにして測定します

Step    Img/sec total_loss
1   images/sec: 118.2 +/- 0.0 (jitter = 0.0)    7.383
10  images/sec: 117.6 +/- 0.4 (jitter = 0.2)    7.442
20  images/sec: 117.5 +/- 0.3 (jitter = 0.2)    7.339
30  images/sec: 117.4 +/- 0.3 (jitter = 0.3)    7.501
40  images/sec: 117.5 +/- 0.2 (jitter = 0.3)    7.408
50  images/sec: 117.4 +/- 0.2 (jitter = 0.3)    7.361
60  images/sec: 117.4 +/- 0.2 (jitter = 0.3)    7.398
70  images/sec: 117.3 +/- 0.2 (jitter = 0.4)    7.279
80  images/sec: 117.2 +/- 0.2 (jitter = 0.4)    7.361
90  images/sec: 117.2 +/- 0.2 (jitter = 0.4)    7.464
100 images/sec: 117.2 +/- 0.2 (jitter = 0.5)    7.418
----------------------------------------------------------------
total images/sec: 117.12
----------------------------------------------------------------

さすが7nmって感じですね15 images/sぐらいのダウンでしょうか

RX570 16GB

Step    Img/sec total_loss
1   images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.363
10  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.425
20  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.346
30  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.468
40  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.396
50  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.350
60  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.419
70  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.323
80  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.345
90  images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.485
100 images/sec: 45.9 +/- 0.0 (jitter = 0.0) 7.424
----------------------------------------------------------------
total images/sec: 45.94
----------------------------------------------------------------

まあまあですかね・・?ベンチマーク時の挙動がすごい怪しかったのでメモを参考にしないと再現性のない結果になるかもしれません

ResNet50

https://www.leadergpu.com/articles/429-tensorflow-resnet-50-benchmark
次はResNet50を測定します。

$python  ./tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32

VegaFE

ひとまずフルパワーsetsclk 7で動かしてます

Step    Img/sec total_loss
1   images/sec: 149.8 +/- 0.0 (jitter = 0.0)    8.458
10  images/sec: 155.7 +/- 1.3 (jitter = 0.9)    7.997
20  images/sec: 155.5 +/- 0.8 (jitter = 1.0)    8.260
30  images/sec: 155.6 +/- 0.6 (jitter = 0.9)    8.339
40  images/sec: 155.6 +/- 0.5 (jitter = 1.0)    8.190
50  images/sec: 155.6 +/- 0.5 (jitter = 1.1)    7.757
60  images/sec: 155.6 +/- 0.4 (jitter = 1.3)    8.069
70  images/sec: 155.7 +/- 0.4 (jitter = 1.2)    8.481
80  images/sec: 155.7 +/- 0.4 (jitter = 1.1)    8.290
90  images/sec: 155.6 +/- 0.3 (jitter = 1.1)    8.032
100 images/sec: 155.5 +/- 0.3 (jitter = 1.1)    8.021
----------------------------------------------------------------
total images/sec: 155.38
----------------------------------------------------------------
========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     65.0c  233.0W   1528Mhz 945Mhz  8.0GT/s, x16   20.78%  manual  220.0W   0%        0%       99%      
1     N/A    N/A      N/A     N/A     N/A            0%      N/A     N/A      N/A       N/A      N/A      
================================================================================================
========================               End of ROCm SMI Log              ========================


消費電力もフルパワーで230wも喰っています

150wにリミットを設定して測定します

1   images/sec: 133.4 +/- 0.0 (jitter = 0.0)    8.458
10  images/sec: 138.0 +/- 1.3 (jitter = 0.4)    7.997
20  images/sec: 138.1 +/- 0.7 (jitter = 0.5)    8.260
30  images/sec: 138.1 +/- 0.6 (jitter = 0.5)    8.337
40  images/sec: 138.0 +/- 0.5 (jitter = 0.5)    8.181
50  images/sec: 138.1 +/- 0.4 (jitter = 0.5)    7.753
60  images/sec: 138.2 +/- 0.4 (jitter = 0.6)    8.052
70  images/sec: 138.3 +/- 0.4 (jitter = 0.5)    8.463
80  images/sec: 138.3 +/- 0.3 (jitter = 0.5)    8.274
90  images/sec: 138.3 +/- 0.3 (jitter = 0.5)    8.034
100 images/sec: 138.3 +/- 0.3 (jitter = 0.5)    8.006
----------------------------------------------------------------
total images/sec: 138.19
----------------------------------------------------------------

89%ぐらいの性能低下と言ったところでしょうか
それに対して80wぐらい消費電力が減ってるのでVegaと言うアーキテクチャはピーキーなクロックセッティングなんだなと思います。

試しにFP16の有効化も試してみます、パワーリミットはそのままです

$ python ./tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size 32 --use_fp16

以上のコマンドで走らせました

Step    Img/sec total_loss
1   images/sec: 145.9 +/- 0.0 (jitter = 0.0)    7.971
10  images/sec: 148.2 +/- 1.3 (jitter = 3.4)    8.046
20  images/sec: 148.9 +/- 0.9 (jitter = 1.9)    8.330
30  images/sec: 149.2 +/- 0.7 (jitter = 1.9)    8.094
40  images/sec: 149.0 +/- 0.7 (jitter = 1.7)    8.683
50  images/sec: 149.0 +/- 0.6 (jitter = 1.9)    8.299
60  images/sec: 148.9 +/- 0.5 (jitter = 1.7)    8.334
70  images/sec: 148.9 +/- 0.5 (jitter = 1.9)    8.153
80  images/sec: 148.9 +/- 0.4 (jitter = 1.8)    8.132
90  images/sec: 148.9 +/- 0.4 (jitter = 1.8)    8.380
100 images/sec: 148.9 +/- 0.4 (jitter = 1.7)    8.333
----------------------------------------------------------------
total images/sec: 148.86
----------------------------------------------------------------

若干性能が・・あがったような・・

RadeonⅦ

定格で測りました

Done warm up
Step    Img/sec total_loss
1   images/sec: 207.0 +/- 0.0 (jitter = 0.0)    8.458
10  images/sec: 210.5 +/- 2.6 (jitter = 1.4)    7.998
20  images/sec: 210.3 +/- 1.8 (jitter = 1.4)    8.261
30  images/sec: 210.0 +/- 1.4 (jitter = 1.4)    8.335
40  images/sec: 210.1 +/- 1.2 (jitter = 1.3)    8.184
50  images/sec: 209.9 +/- 1.0 (jitter = 1.3)    7.751
60  images/sec: 210.0 +/- 1.0 (jitter = 1.4)    8.066
70  images/sec: 210.2 +/- 0.9 (jitter = 1.4)    8.463
80  images/sec: 210.5 +/- 0.8 (jitter = 1.4)    8.294
90  images/sec: 210.4 +/- 0.8 (jitter = 1.4)    8.012
100 images/sec: 210.3 +/- 0.7 (jitter = 1.4)    8.014
----------------------------------------------------------------
total images/sec: 210.18
----------------------------------------------------------------

パワーリミットをかけて測ります

Step    Img/sec total_loss
1   images/sec: 209.4 +/- 0.0 (jitter = 0.0)    8.458
10  images/sec: 207.6 +/- 1.4 (jitter = 0.5)    7.997
20  images/sec: 207.9 +/- 0.8 (jitter = 0.7)    8.262
30  images/sec: 207.5 +/- 0.7 (jitter = 0.8)    8.337
40  images/sec: 207.6 +/- 0.6 (jitter = 0.8)    8.191
50  images/sec: 207.6 +/- 0.5 (jitter = 0.8)    7.760
60  images/sec: 207.6 +/- 0.5 (jitter = 0.8)    8.059
70  images/sec: 207.5 +/- 0.5 (jitter = 0.9)    8.469
80  images/sec: 207.6 +/- 0.4 (jitter = 1.0)    8.292
90  images/sec: 207.3 +/- 0.5 (jitter = 1.0)    8.021
100 images/sec: 207.3 +/- 0.5 (jitter = 0.9)    8.001
----------------------------------------------------------------
total images/sec: 207.12
----------------------------------------------------------------

流石に最新のRadeonⅦではFP16を使うとそれなりに早くなります

Step    Img/sec total_loss
1   images/sec: 289.6 +/- 0.0 (jitter = 0.0)    7.976
10  images/sec: 288.6 +/- 2.8 (jitter = 2.9)    8.040
20  images/sec: 288.1 +/- 2.1 (jitter = 1.5)    8.314
30  images/sec: 288.7 +/- 1.6 (jitter = 1.1)    8.075
40  images/sec: 288.9 +/- 1.3 (jitter = 1.2)    8.678
50  images/sec: 288.3 +/- 1.3 (jitter = 1.2)    8.303
60  images/sec: 286.7 +/- 1.5 (jitter = 1.7)    8.346
70  images/sec: 287.0 +/- 1.4 (jitter = 1.6)    8.166
80  images/sec: 286.9 +/- 1.3 (jitter = 1.8)    8.147
90  images/sec: 287.1 +/- 1.2 (jitter = 2.0)    8.411
100 images/sec: 287.4 +/- 1.1 (jitter = 1.8)    8.329
----------------------------------------------------------------
total images/sec: 287.15
----------------------------------------------------------------

RX570 16GB

Resnet50ではまあまあの実行性能でした

Step    Img/sec total_loss
1   images/sec: 83.1 +/- 0.0 (jitter = 0.0) 8.458
10  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 7.997
20  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.261
30  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.338
40  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.188
50  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 7.743
60  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.072
70  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.463
80  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.295
90  images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.032
100 images/sec: 83.1 +/- 0.0 (jitter = 0.1) 8.015
----------------------------------------------------------------
total images/sec: 83.05
----------------------------------------------------------------

ResNet-152

実行コマンドは以下の通り

 python ./tf_cnn_benchmarks.py --num_gpus=1 --model resnet152 --batch_size 32

VegaFE

フルパワー--setsclk 7 で測定

1   images/sec: 65.2 +/- 0.0 (jitter = 0.0) 9.909
10  images/sec: 64.7 +/- 0.2 (jitter = 0.6) 9.650
20  images/sec: 64.6 +/- 0.2 (jitter = 0.6) 9.755
30  images/sec: 64.6 +/- 0.1 (jitter = 0.7) 9.905
40  images/sec: 64.4 +/- 0.1 (jitter = 1.0) 9.960
50  images/sec: 64.2 +/- 0.1 (jitter = 1.2) 10.084
60  images/sec: 63.8 +/- 0.1 (jitter = 1.6) 10.244
70  images/sec: 63.6 +/- 0.1 (jitter = 1.8) 9.957
80  images/sec: 63.4 +/- 0.1 (jitter = 1.5) 9.885
90  images/sec: 63.2 +/- 0.1 (jitter = 1.5) 10.220
100 images/sec: 63.0 +/- 0.1 (jitter = 1.4) 10.038
----------------------------------------------------------------
total images/sec: 63.02
----------------------------------------------------------------

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     81.0c  251.0W   1269Mhz 945Mhz  8.0GT/s, x16   40.78%  manual  220.0W   0%        0%       100%     
1     N/A    N/A      N/A     N/A     N/A            0%      N/A     N/A      N/A       N/A      N/A      
================================================================================================
========================               End of ROCm SMI Log              ========================

ピークで250wぐらいですね

パワーリミット150wで動かします

(これ結果貼り間違えっぽいのであとで再測定します)

結果貼り間違えました

これもFP16を投入して実験的に試しましたがボロボロでした

$ TF_ROCM_FUSION_ENABLE=1 python ./tf_cnn_benchmarks.py --num_gpus=1 --model resnet152 --batch_size 32 --use_fp16

Step    Img/sec total_loss
1   images/sec: 57.1 +/- 0.0 (jitter = 0.0) 10.044
10  images/sec: 56.6 +/- 0.2 (jitter = 0.6) 9.827
20  images/sec: 56.6 +/- 0.2 (jitter = 0.7) 9.752
30  images/sec: 56.4 +/- 0.1 (jitter = 0.8) 10.009
40  images/sec: 56.4 +/- 0.1 (jitter = 0.8) 9.830
50  images/sec: 56.4 +/- 0.1 (jitter = 0.8) 10.032
60  images/sec: 56.4 +/- 0.1 (jitter = 0.8) 9.625
70  images/sec: 56.3 +/- 0.1 (jitter = 0.9) 9.874
80  images/sec: 56.3 +/- 0.1 (jitter = 0.9) 9.608
90  images/sec: 56.3 +/- 0.1 (jitter = 0.9) 10.094
100 images/sec: 56.3 +/- 0.1 (jitter = 0.9) 10.082
----------------------------------------------------------------
total images/sec: 56.32
----------------------------------------------------------------

半分未満ってどういうこと?

RadeonⅦ

Step    Img/sec total_loss
1   images/sec: 89.5 +/- 0.0 (jitter = 0.0) 9.906
10  images/sec: 88.6 +/- 0.3 (jitter = 0.6) 9.670
20  images/sec: 88.6 +/- 0.3 (jitter = 0.7) 9.745
30  images/sec: 88.6 +/- 0.2 (jitter = 0.5) 9.937
40  images/sec: 88.6 +/- 0.2 (jitter = 0.5) 9.977
50  images/sec: 88.5 +/- 0.2 (jitter = 0.4) 10.093
60  images/sec: 88.5 +/- 0.2 (jitter = 0.5) 10.304
70  images/sec: 88.5 +/- 0.2 (jitter = 0.7) 10.016
80  images/sec: 88.5 +/- 0.1 (jitter = 0.6) 9.965
90  images/sec: 88.5 +/- 0.1 (jitter = 0.6) 10.213
100 images/sec: 88.4 +/- 0.1 (jitter = 0.6) 10.020
----------------------------------------------------------------
total images/sec: 88.41
----------------------------------------------------------------

リミットをつけた状態

1   images/sec: 82.8 +/- 0.0 (jitter = 0.0) 9.873
10  images/sec: 82.3 +/- 0.3 (jitter = 0.2) 9.694
20  images/sec: 82.0 +/- 0.2 (jitter = 0.4) 9.757
30  images/sec: 82.0 +/- 0.2 (jitter = 0.3) 9.884
40  images/sec: 82.0 +/- 0.2 (jitter = 0.4) 9.931
50  images/sec: 81.9 +/- 0.2 (jitter = 0.5) 10.092
60  images/sec: 81.9 +/- 0.1 (jitter = 0.5) 10.253
70  images/sec: 81.9 +/- 0.1 (jitter = 0.5) 9.958
80  images/sec: 81.9 +/- 0.1 (jitter = 0.5) 9.902
90  images/sec: 82.0 +/- 0.1 (jitter = 0.4) 10.204
100 images/sec: 81.9 +/- 0.1 (jitter = 0.5) 10.048
----------------------------------------------------------------
total images/sec: 81.90
----------------------------------------------------------------

FP16適用

Step    Img/sec total_loss
1   images/sec: 113.9 +/- 0.0 (jitter = 0.0)    10.044
10  images/sec: 116.1 +/- 0.6 (jitter = 1.1)    9.814
20  images/sec: 116.3 +/- 0.4 (jitter = 0.8)    9.755
30  images/sec: 116.3 +/- 0.3 (jitter = 0.8)    10.030
40  images/sec: 116.3 +/- 0.3 (jitter = 0.7)    9.825
50  images/sec: 116.2 +/- 0.2 (jitter = 0.9)    10.052
60  images/sec: 116.3 +/- 0.2 (jitter = 0.8)    9.654
70  images/sec: 116.3 +/- 0.2 (jitter = 0.8)    9.919
80  images/sec: 116.2 +/- 0.2 (jitter = 0.8)    9.650
90  images/sec: 116.2 +/- 0.2 (jitter = 0.8)    10.083
100 images/sec: 116.2 +/- 0.2 (jitter = 0.8)    10.041
----------------------------------------------------------------
total images/sec: 116.19
----------------------------------------------------------------

RX570 16GB

Done warm up
Step    Img/sec total_loss
1   images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.936
10  images/sec: 35.5 +/- 0.0 (jitter = 0.1) 9.670
20  images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.764
30  images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.933
40  images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.962
50  images/sec: 35.5 +/- 0.0 (jitter = 0.0) 10.067
60  images/sec: 35.5 +/- 0.0 (jitter = 0.0) 10.225
70  images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.986
80  images/sec: 35.5 +/- 0.0 (jitter = 0.0) 9.944
90  images/sec: 35.5 +/- 0.0 (jitter = 0.0) 10.225
100 images/sec: 35.5 +/- 0.0 (jitter = 0.0) 10.069
----------------------------------------------------------------
total images/sec: 35.49
----------------------------------------------------------------

ALexnet

https://www.leadergpu.com/articles/428-tensorflow-alexnet-benchmark
より引用

 python ./tf_cnn_benchmarks.py --num_gpus=1 --model  alexnet --batch_size 32

batchサイズは32で実行

VegaFE

setsclk 7 フルパワーで実行

Step    Img/sec total_loss
1   images/sec: 862.0 +/- 0.0 (jitter = 0.0)    nan
10  images/sec: 858.1 +/- 2.5 (jitter = 6.1)    nan
20  images/sec: 858.6 +/- 1.9 (jitter = 6.1)    nan
30  images/sec: 859.4 +/- 1.3 (jitter = 5.3)    nan
40  images/sec: 859.6 +/- 1.1 (jitter = 5.4)    nan
50  images/sec: 860.1 +/- 0.9 (jitter = 5.2)    nan
60  images/sec: 858.4 +/- 1.6 (jitter = 5.3)    nan
70  images/sec: 858.4 +/- 1.4 (jitter = 5.3)    nan
80  images/sec: 856.4 +/- 2.5 (jitter = 5.1)    nan
90  images/sec: 857.2 +/- 2.3 (jitter = 4.9)    nan
100 images/sec: 858.0 +/- 2.1 (jitter = 4.7)    nan
----------------------------------------------------------------
total images/sec: 855.59
----------------------------------------------------------------

パワーリミットを150wにして実行

Step    Img/sec total_loss
1   images/sec: 821.2 +/- 0.0 (jitter = 0.0)    nan
10  images/sec: 800.8 +/- 2.8 (jitter = 4.7)    nan
20  images/sec: 801.2 +/- 1.7 (jitter = 3.2)    nan
30  images/sec: 801.4 +/- 1.3 (jitter = 3.5)    nan
40  images/sec: 802.0 +/- 1.0 (jitter = 3.6)    nan
50  images/sec: 801.9 +/- 0.9 (jitter = 3.9)    nan
60  images/sec: 801.6 +/- 0.8 (jitter = 3.8)    nan
70  images/sec: 799.3 +/- 1.8 (jitter = 3.8)    nan
80  images/sec: 799.6 +/- 1.6 (jitter = 3.4)    nan
90  images/sec: 799.8 +/- 1.4 (jitter = 3.4)    nan
100 images/sec: 799.7 +/- 1.3 (jitter = 3.1)    nan
----------------------------------------------------------------
total images/sec: 797.83
----------------------------------------------------------------

FP16を有効化して測定

$ TF_ROCM_FUSION_ENABLE=1  python ./tf_cnn_benchmarks.py --num_gpus=1 --model alexnet  --batch_size 32  --use_fp16
Step    Img/sec total_loss
1   images/sec: 570.9 +/- 0.0 (jitter = 0.0)    7.199
10  images/sec: 562.0 +/- 1.8 (jitter = 5.6)    7.199
20  images/sec: 561.1 +/- 1.5 (jitter = 7.0)    7.200
30  images/sec: 562.3 +/- 1.3 (jitter = 7.6)    7.199
40  images/sec: 562.6 +/- 1.0 (jitter = 6.8)    7.199
50  images/sec: 562.8 +/- 0.9 (jitter = 6.2)    7.199
60  images/sec: 560.7 +/- 1.7 (jitter = 7.5)    7.199
70  images/sec: 560.7 +/- 1.4 (jitter = 7.4)    7.199
80  images/sec: 561.2 +/- 1.3 (jitter = 7.1)    7.200
90  images/sec: 561.6 +/- 1.2 (jitter = 6.5)    7.199
100 images/sec: 561.7 +/- 1.1 (jitter = 6.5)    7.199
----------------------------------------------------------------
total images/sec: 560.68
----------------------------------------------------------------

逆に遅くなってますね

RadeonⅦ

コレに関しても一応定格測定です

Done warm up
Step    Img/sec total_loss
1   images/sec: 983.5 +/- 0.0 (jitter = 0.0)    nan
10  images/sec: 984.9 +/- 2.6 (jitter = 6.6)    nan
20  images/sec: 980.3 +/- 2.9 (jitter = 8.2)    nan
30  images/sec: 982.1 +/- 2.3 (jitter = 8.4)    nan
40  images/sec: 983.5 +/- 1.8 (jitter = 8.4)    nan
50  images/sec: 983.8 +/- 1.5 (jitter = 7.6)    nan
60  images/sec: 983.5 +/- 1.4 (jitter = 7.7)    nan
70  images/sec: 984.0 +/- 1.2 (jitter = 6.9)    nan
80  images/sec: 980.7 +/- 3.1 (jitter = 6.9)    nan
90  images/sec: 981.6 +/- 2.8 (jitter = 6.1)    nan
100 images/sec: 982.2 +/- 2.5 (jitter = 6.2)    nan
----------------------------------------------------------------
total images/sec: 979.09
----------------------------------------------------------------

パワーリミット150wをかけました

Step    Img/sec total_loss
1   images/sec: 965.4 +/- 0.0 (jitter = 0.0)    nan
10  images/sec: 940.8 +/- 13.7 (jitter = 28.5)  nan
20  images/sec: 950.9 +/- 7.6 (jitter = 18.3)   nan
30  images/sec: 956.7 +/- 5.4 (jitter = 14.0)   nan
40  images/sec: 957.2 +/- 4.5 (jitter = 13.5)   nan
50  images/sec: 958.2 +/- 3.7 (jitter = 13.1)   nan
60  images/sec: 959.0 +/- 3.1 (jitter = 12.9)   nan
70  images/sec: 956.1 +/- 4.0 (jitter = 13.1)   nan
80  images/sec: 957.2 +/- 3.6 (jitter = 12.9)   nan
90  images/sec: 957.8 +/- 3.2 (jitter = 11.8)   nan
100 images/sec: 958.3 +/- 2.9 (jitter = 12.8)   nan
----------------------------------------------------------------
total images/sec: 955.23
----------------------------------------------------------------

FP16を適用

1   images/sec: 979.4 +/- 0.0 (jitter = 0.0)    7.178
10  images/sec: 973.1 +/- 3.9 (jitter = 5.2)    7.197
20  images/sec: 972.6 +/- 2.3 (jitter = 8.9)    7.198
30  images/sec: 973.1 +/- 1.8 (jitter = 7.7)    7.199
40  images/sec: 972.9 +/- 1.6 (jitter = 7.7)    7.199
50  images/sec: 974.2 +/- 1.6 (jitter = 8.9)    7.199
60  images/sec: 970.5 +/- 4.0 (jitter = 9.5)    7.200
70  images/sec: 970.5 +/- 3.5 (jitter = 9.0)    7.199
80  images/sec: 972.0 +/- 3.1 (jitter = 8.5)    7.200
90  images/sec: 972.9 +/- 2.8 (jitter = 8.6)    7.199
100 images/sec: 974.0 +/- 2.5 (jitter = 9.1)    7.199
----------------------------------------------------------------
total images/sec: 970.86
----------------------------------------------------------------

流石にTeslaV100には勝てませんがP100に並ぶぐらいの性能は出てる感じです

RX570 16GB

Step    Img/sec total_loss
1   images/sec: 344.3 +/- 0.0 (jitter = 0.0)    nan
10  images/sec: 364.6 +/- 2.2 (jitter = 1.7)    nan
20  images/sec: 366.2 +/- 1.2 (jitter = 2.0)    nan
30  images/sec: 366.4 +/- 0.9 (jitter = 2.3)    nan
40  images/sec: 366.6 +/- 0.7 (jitter = 2.2)    nan
50  images/sec: 366.9 +/- 0.5 (jitter = 2.2)    nan
60  images/sec: 367.0 +/- 0.5 (jitter = 2.3)    nan
70  images/sec: 366.9 +/- 0.4 (jitter = 2.4)    nan
80  images/sec: 367.1 +/- 0.4 (jitter = 2.3)    nan
90  images/sec: 367.1 +/- 0.4 (jitter = 2.4)    nan
100 images/sec: 367.1 +/- 0.3 (jitter = 2.4)    nan
----------------------------------------------------------------
total images/sec: 366.67
----------------------------------------------------------------

VGG16

 python ./tf_cnn_benchmarks.py --num_gpus=1 --model vgg16  --batch_size 32

以上のコマンドで実行しました。

VegaFEで測定

--setsclk 7で測定

Step    Img/sec total_loss
1   images/sec: 93.7 +/- 0.0 (jitter = 0.0) 7.262
10  images/sec: 94.3 +/- 0.1 (jitter = 0.4) 7.242
20  images/sec: 94.2 +/- 0.1 (jitter = 0.3) 7.273
30  images/sec: 94.2 +/- 0.1 (jitter = 0.3) 7.212
40  images/sec: 94.1 +/- 0.1 (jitter = 0.3) 7.314
50  images/sec: 94.1 +/- 0.1 (jitter = 0.4) 7.276
60  images/sec: 94.1 +/- 0.1 (jitter = 0.3) 7.247
70  images/sec: 94.1 +/- 0.1 (jitter = 0.3) 7.240
80  images/sec: 93.9 +/- 0.1 (jitter = 0.4) 7.265
90  images/sec: 93.6 +/- 0.1 (jitter = 0.4) 7.269
100 images/sec: 93.4 +/- 0.1 (jitter = 0.5) 7.275
----------------------------------------------------------------
total images/sec: 93.35
----------------------------------------------------------------

消費電力は220w前後で推移

パワーリミットを150wに絞って測定

Step    Img/sec total_loss
1   images/sec: 79.3 +/- 0.0 (jitter = 0.0) 7.276
10  images/sec: 79.0 +/- 0.2 (jitter = 0.7) 7.235
20  images/sec: 79.1 +/- 0.2 (jitter = 0.5) 7.289
30  images/sec: 79.0 +/- 0.1 (jitter = 0.7) 7.227
40  images/sec: 78.9 +/- 0.1 (jitter = 0.7) 7.273
50  images/sec: 78.9 +/- 0.1 (jitter = 0.7) 7.260
60  images/sec: 78.8 +/- 0.1 (jitter = 0.7) 7.271
70  images/sec: 78.7 +/- 0.1 (jitter = 0.6) 7.264
80  images/sec: 78.7 +/- 0.1 (jitter = 0.6) 7.252
90  images/sec: 78.7 +/- 0.1 (jitter = 0.6) 7.267
100 images/sec: 78.7 +/- 0.1 (jitter = 0.6) 7.267
----------------------------------------------------------------
total images/sec: 78.64
----------------------------------------------------------------

消費電力は150w前後で推移しました

FP16について

vgg16にてFP16を用いた時の性能テストをしました、実行コマンドは

$ python ./tf_cnn_benchmarks.py --num_gpus=1 --model vgg16 --batch_size 32 --use_fp16

です

Step    Img/sec total_loss
1   images/sec: 52.9 +/- 0.0 (jitter = 0.0) 7.275
10  images/sec: 52.9 +/- 0.1 (jitter = 0.4) 7.298
20  images/sec: 52.8 +/- 0.1 (jitter = 0.3) 7.294
30  images/sec: 52.8 +/- 0.1 (jitter = 0.2) 7.251
40  images/sec: 52.8 +/- 0.1 (jitter = 0.2) 7.285
50  images/sec: 52.8 +/- 0.1 (jitter = 0.2) 7.251
60  images/sec: 52.8 +/- 0.0 (jitter = 0.2) 7.252
70  images/sec: 52.7 +/- 0.0 (jitter = 0.2) 7.263
80  images/sec: 52.7 +/- 0.0 (jitter = 0.2) 7.266
90  images/sec: 52.8 +/- 0.0 (jitter = 0.2) 7.255
100 images/sec: 52.7 +/- 0.0 (jitter = 0.3) 7.252
----------------------------------------------------------------
total images/sec: 52.72
----------------------------------------------------------------

https://www.nttpc.co.jp/gpu/article/benchmark03.html
のTeslaP100、V100と比較してもちょっとFP16の性能が良くないです。。。むしろ遅くなってるのでカタログスペック上ネイティブ対応してるはずなのでROCmの実装になんらかの問題があると考えられます

RadeonⅦ

これは定格値になります

Step    Img/sec total_loss
1   images/sec: 124.6 +/- 0.0 (jitter = 0.0)    7.256
10  images/sec: 125.0 +/- 0.1 (jitter = 0.3)    7.249
20  images/sec: 124.9 +/- 0.1 (jitter = 0.4)    7.275
30  images/sec: 124.9 +/- 0.1 (jitter = 0.5)    7.213
40  images/sec: 124.9 +/- 0.1 (jitter = 0.5)    7.279
50  images/sec: 124.9 +/- 0.1 (jitter = 0.4)    7.278
60  images/sec: 124.8 +/- 0.1 (jitter = 0.5)    7.252
70  images/sec: 124.8 +/- 0.1 (jitter = 0.5)    7.242
80  images/sec: 124.7 +/- 0.1 (jitter = 0.5)    7.253
90  images/sec: 124.7 +/- 0.1 (jitter = 0.5)    7.264
100 images/sec: 124.7 +/- 0.1 (jitter = 0.5)    7.275
----------------------------------------------------------------
total images/sec: 124.61
----------------------------------------------------------------

パワーリミット150wです

Step    Img/sec total_loss
1   images/sec: 108.9 +/- 0.0 (jitter = 0.0)    7.246
10  images/sec: 109.3 +/- 0.1 (jitter = 0.3)    7.241
20  images/sec: 109.4 +/- 0.1 (jitter = 0.2)    7.255
30  images/sec: 109.4 +/- 0.0 (jitter = 0.2)    7.240
40  images/sec: 109.3 +/- 0.1 (jitter = 0.2)    7.305
50  images/sec: 109.2 +/- 0.1 (jitter = 0.2)    7.276
60  images/sec: 109.2 +/- 0.1 (jitter = 0.3)    7.260
70  images/sec: 109.1 +/- 0.0 (jitter = 0.4)    7.261
80  images/sec: 109.0 +/- 0.1 (jitter = 0.3)    7.249
90  images/sec: 109.0 +/- 0.1 (jitter = 0.3)    7.271
100 images/sec: 109.0 +/- 0.1 (jitter = 0.3)    7.272
----------------------------------------------------------------
total images/sec: 108.99
----------------------------------------------------------------

FP16を適用

Step    Img/sec total_loss
1   images/sec: 161.6 +/- 0.0 (jitter = 0.0)    7.244
10  images/sec: 162.0 +/- 0.2 (jitter = 0.5)    7.260
20  images/sec: 161.8 +/- 0.3 (jitter = 0.4)    7.307
30  images/sec: 161.8 +/- 0.2 (jitter = 0.5)    7.257
40  images/sec: 161.8 +/- 0.1 (jitter = 0.6)    7.259
50  images/sec: 161.8 +/- 0.1 (jitter = 0.5)    7.240
60  images/sec: 161.7 +/- 0.1 (jitter = 0.5)    7.265
70  images/sec: 161.7 +/- 0.1 (jitter = 0.4)    7.257
80  images/sec: 161.7 +/- 0.1 (jitter = 0.5)    7.269
90  images/sec: 161.6 +/- 0.1 (jitter = 0.5)    7.243
100 images/sec: 161.6 +/- 0.1 (jitter = 0.5)    7.250
----------------------------------------------------------------
total images/sec: 161.54
----------------------------------------------------------------

RX570 16GB

Step    Img/sec total_loss
1   images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.924
10  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.608
20  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.733
30  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.907
40  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.926
50  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 10.055
60  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 10.276
70  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.989
80  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 9.863
90  images/sec: 34.2 +/- 0.0 (jitter = 0.0) 10.227
100 images/sec: 34.2 +/- 0.0 (jitter = 0.0) 10.029
----------------------------------------------------------------
total images/sec: 34.19
----------------------------------------------------------------

まとめ

結果

即興でLibere Officeで作ったのでちょっと微妙です、ベンチマーク別はちょっと見づらいのでやめます。
あとでWPS Officeで作り直します

Screenshot from 2019-03-08 22-22-28.png
(ゼロになってる値は測定ミスをRadeonⅦに換装してから気づいてしまったため後で測り直しすることにしました)

VegaFEについての総評

FP16を使わないほうが良いです、パワーリミットをつけても性能は極端に下がらないので常時稼働とかするときはしちゃってもいいかもです

RadeonⅦについての総評

間違いなくRadeon史上最速の深層学習性能であります、これ10万円らしいので多分大してAMDは儲かってなさそうですね
FP16でもそれなりに速くなります、使えるなら使いましょう、パワーリミットをつけけないとファンがすごいです。熱対策が必須になります

CUDA編(編集中)

CUDAでもテストしてみました

環境

GPU GTX1080Ti
Tensorflow-gpu 1.12.0
python 3.6.8

InceptionV3

Step    Img/sec total_loss
1   images/sec: 133.7 +/- 0.0 (jitter = 0.0)    7.345
10  images/sec: 133.9 +/- 0.8 (jitter = 1.8)    7.438
20  images/sec: 134.5 +/- 0.5 (jitter = 1.0)    7.323
30  images/sec: 134.7 +/- 0.3 (jitter = 0.9)    7.496
40  images/sec: 134.8 +/- 0.3 (jitter = 0.9)    7.332
50  images/sec: 134.8 +/- 0.2 (jitter = 0.8)    7.318
60  images/sec: 134.8 +/- 0.2 (jitter = 0.8)    7.392
70  images/sec: 134.8 +/- 0.2 (jitter = 0.8)    7.320
80  images/sec: 134.8 +/- 0.2 (jitter = 0.8)    7.361
90  images/sec: 134.8 +/- 0.2 (jitter = 0.8)    7.497
100 images/sec: 134.8 +/- 0.2 (jitter = 0.7)    7.400
----------------------------------------------------------------
total images/sec: 134.78
----------------------------------------------------------------

Resnet50

Step    Img/sec total_loss
1   images/sec: 191.2 +/- 0.0 (jitter = 0.0)    8.458
10  images/sec: 190.3 +/- 2.5 (jitter = 10.6)   7.997
20  images/sec: 191.8 +/- 1.5 (jitter = 7.1)    8.260
30  images/sec: 193.4 +/- 1.1 (jitter = 6.0)    8.336
40  images/sec: 193.9 +/- 0.9 (jitter = 4.6)    8.195
50  images/sec: 194.1 +/- 0.7 (jitter = 4.5)    7.749
60  images/sec: 195.0 +/- 0.7 (jitter = 4.7)    8.065
70  images/sec: 195.9 +/- 0.6 (jitter = 5.1)    8.474
80  images/sec: 196.6 +/- 0.6 (jitter = 5.0)    8.287
90  images/sec: 196.7 +/- 0.6 (jitter = 4.7)    8.003
100 images/sec: 196.8 +/- 0.5 (jitter = 4.6)    8.007
----------------------------------------------------------------
total images/sec: 196.62
----------------------------------------------------------------

RX570 16GB版を稼働させるときのメモ

gfx803_32.cd.pdb.txtが無いぞとうるさいので勝手にgfx803_36.cd.pdb.txtをリネームしたものを複製して無理やり動かしました。ファイルパスは以下の通り

/opt/rocm/miopen/share/miopen/db$ ls
gfx803_32.cd.pdb.txt  gfx900_56.cd.pdb.txt  gfx906_60.cd.pdb.txt
gfx803_36.cd.pdb.txt  gfx900_64.cd.pdb.txt  gfx906_64.cd.pdb.txt
gfx803_64.cd.pdb.txt  gfx906_56.cd.pdb.txt

参考

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
https://qiita.com/syoyo/items/58bc1ed7558660defe29
https://github.com/RadeonOpenCompute/ROC-smi

todo

RX470 8GBでも測定する
一部抜けているVegaFEの測定を埋める
グラフ主体で見やすい結果に改変する
新規で記事を作りなおす
いちいち手動で測定するのも面倒くさいので.shでも書いてまとめてもいいのでは

2
5
2

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
5