Posted at

AMD社製GPU(ROCm)を用いたTensorFlow動作(CPU-GPU比較)


はじめに

前回記事ではAMD GPUを用いてTensorflowのサンプル動作するまでの過程を記載しましたが、今回はGPUで動作されることでどれくらい高速化が図れるのか調べてみました。


構成

CPU: Celeron G3930

GPU: Radeon Vega 56, RX570, RX580

Ubuntu : 18.04 LTS(Kernel 4.15)

ROCm Version: 2.1

Python: 3.6

Tensorflow: 1.12


GPU側情報表示(ROCm-smi)

ROCm System Management Interfaceというツールを使います。

GPU使用率の表示

$ sudo /opt/rocm/bin/rocm-smi -u

返り値

========================        ROCm System Management Interface        ========================

================================================================================================
GPU[0] : Cannot get GPU use.
GPU[1] : Current GPU use: 64%
================================================================================================
======================== End of ROCm SMI Log ========================

その他オプション


-h, --help show this help message and exit

--load FILE Load Clock, Fan, Performance and Profile settings from FILE

--save FILE Save Clock, Fan, Performance and Profile settings to FILE

-d DEVICE, --device DEVICE Execute command on specified device

-i, --showid Show GPU ID

-v, --showvbios Show VBIOS version

--showhw Show Hardware details

-t, --showtemp Show current temperature

-c, --showclocks Show current clock frequencies

-g, --showgpuclocks Show current GPU clock frequencies

-f, --showfan Show current fan speed

-p, --showperflevel Show current DPM Performance Level

-P, --showpower Show current Average Graphics Package Power Consumption

-o, --showoverdrive Show current GPU Clock OverDrive level

-m, --showmemoverdrive Show current GPU Memory Clock OverDrive level

-M, --showmaxpower Show maximum graphics package power this GPU will consume

-l, --showprofile Show Compute Profile attributes

-s, --showclkfrq Show supported GPU and Memory Clock

-u, --showuse Show current GPU use

-b, --showbw Show estimated PCIe use

-S, --showclkvolt Show supported GPU and Memory Clocks and Voltages

-a, --showallinfo Show Temperature, Fan and Clock values

-r, --resetclocks Reset sclk, mclk and pclk to default

--setsclk LEVEL [LEVEL ...] Set GPU Clock Frequency Level(s) (requires manual Perf level)

--setmclk LEVEL [LEVEL ...] Set GPU Memory Clock Frequency Level(s) (requires manual Perf level)

--setpclk LEVEL [LEVEL ...] Set PCIE Clock Frequency Level(s) (requires manual Perf level)

--setslevel SCLKLEVEL SCLK SVOLT Change GPU Clock frequency (MHz) and Voltage (mV) for a specific Level

--setmlevel MCLKLEVEL MCLK MVOLT Change GPU Memory clock frequency (MHz) and Voltage for (mV) a specific Level

--resetfans Reset fans to automatic (driver) control

--setfan LEVEL Set GPU Fan Speed (Level or %)

--setperflevel LEVEL Set Performance Level

--setoverdrive % Set GPU OverDrive level (requires manual|high Perf level)

--setmemoverdrive % Set GPU Memory Overclock OverDrive level (requires manual|high Perf level)

--setpoweroverdrive WATTS Set the maximum GPU power using Power OverDrive in Watts

--resetpoweroverdrive Set the maximum GPU power back to the device deafult state

--setprofile SETPROFILE Specify Power Profile level (#) or a quoted string of CUSTOM Profile

attributes "# # # #..." (requires manual Perf level)

--resetprofile Reset Power Profile back to default

--autorespond RESPONSE Response to automatically provide for all prompts (NOT RECOMMENDED)

--loglevel ILEVEL How much output will be printed for what program is doing, one of

debug/info/warning/error/critical



Cifar10を用いたCPU-GPU比較評価

階層を移動

cd ~/models/tutorials/image/cifar10

GPUを指定

ROCm-smiで表示された番号に対応しています。

export HIP_VISIBLE_DEVICES=0 or 1

train.pyの実行

python3 ./cifar10_train.py

なおデフォルトでは以下になっているらしく、

そのままではかなりの長時間必要となるため、適宜、値を変えてください。

--max_steps 1000000

--batch_size 128

今回は--max_steps 5000にして実験しました。

まずCPU側

export HIP_VISIBLE_DEVICES=1

python3 ./cifar10_train.py --max_steps 5000

2019-02-14 14:49:36.524011: step 0, loss = 4.68 (334.1 examples/sec; 0.383 sec/batch)

2019-02-14 14:49:42.452843: step 10, loss = 4.61 (215.9 examples/sec; 0.593 sec/batch)
2019-02-14 14:49:48.248498: step 20, loss = 4.49 (220.9 examples/sec; 0.580 sec/batch)
~中略~
2019-02-14 15:37:19.149804: step 4950, loss = 1.33 (222.7 examples/sec; 0.575 sec/batch)
2019-02-14 15:37:24.893713: step 4960, loss = 0.88 (222.8 examples/sec; 0.574 sec/batch)
2019-02-14 15:37:30.636494: step 4970, loss = 1.20 (222.9 examples/sec; 0.574 sec/batch)
2019-02-14 15:37:36.384617: step 4980, loss = 0.95 (222.7 examples/sec; 0.575 sec/batch)
2019-02-14 15:37:42.136948: step 4990, loss = 1.15 (222.5 examples/sec; 0.575 sec/batch)

48分6秒かかりました。

続いてGPU側

export HIP_VISIBLE_DEVICES=0

python3 ./cifar10_train.py --max_steps 5000

2019-02-14 16:25:06.689031: step 0, loss = 4.68 (72.8 examples/sec; 1.758 sec/batch)

2019-02-14 16:25:07.120987: step 10, loss = 4.60 (2963.2 examples/sec; 0.043 sec/batch)
2019-02-14 16:25:07.435396: step 20, loss = 4.52 (4071.1 examples/sec; 0.031 sec/batch)
~中略~
2019-02-14 16:27:56.178652: step 4950, loss = 1.05 (4090.8 examples/sec; 0.031 sec/batch)
2019-02-14 16:27:56.488431: step 4960, loss = 1.08 (4132.0 examples/sec; 0.031 sec/batch)
2019-02-14 16:27:56.796098: step 4970, loss = 1.01 (4160.4 examples/sec; 0.031 sec/batch)
2019-02-14 16:27:57.110136: step 4980, loss = 1.27 (4075.9 examples/sec; 0.031 sec/batch)
2019-02-14 16:27:57.409439: step 4990, loss = 1.06 (4276.6 examples/sec; 0.030 sec/batch)

2分51秒かかりました。

GPUはCPUと比較して約1/17の時間で完了しました。