Edited at

AMD社製GPUを用いたTensorFlow環境構築(Tensorflow導入~サンプル動作編)


はじめに

AMD GPUを用いてTensorflowのサンプル動作するまでの過程を記載します。

マイニングマシンからの転用でROCmを用いたTensorFlow環境を構築できるか試してみます。

前回ではROCmの導入をしましたので、

今回はTensorflowの導入~サンプル動作までを行います。


本記事は概要版となります。

詳細はAMD社製GPUを用いたTensorFlow環境構築(Tensorflow導入~サンプル動作編):詳細版で紹介しています。



構成

CPU: Celeron G3930

GPU: Radeon Vega 56

Ubuntu : 18.04 LTS(Kernel 4.15)

ROCm Version: 2.1

Python: 3.6

Tensorflow: 1.12


TensorFlowインストール事前準備

・Python諸々のインストール



sudo apt-get update && sudo apt-get install -y \

python3-numpy \

python3-dev \

python3-wheel \

python3-mock \

python3-future \

python3-pip \

python3-yaml \

python3-setuptools && \

sudo apt-get clean && \

sudo rm -rf /var/lib/apt/lists/*


TensorFlowインストール

・tensorflow-rocmをPipでインストール

pip3 install tensorflow-rocm


TensorFlow実行

Tensorflow動作確認→色々足りませんエラー

python3

>>> import tensorflow
Traceback (most recent call last):
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/usr/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libCXLActivityLogger.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/usr/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libCXLActivityLogger.so: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
>>>

調べてみると同様な現象の方々がいた為、こちらを参考に以下コマンドを入力しました。

$ sudo apt-get update && \

sudo apt-get install -y --allow-unauthenticated \
rocm-dkms rocm-dev rocm-libs \
rocm-device-libs \
hsa-ext-rocr-dev hsakmt-roct-dev hsa-rocr-dev \
rocm-opencl rocm-opencl-dev \
rocm-utils \
rocm-profiler cxlactivitylogger \
miopen-hip miopengemm

諸々のインストール完了後、再度Tensorflow動作確認→とりあえずは動作しました。

python3

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
WARNING:tensorflow:From /home/tk/.local/lib/python3.6/site-packages/tensorflow/python/ops/distributions/distribution.py:265: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/tk/.local/lib/python3.6/site-packages/tensorflow/python/ops/distributions/bernoulli.py:169: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
>>>


サンプル動作

・Gitのインストール

sudo apt install git

・Git clone:公式GitHubに従って、クローンしました。

cd ~

git clone https://github.com/tensorflow/models.git

・この中から、CIFAR10というベンチマークを動かしてみました。

cd ~/models/tutorials/image/cifar10

export HIP_VISIBLE_DEVICES=0
python3 ./cifar10_train.py

・結果→無事動作しました。

019-02-12 16:58:19.769456: step 7720, loss = 0.85 (3943.5 examples/sec; 0.032 sec/batch)

2019-02-12 16:58:20.132842: step 7730, loss = 0.87 (3522.4 examples/sec; 0.036 sec/batch)
2019-02-12 16:58:20.468507: step 7740, loss = 0.88 (3813.3 examples/sec; 0.034 sec/batch)
2019-02-12 16:58:20.791237: step 7750, loss = 1.07 (3966.2 examples/sec; 0.032 sec/batch)
2019-02-12 16:58:21.121733: step 7760, loss = 0.90 (3873.0 examples/sec; 0.033 sec/batch)
2019-02-12 16:58:21.487553: step 7770, loss = 0.83 (3499.0 examples/sec; 0.037 sec/batch)
2019-02-12 16:58:21.851375: step 7780, loss = 0.81 (3518.2 examples/sec; 0.036 sec/batch)
2019-02-12 16:58:22.170275: step 7790, loss = 0.87 (4013.8 examples/sec; 0.032 sec/batch)

・GPUの負荷を見てみてもちゃんと動作しているようです。

$ sudo /opt/rocm/bin/rocm-smi -u

======================== ROCm System Management Interface ========================
================================================================================================
GPU[0] : Cannot get GPU use.
GPU[1] : Current GPU use: 64%
================================================================================================
======================== End of ROCm SMI Log ========================


まとめ

一応ROCm、TensorFlowの導入、サンプル動作まで一通り実現できましたが、

ほかの方のベンチマークを見てみるともっと値が出ていたり、GPUの負荷も結構変動しているように見受けられたので、パラメータ最適化に関しては少し調べてみたいと思います。