More than 5 years have passed since last update.

Pytorch-ROCm Dockerを動かしてみる試み（ビルドまではできましたが実行に不安あり）

Last updated at 2019-03-19Posted at 2019-03-18

準拠先
(1)https://rocm.github.io/pytorch.html
（この通りにやっても全然うまく行きませんでした（#^ω^））

(2)https://rocm-documentation.readthedocs.io/en/latest/Deep_learning/Deep-learning.html
同じ公式ドキュメントでもこっちのほうがまともなことが書いてあるのでこっちを参考にしたほうが良いです（それでも一部ディレクトリの中身を弄るなどしないと動きませんでしたが）

おおまかな環境は
OS Ubuntu 16.04
GPU RadeonⅦ(gfx906)
ROCm version 2.1
Docker version 18.09.3
となっています

環境構築

DockerとROCｍは既に環境構築が済んでいるものとしてます

$ sudo apt update
$ sudo apt upgrade
$ docker pull rocm/pytorch:rocm2.1_ubuntu16.04_pytorch_gfx906
$  sudo docker run　--name pytorch -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/pytorch:rocm2.1_ubuntu16.04_pytorch_gfx906 (ここは各々のGPUで変えたほうがいい・・？）

ドキュメント(2)にはdocker pull rocm/pytorch:rocm2.1と書いてありますがやっても

Error response from daemon: manifest for rocm/pytorch:rocm2.1 not found

と返されます

イメージをpullしてからrunさせます

cd ~
git clone https://github.com/pytorch/pytorch.git
cd pytorch
git submodule init
git submodule update

これします

次にコンテナ側に戻って

# cd /data/pytorch

dataに自動的にホストのホームディレクトリがマウントされてるはずです
ちなみにvega10(gfx900)以外のアーキテクチャでは環境変数にgfx906の場合

export HCC_AMDGPU_TARGET=gfx906
と明記して追加する必要があります

# cp -r ./tools/  ./.jenkins/pytorch/
# cp -r ./requirements.txt  ./.jenkins/pytorch/
# cd /data/pytorch/.jenkins/pytorch

+ pip install -q -r requirements.txt
Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'
You are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

エラーに煽られたりしますのでpipをアップグレードします

# pip install --upgrade pip
# ./builed.sh

ちなみにこのビルドは1時間ぐらいかかりますのでお茶でも飲んで待ちましょう(Xeon E5-2603v4 RAM32GBにて)

このままだとpythonのバージョン監理が面倒くさいので次にcondaを入れます

# cd ~
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# touch .bashrc(既に環境変数をいじる為に.bashrcがあるならしなくてよいです）
# bash Miniconda3-latest-Linux-x86_64.sh 
# source .bashrc
# conda -V      
conda 4.5.12

次にテストに移ります

# cd /data/pytorch
# conda create -n py37 python=3.7
# conda activate py37
# pip install -q -r requirements.txt(pip versionを上げろと言って怒られたりします）
# pip install torchvision
# cd ./test
# PYTORCH_TEST_WITH_ROCM=1 python ./run_test.py --verbose

このときpython version3.7にするのがポイントです3.5だと動きませんでした。

これをやると各種機能がOKかどうかを逐一チェックしだすのですが

======================================================================
FAIL: test_multinomial_invalid_probs_cuda (test_cuda.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/pytorch/test/common_utils.py", line 298, in wrapper
    method(*args, **kwargs)
  File "/data/pytorch/test/test_cuda.py", line 2285, in test_multinomial_invalid_probs_cuda
    self._spawn_method(test_method, torch.Tensor([1, -1, 1]))
  File "/data/pytorch/test/test_cuda.py", line 2265, in _spawn_method
    self.fail(e)
AssertionError: False

----------------------------------------------------------------------
Ran 159 tests in 44.647s

FAILED (failures=1, skipped=76)
Traceback (most recent call last):
  File "./run_test.py", line 457, in <module>
    main()
  File "./run_test.py", line 449, in main
    raise RuntimeError(message)
RuntimeError: test_cuda failed!

ここで落ちてしまいました、試しにひとまず適当にpytorchを走らせてみます

# cd /data
# git clone https://github.com/pytorch/examples.git
# cd ./examples/mnist
# cd pip install -r requirements.txt
#  python ./main.py 
（結果中略）
Test set: Average loss: 0.0318, Accuracy: 9893/10000 (99%)

Epoch1からEpoch: 10まで実行されます、ただrocm-smiを見てる限りあんまり動いてる感じがしなかった（GPU使用率はせいぜい40%ぐらい？）だったのでちょっと腑に落ちないですね。

せっかくなのでベンチマークを取ってみるかと思ったのですが

コレを使ってみたのですが
https://github.com/ryujaehun/pytorch-gpu-benchmark

# git clone https://github.com/ryujaehun/pytorch-gpu-benchmark.git
# cd ./pytorch-gpu-benchmark/
# pip install -r  requirement.txt 
# ./test.sh 
(py37) root@ac1b159c79ec:~/pytorch-gpu-benchmark# ./test.sh 
Number of GPUs on current device 0
CUDA version = 9.0.176
cudnn version= 7102
Traceback (most recent call last):
  File "benchmark_models.py", line 13, in <module>
    print_info()
  File "/root/pytorch-gpu-benchmark/info_utils.py", line 23, in print_info
    print_device_name()
  File "/root/pytorch-gpu-benchmark/info_utils.py", line 16, in print_device_name
    print('device_name=', torch.cuda.get_device_name(0))
  File "/root/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 272, in get_device_name
    return get_device_properties(device).name
  File "/root/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 290, in get_device_properties
    init()  # will define _get_device_properties and _CudaDeviceProperties
  File "/root/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 143, in init
    _lazy_init()
  File "/root/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 160, in _lazy_init
    _check_driver()
  File "/root/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 81, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

どうやらCUDAがインストールされてるかどうかをチェックする項目があるらしく動きませんでした・・・
どうすればいいのかよくわからなかったのでこれは一時保留にします。

となるとさっきの

FAIL: test_multinomial_invalid_probs_cuda (test_cuda.TestCuda)

が悪かった・・・？ようにも思えるのでもう一度環境をを立て直して再検証してみたいと思います。

再構築検証編（現在編集中です、近日追加します）

$ sudo docker container run --name pytorch2 -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/pytorch:rocm2.1_ubuntu16.04_pytorch_gfx906

docker commit pytorch2コンテナを保存する場合はこれで保存します
先ほどの手順と同じようにホストマシンのホームディレクトリにあれらを入れてください

コンテナに入ったら

# apt update
# apt upgarade
# apt install nano(これはお好みでどうぞ）
# apt install git
# cd ~
# 
# nano .bashrc (export HCC_AMDGPU_TARGET=gfx906)を追記する
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# touch .bashrc(既に環境変数をいじる為に.bashrcがあるならしなくてよいです）
# bash Miniconda3-latest-Linux-x86_64.sh 
# source .bashrc
# conda -V # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# bash Miniconda3-latest-Linux-x86_64.sh 
# source .bashrc
# conda -V 
# cd /data
# cd /data/pytorch/.jenkins/pytorch
# cp ./build.sh  ../../
# cd ../../
# ./build.sh
# conda create -n py37 python=3.7
# conda activate py37
#  pip install -q -r requirements.txt 
# cd ./test
# PYTORCH_TEST_WITH_ROCM=1 python ./run_test.py --verbose

一式打ち込んでみましたが

======================================================================
FAIL: test_multinomial_invalid_probs_cuda (test_cuda.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/pytorch/test/common_utils.py", line 298, in wrapper
    method(*args, **kwargs)
  File "/data/pytorch/test/test_cuda.py", line 2285, in test_multinomial_invalid_probs_cuda
    self._spawn_method(test_method, torch.Tensor([1, -1, 1]))
  File "/data/pytorch/test/test_cuda.py", line 2265, in _spawn_method
    self.fail(e)
AssertionError: False

----------------------------------------------------------------------
Ran 159 tests in 45.323s

この部分で落ちてします
完全な動作はまだ厳しいようなので今後のアップデートが望まれます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up