FPGAでDeep Learningしてみる - きゅうりを選果するを、AWSで動かそうとして、いろいろハマったので、メモ書きです。
アジェンダ
- GPU環境の構築
- Python, Theano, Lasagne, Pylearn2のインストール
- 学習データとBNN-PYNQのインストール、学習の実行
- 学習パラメータのPYNQ上へのコピー、binary形式への変換
- PYNQ上で推論
※以下、scp先やsshの鍵名などは適宜自分の実行環境のものに読み替えてください。
#1. GPU環境の構築
AWSにはDeep Learning AMIというPython, Therano, Nvidia Dribvers, CUDA, cuDNNがプレインストールされたimageがあるのですが、これで後述のcucumber9.pyを実行しようとすると、どう頑張っても下記のようなエラーが出て動かないので、素のubuntuにnvidia driverからインストールことにしました。
ubuntu@ip-172-31-26-1:~/BNN-PYNQ/bnn/src/training$ python2.7 cucumber9.py
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release. Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 7104)
/home/ubuntu/.local/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:620: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.1.
warnings.warn(warn)
Traceback (most recent call last):
File "cucumber9.py", line 43, in <module>
import lasagne
File "/home/ubuntu/.local/lib/python2.7/site-packages/lasagne/__init__.py", line 12, in <module>
import theano
File "/home/ubuntu/.local/lib/python2.7/site-packages/theano/__init__.py", line 115, in <module>
theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
File "/home/ubuntu/.local/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.py", line 41, in test_nvidia_driver1
raise Exception("The nvidia driver version installed with this OS "
Exception: The nvidia driver version installed with this OS does not give good results for reduction.Installing the nvidia driver available on the same download page as the cuda package will fix the problem: http://developer.nvidia.com/cuda-downloads
AWS EC2のInstancesからLaunch Instanceして、Quick StartにあるUbuntu Server 16.04 LTS (HVM), SSD Volume Typeを選択
Instance TypeはGPU Computeからp2.xlargeを選択
(GPUインスタンスはデフォルトでは使えないので、EC2の左カラムのLimitsからp2.xlargeに対してRequest limit increaseしてlimit valueを1以上に変更しておく必要があります。)
Add StorageでRootのサイズを16GiBに増やしておく。
(8GiBのままだと、Pythonのインストールの途中でディスクがいっぱいになります)。
インスタンス起動してssh接続した後は、基本的にFPGAでDeep Learningしてみる - きゅうりを選果するに従ってNvidia Dribvers, CUDA, cuDNNのインストールを進めますが、driverのみ、記載のものでは、やはりdriver周りのエラーが出るので、下記のようにCUDA Toolkit 8.0 - Feb 2017のrunfileによるdriver, CUDAインストールにしました。
インストール中にgcc, makeを求められるので先にインストールしとく。
sudo apt-get update
sudo apt-get install gcc make
wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda_8.0.61_375.26_linux-run
sudo sh cuda_8.0.61_375.26_linux-run
install optionはdefaultで進めます。
BLAS libraryのpatchも当てておく
wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/patches/2/cuda_8.0.61.2_linux-run
sudo sh cuda_8.0.61.2_linux-run
cuDNNはFPGAでDeep Learningしてみる - きゅうりを選果するの通り、nvidiaのsiteでアカウント作って、cuDNN ArchiveからDownload cuDNN v5.1 (Jan 20, 2017), for CUDA 8.0を選択、libcudnn5_5.1.10-1+cuda8.0_amd64.debとlibcudnn5-dev_5.1.10-1+cuda8.0_amd64.debをローカルにダウンロード
AWSインスタンス上に転送してインストール
localから
scp -i my-key.pem libcudnn5_5.1.10-1+cuda8.0_amd64.deb ubuntu@ec2-52-87-164-28.compute-1.amazonaws.com:
scp -i my-key.pem libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb ubuntu@ec2-52-87-164-28.compute-1.amazonaws.com:
AWS上で
sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64.deb libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb
パスを通して、インストールしたdriver他を有効化するためreboot
sudo sh -c "echo 'CUDA_HOME=/usr/local/cuda' >> /etc/profile.d/cuda.sh"
sudo sh -c "echo 'export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH}:\${CUDA_HOME}/lib64' >> /etc/profile.d/cuda.sh"
sudo sh -c "echo 'export LIBRARY_PATH=\${LIBRARY_PATH}:\${CUDA_HOME}/lib64' >> /etc/profile.d/cuda.sh"
sudo sh -c "echo 'export C_INCLUDE_PATH=\${C_INCLUDE_PATH}:\${CUDA_HOME}/include' >> /etc/profile.d/cuda.sh"
sudo sh -c "echo 'export CXX_INCLUDE_PATH=\${CXX_INCLUDE_PATH}:\${CUDA_HOME}/include' >> /etc/profile.d/cuda.sh"
sudo sh -c "echo 'export PATH=\${PATH}:\${CUDA_HOME}/bin' >> /etc/profile.d/cuda.sh"
sudo reboot
sshで再接続してnvidia-smiコマンド叩くとdriver versionは375.26に
ubuntu@ip-172-31-89-44:~$ nvidia-smi
Sun Oct 7 11:30:52 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 46C P0 72W / 149W | 0MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
#2. Python, Theano, Lasagne, Pylearn2のインストール
Pythonをインストール
python cucumber9.pyするときにg++がないって怒られるので、g++もインストールしておく。
sudo apt-get install git g++ openssl libssl-dev libbz2-dev libreadline-dev libsqlite3-dev
git clone https://github.com/yyuu/pyenv.git ~/.pyenv
pyenvの設定。viで.bashrcに下記を追加
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
pythonバージョンを2.7とする。
source .bashrc
env PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 2.7.13
pyenv global 2.7.13
あとはだいたいFPGAでDeep Learningしてみる - きゅうりを選果するの通り
Theanoのインストール
pip install --user git+https://github.com/Theano/Theano.git@rel-0.9.0beta1
Lasagneのインストール
pip install --user https://github.com/Lasagne/Lasagne/archive/master.zip
echo "[global]" >> ~/.theanorc
echo "floatX = float32" >> ~/.theanorc
echo "device = gpu" >> ~/.theanorc
echo "openmp = True" >> ~/.theanorc
echo "openmp_elemwise_minsize = 200000" >> ~/.theanorc
echo "" >> ~/.theanorc
echo "[nvcc]" >> ~/.theanorc
echo "fastmath = True" >> ~/.theanorc
echo "" >> ~/.theanorc
echo "[blas]" >> ~/.theanorc
echo "ldflags = -lopenblas" >> ~/.theanorc
Pylearn2のインストール
git clone https://github.com/lisa-lab/pylearn2
cd pylearn2/
setup.pyした時に、下記のエラーが出るので
Traceback (most recent call last):
File "setup.py", line 8, in <module>
from theano.compat.six.moves import input
ImportError: No module named theano.compat.six.moves
vi setup.pyして該当箇所を下記の通り編集
修正前
from theano.compat.six.moves import input
修正後
from six.moves import input
pylearn2のsetup
python setup.py develop --user
途中、下記のようにerror出るが無視
In file included from ext/_yaml.c:565:0:
ext/_yaml.h:2:18: fatal error: yaml.h: No such file or directory
compilation terminated.
Error compiling module, falling back to pure Python
#3. 学習データとBNN-PYNQのインストール、学習の実行
きゅうりのデータをコピー
cd
git clone https://github.com/workpiles/CUCUMBER-9.git
cd CUCUMBER-9/prototype_1/
tar -zxvf cucumber-9-python.tar.gz
BNN-PYNQのソースもインストール
cd
git clone https://github.com/Xilinx/BNN-PYNQ.git
cd BNN-PYNQ/bnn/src/training/
FPGAでDeep Learningしてみる - きゅうりを選果するからcucumber9.pyを持ってきて実行
python cucumber9.py
ubuntu@ip-172-31-89-44:~/BNN-PYNQ/bnn/src/training$ python cucumber9.py
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release. Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5110)
batch_size = 50
alpha = 0.1
epsilon = 0.0001
W_LR_scale = Glorot
num_epochs = 500
LR_start = 0.001
LR_fin = 3e-07
LR_decay = 0.983907435305
save_path = cucumber9_parameters.npz
train_set_size = 2475
shuffle_parts = 1
Loading CUCUMBER9 dataset...
Building the CNN...
/home/ubuntu/.local/lib/python2.7/site-packages/theano/tensor/basic.py:2144: UserWarning: theano.tensor.round() changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy. Use the Theano flag `warn.round=False` to disable this warning.
"theano.tensor.round() changed its default from"
W_LR_scale = 20.049938
H = 1
W_LR_scale = 27.712812
H = 1
W_LR_scale = 33.941124
H = 1
W_LR_scale = 39.191837
H = 1
W_LR_scale = 48.0
H = 1
W_LR_scale = 55.425625
H = 1
W_LR_scale = 22.627417
H = 1
W_LR_scale = 26.127892
H = 1
W_LR_scale = 18.63688
H = 1
Training...
Epoch 1 of 500 took 5.78331184387s
LR: 0.001
training loss: 1.485121870527462
validation loss: 2.055072214868334
validation error rate: 61.11111177338494%
best epoch: 1
best validation error rate: 61.11111177338494%
test loss: 2.055072214868334
test error rate: 61.11111177338494%
…
Epoch 500 of 500 took 5.06151795387s
LR: 3.04906731299e-07
training loss: 0.0023622200804645534
validation loss: 0.15342154022720125
validation error rate: 15.333333197567198%
best epoch: 373
best validation error rate: 11.777777804268732%
test loss: 0.12793712980217403
test error rate: 11.777777804268732%
同インスタンス上でFPGAでDeep Learningしてみる - きゅうりを選果するをもとに作成したcucumber9-gen-binary-weights.pyを実行すると、下記のように怒られるので、binary形式への変換はPYNQ上で行うこととした。
ubuntu@ip-172-31-89-44:~/BNN-PYNQ/bnn/src/training$ python cucumber9-gen-binary-weights.py cucumber9_parameters.npz
Using peCount = 16 simdCount = 3 for engine 0
Traceback (most recent call last):
File "cucumber9-gen-binary-weights.py", line 63, in <module>
(w,t) = rHW.readConvBNComplex(usePopCount=False)
TypeError: readConvBNComplex() takes at least 7 arguments (2 given)
#4. 学習パラメータのPYNQ上へのコピー、binary形式への変換
PYNQ上にcucumber9-gen-binary-weights.pyとAWS上に出来たcucumber9_parameters.npzをscpする。
ローカルにcucumber9_parameters.npzをコピー
scp -i my-key.pem ubuntu@ec2-52-87-164-28.compute-1.amazonaws.com:~/BNN-PYNQ/bnn/src/training/cucumber9_parameters.npz ./
PYNQ上へcucumber9-gen-binary-weights.pyとcucumber9_parameters.npzをコピー
scp cucumber9-gen-binary-weights.py cucumber9_parameters.npz xilinx@192.168.0.9:
PYNQ上でcucumber9-gen-binary-weights.pyを実行
sudo cp cucumber9-gen-binary-weights.py /opt/python3.6/lib/python3.6/site-packages/bnn/src/training/
python /opt/python3.6/lib/python3.6/site-packages/bnn/src/training/cucumber9-gen-binary-weights.py cucumber9_parameters.npz
xilinx@pynq:~$ python /opt/python3.6/lib/python3.6/site-packages/bnn/src/training/cucumber9-gen-binary-weights.py cucumber9_parameters.npz
Using peCount = 16 simdCount = 3 for engine 0
Extracting conv-BN complex, OFM=64 IFM=3 k=3
Layer 0: 64 x 27
WMem = 36 TMem = 4
Using peCount = 32 simdCount = 32 for engine 1
Extracting conv-BN complex, OFM=64 IFM=64 k=3
Layer 1: 64 x 576
WMem = 36 TMem = 2
Using peCount = 16 simdCount = 32 for engine 2
Extracting conv-BN complex, OFM=128 IFM=64 k=3
Layer 2: 128 x 576
WMem = 144 TMem = 8
Using peCount = 16 simdCount = 32 for engine 3
Extracting conv-BN complex, OFM=128 IFM=128 k=3
Layer 3: 128 x 1152
WMem = 288 TMem = 8
Using peCount = 4 simdCount = 32 for engine 4
Extracting conv-BN complex, OFM=256 IFM=128 k=3
Layer 4: 256 x 1152
WMem = 2304 TMem = 64
Using peCount = 1 simdCount = 32 for engine 5
Extracting conv-BN complex, OFM=256 IFM=256 k=3
Layer 5: 256 x 2304
WMem = 18432 TMem = 256
Using peCount = 1 simdCount = 4 for engine 6
Extracting FCBN complex, ins = 256 outs = 512
Interleaving 256 channels in fully connected layer...
Layer 6: 512 x 256
WMem = 32768 TMem = 512
Using peCount = 1 simdCount = 8 for engine 7
Extracting FCBN complex, ins = 512 outs = 512
Layer 7: 512 x 512
WMem = 32768 TMem = 512
Using peCount = 4 simdCount = 1 for engine 8
Extracting FCBN complex, ins = 512 outs = 9
Layer 8: 12 x 512
WMem = 1536 TMem = 3
binary形式に変換したパラメータを所定の位置にコピー
sudo mkdir /opt/python3.6/lib/python3.6/site-packages/bnn/params/cucumber9
sudo cp binparam-cnv-pynq/* /opt/python3.6/lib/python3.6/site-packages/bnn/params/cucumber9/
#5. PYNQ上で推論
きゅうりのデータをgit cloneすると下記のように証明書のエラーとなるので、無視する設定入れる。
xilinx@pynq:~$ git clone https://github.com/workpiles/CUCUMBER-9.git
Cloning into 'CUCUMBER-9'...
fatal: unable to access 'https://github.com/workpiles/CUCUMBER-9.git/': server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
export GIT_SSL_NO_VERIFY=1
学習(推論)データのコピー
git clone https://github.com/workpiles/CUCUMBER-9.git
cd CUCUMBER-9/prototype_1/
tar -zxvf cucumber-9-python.tar.gz
Jupyter notebook上で、Cifar10.ipynbをDuplicate, Renameして編集、下記の通りCucumber9.ipynbを作成して、実行
"4. Launching BNN in hardware" で、FPGAだと正しく0.0016秒程度で推論できてるようです。
"5. Launching BNN in software"で、CPUだと、0.834秒程度かかるようです。FPGAは約500倍早いですね。
In[1]:
import bnn
print(bnn.available_params(bnn.NETWORK_CNV))
classifier = bnn.CnvClassifier('cucumber9')
In[2]:
print(classifier.bnn.classes)
In[3]:
import _pickle, random
from PIL import Image
import numpy as np
def unpickle(file):
fo = open(file, 'rb')
dict = _pickle.load(fo, encoding='latin-1')
fo.close()
return dict
def image_from_cucumber(index, dic):
name = dic['filenames'][index]
r = dic['data'][index*3]
g = dic['data'][index*3+1]
b = dic['data'][index*3+2]
label = dic['labels'][index]
data = np.array([r, g, b]).T.reshape(32, 32, 3)
img = Image.fromarray(data, 'RGB')
img.save(name)
return name, label
dic = unpickle('/home/xilinx/CUCUMBER-9/prototype_1/test_batch')
filename, label = image_from_cucumber(random.randint(0, len(dic['filenames'])), dic)
im = Image.open('/home/xilinx/jupyter_notebooks/bnn/'+filename)
print(classifier.class_name(label))
im
In[4]:
class_out=classifier.classify_image(im)
print("Class number: {0}".format(class_out))
print("Class name: {0}".format(classifier.class_name(class_out)))
print("Correct Class number: {0}".format(label))
print("Correct Class name: {0}".format(classifier.class_name(label)))
In[5]:
sw_class = bnn.CnvClassifier('cucumber9', bnn.RUNTIME_SW)
class_out = sw_class.classify_image(im)
print("Class number: {0}".format(class_out))
print("Class name: {0}".format(classifier.class_name(class_out)))
print("Correct Class number: {0}".format(label))
print("Correct Class name: {0}".format(classifier.class_name(label)))
In[6]:
from IPython.display import display
im.thumbnail((64, 64), Image.ANTIALIAS)
display(im)
class_detail = classifier.classify_details(im)
print("classes: {0}".format(class_detail))
In[7]:
%matplotlib inline
import matplotlib.pyplot as plt
x_pos = np.arange(len(class_detail))
fig, ax = plt.subplots()
ax.bar(x_pos, (class_detail/255)-1)
ax.set_xticklabels(classifier.bnn.classes, rotation='vertical')
ax.set_xticks(x_pos)
ax.set
plt.show()
In[8]:
from pynq import Xlnk
xlnk = Xlnk()
xlnk.xlnk_reset()