0. 概要
FPGAで動作可能なBNNの中に、オープンソースのBNN-PYNQ (FINN)というものがあります。
こちらをカスタマイズして、FPGA上で動作させたいと思います。
1. 環境の構築
まずはBNN-PYNQをビルド出来るように、環境を構築していきます。
まずはLinux上からVivado Design Suite (WebPack)をダウンロードします。
https://japan.xilinx.com/support/download.html
以下でインストール
$ cd ~/Downloads/
$ chmod +x Vivado_Installer.bin
$ ./Vivado_Installer.bin
こんな感じの画面が表示されたら、後はインストーラーに従ってインストールをする。
Select Edition to InstallではVivado Webpack Editionを選択すること。
後は30分程放置していればインストールが完了する。
最後にVivadoを使えるようにパスを通す。以下の2つをbash.rcにでも追加しておこう。
source "/YOUR_PATH/Xilinx/Vivado/2018.3/.settings64-Vivado.sh"
source "/YOUR_PATH/Xilinx/Vivado/2018.3/settings64.sh"
2. BNN-PYNQのHWビルド
まずはBNN-PYNQを引っ張ってくる
$ git clone https://github.com/Xilinx/BNN-PYNQ.git
次にROOTパスを設定
$ export XILINX_BNN_ROOT="/YOUR_PATH/BNN-PYNQ/bnn/src/"
bitstreamを構築するシェルをたたく
$ cd /YOUR_PATH/BNN_PYNQ/bnn/src/network/
$ ./make-hw.sh {network} {platform} {mode}
利用可能なネットワークはcnvW1A1, cnvW1A2, cnvW2A2 or lfcW1A1, lfcW1A2がある
プラットフォームは、pynqZ1-Z2 or ultra96がある
モードはh(高位合成), b(論理合成), a(両方実行)がある。
なので、以下のパラメータで実行してみる
$ ./make-hw.sh cnvW1A1 pynqZ1-Z2 a
そうするとおぞましい数の出力なされた後にDoneと文字が出てくる
.....
INFO: [Common 17-206] Exiting Vivado at Mon Jan 7 15:06:07 2019...
Bitstream copied to /YOUR_PATH/BNN-PYNQ/bnn/src/network/output/bitstream/cnvW1A1-pynqZ1-Z2.bit
Done!
以下にbitstreamが出力される。
$ ls /YOUR_PATH/BNN_PYNQ/bnn/src/network/output/
そうしたら、最後にこのbitstreamと関連ファイルをPYNQに持っていき、
以下フォルダに配置したらPYNQ上で実行できるようになる。
PIP_PATH/bnn/bitstreams/
3. BNN-PYNQのSWビルド
ハードウェア側で推論するbitstreamは作成した。
次は推論時に用いる重みを生成する。
3.1. 環境構築
GPU環境が整っている状態を前提として進めていく。
なお、私はCUDA 9.2. CuDNN 7、Ubuntu 18という環境である。
実行環境はこれに加えてTheanoが必要なため、こちらもインストールしていく。
一般的なやり方は以下に従うものの
# おすすめしない
$ conda create -n pynq python=2.7 anaconda
$ source activate pynq
$ pip install tensorflow-gpu
$ pip install --user git+https://github.com/Theano/Theano.git@rel-0.9.0beta1
$ pip install --user https://github.com/Lasagne/Lasagne/archive/master.zip
# Pylearn
$ pip install --user numpy==1.11.0 # Pylearn2 seems to not work with the latest version of numpy
$ git clone https://github.com/lisa-lab/pylearn2
$ cd pylearn2
$ python setup.py develop --user
$ cd ..
# dataset
$ export PYLEARN2_DATA_PATH=~/.pylearn2
$ mkdir -p ~/.pylearn2
$ cd pylearn2/pylearn2/scripts/datasets
$ python download_mnist.py
$ ./download_cifar10.sh
$ cd ../../..
Pylearnは既に開発が停止しているので、Kerasのデータセットを使う方法に変えました。
※内部のソースコードを見ても白色相関係数等は使っておらず、Pylearnをデータセットとしか使っていないため
というのも、Theano 0.9以上だとPylearnが対応していない。しかし、Theano 1.0以下だとCuDNNに対応していない・・・。
そこで、Pylearnを切り捨てることとしました。流石にGPU使わないのは厳しい。
以下で環境構築してください。
$ conda create -n pynq python=2.7
$ conda install tensorflow-gpu
$ pip install cython
$ pip install keras
$ pip install theano==1.0.3
$ pip install lasagne
なお、3.x系のPythonでもBinary_netの5行目のソースコードを少し修正すれば動きますが、
特に3.x系を使う理由もないので、今回は2.x系で進めます。
次に、Theanoのコンフィグを書いていく。
$ echo "[global]" >> ~/.theanorc
$ echo "floatX = float32" >> ~/.theanorc
$ echo "device = cuda" >> ~/.theanorc
$ echo "openmp = True" >> ~/.theanorc
$ echo "openmp_elemwise_minsize = 200000" >> ~/.theanorc
$ echo "" >> ~/.theanorc
$ echo "[nvcc]" >> ~/.theanorc
$ echo "fastmath = True" >> ~/.theanorc
$ echo "" >> ~/.theanorc
$ echo "[blas]" >> ~/.theanorc
$ echo "ldflags = -lopenblas" >> ~/.theanorc
GPU系でerrorが出る場合は、~/.theanorc
の中にあるcuda
をgpu
に変えるとよい。
ModuleNotFoundError: No module named 'theano.compat.six'
のようなエラーが出る場合は
setup.py
のfrom theano.compat.six.moves import input
をfrom six.moves import input
に変えるとよい。
画像処理系でPillowも一応いれておく
$ pip install --user Pillow
AttributeError: module 'urllib' has no attribute 'urlretrieve'
とエラーが起きる場合は、download_mnist.py
のurllib
を全てurllib.request
に変更すると動く。
3.2 学習させる
PylearnによるデータセットのロードをKerasのデータセットに書き換える。書き換えたのはcifar10.py
である。
~省略~
#from pylearn2.datasets.zca_dataset import ZCA_Dataset
#from pylearn2.datasets.cifar10 import CIFAR10
#from pylearn2.utils import serial
from keras.datasets import cifar10
~省略~
# load data
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
# Transpose: Batch, C, X, Y
train_set_X = np.float32(X_train[0:45000].transpose(0, 3, 1, 2))
train_set_X = (train_set_X * 2./255.) - 1
train_set_y = y_train[0:45000]
train_set_y = np.float32(np.eye(10)[train_set_y.flatten()])
train_set_y = 2 * train_set_y - 1.
valid_set_X = np.float32(X_train[45000:].transpose(0, 3, 1, 2))
valid_set_X = (valid_set_X * 2./255.) - 1
valid_set_y = y_train[45000:]
valid_set_y = np.float32(np.eye(10)[valid_set_y.flatten()])
valid_set_y = 2 * valid_set_y - 1.
test_set_X = np.float32(X_test.transpose(0, 3, 1, 2))
test_set_X = (test_set_X * 2./255.) - 1
test_set_y = y_test
test_set_y = np.float32(np.eye(10)[test_set_y.flatten()])
test_set_y = 2 * test_set_y - 1.
# original code
#valid_set = CIFAR10(which_set="train",start=train_set_size,stop = 50000)
#test_set = CIFAR10(which_set="test")
# bc01 format
# Inputs in the range [-1,+1]
# print("Inputs in the range [-1,+1]")
#train_set.X = np.reshape(np.subtract(np.multiply(2./255.,train_set.X),1.),(-1,3,32,32))
#valid_set.X = np.reshape(np.subtract(np.multiply(2./255.,valid_set.X),1.),(-1,3,32,32))
#test_set.X = np.reshape(np.subtract(np.multiply(2./255.,test_set.X),1.),(-1,3,32,32))
# flatten targets
#train_set.y = np.hstack(train_set.y)
#valid_set.y = np.hstack(valid_set.y)
#test_set.y = np.hstack(test_set.y)
# Onehot the targets
#train_set.y = np.float32(np.eye(10)[train_set.y])
#valid_set.y = np.float32(np.eye(10)[valid_set.y])
#test_set.y = np.float32(np.eye(10)[test_set.y])
# for hinge loss
#train_set.y = 2* train_set.y - 1.
#valid_set.y = 2* valid_set.y - 1.
#test_set.y = 2* test_set.y - 1.
~省略~
binary_net.train(
train_fn,val_fn,
cnn,
batch_size,
LR_start,LR_decay,
num_epochs,
train_set_X,train_set_y,
valid_set_X,valid_set_y,
test_set_X,test_set_y,
save_path=save_path,
shuffle_parts=shuffle_parts)
これで、以下が動くはず。
$ python cifar10.py
もしOpenBLASでエラーが出るのであれば以下を実行
$ sudo apt install libatlas-base-dev
$ sudo apt install libatlas-doc
$ sudo apt install libopenblas-base
$ sudo apt install libopenblas-dev
上手く動作するとこんな感じで学習が始まる。
3.3. Weightの分割
次に、学習によって書き出されたWeight (cifar10_parameters.npz)を分割し、FPGAで並列計算しやすいようにする。
まずは、変換部分を作るため、以下でコピーしよう。
cp cifar10-gen-weights-W1A1.py cifar10-gen-wights.py
こんな感じにパスだけチャチャっと書き直す。
if __name__ == "__main__":
bnnRoot = "."
npzFile = bnnRoot + "/cifar10_parameters.npz"
targetDirBin = bnnRoot + "/binparam-cifar10-learned-pynq"
targetDirHLS = bnnRoot + "/binparam-cifar10-learned-pynq/hw"
#topology of convolutional layers (only for config.h defines)
ifm = [32, 30, 14, 12, 5, 3]
ofm = [30, 28, 12, 10, 3, 1]
ifm_ch = [ 3, 64, 64, 128, 128, 256]
ofm_ch = [64, 64, 128, 128, 256, 256]
filterDim = [ 3, 3, 3, 3, 3, 3]
WeightsPrecisions_fractional = [0 , 0 , 0 , 0 , 0 , 0 , 0, 0, 0]
ActivationPrecisions_fractional = [0 , 0 , 0 , 0 , 0 , 0 , 0, 0, 0]
InputPrecisions_fractional = [7 , 0 , 0 , 0 , 0 , 0 , 0, 0, 0]
WeightsPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1 , 1, 1, 1]
ActivationPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1 , 1, 1, 16]
InputPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1 , 1, 1, 1]
classes = ['Airplane', 'Automobile', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck']
#configuration of PE and SIMD counts
peCounts = [16, 32, 16, 16, 4, 1, 1, 1, 4]
simdCounts = [ 3, 32, 32, 32, 32, 32, 4, 8, 1]
この辺は各層の精度情報を入れてあげる。それ以外は変化なし。
WeightsPrecisions_fractional = [0 , 0 , 0 , 0 , 0 , 0 , 0, 0, 0]
ActivationPrecisions_fractional = [0 , 0 , 0 , 0 , 0 , 0 , 0, 0, 0]
InputPrecisions_fractional = [7 , 0 , 0 , 0 , 0 , 0 , 0, 0, 0]
WeightsPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1 , 1, 1, 1]
ActivationPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1 , 1, 1, 16]
InputPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1 , 1, 1, 1]
指定したフォルダに分割されたweightが書き出されていることが分かる。
こんな感じになればおk
0-0-thres.bin 0-5-weights.bin* 1-16-thres.bin* 1-26-weights.bin 1-8-thres.bin* 2-3-weights.bin 3-14-thres.bin 4-0-weights.bin
0-0-weights.bin 0-6-thres.bin* 1-16-weights.bin* 1-27-thres.bin 1-8-weights.bin* 2-4-thres.bin 3-14-weights.bin 4-1-thres.bin
0-10-thres.bin* 0-6-weights.bin* 1-17-thres.bin* 1-27-weights.bin 1-9-thres.bin* 2-4-weights.bin 3-15-thres.bin 4-1-weights.bin
0-10-weights.bin* 0-7-thres.bin* 1-17-weights.bin* 1-28-thres.bin 1-9-weights.bin* 2-5-thres.bin 3-15-weights.bin 4-2-thres.bin
0-11-thres.bin* 0-7-weights.bin* 1-18-thres.bin* 1-28-weights.bin 2-0-thres.bin 2-5-weights.bin 3-1-thres.bin 4-2-weights.bin
0-11-weights.bin* 0-8-thres.bin* 1-18-weights.bin* 1-29-thres.bin 2-0-weights.bin 2-6-thres.bin 3-1-weights.bin 4-3-thres.bin
0-12-thres.bin* 0-8-weights.bin* 1-19-thres.bin* 1-29-weights.bin 2-10-thres.bin 2-6-weights.bin 3-2-thres.bin 4-3-weights.bin
0-12-weights.bin* 0-9-thres.bin* 1-19-weights.bin* 1-2-thres.bin* 2-10-weights.bin 2-7-thres.bin 3-2-weights.bin 5-0-thres.bin
0-13-thres.bin* 0-9-weights.bin* 1-1-thres.bin* 1-2-weights.bin* 2-11-thres.bin 2-7-weights.bin 3-3-thres.bin 5-0-weights.bin
0-13-weights.bin* 1-0-thres.bin* 1-1-weights.bin* 1-30-thres.bin 2-11-weights.bin 2-8-thres.bin 3-3-weights.bin 6-0-thres.bin
0-14-thres.bin* 1-0-weights.bin* 1-20-thres.bin* 1-30-weights.bin 2-12-thres.bin 2-8-weights.bin 3-4-thres.bin 6-0-weights.bin
0-14-weights.bin* 1-10-thres.bin* 1-20-weights.bin* 1-31-thres.bin 2-12-weights.bin 2-9-thres.bin 3-4-weights.bin 7-0-thres.bin
0-15-thres.bin* 1-10-weights.bin* 1-21-thres.bin* 1-31-weights.bin 2-13-thres.bin 2-9-weights.bin 3-5-thres.bin 7-0-weights.bin
0-15-weights.bin* 1-11-thres.bin* 1-21-weights.bin* 1-3-thres.bin* 2-13-weights.bin 3-0-thres.bin 3-5-weights.bin 8-0-thres.bin
0-1-thres.bin 1-11-weights.bin* 1-22-thres.bin* 1-3-weights.bin* 2-14-thres.bin 3-0-weights.bin 3-6-thres.bin 8-0-weights.bin
0-1-weights.bin 1-12-thres.bin* 1-22-weights.bin* 1-4-thres.bin* 2-14-weights.bin 3-10-thres.bin 3-6-weights.bin 8-1-thres.bin
0-2-thres.bin 1-12-weights.bin* 1-23-thres.bin 1-4-weights.bin* 2-15-thres.bin 3-10-weights.bin 3-7-thres.bin 8-1-weights.bin
0-2-weights.bin 1-13-thres.bin* 1-23-weights.bin 1-5-thres.bin* 2-15-weights.bin 3-11-thres.bin 3-7-weights.bin 8-2-thres.bin
0-3-thres.bin 1-13-weights.bin* 1-24-thres.bin 1-5-weights.bin* 2-1-thres.bin 3-11-weights.bin 3-8-thres.bin 8-2-weights.bin
0-3-weights.bin 1-14-thres.bin* 1-24-weights.bin 1-6-thres.bin* 2-1-weights.bin 3-12-thres.bin 3-8-weights.bin 8-3-thres.bin
0-4-thres.bin* 1-14-weights.bin* 1-25-thres.bin 1-6-weights.bin* 2-2-thres.bin 3-12-weights.bin 3-9-thres.bin 8-3-weights.bin
0-4-weights.bin* 1-15-thres.bin* 1-25-weights.bin 1-7-thres.bin* 2-2-weights.bin 3-13-thres.bin 3-9-weights.bin classes.txt
0-5-thres.bin* 1-15-weights.bin* 1-26-thres.bin 1-7-weights.bin* 2-3-thres.bin 3-13-weights.bin 4-0-thres.bin
3.4. Pynqで動作させる
まずはPyqnにSSHして、以下のフォルダを作る。
$ cd /usr/local/lib/python3.6/dist-packages/bnn/params/
$ mkdir original
$ cd original
$ mkdir cnvW1A1
次に作成したcnvW1A1
に3.3で分割したWeightをコピーする。
そして、jupyterを開き、以下のように実行していく。
これでVGG16のネットワーク上で自身で学習したweightをロードさせられた。
後は識別を実行するだけである。
4. 自身のネットワークとデータセットをFPGA上で動作させる
今回は簡易的なCNNモデルを作成してMNISTを学習させる。
Grayscaleを扱うので大きな変更が必要である。
4.1. ネットワークの構築と学習
まずは、新しいネットワークを構築し、学習させる。
YOUR_PATH/BNN-PYNQ/bnn/src/training/
配下で以下のファイルをコピーする。
# 学習に使う
$ cp cifar10.py custom_train.py
# ネットワーク構造本体
$ cp cnv.py custom_network.py
custom_train.py
は以下のように書き換える。
主な変更点はimport先の変更、学習率の変更、保存先ファイル名の変更などである。
from __future__ import print_function
import sys
import os
import time
import numpy as np
np.random.seed(1234) # for reproducibility?
import theano
import theano.tensor as T
import lasagne
import gzip
import binary_net
import custom_network as nw
from keras.datasets import mnist
import cv2
from collections import OrderedDict
if __name__ == "__main__":
learning_parameters = OrderedDict()
# BN parameters
batch_size = 50
print("batch_size = "+str(batch_size))
# alpha is the exponential moving average factor
learning_parameters.alpha = .1
print("alpha = "+str(learning_parameters.alpha))
learning_parameters.epsilon = 1e-3
print("epsilon = "+str(learning_parameters.epsilon))
# W_LR_scale = 1.
learning_parameters.W_LR_scale = "Glorot" # "Glorot" means we are using the coefficients from Glorot's paper
print("W_LR_scale = "+str(learning_parameters.W_LR_scale))
# Training parameters
num_epochs = 500
print("num_epochs = "+str(num_epochs))
# Decaying LR
LR_start = 0.001
print("LR_start = "+str(LR_start))
LR_fin = 0.0000003
print("LR_fin = "+str(LR_fin))
LR_decay = (LR_fin/LR_start)**(1./num_epochs)
print("LR_decay = "+str(LR_decay))
# BTW, LR decay might good for the BN moving average...
save_path = "custom_parameters.npz"
print("save_path = "+str(save_path))
train_set_size = 45000
print("train_set_size = "+str(train_set_size))
shuffle_parts = 1
print("shuffle_parts = "+str(shuffle_parts))
print('Loading dataset...')
# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Transpose: Batch, C, X, Y
tmp = []
for i in range(len(X_train)):
tmp.append( np.reshape(cv2.resize(X_train[i], (60, 60)), (1, 60, 60)) )
X_train = np.array(tmp)
tmp = []
for i in range(len(X_test)):
tmp.append( np.reshape(cv2.resize(X_test[i], (60, 60)), (1, 60, 60)) )
X_test = np.array(tmp)
train_set_X = np.float32(X_train[0:50000])
train_set_X = (train_set_X * 2./255.) - 1
train_set_y = y_train[0:50000]
train_set_y = np.float32(np.eye(10)[train_set_y.flatten()])
train_set_y = 2 * train_set_y - 1.
valid_set_X = np.float32(X_train[50000:])
valid_set_X = (valid_set_X * 2./255.) - 1
valid_set_y = y_train[50000:]
valid_set_y = np.float32(np.eye(10)[valid_set_y.flatten()])
valid_set_y = 2 * valid_set_y - 1.
test_set_X = np.float32(X_test)
test_set_X = (test_set_X * 2./255.) - 1
test_set_y = y_test
test_set_y = np.float32(np.eye(10)[test_set_y.flatten()])
test_set_y = 2 * test_set_y - 1.
print('Building the CNN...')
# Prepare Theano variables for inputs and targets
input = T.tensor4('inputs')
target = T.matrix('targets')
LR = T.scalar('LR', dtype=theano.config.floatX)
cnn = nw.genNetwork(input, 10, learning_parameters)
train_output = lasagne.layers.get_output(cnn, deterministic=False)
# squared hinge loss
loss = T.mean(T.sqr(T.maximum(0.,1.-target*train_output)))
# W updates
W = lasagne.layers.get_all_params(cnn, binary=True)
W_grads = binary_net.compute_grads(loss,cnn)
updates = lasagne.updates.adam(loss_or_grads=W_grads, params=W, learning_rate=LR)
updates = binary_net.clipping_scaling(updates,cnn)
# other parameters updates
params = lasagne.layers.get_all_params(cnn, trainable=True, binary=False)
updates = OrderedDict(updates.items() + lasagne.updates.adam(loss_or_grads=loss, params=params, learning_rate=LR).items())
test_output = lasagne.layers.get_output(cnn, deterministic=True)
test_loss = T.mean(T.sqr(T.maximum(0.,1.-target*test_output)))
test_err = T.mean(T.neq(T.argmax(test_output, axis=1), T.argmax(target, axis=1)),dtype=theano.config.floatX)
# Compile a function performing a training step on a mini-batch (by giving the updates dictionary)
# and returning the corresponding training loss:
train_fn = theano.function([input, target, LR], loss, updates=updates)
# Compile a second function computing the validation loss and accuracy:
val_fn = theano.function([input, target], [test_loss, test_err])
print('Training...')
binary_net.train(
train_fn,val_fn,
cnn,
batch_size,
LR_start,LR_decay,
num_epochs,
train_set_X,train_set_y,
valid_set_X,valid_set_y,
test_set_X,test_set_y,
save_path=save_path,
shuffle_parts=shuffle_parts)
custom_network.py
も以下のように書き換える
主な変更点は大きな入力に対応させた点である。
import lasagne
import binary_net
def genNetwork(input, num_outputs, learning_parameters):
# A function to generate the cnv network topology which matches the overlay for the Pynq board.
# WARNING: If you change this file, it's likely the resultant weights will not fit on the Pynq overlay.
if num_outputs < 1 or num_outputs > 64:
error("num_outputs should be in the range of 1 to 64.")
stochastic = False
binary = True
H = 1
activation = binary_net.binary_tanh_unit
W_LR_scale = learning_parameters.W_LR_scale
epsilon = learning_parameters.epsilon
alpha = learning_parameters.alpha
cnn = lasagne.layers.InputLayer(
shape=(None, 1, 60, 60),
input_var=input)
print(cnn.output_shape)
# 64C3-64C3-P2
cnn = binary_net.Conv2DLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
num_filters=64,
filter_size=(3, 3),
pad='valid',
flip_filters=False,
nonlinearity=lasagne.nonlinearities.identity)
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
cnn = lasagne.layers.NonlinearityLayer(
cnn,
nonlinearity=activation)
print(cnn.output_shape)
cnn = binary_net.Conv2DLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
num_filters=64,
filter_size=(3, 3),
pad='valid',
flip_filters=False,
nonlinearity=lasagne.nonlinearities.identity)
cnn = lasagne.layers.MaxPool2DLayer(cnn, pool_size=(2, 2))
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
cnn = lasagne.layers.NonlinearityLayer(
cnn,
nonlinearity=activation)
print(cnn.output_shape)
# 256C3-256C3-P2
cnn = binary_net.Conv2DLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
num_filters=128,
filter_size=(3, 3),
pad='valid',
flip_filters=False,
nonlinearity=lasagne.nonlinearities.identity)
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
cnn = lasagne.layers.NonlinearityLayer(
cnn,
nonlinearity=activation)
print(cnn.output_shape)
cnn = binary_net.Conv2DLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
num_filters=128,
filter_size=(3, 3),
pad='valid',
flip_filters=False,
nonlinearity=lasagne.nonlinearities.identity)
cnn = lasagne.layers.MaxPool2DLayer(cnn, pool_size=(2, 2))
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
cnn = lasagne.layers.NonlinearityLayer(
cnn,
nonlinearity=activation)
print(cnn.output_shape)
cnn = binary_net.Conv2DLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
num_filters=256,
filter_size=(3, 3),
pad='valid',
flip_filters=False,
nonlinearity=lasagne.nonlinearities.identity)
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
cnn = lasagne.layers.NonlinearityLayer(
cnn,
nonlinearity=activation)
print(cnn.output_shape)
cnn = binary_net.Conv2DLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
num_filters=256,
filter_size=(3, 3),
pad='valid',
flip_filters=False,
nonlinearity=lasagne.nonlinearities.identity)
cnn = lasagne.layers.MaxPool2DLayer(cnn, pool_size=(2, 2))
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
cnn = lasagne.layers.NonlinearityLayer(
cnn,
nonlinearity=activation)
print(cnn.output_shape)
cnn = binary_net.Conv2DLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
num_filters=256,
filter_size=(3, 3),
pad='valid',
flip_filters=False,
nonlinearity=lasagne.nonlinearities.identity)
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
cnn = lasagne.layers.NonlinearityLayer(
cnn,
nonlinearity=activation)
# 512FP-outputFP
print(cnn.output_shape)
cnn = binary_net.DenseLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
nonlinearity=lasagne.nonlinearities.identity,
num_units=512)
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
cnn = lasagne.layers.NonlinearityLayer(
cnn,
nonlinearity=activation)
print(cnn.output_shape)
cnn = binary_net.DenseLayer(
cnn,
binary=binary,
stochastic=stochastic,
H=H,
W_LR_scale=W_LR_scale,
nonlinearity=lasagne.nonlinearities.identity,
num_units=num_outputs)
cnn = lasagne.layers.BatchNormLayer(
cnn,
epsilon=epsilon,
alpha=alpha)
print(cnn.output_shape)
return cnn
構造的には以下である。
Input (1, 60, 60) -> CNV (64, 3, 3) -> B & A ->CNV (64, 3, 3) -> MAXPooling2x2 -> B & A ->
CNV (128, 3, 3) -> B & A -> CNV (128, 3 ,3) -> MAXPooling2x2 -> B & A ->
CNV (256, 3 ,3) -> B & A -> CNV (256, 3 ,3) -> MAXPooling2x2 -> B & A ->
CNV (256, 3 ,3) -> B & A -> DNS (512) -> B & A -> DNS (classes_num) -> B
では最後に以下で学習を開始しよう
$ python custom_train.py
ネットワーク構造が以下のように表示されたはずである。
(None, 1, 60, 60)
(None, 64, 58, 58)
(None, 64, 28, 28)
(None, 128, 26, 26)
(None, 128, 12, 12)
(None, 256, 10, 10)
(None, 256, 4, 4)
(None, 256, 2, 2)
(None, 512)
(None, 10)
4.2. Weightの変換
次に、学習済みのWeightをFPGA用に変換する
YOUR_PATH/BNN-PYNQ/bnn/src/training/
配下で以下のファイルをコピーする。
$ cp cifar10-gen-weights.py custom-gen-weights.py
custom-gen-weights.py
を開く。
4.2.1. サイズ
まず、ネットワーク構造を基にconvolution層の入力/出力の画像の大きさとチャンネル数、フィルターの大きさを書き換える。
4.1のネットワーク構造を基に、以下のように書き換えよう
ifm = [60, 58, 28, 26, 12, 10, 4]
ofm = [58, 56, 26, 24 , 10, 8, 2]
ifm_ch = [ 1, 64, 64, 128, 128, 256, 256]
ofm_ch = [64, 64, 128, 128, 256, 256, 256]
filterDim = [ 3, 3, 3, 3, 3, 3, 3]
4.2.2. 精度
次に、各精度情報は以下のように書き換えよう。とりあえず、入力と出力以外はバイナリーでよい。
WeightsPrecisions_fractional = [0 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0]
ActivationPrecisions_fractional = [0 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0]
InputPrecisions_fractional = [7 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0]
WeightsPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1]
ActivationPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 16]
InputPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1]
4.2.3. 並列
次に面倒なのが、PEとSIMDの計算。
FPGA上で各レイヤーの処理速度がネックにならないように、大きなレイヤーでは細かく並列させている。
・PはPEの数、SはPEあたりのSIMDレーンの数
・行列の高さ: P
・行列の幅(タイル): S
・PSは一度に処理される
・タイルの各行は異なるPEで処理される
・タイルの各列は異なるSIMDレーンで処理される
・遅いレイヤが全体のスループットになるので、SIMD/PEを上手く制御して、全てのレイヤーで同じ速度で動くようにすべき
すなわち、畳み込みのカーネル(タイル)の各行はPEによって処理、各列はSIMDによって処理されるということらしい。
$X × Y$の行列の場合、$F_n=X/P$がニューロンの畳み込み、$F_s=Y/S$がシナプスの畳み込みとなる。全畳み込み$F$は$F_n・F_s$によって与えられる。
例えば、6×4の重み行列があったとして、それぞれ2つのSIMDレーンを持つ3つのPEで分割した場合、$F_n = (6/3)$、$F_s = (4/2)$で、$F_n・F_s$は4サイクルとなる。
畳み込みレイヤの総畳み込み数は$F=F_m・F_n・F_s$となる。$F_m$は複数の行列ベクトル積による定数である。これは出力ピクセル数と等しくなる。
ストリーミング処理の場合、識別タスクのスループットは$F_clk (クロック周波数)/II_max (最遅延レイヤ)$と定義できる。
結合層では、総畳み込み数$F$は処理開始の間隔と等しくなる。したがって、結合層でバランスを取るには、各層について$F_n, F_s$を使って$F_n, F_s = F_clk/FPS$となるように調整すればよい。
。。。
やばい・・・。全然分からない。
約まるところ、遅延時間が極力どの層も同じになれば良いということである。
おもむろにfinnthesizer.py
にある畳み込みの遅延時間の推定コードをみてみる。
# return HW config string as C #define's for a Conv layer
def printConvDefines(prefix, kernelDim, ifm_ch, ifm_dim, ofm_ch, ofm_dim, simd, pe, wmem, tmem, wpi, api, wpf, apf):
#network topology
config = ""
numb_ops = 2*ifm_ch*ofm_ch*kernelDim*kernelDim*ofm_dim*ofm_dim # 2* because of MAC
est_latency = numb_ops/(2*simd*pe)
・
・
・
以下の計算式から、ざっくりsimd*peの大きさは計算できそう。
numb_ops = 2 * 入力画像のチャンネル数 * 出力画像のチャンネル数 * カーネルx * カーネルy * 出力画像の大きさx * 出力画像の大きさy
latency = numb_ops / (2* simdの数 * peの数)
でも、SimdとPEの数をどのように分割するかは不明
因みに、FCはもっと単純。
# return HW config string as C #define's for a FC layer
def printFCDefines(prefix, simd, pe, wmem, tmem, mw, mh, wpi, api, wpf, apf):
config = ""
numb_ops = 2*mw*mh # 2* because of MAC
est_latency = numb_ops/(2*simd*pe)
一応WMEMとTMEMについても触れておく。
convの場合
# compute the padded width and height
paddedH = padTo(w.shape[0], peCount)
paddedW = padTo(w.shape[1], simdCount)
# compute memory needed for weights and thresholds
neededWMem = (paddedW * paddedH) / (simdCount * peCount)
neededTMem = paddedH / peCount
print "Layer %d: %d x %d" % (convl, paddedH, paddedW)
print "WMem = %d TMem = %d" % (neededWMem, neededTMem)
# return val to nearest multiple of pad
def padTo(val, pad):
rem = val % pad
return val if rem == 0 else (val + pad - rem)
最終的にFPGAのLUTに影響するのでPEを抑えつつ、計算量が同じになるよう微調整しながら調整する。
#configuration of PE and SIMD counts
peCounts = [16, 32, 16, 16, 16, 8, 4, 1, 4]
simdCounts = [ 1, 32, 32, 32, 32, 32, 8, 8, 1]
因みにSIMD/PEはIFM_CHで割り切れる必要があるのと、IFM_CHを超えてはいけない。
(その数で並列化するため。)
4.2.3 生成(重みの分割)
後は、変更したネットワークに合わせて、クラスの数やfor文の回数などを変える
出来上がるとこんな感じ。
#BSD 3-Clause License
#=======
#
#Copyright (c) 2017, Xilinx
#All rights reserved.
#
#Redistribution and use in source and binary forms, with or without
#modification, are permitted provided that the following conditions are met:
#
#* Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
#* Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
#* Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
#THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
#AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
#IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
#DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
#FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
#DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
#SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
#CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
#OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
import os
import sys
from finnthesizer import *
if __name__ == "__main__":
bnnRoot = "."
npzFile = bnnRoot + "/custom_parameters.npz"
targetDirBin = bnnRoot + "/binparam-custom"
targetDirHLS = bnnRoot + "/binparam-custom/hw"
#topology of convolutional layers (only for config.h defines)
ifm = [60, 58, 28, 26, 12, 10, 4]
ofm = [58, 56, 26, 24 , 10, 8, 2]
ifm_ch = [ 1, 64, 64, 128, 128, 256, 256]
ofm_ch = [64, 64, 128, 128, 256, 256, 256]
filterDim = [ 3, 3, 3, 3, 3, 3, 3]
WeightsPrecisions_fractional = [0 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0]
ActivationPrecisions_fractional = [0 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0]
InputPrecisions_fractional = [7 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0]
WeightsPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1]
ActivationPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 16]
InputPrecisions_integer = [1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1]
classes = ['num_0', 'num_1', 'num_2', 'num_3', 'num_4', 'num_5', 'num_6', 'num_7', 'num_8', 'num_9']
#configuration of PE and SIMD counts
peCounts = [16, 32, 16, 16, 16, 8, 4, 1, 4]
simdCounts = [ 1, 32, 32, 32, 32, 32, 8, 8, 1]
#peCounts = [16, 32, 16, 16, 4, 1, 1, 1, 4]
#simdCounts = [ 3, 32, 32, 32, 32, 32, 4, 8, 1]
if not os.path.exists(targetDirBin):
os.mkdir(targetDirBin)
if not os.path.exists(targetDirHLS):
os.mkdir(targetDirHLS)
#read weights
rHW = BNNWeightReader(npzFile, True)
config = "/**\n"
config+= " * Finnthesizer Config-File Generation\n";
config+= " *\n **/\n\n"
config+= "#ifndef __LAYER_CONFIG_H_\n#define __LAYER_CONFIG_H_\n\n"
# process convolutional layers
for convl in range(0, 7):
peCount = peCounts[convl]
simdCount = simdCounts[convl]
WPrecision_fractional = WeightsPrecisions_fractional[convl]
APrecision_fractional = ActivationPrecisions_fractional[convl]
IPrecision_fractional = InputPrecisions_fractional[convl]
WPrecision_integer = WeightsPrecisions_integer[convl]
APrecision_integer = ActivationPrecisions_integer[convl]
IPrecision_integer = InputPrecisions_integer[convl]
print "Using peCount = %d simdCount = %d for engine %d" % (peCount, simdCount, convl)
if convl == 0:
# use fixed point weights for the first layer
(w,t) = rHW.readConvBNComplex(WPrecision_fractional, APrecision_fractional, IPrecision_fractional, WPrecision_integer, APrecision_integer, IPrecision_integer, usePopCount=False)
# compute the padded width and height
paddedH = padTo(w.shape[0], peCount)
paddedW = padTo(w.shape[1], simdCount)
# compute memory needed for weights and thresholds
neededWMem = (paddedW * paddedH) / (simdCount * peCount)
neededTMem = paddedH / peCount
print "Layer %d: %d x %d" % (convl, paddedH, paddedW)
print "WMem = %d TMem = %d" % (neededWMem, neededTMem)
print "IPrecision = %d.%d WPrecision = %d.%d APrecision = %d.%d" % (IPrecision_integer, IPrecision_fractional, WPrecision_integer,WPrecision_fractional, APrecision_integer, APrecision_fractional)
m = BNNProcElemMem(peCount, simdCount, neededWMem, neededTMem, WPrecision_integer, APrecision_integer, IPrecision_integer, WPrecision_fractional, APrecision_fractional, IPrecision_fractional, numThresBits=24, numThresIntBits=16)
m.addMatrix(w,t,paddedW,paddedH)
config += (printConvDefines("L%d" % convl, filterDim[convl], ifm_ch[convl], ifm[convl], ofm_ch[convl], ofm[convl], simdCount, peCount, neededWMem, neededTMem, WPrecision_integer, APrecision_integer, WPrecision_fractional, APrecision_fractional)) + "\n"
#generate HLS weight and threshold header file to initialize memory directly on bitstream generation
#m.createHLSInitFiles(targetDirHLS + "/memdata-" + str(convl) + ".h", str(convl))
#generate binary weight and threshold files to initialize memory during runtime
#because HLS might not work for very large header files
m.createBinFiles(targetDirBin, str(convl))
else:
# regular binarized layer
(w,t) = rHW.readConvBNComplex(WPrecision_fractional, APrecision_fractional, IPrecision_fractional, WPrecision_integer, APrecision_integer, IPrecision_integer)
# compute the padded width and height
paddedH = padTo(w.shape[0], peCount)
paddedW = padTo(w.shape[1], simdCount)
# compute memory needed for weights and thresholds
neededWMem = (paddedW * paddedH) / (simdCount * peCount)
neededTMem = paddedH / peCount
print "Layer %d: %d x %d" % (convl, paddedH, paddedW)
print "WMem = %d TMem = %d" % (neededWMem, neededTMem)
print "IPrecision = %d.%d WPrecision = %d.%d APrecision = %d.%d" % (IPrecision_integer, IPrecision_fractional, WPrecision_integer,WPrecision_fractional, APrecision_integer, APrecision_fractional)
m = BNNProcElemMem(peCount, simdCount, neededWMem, neededTMem, WPrecision_integer, APrecision_integer, IPrecision_integer, WPrecision_fractional, APrecision_fractional, IPrecision_fractional)
m.addMatrix(w,t,paddedW,paddedH)
config += (printConvDefines("L%d" % convl, filterDim[convl], ifm_ch[convl], ifm[convl], ofm_ch[convl], ofm[convl], simdCount, peCount, neededWMem, neededTMem, WPrecision_integer, APrecision_integer, WPrecision_fractional, APrecision_fractional)) + "\n"
#generate HLS weight and threshold header file to initialize memory directly on bitstream generation
#m.createHLSInitFiles(targetDirHLS + "/memdata-" + str(convl) + ".h", str(convl))
#generate binary weight and threshold files to initialize memory during runtime
#because HLS might not work for very large header files
m.createBinFiles(targetDirBin, str(convl))
# process fully-connected layers
for fcl in range(7,9):
peCount = peCounts[fcl]
simdCount = simdCounts[fcl]
WPrecision_fractional = WeightsPrecisions_fractional[fcl]
APrecision_fractional = ActivationPrecisions_fractional[fcl]
IPrecision_fractional = InputPrecisions_fractional[fcl]
WPrecision_integer = WeightsPrecisions_integer[fcl]
APrecision_integer = ActivationPrecisions_integer[fcl]
IPrecision_integer = InputPrecisions_integer[fcl]
print "Using peCount = %d simdCount = %d for engine %d" % (peCount, simdCount, fcl)
(w,t) = rHW.readFCBNComplex(WPrecision_fractional, APrecision_fractional, IPrecision_fractional, WPrecision_integer, APrecision_integer, IPrecision_integer)
# compute the padded width and height
paddedH = padTo(w.shape[0], peCount)
if (fcl == 9):
paddedH = padTo(w.shape[0], 64)
paddedW = padTo(w.shape[1], simdCount)
# compute memory needed for weights and thresholds
neededWMem = (paddedW * paddedH) / (simdCount * peCount)
neededTMem = paddedH / peCount
print "Layer %d: %d x %d" % (fcl, paddedH, paddedW)
print "WMem = %d TMem = %d" % (neededWMem, neededTMem)
print "IPrecision = %d.%d WPrecision = %d.%d APrecision = %d.%d" % (IPrecision_integer, IPrecision_fractional, WPrecision_integer,WPrecision_fractional, APrecision_integer, APrecision_fractional)
m = BNNProcElemMem(peCount, simdCount, neededWMem, neededTMem, WPrecision_integer, APrecision_integer, IPrecision_integer, WPrecision_fractional, APrecision_fractional, IPrecision_fractional)
m.addMatrix(w,t,paddedW,paddedH)
config += (printFCDefines("L%d" % fcl, simdCount, peCount, neededWMem, neededTMem, paddedW, paddedH, WPrecision_integer, APrecision_integer, WPrecision_fractional, APrecision_fractional)) + "\n"
#generate HLS weight and threshold header file to initialize memory directly on bitstream generation
#m.createHLSInitFiles(targetDirHLS + "/memdata-" + str(fcl) + ".h", str(fcl))
#generate binary weight and threshold files to initialize memory during runtime
#because HLS might not work for very large header files
m.createBinFiles(targetDirBin, str(fcl))
config+="#endif //__LAYER_CONFIG_H_\n"
configFile = open(targetDirHLS+"/config.h", "w")
configFile.write(config)
configFile.close()
with open(targetDirBin + "/classes.txt", "w") as f:
f.write("\n".join(classes))
以下で、重みの分割を行う。
$ python custom_gen_weights.py
/binparam-custom/hw/config.h
にExt Latencyと書かれている部分がある。
これが遅延である。この値がどれも一定になっていればクリティカルパスが防げる。
4.3. FPGA側での実装
では、FPGA側でも同様のネットワーク構造を作っていく。
まずはテンプレートをコピー
$ cd /YOUR_PATH/BNN-PYNQ/bnn/src/network/
$ cp -R cnvW1A1 cnvCustom
次に4.2で生成した/binparam-custom-learned-pynq/hw
からconfig.h
をcnvCustom/hw/config.h
に上書きする。
このconfig.h
に合うようにtop.cpp
を書き換えていく。
4.3.1. [H/W]重み (CNV, FC)
まず、cnvやFCの重みは全て8個あるので以下のように定義する。
static BinaryWeights<L0_SIMD, L0_PE, L0_WMEM> weights0;
static BinaryWeights<L1_SIMD, L1_PE, L1_WMEM> weights1;
static BinaryWeights<L2_SIMD, L2_PE, L2_WMEM> weights2;
static BinaryWeights<L3_SIMD, L3_PE, L3_WMEM> weights3;
static BinaryWeights<L4_SIMD, L4_PE, L4_WMEM> weights4;
static BinaryWeights<L5_SIMD, L5_PE, L5_WMEM> weights5;
static BinaryWeights<L6_SIMD, L6_PE, L6_WMEM> weights6;
static BinaryWeights<L7_SIMD, L7_PE, L7_WMEM> weights7;
static BinaryWeights<L8_SIMD, L8_PE, L8_WMEM> weights8;
4.3.2. [H/W]重み (Batch Normalization, Activation)
次にBatch normalizationとActivationの重みを定義する
static ThresholdsActivation<L0_TMEM, L0_PE, L0_API, ap_fixed<24, 16>, ap_uint<L0_API> > threshs0;
static ThresholdsActivation<L1_TMEM, L1_PE, L1_API, ap_int<16>, ap_uint<L1_API>> threshs1;
static ThresholdsActivation<L2_TMEM, L2_PE, L2_API, ap_int<16>, ap_uint<L2_API>> threshs2;
static ThresholdsActivation<L3_TMEM, L3_PE, L3_API, ap_int<16>, ap_uint<L3_API>> threshs3;
static ThresholdsActivation<L4_TMEM, L4_PE, L4_API, ap_int<16>, ap_uint<L4_API>> threshs4;
static ThresholdsActivation<L5_TMEM, L5_PE, L5_API, ap_int<16>, ap_uint<L5_API>> threshs5;
static ThresholdsActivation<L6_TMEM, L6_PE, L6_API, ap_int<16>, ap_uint<L6_API>> threshs6;
static ThresholdsActivation<L7_TMEM, L7_PE, L7_API, ap_int<16>, ap_uint<L7_API>> threshs7;
4.3.3. [H/W]重みの読み込み
次に、定義した変数に重みを読み込む部分を作る
void DoMemInit(unsigned int targetLayer, unsigned int targetMem, unsigned int targetInd, unsigned int targetThresh, ap_uint<64> val) {
switch (targetLayer) {
case 0:
weights0.m_weights[targetMem][targetInd] = val;
break;
case 1:
threshs0.m_thresholds[targetMem][targetInd][targetThresh] = *reinterpret_cast<ap_fixed<64, 56> *>(&val);
break;
case 2:
weights1.m_weights[targetMem][targetInd] = val;
break;
case 3:
threshs1.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 4:
weights2.m_weights[targetMem][targetInd] = val;
break;
case 5:
threshs2.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 6:
weights3.m_weights[targetMem][targetInd] = val;
break;
case 7:
threshs3.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 8:
weights4.m_weights[targetMem][targetInd] = val;
break;
case 9:
threshs4.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 10:
weights5.m_weights[targetMem][targetInd] = val;
break;
case 11:
threshs5.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 12:
weights6.m_weights[targetMem][targetInd] = val;
break;
case 13:
threshs6.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 14:
weights7.m_weights[targetMem][targetInd] = val;
break;
case 15:
threshs7.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 16:
weights8.m_weights[targetMem][targetInd] = val;
break;
case 17:
// do nothing, no thres mem for layer 8 as PassThrough activation is used
break;
}
}
4.3.4 [H/W]処理部
学習の際に構築したネットワークと同様のものをここでも作る。
void DoCompute(ap_uint<64> *in, ap_uint<64>* out, const unsigned int numReps) {
#pragma HLS DATAFLOW
stream<ap_uint<64>> inter0("DoCompute.inter0");
stream<ap_uint<192>> inter0_1("DoCompute.inter0_1");
stream<ap_uint<24>> inter0_2("DoCompute.inter0_2");
#pragma HLS STREAM variable=inter0_2 depth=128
stream<ap_uint<64>> inter1("DoCompute.inter1");
#pragma HLS STREAM variable=inter1 depth=128
stream<ap_uint<64>> inter2("DoCompute.inter2");
stream<ap_uint<64>> inter3("DoCompute.inter3");
#pragma HLS STREAM variable=inter3 depth=128
stream<ap_uint<128>> inter4("DoCompute.inter4");
#pragma HLS STREAM variable=inter4 depth=128
stream<ap_uint<128>> inter5("DoCompute.inter5");
stream<ap_uint<128>> inter6("DoCompute.inter6");
#pragma HLS STREAM variable=inter6 depth=81
stream<ap_uint<256>> inter7("DoCompute.inter7");
#pragma HLS STREAM variable=inter7 depth=1
stream<ap_uint<256>> inter8("DoCompute.inter8");
stream<ap_uint<256>> inter9("DoCompute.inter9");
#pragma HLS STREAM variable=inter9 depth=1
stream<ap_uint<64>> inter10("DoCompute.inter10");
#pragma HLS STREAM variable=inter10 depth=128
stream<ap_uint<64>> inter11("DoCompute.inter11");
#pragma HLS STREAM variable=inter11 depth=3
stream<ap_uint<64>> memOutStrm("DoCompute.memOutStrm");
const unsigned int inBits = 60 * 60 * 8;
// const unsigned int inBitsPadded = paddedSize(inBits, 64);
const unsigned int outBits = L8_MH*16;
Mem2Stream_Batch<64, inBits / 8>(in, inter0, numReps);
StreamingDataWidthConverter_Batch<64, 192, (60 * 60 * 8) / 64>(inter0, inter0_1, numReps);
StreamingDataWidthConverter_Batch<192, 24, (60 * 60 * 8) / 192>(inter0_1, inter0_2, numReps);
// convolutional layers
ConvLayer_Batch<L0_K, L0_IFM_CH, L0_IFM_DIM, L0_OFM_CH, L0_OFM_DIM, L0_SIMD, L0_PE, Slice<ap_fixed<8, 1, AP_TRN, AP_SAT>>, Identity, Recast<Binary>>(inter0_2, inter1, weights0, threshs0, numReps, ap_resource_lut());
ConvLayer_Batch<L1_K, L1_IFM_CH, L1_IFM_DIM, L1_OFM_CH, L1_OFM_DIM, L1_SIMD, L1_PE, Recast<XnorMul>>(inter1, inter2, weights1, threshs1, numReps, ap_resource_lut());
StreamingMaxPool_Batch<L1_OFM_DIM, 2, L1_OFM_CH>(inter2, inter3, numReps);
ConvLayer_Batch<L2_K, L2_IFM_CH, L2_IFM_DIM, L2_OFM_CH, L2_OFM_DIM, L2_SIMD, L2_PE, Recast<XnorMul>>(inter3, inter4, weights2, threshs2, numReps, ap_resource_lut());
ConvLayer_Batch<L3_K, L3_IFM_CH, L3_IFM_DIM, L3_OFM_CH, L3_OFM_DIM, L3_SIMD, L3_PE, Recast<XnorMul>>(inter4, inter5, weights3, threshs3, numReps, ap_resource_lut());
StreamingMaxPool_Batch<L3_OFM_DIM, 2, L3_OFM_CH>(inter5, inter6, numReps);
ConvLayer_Batch<L4_K, L4_IFM_CH, L4_IFM_DIM, L4_OFM_CH, L4_OFM_DIM, L4_SIMD, L4_PE, Recast<XnorMul>>(inter6, inter7, weights4, threshs4, numReps, ap_resource_lut());
ConvLayer_Batch<L5_K, L5_IFM_CH, L5_IFM_DIM, L5_OFM_CH, L5_OFM_DIM, L5_SIMD, L5_PE, Recast<XnorMul>>(inter7, inter8, weights5, threshs5, numReps, ap_resource_lut());
StreamingMaxPool_Batch<L5_OFM_DIM, 2, L5_OFM_CH>(inter8, inter9, numReps);
ConvLayer_Batch<L6_K, L6_IFM_CH, L6_IFM_DIM, L6_OFM_CH, L6_OFM_DIM, L6_SIMD, L6_PE, Recast<XnorMul>>(inter9, inter10, weights6, threshs6, numReps, ap_resource_lut());
// fully connected layers
WidthAdjustedOutputStream<16 * L8_PE, 64, L8_MH / L8_PE> wa_out(memOutStrm, numReps);
StreamingFCLayer_Batch<L7_MW, L7_MH, L7_SIMD, L7_PE, Recast<XnorMul>>
(inter10, inter11, weights7, threshs7, numReps, ap_resource_lut());
StreamingFCLayer_Batch<L8_MW, L8_MH, L8_SIMD, L8_PE, Recast<XnorMul>, Slice<ap_uint<16> >>
(inter11, static_cast<hls::stream<ap_uint<16 * L8_PE>>&>(wa_out), weights8, PassThroughActivation<ap_uint<16>>(), numReps, ap_resource_lut());
Stream2Mem_Batch<64, outBits/8>(memOutStrm, out, numReps);
}
特に以下の計算が重要
Mem2Stream_Batch<64, inBits / 8>(in, inter0, numReps);
StreamingDataWidthConverter_Batch<64, 192, (60 * 60 * 8) / 64>(inter0, inter0_1, numReps);
StreamingDataWidthConverter_Batch<192, 24, (60 * 60 * 8) / 192>(inter0_1, inter0_2, numReps);
まずAXi4のバス幅に従って64ビット毎にデータをストリームし、次に入力が64ビット384個のデータ(60608bit)であるものを
192ビット128個のデータに変換している。次に、その変換したデータを24ビット1024個に分割している。
最終的には24ビットに分割してConvolution層に入れる必要があるのだが、この変換を間違えると上手く演算できない。
すなわち、64でも192でも割れる値にしないといけない。323238や6060*1などである。
もしくは、割り切れるようにPaddingを追加してもよい。
無駄なデータの0数列が増えたところで演算上はPopCountを用いてるので何も問題はない。
もしエラーが起きるようなら
BNN-PYNQ/bnn/src/hls/streamtools.h
を以下のように書き換えると、入力データがどのように変換されていくかが分かる。
#ifndef STREAMTOOLS_H
#define STREAMTOOLS_H
// only let the first X elements of a stream to pass through, the remainder
// are consumed from input but not re-emitted from the output
// useful for getting rid of e.g. padding words
template<unsigned int DataWidth, // stream width
unsigned int NumAllowed, // number of words to pass through
unsigned int NumTotal // total number of words (NumTotal-NumAllowed swallowed)
>
void StreamLimiter(hls::stream<ap_uint<DataWidth> > & in,
hls::stream<ap_uint<DataWidth> > & out) {
CASSERT_DATAFLOW(NumTotal >= NumAllowed);
unsigned int numLeft = NumAllowed;
for (unsigned int i = 0; i < NumTotal; i++) {
#pragma HLS PIPELINE II=1
ap_uint<DataWidth> e = in.read();
if (numLeft > 0) {
out.write(e);
numLeft--;
}
}
}
template<unsigned int DataWidth, // stream width
unsigned int NumAllowed, // number of words to pass through
unsigned int NumTotal // total number of words (NumTotal-NumAllowed swallowed)
>
void StreamLimiter_Batch(hls::stream<ap_uint<DataWidth> > & in,
hls::stream<ap_uint<DataWidth> > & out, unsigned int numReps) {
for (unsigned int rep = 0; rep < numReps; rep++) {
StreamLimiter<DataWidth, NumAllowed, NumTotal>(in, out);
}
}
template<typename InT, typename OutT>
void StreamingCast(hls::stream<InT> & in, hls::stream<OutT> & out, unsigned int numReps) {
for(unsigned int i = 0; i < numReps; i++) {
#pragma HLS PIPELINE II=1
out.write((OutT) in.read());
}
}
template<unsigned int InWidth, // width of input stream
unsigned int OutWidth, // width of output stream
unsigned int NumInWords // number of input words to process
>
void StreamingDataWidthConverter_Batch(
hls::stream<ap_uint<InWidth> > & in,
hls::stream<ap_uint<OutWidth> > & out,
const unsigned int numReps) {
if (InWidth > OutWidth) {
cout << "InWidth > OutWidth" << endl;
// emit multiple output words per input word read
CASSERT_DATAFLOW(InWidth % OutWidth == 0);
const unsigned int outPerIn = InWidth / OutWidth;
const unsigned int totalIters = NumInWords * outPerIn * numReps;
unsigned int o = 0;
cout << "InWidth: " << InWidth << endl;
cout << "OutWidth: " << OutWidth << endl;
cout << "NumInWords: " << NumInWords << endl;
cout << "outPerIn: " << outPerIn << endl;
cout << "numReps: " << numReps << endl;
cout << "totalIters: " << totalIters << endl;
ap_uint<InWidth> ei = 0;
for (unsigned int t = 0; t < totalIters; t++) {
#pragma HLS PIPELINE II=1
// read new input word if current out count is zero
if (o == 0) {
ei = in.read();
}
// pick output word from the rightmost position
ap_uint<OutWidth> eo = ei(OutWidth - 1, 0);
out.write(eo);
// shift input to get new output word for next iteration
ei = ei >> OutWidth;
// increment written output count
o++;
// wraparound indices to recreate the nested loop structure
if (o == outPerIn) {
o = 0;
}
}
} else if (InWidth == OutWidth) {
cout << "InWidth == OutWidth" << endl;
// straight-through copy
for (unsigned int i = 0; i < NumInWords * numReps; i++) {
#pragma HLS PIPELINE II=1
ap_uint<InWidth> e = in.read();
out.write(e);
}
} else { // InWidth < OutWidth
// read multiple input words per output word emitted
cout << "InWidth < OutWidth" << endl;
CASSERT_DATAFLOW(OutWidth % InWidth == 0);
const unsigned int inPerOut = OutWidth / InWidth;
const unsigned int totalIters = NumInWords * numReps;
unsigned int i = 0;
cout << "InWidth: " << InWidth << endl;
cout << "OutWidth: " << OutWidth << endl;
cout << "NumInWords: " << NumInWords << endl;
cout << "inPerOut: " << inPerOut << endl;
cout << "numReps: " << numReps << endl;
cout << "totalIters: " << totalIters << endl;
ap_uint<OutWidth> eo = 0;
for (unsigned int t = 0; t < totalIters; t++) {
#pragma HLS PIPELINE II=1
// read input and shift into output buffer
ap_uint<InWidth> ei = in.read();
eo = eo >> InWidth;
eo(OutWidth - 1, OutWidth - InWidth) = ei;
// increment read input count
i++;
// wraparound logic to recreate nested loop functionality
if (i == inPerOut) {
i = 0;
out.write(eo);
}
}
}
cout << endl;
}
template<unsigned IW, unsigned OW, unsigned N>
class WidthAdjustedInputStream {
hls::stream<ap_uint<OW>> m_target;
public:
WidthAdjustedInputStream(hls::stream<ap_uint<IW> >& source, unsigned const reps) {
StreamingDataWidthConverter_Batch<IW, OW, N>(source, m_target, reps);
}
~WidthAdjustedInputStream() {}
public:
operator hls::stream<ap_uint<OW> >&() {
return m_target;
}
};
template<unsigned W, unsigned N>
class WidthAdjustedInputStream<W, W, N> {
hls::stream<ap_uint<W>> &m_source;
public:
WidthAdjustedInputStream(hls::stream<ap_uint<W> >& source, unsigned const reps) : m_source(source) {}
~WidthAdjustedInputStream() {}
public:
operator hls::stream<ap_uint<W> >&() {
return m_source;
}
};
template<unsigned IW, unsigned OW, unsigned N>
class WidthAdjustedOutputStream {
hls::stream<ap_uint<IW>> m_buffer;
hls::stream<ap_uint<OW>> &m_target;
unsigned const m_reps;
public:
WidthAdjustedOutputStream(hls::stream<ap_uint<OW> >& target, unsigned const reps) : m_target(target), m_reps(reps) {}
~WidthAdjustedOutputStream() {
StreamingDataWidthConverter_Batch<IW, OW, N>(m_buffer, m_target, m_reps);
}
public:
operator hls::stream<ap_uint<IW> >&() {
return m_buffer;
}
};
template<unsigned W, unsigned N>
class WidthAdjustedOutputStream<W, W, N> {
hls::stream<ap_uint<W>> &m_target;
public:
WidthAdjustedOutputStream(hls::stream<ap_uint<W> >& target, unsigned const reps)
: m_target(target) {}
~WidthAdjustedOutputStream() {}
public:
operator hls::stream<ap_uint<W> >&() {
return m_target;
}
};
#endif
4.3.5 [H/W]まとめ
最終的には以下のようなコードになる。
#include "config.h"
#include "bnn-library.h"
#include "weights.hpp"
#include "activations.hpp"
#include "interpret.hpp"
#include "mvau.hpp"
static BinaryWeights<L0_SIMD, L0_PE, L0_WMEM> weights0;
static BinaryWeights<L1_SIMD, L1_PE, L1_WMEM> weights1;
static BinaryWeights<L2_SIMD, L2_PE, L2_WMEM> weights2;
static BinaryWeights<L3_SIMD, L3_PE, L3_WMEM> weights3;
static BinaryWeights<L4_SIMD, L4_PE, L4_WMEM> weights4;
static BinaryWeights<L5_SIMD, L5_PE, L5_WMEM> weights5;
static BinaryWeights<L6_SIMD, L6_PE, L6_WMEM> weights6;
static BinaryWeights<L7_SIMD, L7_PE, L7_WMEM> weights7;
static BinaryWeights<L8_SIMD, L8_PE, L8_WMEM> weights8;
static ThresholdsActivation<L0_TMEM, L0_PE, L0_API, ap_fixed<24, 16>, ap_uint<L0_API> > threshs0;
static ThresholdsActivation<L1_TMEM, L1_PE, L1_API, ap_int<16>, ap_uint<L1_API>> threshs1;
static ThresholdsActivation<L2_TMEM, L2_PE, L2_API, ap_int<16>, ap_uint<L2_API>> threshs2;
static ThresholdsActivation<L3_TMEM, L3_PE, L3_API, ap_int<16>, ap_uint<L3_API>> threshs3;
static ThresholdsActivation<L4_TMEM, L4_PE, L4_API, ap_int<16>, ap_uint<L4_API>> threshs4;
static ThresholdsActivation<L5_TMEM, L5_PE, L5_API, ap_int<16>, ap_uint<L5_API>> threshs5;
static ThresholdsActivation<L6_TMEM, L6_PE, L6_API, ap_int<16>, ap_uint<L6_API>> threshs6;
static ThresholdsActivation<L7_TMEM, L7_PE, L7_API, ap_int<16>, ap_uint<L7_API>> threshs7;
unsigned int paddedSizeHW(unsigned int in, unsigned int padTo) {
if(in % padTo == 0) {
return in;
} else {
return in + padTo - (in % padTo);
}
}
void DoMemInit(unsigned int targetLayer, unsigned int targetMem, unsigned int targetInd, unsigned int targetThresh, ap_uint<64> val) {
switch (targetLayer) {
case 0:
weights0.m_weights[targetMem][targetInd] = val;
break;
case 1:
threshs0.m_thresholds[targetMem][targetInd][targetThresh] = *reinterpret_cast<ap_fixed<64, 56> *>(&val);
break;
case 2:
weights1.m_weights[targetMem][targetInd] = val;
break;
case 3:
threshs1.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 4:
weights2.m_weights[targetMem][targetInd] = val;
break;
case 5:
threshs2.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 6:
weights3.m_weights[targetMem][targetInd] = val;
break;
case 7:
threshs3.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 8:
weights4.m_weights[targetMem][targetInd] = val;
break;
case 9:
threshs4.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 10:
weights5.m_weights[targetMem][targetInd] = val;
break;
case 11:
threshs5.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 12:
weights6.m_weights[targetMem][targetInd] = val;
break;
case 13:
threshs6.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 14:
weights7.m_weights[targetMem][targetInd] = val;
break;
case 15:
threshs7.m_thresholds[targetMem][targetInd][targetThresh] = val;
break;
case 16:
weights8.m_weights[targetMem][targetInd] = val;
break;
case 17:
// do nothing, no thres mem for layer 8 as PassThrough activation is used
break;
}
}
void DoCompute(ap_uint<64> *in, ap_uint<64>* out, const unsigned int numReps) {
#pragma HLS DATAFLOW
stream<ap_uint<64>> inter0("DoCompute.inter0");
stream<ap_uint<192>> inter0_1("DoCompute.inter0_1");
stream<ap_uint<24>> inter0_2("DoCompute.inter0_2");
#pragma HLS STREAM variable=inter0_2 depth=128
stream<ap_uint<64>> inter1("DoCompute.inter1");
#pragma HLS STREAM variable=inter1 depth=128
stream<ap_uint<64>> inter2("DoCompute.inter2");
stream<ap_uint<64>> inter3("DoCompute.inter3");
#pragma HLS STREAM variable=inter3 depth=128
stream<ap_uint<128>> inter4("DoCompute.inter4");
#pragma HLS STREAM variable=inter4 depth=128
stream<ap_uint<128>> inter5("DoCompute.inter5");
stream<ap_uint<128>> inter6("DoCompute.inter6");
#pragma HLS STREAM variable=inter6 depth=81
stream<ap_uint<256>> inter7("DoCompute.inter7");
#pragma HLS STREAM variable=inter7 depth=1
stream<ap_uint<256>> inter8("DoCompute.inter8");
stream<ap_uint<256>> inter9("DoCompute.inter9");
#pragma HLS STREAM variable=inter9 depth=1
stream<ap_uint<64>> inter10("DoCompute.inter10");
#pragma HLS STREAM variable=inter10 depth=128
stream<ap_uint<64>> inter11("DoCompute.inter11");
#pragma HLS STREAM variable=inter11 depth=3
stream<ap_uint<64>> memOutStrm("DoCompute.memOutStrm");
const unsigned int inBits = 60 * 60 * 8; //32 * 32 * 3 * 8;
// const unsigned int inBitsPadded = paddedSize(inBits, 64);
const unsigned int outBits = L8_MH*16;
Mem2Stream_Batch<64, inBits / 8>(in, inter0, numReps);
//StreamingDataWidthConverter_Batch<64, 192, (32 * 32 * 3 * 8) / 64>(inter0, inter0_1, numReps);
//StreamingDataWidthConverter_Batch<192, 24, (32 * 32 * 3 * 8) / 192>(inter0_1, inter0_2, numReps);
StreamingDataWidthConverter_Batch<64, 192, (60 * 60 * 8) / 64>(inter0, inter0_1, numReps);
StreamingDataWidthConverter_Batch<192, 24, (60 * 60 * 8) / 192>(inter0_1, inter0_2, numReps);
// convolutional layers
ConvLayer_Batch<L0_K, L0_IFM_CH, L0_IFM_DIM, L0_OFM_CH, L0_OFM_DIM, L0_SIMD, L0_PE, Slice<ap_fixed<8, 1, AP_TRN, AP_SAT>>, Identity, Recast<Binary>>(inter0_2, inter1, weights0, threshs0, numReps, ap_resource_lut());
ConvLayer_Batch<L1_K, L1_IFM_CH, L1_IFM_DIM, L1_OFM_CH, L1_OFM_DIM, L1_SIMD, L1_PE, Recast<XnorMul>>(inter1, inter2, weights1, threshs1, numReps, ap_resource_lut());
StreamingMaxPool_Batch<L1_OFM_DIM, 2, L1_OFM_CH>(inter2, inter3, numReps);
ConvLayer_Batch<L2_K, L2_IFM_CH, L2_IFM_DIM, L2_OFM_CH, L2_OFM_DIM, L2_SIMD, L2_PE, Recast<XnorMul>>(inter3, inter4, weights2, threshs2, numReps, ap_resource_lut());
ConvLayer_Batch<L3_K, L3_IFM_CH, L3_IFM_DIM, L3_OFM_CH, L3_OFM_DIM, L3_SIMD, L3_PE, Recast<XnorMul>>(inter4, inter5, weights3, threshs3, numReps, ap_resource_lut());
StreamingMaxPool_Batch<L3_OFM_DIM, 2, L3_OFM_CH>(inter5, inter6, numReps);
ConvLayer_Batch<L4_K, L4_IFM_CH, L4_IFM_DIM, L4_OFM_CH, L4_OFM_DIM, L4_SIMD, L4_PE, Recast<XnorMul>>(inter6, inter7, weights4, threshs4, numReps, ap_resource_lut());
ConvLayer_Batch<L5_K, L5_IFM_CH, L5_IFM_DIM, L5_OFM_CH, L5_OFM_DIM, L5_SIMD, L5_PE, Recast<XnorMul>>(inter7, inter8, weights5, threshs5, numReps, ap_resource_lut());
StreamingMaxPool_Batch<L5_OFM_DIM, 2, L5_OFM_CH>(inter8, inter9, numReps);
ConvLayer_Batch<L6_K, L6_IFM_CH, L6_IFM_DIM, L6_OFM_CH, L6_OFM_DIM, L6_SIMD, L6_PE, Recast<XnorMul>>(inter9, inter10, weights6, threshs6, numReps, ap_resource_lut());
// fully connected layers
WidthAdjustedOutputStream<16 * L8_PE, 64, L8_MH / L8_PE> wa_out(memOutStrm, numReps);
StreamingFCLayer_Batch<L7_MW, L7_MH, L7_SIMD, L7_PE, Recast<XnorMul>>
(inter10, inter11, weights7, threshs7, numReps, ap_resource_lut());
StreamingFCLayer_Batch<L8_MW, L8_MH, L8_SIMD, L8_PE, Recast<XnorMul>, Slice<ap_uint<16> >>
(inter11, static_cast<hls::stream<ap_uint<16 * L8_PE>>&>(wa_out), weights8, PassThroughActivation<ap_uint<16>>(), numReps, ap_resource_lut());
Stream2Mem_Batch<64, outBits/8>(memOutStrm, out, numReps);
}
void BlackBoxJam(ap_uint<64> *in, ap_uint<64> *out, bool doInit,
unsigned int targetLayer, unsigned int targetMem,
unsigned int targetInd, unsigned int targetThresh, ap_uint<64> val, unsigned int numReps) {
// pragmas for MLBP jam interface
// signals to be mapped to the AXI Lite slave port
#pragma HLS INTERFACE s_axilite port=return bundle=control
#pragma HLS INTERFACE s_axilite port=doInit bundle=control
#pragma HLS INTERFACE s_axilite port=targetLayer bundle=control
#pragma HLS INTERFACE s_axilite port=targetMem bundle=control
#pragma HLS INTERFACE s_axilite port=targetInd bundle=control
#pragma HLS INTERFACE s_axilite port=targetThresh bundle=control
#pragma HLS INTERFACE s_axilite port=val bundle=control
#pragma HLS INTERFACE s_axilite port=numReps bundle=control
// signals to be mapped to the AXI master port (hostmem)
#pragma HLS INTERFACE m_axi offset=slave port=in bundle=hostmem depth=512
#pragma HLS INTERFACE s_axilite port=in bundle=control
#pragma HLS INTERFACE m_axi offset=slave port=out bundle=hostmem depth=16
#pragma HLS INTERFACE s_axilite port=out bundle=control
// partition PE arrays
#pragma HLS ARRAY_PARTITION variable=weights0.m_weights complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs0.m_thresholds complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs0.m_thresholds complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights1.m_weights complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs1.m_thresholds complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs1.m_thresholds complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights2.m_weights complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs2.m_thresholds complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs2.m_thresholds complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights3.m_weights complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs3.m_thresholds complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs3.m_thresholds complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights4.m_weights complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs4.m_thresholds complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs4.m_thresholds complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights5.m_weights complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs5.m_thresholds complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs5.m_thresholds complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights6.m_weights complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs6.m_thresholds complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs6.m_thresholds complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights7.m_weights complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs7.m_thresholds complete dim=1
#pragma HLS ARRAY_PARTITION variable=threshs7.m_thresholds complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights8.m_weights complete dim=1
if (doInit) {
DoMemInit(targetLayer, targetMem, targetInd, targetThresh, val);
} else {
DoCompute(in, out, numReps);
}
}
4.4. SoC側の実装
FPGAの回路となるプログラムは作ったので、
次にFPGAの回路にデータを送るS/W部分を作っていく。
/binparam-custom-learned-pynq/sw
からmain_python.cpp
を開いて書き換えていく。
4.4.1. [S/W]ネットワークの構築
以下のように画像サイズを指定する。
void makeNetwork(network<mse, adagrad> & nn) {
nn
#ifdef OFFLOAD
<< chaninterleave_layer<identity>(1, 60 * 60, false)
<< offloaded_layer(1 * 60 * 60, 10, &FixedFoldedMVOffload<8, 1, ap_int<16>>, 0xdeadbeef, 0)
#endif
;
}
4.4.2. [S/W]重みの読み込み
config.h
に書かれている重みを全て読み込むように書く。
extern "C" void load_parameters(const char* path) {
#include "config.h"
FoldedMVInit("cnvCustom");
network<mse, adagrad> nn;
makeNetwork(nn);
cout << "Setting network weights and thresholds in accelerator..." << endl;
FoldedMVLoadLayerMem(path, 0, L0_PE, L0_WMEM, L0_TMEM, L0_API);
FoldedMVLoadLayerMem(path, 1, L1_PE, L1_WMEM, L1_TMEM, L1_API);
FoldedMVLoadLayerMem(path, 2, L2_PE, L2_WMEM, L2_TMEM, L2_API);
FoldedMVLoadLayerMem(path, 3, L3_PE, L3_WMEM, L3_TMEM, L3_API);
FoldedMVLoadLayerMem(path, 4, L4_PE, L4_WMEM, L4_TMEM, L4_API);
FoldedMVLoadLayerMem(path, 5, L5_PE, L5_WMEM, L5_TMEM, L5_API);
FoldedMVLoadLayerMem(path, 6, L6_PE, L6_WMEM, L6_TMEM, L6_API);
FoldedMVLoadLayerMem(path, 7, L7_PE, L7_WMEM, L7_TMEM, L7_API);
FoldedMVLoadLayerMem(path, 8, L8_PE, L8_WMEM, L8_TMEM, 0);
}
4.4.3. [S/W]推論
画像読み込みがグレースケールに対応していないので開発する必要がある。
テストデータでいいならばこんな感じに書いてもよい。
// gray scale data
std::vector<vec_t> test_image;
std::vector<label_t> test_label;
vec_t img;
img.resize(60 * 60 * 1, -1.0);
for (int i = 0; i < 60 * 60 * 1; i++){
img[i] = 0.5;
}
test_image.push_back(img);
今回はちゃんと画像読み込みを実装したいので、まずは関数を作成する。
どういった関数かというと、バイナリファイルを読み込んできて-1~1にスケーリングするというもの。
今回は60*60以外にも対応させた。RGBも関係なし。
void parse_image_grayscale(const std::string& filename,
std::vector<vec_t> *train_images,
int img_size)
{
std::cout << "[parse_image_grayscale]: called" << std::endl;
std::ifstream ifs(filename.c_str(), std::ios::in | std::ios::binary);
if (ifs.fail() || ifs.bad())
throw nn_error("failed to open file:" + filename);
std::vector<unsigned char> buf(img_size);
if (!ifs.read((char*) &buf[0], img_size)) return; // load buffer
vec_t img;
std::cout << "[parse_image_grayscale]: cast unsigned char" << std::endl;
std::transform(buf.begin(), buf.end(), std::back_inserter(img),
[=](unsigned char c) { return c * (1.0 / 255.0) - 1; });
train_images->push_back(img);
std::cout << "[parse_image_grayscale]: indicating loading data the below" << std::endl;
std::cout << img[0] << ", " << img[1] << ", " << img[2] << ", " << img[3] << ", " << img[4] << ", " << img[5] << endl;
std::cout << "[parse_image_grayscale]: done" << std::endl;
}
次に、画像を読み込んで、FPGAに画像データを渡して推論する部分を作る。
extern "C" int inference(const char* path, int results[64], int number_class, float* usecPerImage) {
cout << "[main_python.cpp::inference::called" << endl;
std::vector<label_t> test_labels;
std::vector<vec_t> test_images;
std::vector<int> class_result;
float usecPerImage_int;
FoldedMVInit("cnvCustom");
network<mse, adagrad> nn;
makeNetwork(nn);
parse_image_grayscale(path, &test_images, 60*60); // load image
class_result=testPrebuiltCUSTOM_from_image<8, 16, ap_int<16>>(test_images, number_class, usecPerImage_int);
if(results) {
std::copy(class_result.begin(),class_result.end(), results);
}
if (usecPerImage) {
*usecPerImage = usecPerImage_int;
}
return (std::distance(class_result.begin(),std::max_element(class_result.begin(), class_result.end())));
}
ここで、FPGA側に処理を投げているのであるが、こちらも関数を作る必要がある。
class_result=testPrebuiltCUSTOM_from_image<8, 16, ap_int<16>>(test_images, number_class, usecPerImage_int);
/BNN-PYNQ/bnn/src/library/host/foldedmv-offload.h
を開く。
void testPrebuiltCIFAR10(std::vector<tiny_cnn::vec_t> & imgs, std::vector<tiny_cnn::label_t> & labels, const unsigned int numCategories) {
...
}
という関数がいくつもあると思われる。
これをコピーして以下のように書き換える。
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
void testPrebuiltCUSTOM(std::vector<tiny_cnn::vec_t> & imgs, std::vector<tiny_cnn::label_t> & labels, const unsigned int numCategories) {
const unsigned int count = imgs.size();
cout << "[SW-mode] Packing and interleaving CUSTOM inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = 16; //paddedSize(numCategories*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
cout << psi << endl;
cout << pso << endl;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
cout << "ExtMemWord" << endl;
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(1, 60 * 60, false);
cout << "tiny_cnn" << endl;
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CUSTOM test for " << count << " images..." << endl;
auto t1 = chrono::high_resolution_clock::now();
// call the accelerator in compute mode
BlackBoxJam((ap_uint<64> *)packedImages, (ap_uint<64> *)packedOut, false, 0, 0, 0, 0,0, count);
auto t2 = chrono::high_resolution_clock::now();
// compare against labels
unsigned int ok = 0, failed = 0;
tiny_cnn::vec_t outTest(numCategories, 0);
for(unsigned int i = 0; i < count; i++) {
copyFromLowPrecBuffer<LowPrecType>(&packedOut[i * pso], outTest);
int maxInd = 0;
LowPrecType maxVal = 0;
for(unsigned int j = 0; j < numCategories; j++) {
if(outTest[j] > maxVal) {
maxVal = outTest[j];
maxInd = j;
}
}
if(maxInd == labels[i]) {
ok++;
} else {
failed++;
}
}
cout << "Succeeded " << ok << " failed " << failed << " accuracy " << 100.0*(float)ok/count << "%" << endl;
auto duration = chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
float usecPerImage = (float)duration / count;
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete[] packedImages;
delete[] packedOut;
}
いくつもあるのはS/Wデバッグ用のものと、H/W動作時でIFDEFでコンパイルするコードを分けているからである。
全て画像読み込みの部分を書き換えなければならない。
全て書き換えると以下のようになる。
#pragma once
#include <string>
#include <iostream>
#include "tiny_cnn/tiny_cnn.h"
#include "ap_int.h"
using namespace std;
typedef unsigned long long ExtMemWord;
const unsigned int bitsPerExtMemWord = sizeof(ExtMemWord)*8;
#ifndef VIRTUAL
#define INPUT_BUF_ENTRIES 3840000
#define OUTPUT_BUF_ENTRIES 160000
#else
#define INPUT_BUF_ENTRIES 8192
#define OUTPUT_BUF_ENTRIES 1024
#endif
#define FOLDEDMV_INPUT_PADCHAR 0
void FoldedMVOffloadBinarized(const ExtMemWord * in,
ExtMemWord * out,
const unsigned int inBufWords,
const unsigned int outBufWords,
const unsigned int numImages);
void FoldedMVInit(const char * attachName);
void FoldedMVDeinit();
void FoldedMVLoadLayerMem(std::string dir,
unsigned int peCount,
unsigned int layerNo,
unsigned int linesWMem,
unsigned int linesTMem,
unsigned int numThresh);
void FoldedMVMemSet(unsigned int targetLayer,
unsigned int targetMem,
unsigned int targetInd,
unsigned int targetThresh,
ExtMemWord val);
std::vector<int> testPrebinarized_nolabel_multiple_images(std::vector<tiny_cnn::vec_t> & imgs,
const unsigned int labelBits,
float &usecPerImage);
std::vector<int> testPrebinarized_nolabel(std::vector<tiny_cnn::vec_t> & imgs,
const unsigned int labelBits,
float &usecPerImage);
void testPrebinarized(std::vector<tiny_cnn::vec_t> & imgs,
std::vector<tiny_cnn::label_t> & labels,
const unsigned int labelBits);
void binarizeAndPack(const tiny_cnn::vec_t & in,
ExtMemWord * out,
unsigned int inBufSize=INPUT_BUF_ENTRIES);
void unpackAndDebinarize(const ExtMemWord * in, tiny_cnn::vec_t &out);
unsigned int paddedSize(unsigned int in, unsigned int padTo);
std::string getBNNRoot();
template<typename LowPrecType>
void copyFromLowPrecBuffer(void * buf, tiny_cnn::vec_t & out) {
LowPrecType * lpbuf = (LowPrecType *) buf;
for(unsigned int i = 0; i < out.size(); i++) {
out[i] = (tiny_cnn::float_t) lpbuf[i];
}
}
template<unsigned int inWidth, unsigned int SIMDWidth>
void quantiseAndPack(const tiny_cnn::vec_t & in, ExtMemWord * out, unsigned int inBufSize=INPUT_BUF_ENTRIES) {
if((in.size() * inWidth) > (inBufSize * bitsPerExtMemWord)) {
throw "Not enough space in input buffer";
}
// first, fill the target buffer with padding data
memset(out, 0, inBufSize * sizeof(ExtMemWord));
ExtMemWord tmpv[bitsPerExtMemWord / inWidth];
// now pack each quantised value as required.
for(unsigned int i=0; i < in.size(); i++) {
ap_fixed<inWidth, 1, AP_TRN, AP_SAT> fxdValue = in[i];
ap_uint<inWidth> uValue = *reinterpret_cast<ap_uint<inWidth> *>(&fxdValue); // Interpret the fixed value as an integer.
ExtMemWord v = ((ExtMemWord)uValue & (~(ExtMemWord)0 >> (bitsPerExtMemWord - inWidth))); // Zero all bits except for the (bitsPerExtMemWord - inWidth) least significant bits.
out[i / (bitsPerExtMemWord / inWidth)] |= (v << inWidth*(i % (bitsPerExtMemWord / inWidth)));
}
}
#if defined(OFFLOAD) && defined(RAWHLS)
#include "bnn-library.h"
void BlackBoxJam(ap_uint<64> * in, ap_uint<64> * out, bool doInit, unsigned int targetLayer, unsigned int targetMem, unsigned int targetInd, unsigned int targetThresh, ap_uint<64> val, unsigned int numReps);
extern ExtMemWord * bufIn, * bufOut;
template<typename LowPrecType>
void FoldedMVOffload(const tiny_cnn::vec_t &in, tiny_cnn::vec_t & out, unsigned int offloadID, tiny_cnn::OffloadConvParams * convParams) {
// binarize input and pack into bit stream
binarizeAndPack(in, bufIn);
// call the accelerator in compute mode
BlackBoxJam((ap_uint<64> *)bufIn, (ap_uint<64> *)bufOut, false, 0, 0, 0, 0, 0, 1);
// unpack output bits and convert output back to float
if(offloadID == 0xdeadbeef) {
copyFromLowPrecBuffer<LowPrecType>((void *)bufOut, out);
} else {
unpackAndDebinarize(bufOut, out);
}
}
template<unsigned int inWidth, unsigned int SIMDWidth, typename LowPrecType>
void FixedFoldedMVOffload(const tiny_cnn::vec_t &in, tiny_cnn::vec_t &out, unsigned int offloadID, tiny_cnn::OffloadConvParams * convParams) {
// binarize input and pack into bit stream
quantiseAndPack<inWidth, SIMDWidth>(in, bufIn);
// call the accelerator in compute mode
BlackBoxJam((ap_uint<64> *)bufIn, (ap_uint<64> *)bufOut, false, 0, 0, 0, 0, 0, 1);
// unpack output bits and convert output back to float
if(offloadID == 0xdeadbeef) {
copyFromLowPrecBuffer<LowPrecType>((void *)bufOut, out);
} else {
unpackAndDebinarize(bufOut, out);
}
}
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
void testPrebuiltCIFAR10(std::vector<tiny_cnn::vec_t> & imgs, std::vector<tiny_cnn::label_t> & labels, const unsigned int numCategories) {
const unsigned int count = imgs.size();
cout << "Packing and interleaving CIFAR-10 inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = 16; //paddedSize(numCategories*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(3, 32 * 32, false);
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CIFAR-10 test for " << count << " images..." << endl;
auto t1 = chrono::high_resolution_clock::now();
// call the accelerator in compute mode
BlackBoxJam((ap_uint<64> *)packedImages, (ap_uint<64> *)packedOut, false, 0, 0, 0, 0,0, count);
auto t2 = chrono::high_resolution_clock::now();
// compare against labels
unsigned int ok = 0, failed = 0;
tiny_cnn::vec_t outTest(numCategories, 0);
for(unsigned int i = 0; i < count; i++) {
copyFromLowPrecBuffer<LowPrecType>(&packedOut[i * pso], outTest);
int maxInd = 0;
LowPrecType maxVal = 0;
for(unsigned int j = 0; j < numCategories; j++) {
if(outTest[j] > maxVal) {
maxVal = outTest[j];
maxInd = j;
}
}
if(maxInd == labels[i]) {
ok++;
} else {
failed++;
}
}
cout << "Succeeded " << ok << " failed " << failed << " accuracy " << 100.0*(float)ok/count << "%" << endl;
auto duration = chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
float usecPerImage = (float)duration / count;
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete[] packedImages;
delete[] packedOut;
}
// CNV CUSTOM
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
void testPrebuiltCUSTOM(std::vector<tiny_cnn::vec_t> & imgs, std::vector<tiny_cnn::label_t> & labels, const unsigned int numCategories) {
const unsigned int count = imgs.size();
cout << "[SW-mode] Packing and interleaving CUSTOM inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = 16; //paddedSize(numCategories*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
cout << psi << endl;
cout << pso << endl;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
cout << "ExtMemWord" << endl;
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(1, 60 * 60, false);
cout << "tiny_cnn" << endl;
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CUSTOM test for " << count << " images..." << endl;
auto t1 = chrono::high_resolution_clock::now();
// call the accelerator in compute mode
BlackBoxJam((ap_uint<64> *)packedImages, (ap_uint<64> *)packedOut, false, 0, 0, 0, 0,0, count);
auto t2 = chrono::high_resolution_clock::now();
// compare against labels
unsigned int ok = 0, failed = 0;
tiny_cnn::vec_t outTest(numCategories, 0);
for(unsigned int i = 0; i < count; i++) {
copyFromLowPrecBuffer<LowPrecType>(&packedOut[i * pso], outTest);
int maxInd = 0;
LowPrecType maxVal = 0;
for(unsigned int j = 0; j < numCategories; j++) {
if(outTest[j] > maxVal) {
maxVal = outTest[j];
maxInd = j;
}
}
if(maxInd == labels[i]) {
ok++;
} else {
failed++;
}
}
cout << "Succeeded " << ok << " failed " << failed << " accuracy " << 100.0*(float)ok/count << "%" << endl;
auto duration = chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
float usecPerImage = (float)duration / count;
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete[] packedImages;
delete[] packedOut;
}
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
std::vector<int> testPrebuiltCIFAR10_from_image(std::vector<tiny_cnn::vec_t> & imgs, const unsigned int numCategories, float &usecPerImage) {
const unsigned int count = 1;
cout << "Packing and interleaving CIFAR-10 inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = paddedSize(64*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(3, 32 * 32, false);
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CIFAR-10 test for " << count << " images..." << endl;
auto t1 = chrono::high_resolution_clock::now();
// call the accelerator in compute mode
BlackBoxJam((ap_uint<64> *)packedImages, (ap_uint<64> *)packedOut, false, 0, 0, 0, 0, 0, count);
auto t2 = chrono::high_resolution_clock::now();
// compare against labels
unsigned int ok = 0, failed = 0;
tiny_cnn::vec_t outTest(numCategories, 0);
copyFromLowPrecBuffer<LowPrecType>(&packedOut[0], outTest);
std::vector<int> result;
for(unsigned int j = 0; j < numCategories; j++) {
result.push_back(outTest[j]);
}
auto duration = chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
usecPerImage = (float)duration / (count);
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete[] packedImages;
delete[] packedOut;
return result;
}
// CNV CUSTOM
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
std::vector<int> testPrebuiltCUSTOM_from_image(std::vector<tiny_cnn::vec_t> & imgs, const unsigned int numCategories, float &usecPerImage) {
const unsigned int count = 1;
cout << "[SW-mode] Packing and interleaving CUSTOM inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = paddedSize(64*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(1, 60 * 60, false);
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CIFAR-10 test for " << count << " images..." << endl;
auto t1 = chrono::high_resolution_clock::now();
// call the accelerator in compute mode
BlackBoxJam((ap_uint<64> *)packedImages, (ap_uint<64> *)packedOut, false, 0, 0, 0, 0, 0, count);
auto t2 = chrono::high_resolution_clock::now();
// compare against labels
unsigned int ok = 0, failed = 0;
tiny_cnn::vec_t outTest(numCategories, 0);
copyFromLowPrecBuffer<LowPrecType>(&packedOut[0], outTest);
std::vector<int> result;
for(unsigned int j = 0; j < numCategories; j++) {
result.push_back(outTest[j]);
}
auto duration = chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
usecPerImage = (float)duration / (count);
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete[] packedImages;
delete[] packedOut;
return result;
}
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
std::vector<int> testPrebuiltCIFAR10_multiple_images(std::vector<tiny_cnn::vec_t> & imgs, const unsigned int numCategories, std::vector<int> & detailed_results, float & usecPerImage) {
const unsigned int count = imgs.size();
std::vector<int> results;
cout << "Packing and interleaving CIFAR-10 inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = paddedSize(64*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
if(INPUT_BUF_ENTRIES < count*psi)
throw "Not enough space in accelBufIn";
if(OUTPUT_BUF_ENTRIES < count*pso)
throw "Not enough space in accelBufOut";
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(3, 32 * 32, false);
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CIFAR-10 test for " << count << " images..." << endl;
// copy inputs to accelerator
auto t1 = chrono::high_resolution_clock::now();
// call the accelerator in compute mode
BlackBoxJam((ap_uint<64> *)packedImages, (ap_uint<64> *)packedOut, false, 0, 0, 0, 0, 0, count);
auto t2 = chrono::high_resolution_clock::now();
// compare against labels
tiny_cnn::vec_t outTest(numCategories, 0);
for(unsigned int i = 0; i < count; i++) {
copyFromLowPrecBuffer<LowPrecType>(&packedOut[i * pso], outTest);
int maxInd = 0;
LowPrecType maxVal = 0;
for(unsigned int j = 0; j < numCategories; j++) {
detailed_results.push_back(outTest[j]);
if(outTest[j] > maxVal) {
maxVal = outTest[j];
maxInd = j;
}
}
results.push_back(maxInd);
}
auto duration = chrono::duration_cast<chrono::microseconds>(t2 - t1).count();
usecPerImage = (float)duration / (count);
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete[] packedImages;
delete[] packedOut;
return results;
}
#elif defined(OFFLOAD) && !defined(RAWHLS)
#include "platform.hpp"
#include <vector>
extern DonutDriver * thePlatform;
extern void * accelBufIn, * accelBufOut;
extern ExtMemWord * bufIn, * bufOut;
void ExecAccel();
template<typename LowPrecType>
void FoldedMVOffload(const tiny_cnn::vec_t &in, tiny_cnn::vec_t &out, unsigned int offloadID, tiny_cnn::OffloadConvParams * convParams) {
// always operates on a single image per call for now -- set numImages to 1
thePlatform->writeJamRegAddr(0x5C, 1);
// binarize input and pack into bit stream
binarizeAndPack(in, bufIn);
// TODO size to pad input to is max(64, PE_SYNGROUP_BITS)
unsigned int paddedInDim = paddedSize(in.size(), bitsPerExtMemWord);
// copy into accelerator input
const unsigned int numInpWords = (paddedInDim / bitsPerExtMemWord);
thePlatform->copyBufferHostToAccel((void *)bufIn, accelBufIn, sizeof(ExtMemWord) * numInpWords);
// launch
ExecAccel();
if(offloadID == 0xdeadbeef) {
unsigned int paddedOutDim = paddedSize(out.size() * 16, bitsPerExtMemWord);
const unsigned int numOutWords = (paddedOutDim / bitsPerExtMemWord);
thePlatform->copyBufferAccelToHost(accelBufOut, (void *)bufOut, sizeof(ExtMemWord) * numOutWords);
copyFromLowPrecBuffer<LowPrecType>((void *)bufOut, out);
} else {
// TODO size to pad input to is max(64, NUM_PE_ELEMENTS)
unsigned int paddedOutDim = paddedSize(out.size(), bitsPerExtMemWord);
// copy from accelerator output
const unsigned int numOutWords = ( paddedOutDim / bitsPerExtMemWord);
thePlatform->copyBufferAccelToHost(accelBufOut, (void *)bufOut, sizeof(ExtMemWord) * numOutWords);
// unpack output bits and convert output back to float
unpackAndDebinarize(bufOut, out);
}
}
template<unsigned int inWidth, unsigned int SIMDWidth, typename LowPrecType>
void FixedFoldedMVOffload(const tiny_cnn::vec_t &in, tiny_cnn::vec_t &out, unsigned int offloadID, tiny_cnn::OffloadConvParams * convParams) {
// always operates on a single image per call for now -- set numImages to 1
thePlatform->writeJamRegAddr(0x5C, 1);
// binarize input and pack into bit stream
quantiseAndPack<inWidth, SIMDWidth>(in, bufIn);
// TODO size to pad input to is max(64, PE_SYNGROUP_BITS)
unsigned int paddedInDim = paddedSize(in.size(), bitsPerExtMemWord);
// copy into accelerator input
const unsigned int numInpWords = (paddedInDim / (bitsPerExtMemWord / inWidth));
thePlatform->copyBufferHostToAccel((void *)bufIn, accelBufIn, sizeof(ExtMemWord) * numInpWords);
// launch
ExecAccel();
if(offloadID == 0xdeadbeef) {
unsigned int paddedOutDim = paddedSize(out.size() * 16, bitsPerExtMemWord);
const unsigned int numOutWords = ( paddedOutDim / bitsPerExtMemWord);
thePlatform->copyBufferAccelToHost(accelBufOut, (void *)bufOut, sizeof(ExtMemWord) * numOutWords);
copyFromLowPrecBuffer<LowPrecType>((void *)bufOut, out);
} else {
// TODO size to pad input to is max(64, NUM_PE_ELEMENTS)
unsigned int paddedOutDim = paddedSize(out.size(), bitsPerExtMemWord);
// copy from accelerator output
const unsigned int numOutWords = ( paddedOutDim / bitsPerExtMemWord);
thePlatform->copyBufferAccelToHost(accelBufOut, (void *)bufOut, sizeof(ExtMemWord) * numOutWords);
// unpack output bits and convert output back to float
unpackAndDebinarize(bufOut, out);
}
}
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
void testPrebuiltCIFAR10(std::vector<tiny_cnn::vec_t> & imgs, std::vector<tiny_cnn::label_t> & labels, const unsigned int numCategories) {
const unsigned int count = imgs.size();
cout << "Packing and interleaving CIFAR-10 inputs..." << endl;
// # of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size() * inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// # of ExtMemWords per output
const unsigned int pso = paddedSize(numCategories * outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(3, 32 * 32, false);
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CIFAR-10 test for " << count << " images..." << endl;
// copy inputs to accelerator
thePlatform->copyBufferHostToAccel((void *)packedImages, accelBufIn, sizeof(ExtMemWord) * count * psi);
// set number of images to recognize
thePlatform->writeJamRegAddr(0x5C, count);
// recognize
auto t1 = chrono::high_resolution_clock::now();
ExecAccel();
auto t2 = chrono::high_resolution_clock::now();
// copy results back to host
thePlatform->copyBufferAccelToHost(accelBufOut, (void *)packedOut, sizeof(ExtMemWord) * count * pso);
// compare against labels
unsigned int ok = 0, failed = 0;
tiny_cnn::vec_t outTest(numCategories, 0);
for(unsigned int i = 0; i < count; i++) {
copyFromLowPrecBuffer<LowPrecType>(&packedOut[i * pso], outTest);
int maxInd = 0;
LowPrecType maxVal = 0;
for(unsigned int j = 0; j < numCategories; j++) {
if(outTest[j] > maxVal) {
maxVal = outTest[j];
maxInd = j;
}
}
if(maxInd == labels[i]) {
ok++;
} else {
failed++;
}
}
cout << "Succeeded " << ok << " failed " << failed << " accuracy " << 100.0 * (float)ok / count << "%" << endl;
auto duration = chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
float usecPerImage = (float)duration / (count);
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete[] packedImages;
delete[] packedOut;
}
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
std::vector<int> testPrebuiltCIFAR10_from_image(std::vector<tiny_cnn::vec_t> & imgs, const unsigned int numCategories, float &usecPerImage) {
const unsigned int count = 1;
cout << "Packing and interleaving CIFAR-10 inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = paddedSize(64*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(3, 32 * 32, false);
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CIFAR-10 test for " << count << " images..." << endl;
// copy inputs to accelerator
thePlatform->copyBufferHostToAccel((void *)packedImages, accelBufIn, sizeof(ExtMemWord) * count * psi);
// set number of images to recognize
thePlatform->writeJamRegAddr(0x5C, count);
// recognize
auto t1 = chrono::high_resolution_clock::now();
ExecAccel();
auto t2 = chrono::high_resolution_clock::now();
// copy results back to host
thePlatform->copyBufferAccelToHost(accelBufOut, (void *)packedOut, sizeof(ExtMemWord) * count * pso);
// compare against labels
unsigned int ok = 0, failed = 0;
tiny_cnn::vec_t outTest(numCategories, 0);
copyFromLowPrecBuffer<LowPrecType>(&packedOut[0], outTest);
std::vector<int> result;
for(unsigned int j = 0; j < numCategories; j++) {
result.push_back(outTest[j]);
}
auto duration = chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
usecPerImage = (float)duration / (count);
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete [] packedImages;
delete [] packedOut;
return result;
}
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
std::vector<int> testPrebuiltCUSTOM_from_image(std::vector<tiny_cnn::vec_t> & imgs, const unsigned int numCategories, float &usecPerImage) {
const unsigned int count = 1;
cout << "[HW-mode] Packing and interleaving CUSTOM inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = paddedSize(64*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(1, 60 * 60, false);
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CUSTOM test for " << count << " images..." << endl;
// copy inputs to accelerator
thePlatform->copyBufferHostToAccel((void *)packedImages, accelBufIn, sizeof(ExtMemWord) * count * psi);
// set number of images to recognize
thePlatform->writeJamRegAddr(0x5C, count);
// recognize
auto t1 = chrono::high_resolution_clock::now();
ExecAccel();
auto t2 = chrono::high_resolution_clock::now();
// copy results back to host
thePlatform->copyBufferAccelToHost(accelBufOut, (void *)packedOut, sizeof(ExtMemWord) * count * pso);
// compare against labels
unsigned int ok = 0, failed = 0;
tiny_cnn::vec_t outTest(numCategories, 0);
copyFromLowPrecBuffer<LowPrecType>(&packedOut[0], outTest);
std::vector<int> result;
for(unsigned int j = 0; j < numCategories; j++) {
result.push_back(outTest[j]);
}
auto duration = chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
usecPerImage = (float)duration / (count);
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete [] packedImages;
delete [] packedOut;
return result;
}
template<unsigned int inWidth, unsigned int outWidth, typename LowPrecType>
std::vector<int> testPrebuiltCIFAR10_multiple_images(std::vector<tiny_cnn::vec_t> & imgs, const unsigned int numCategories, std::vector<int> & detailed_results, float &usecPerImage) {
const unsigned int count = imgs.size();
std::vector<int> results;
cout << "Packing and interleaving CIFAR-""10 inputs..." << endl;
// number of ExtMemWords per image
const unsigned int psi = paddedSize(imgs[0].size()*inWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
// number of ExtMemWords per output
const unsigned int pso = paddedSize(64*outWidth, bitsPerExtMemWord) / bitsPerExtMemWord;
if(INPUT_BUF_ENTRIES < count*psi) {
throw "Not enough space in accelBufIn";
}
if(OUTPUT_BUF_ENTRIES < count*pso) {
throw "Not enough space in accelBufOut";
}
// allocate host-side buffers for packed input and outputs
ExtMemWord * packedImages = new ExtMemWord[(count * psi)];
ExtMemWord * packedOut = new ExtMemWord[(count * pso)];
tiny_cnn::chaninterleave_layer<tiny_cnn::activation::identity> interleaver(3, 32 * 32, false);
// interleave and pack inputs
for(unsigned int i = 0; i < count; i++) {
tiny_cnn::vec_t interleaved = interleaver.forward_propagation(imgs[i], 0);
quantiseAndPack<inWidth, 1>(interleaved, &packedImages[i * psi], psi);
}
cout << "Running prebuilt CIFAR-10 test for " << count << " images..." << endl;
// copy inputs to accelerator
thePlatform->copyBufferHostToAccel((void *)packedImages, accelBufIn, sizeof(ExtMemWord) * count * psi);
// set number of images to recognize
thePlatform->writeJamRegAddr(0x5C, count);
// recognize
auto t1 = chrono::high_resolution_clock::now();
ExecAccel();
auto t2 = chrono::high_resolution_clock::now();
// copy results back to host
thePlatform->copyBufferAccelToHost(accelBufOut, (void *)packedOut, sizeof(ExtMemWord) * count * pso);
tiny_cnn::vec_t outTest(numCategories, 0);
for(unsigned int i = 0; i < count; i++) {
copyFromLowPrecBuffer<LowPrecType>(&packedOut[i * pso], outTest);
int maxInd = 0;
LowPrecType maxVal = 0;
for(unsigned int j = 0; j < numCategories; j++) {
detailed_results.push_back(outTest[j]);
if(outTest[j] > maxVal) {
maxVal = outTest[j];
maxInd = j;
}
}
results.push_back(maxInd);
}
auto duration = chrono::duration_cast<chrono::microseconds>(t2 - t1).count();
usecPerImage = (float)duration / (count);
cout << "Inference took " << duration << " microseconds, " << usecPerImage << " usec per image" << endl;
cout << "Classification rate: " << 1000000.0 / usecPerImage << " images per second" << endl;
delete[] packedImages;
delete[] packedOut;
return results;
}
#endif
4.4.4. [S/W]まとめ
最後に、main関数の部分で推論結果が間違っていた場合ビルド失敗にする仕組みがあるので、面倒なのでコメントアウトしておく。
全てまとめたものは以下である。
#include "tiny_cnn/tiny_cnn.h"
#include "tiny_cnn/util/util.h"
#include <iostream>
#include <string.h>
#include <chrono>
#include "foldedmv-offload.h"
#include <algorithm>
using namespace std;
using namespace tiny_cnn;
using namespace tiny_cnn::activation;
void makeNetwork(network<mse, adagrad> & nn) {
nn
#ifdef OFFLOAD
<< chaninterleave_layer<identity>(1, 60 * 60, false)
<< offloaded_layer(1 * 60 * 60, 10, &FixedFoldedMVOffload<8, 1, ap_int<16>>, 0xdeadbeef, 0)
#endif
;
}
void parse_image_grayscale(const std::string& filename,
std::vector<vec_t> *train_images,
int img_size)
{
std::cout << "[parse_image_grayscale]: called" << std::endl;
std::ifstream ifs(filename.c_str(), std::ios::in | std::ios::binary);
if (ifs.fail() || ifs.bad())
throw nn_error("failed to open file:" + filename);
std::vector<unsigned char> buf(img_size);
if (!ifs.read((char*) &buf[0], img_size)) return; // load buffer
vec_t img;
std::cout << "[parse_image_grayscale]: cast unsigned char" << std::endl;
std::transform(buf.begin(), buf.end(), std::back_inserter(img),
[=](unsigned char c) { return c * (1.0 / 255.0) - 1; });
train_images->push_back(img);
std::cout << "[parse_image_grayscale]: indicating loading data the below" << std::endl;
std::cout << img[0] << ", " << img[1] << ", " << img[2] << ", " << img[3] << ", " << img[4] << ", " << img[5] << endl;
std::cout << "[parse_image_grayscale]: done" << std::endl;
}
extern "C" void load_parameters(const char* path) {
#include "config.h"
FoldedMVInit("cnvCustom");
network<mse, adagrad> nn;
makeNetwork(nn);
cout << "Setting network weights and thresholds in accelerator..." << endl;
FoldedMVLoadLayerMem(path, 0, L0_PE, L0_WMEM, L0_TMEM, L0_API);
FoldedMVLoadLayerMem(path, 1, L1_PE, L1_WMEM, L1_TMEM, L1_API);
FoldedMVLoadLayerMem(path, 2, L2_PE, L2_WMEM, L2_TMEM, L2_API);
FoldedMVLoadLayerMem(path, 3, L3_PE, L3_WMEM, L3_TMEM, L3_API);
FoldedMVLoadLayerMem(path, 4, L4_PE, L4_WMEM, L4_TMEM, L4_API);
FoldedMVLoadLayerMem(path, 5, L5_PE, L5_WMEM, L5_TMEM, L5_API);
FoldedMVLoadLayerMem(path, 6, L6_PE, L6_WMEM, L6_TMEM, L6_API);
FoldedMVLoadLayerMem(path, 7, L7_PE, L7_WMEM, L7_TMEM, L7_API);
FoldedMVLoadLayerMem(path, 8, L8_PE, L8_WMEM, L8_TMEM, 0);
}
extern "C" int inference(const char* path, int results[64], int number_class, float* usecPerImage) {
cout << "[main_python.cpp::inference::called" << endl;
std::vector<label_t> test_labels;
std::vector<vec_t> test_images;
std::vector<int> class_result;
float usecPerImage_int;
FoldedMVInit("cnvCustom");
network<mse, adagrad> nn;
makeNetwork(nn);
parse_image_grayscale(path, &test_images, 60*60); // load image
/*
// gray scale data
std::vector<vec_t> test_image;
std::vector<label_t> test_label;
vec_t img;
img.resize(60 * 60 * 1, -1.0);
for (int i = 0; i < 60 * 60 * 1; i++){
img[i] = 0.5;
}
test_image.push_back(img);*/
// test_imagesを読み込めるようにする。60*60*1のdeer.binを作成する
class_result=testPrebuiltCUSTOM_from_image<8, 16, ap_int<16>>(test_images, number_class, usecPerImage_int);
if(results) {
std::copy(class_result.begin(),class_result.end(), results);
}
if (usecPerImage) {
*usecPerImage = usecPerImage_int;
}
return (std::distance(class_result.begin(),std::max_element(class_result.begin(), class_result.end())));
}
extern "C" int* inference_multiple(const char* path, int number_class, int* image_number, float* usecPerImage, int enable_detail = 0) {
std::vector<int> detailed_results;
std::vector<label_t> test_labels;
std::vector<vec_t> test_images;
std::vector<int> all_result;
float usecPerImage_int;
int * result;
FoldedMVInit("cnvCustom");
network<mse, adagrad> nn;
makeNetwork(nn);
parse_cifar10(path, &test_images, &test_labels, -1.0, 1.0, 0, 0);
all_result=testPrebuiltCIFAR10_multiple_images<8, 16, ap_int<16>>(test_images, number_class, detailed_results, usecPerImage_int);
if (image_number) {
*image_number = all_result.size();
}
if (usecPerImage) {
*usecPerImage = usecPerImage_int;
}
if (enable_detail) {
result = new int [detailed_results.size()];
std::copy(detailed_results.begin(),detailed_results.end(), result);
} else {
result = new int [all_result.size()];
std::copy(all_result.begin(),all_result.end(), result);
}
return result;
}
extern "C" void free_results(int* result) {
delete[] result;
}
extern "C" void deinit() {
FoldedMVDeinit();
}
extern "C" int main(int argc, char** argv) {
if (argc != 5) {
cout << "4 parameters are needed: " << endl;
cout << "1 - folder for the binarized weights (binparam-***) - full path " << endl;
cout << "2 - path to image to be classified" << endl;
cout << "3 - number of classes in the dataset" << endl;
cout << "4 - expected result" << endl;
return 1;
}
float execution_time = 0;
int class_inference = 0;
int scores[64];
load_parameters(argv[1]);
class_inference = inference(argv[2], scores, atol(argv[3]), &execution_time);
cout << "Detected class " << class_inference << endl;
cout << "in " << execution_time << " microseconds" << endl;
deinit();
/*if (class_inference != atol(argv[4])) {
return 1;
} else {
return 0;
}*/
// force success
return 0;
}
これでソースコードは揃った。
4.5. ビルド
まず、H/WビルドとS/Wビルドがある。H/WビルドはFPGA側のBitstreamの生成である。
S/WビルドはPythonからBitstreamを叩けるようにするためのSoC側のビルドである。
4.5.1. H/Wビルド
/BNN-PYNQ/bnn/src/network/
ディレクトリ内で作業をする。
まずは、重みデータやネットワーク構造のプログラムが全部そろっているか確認する。
次にファイルをコピーして開く。
$cp make-hw.sh make-hw-custom.sh
特に大きく変えるところはないが、
以下で重みがある場所を指定しなければならない。次にテスト画像が必要である。
Bytesで出力されたバイナリデータである。
テストデータの識別結果が間違っていれば強制終了させられるが、そのプログラムは消しているので問題ない。
このため、適当なバイトコードで埋めたファイルでも食わせておけばよい。なお、データ数が少ないとエラーがでる。
大きめなファイルを作っておくことを推奨する。
PARAMS="$XILINX_BNN_ROOT/../params/custom/$NETWORK"
TEST_INPUT="$XILINX_BNN_ROOT/../../tests/Test_image/60x60.bin"
最終的なコードは以下である。
#!/bin/bash
NETWORKS=$(ls -d *W*A*/ | cut -f1 -d'/' | tr "\n" " ")
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <network> <platform> <mode>" >&2
echo "where <network> = $NETWORKS" >&2
echo "<platform> = pynqZ1-Z2 ultra96" >&2
echo "<mode> = regenerate (h)ls only, (b)itstream only, (a)ll" >&2
exit 1
fi
NETWORK=$1
PLATFORM=$2
MODE=$3
PATH_TO_VIVADO=$(which vivado)
PATH_TO_VIVADO_HLS=$(which vivado_hls)
if [ -z "$XILINX_BNN_ROOT" ]; then
export XILINX_BNN_ROOT="$( ( cd "$(dirname "$0")/.."; pwd) )"
fi
if [ -z "$PATH_TO_VIVADO" ]; then
echo "Error: Vivado not found."
exit 1
fi
if [ -z "$PATH_TO_VIVADO_HLS" ]; then
echo "Error: Vivado HLS not found."
exit 1
fi
if [ ! -d "$NETWORK" ]; then
echo "Error: Network is not available. Available are: $NETWORKS."
exit 1
fi
OLD_DIR=$(pwd)
cd $XILINX_BNN_ROOT
if [ -d "${XILINX_BNN_ROOT}/xilinx-tiny-cnn/" ]; then
echo "xilinx-tiny-cnn already cloned"
else
git clone https://github.com/Xilinx/xilinx-tiny-cnn.git
fi
cd $OLD_DIR
BNN_PATH=$XILINX_BNN_ROOT/network
HLS_SRC_DIR="$BNN_PATH/$NETWORK/hw"
HLS_OUT_DIR="$BNN_PATH/output/hls-syn/$NETWORK-$PLATFORM"
HLS_SCRIPT=$BNN_PATH/hls-syn.tcl
HLS_IP_REPO="$HLS_OUT_DIR/sol1/impl/ip"
VIVADO_HLS_LOG="$BNN_PATH/output/hls-syn/vivado_hls.log"
HLS_REPORT_PATH="$HLS_OUT_DIR/sol1/syn/report/BlackBoxJam_csynth.rpt"
REPORT_OUT_DIR="$BNN_PATH/output/report/$NETWORK-$PLATFORM"
VIVADO_SCRIPT_DIR=$XILINX_BNN_ROOT/library/script/$PLATFORM
VIVADO_SCRIPT=$VIVADO_SCRIPT_DIR/make-vivado-proj.tcl
# regenerate HLS if requested
if [[ ("$MODE" == "h") || ("$MODE" == "a") ]]; then
mkdir -p $HLS_OUT_DIR
mkdir -p $REPORT_OUT_DIR
OLDDIR=$(pwd)
echo "Calling Vivado HLS for hardware synthesis..."
cd $HLS_OUT_DIR/..
if [[ ("$NETWORK" == "cnv"*) ]]; then
PARAMS="$XILINX_BNN_ROOT/../params/cifar10/$NETWORK"
TEST_INPUT="$XILINX_BNN_ROOT/../../tests/Test_image/deer.bin"
TEST_RESULT=4
elif [[ ("$NETWORK" == "lfc"*) ]]; then
PARAMS="$XILINX_BNN_ROOT/../params/mnist/$NETWORK"
TEST_INPUT="$XILINX_BNN_ROOT/../../tests/Test_image/3.image-idx3-ubyte"
TEST_RESULT=3
fi
## adding ##################################################################
PARAMS="$XILINX_BNN_ROOT/../params/custom/$NETWORK"
TEST_INPUT="$XILINX_BNN_ROOT/../../tests/Test_image/60x60.bin"
# TEST_INPUT="$XILINX_BNN_ROOT/../../tests/Test_image/3.image-idx3-ubyte"
# TEST_RESULT=3
############################################################################
if [[ ("$PLATFORM" == "pynqZ1-Z2") ]]; then
PLATFORM_PART="xc7z020clg400-1"
TARGET_CLOCK=5
elif [[ ("$PLATFORM" == "ultra96") ]]; then
PLATFORM_PART="xczu3eg-sbva484-1-i"
TARGET_CLOCK=3
else
echo "Error: Platform not supported. Please choose between pynqZ1-Z2 and ultra96."
exit 1
fi
if [ ! -d "$PARAMS" ]; then
echo "Error: Please copy binary weight and threshold parameters to $PARAMS"
exit 1
fi
vivado_hls -f $HLS_SCRIPT -tclargs $NETWORK-$PLATFORM $HLS_SRC_DIR $PARAMS $TEST_INPUT $TEST_RESULT $PLATFORM_PART $TARGET_CLOCK
if cat $VIVADO_HLS_LOG | grep "ERROR"; then
echo "Error in Vivado_HLS"
exit 1
fi
if cat $VIVADO_HLS_LOG | grep "CRITICAL WARNING"; then
echo "Critical warning in Vivado_HLS"
exit 1
fi
cat $HLS_REPORT_PATH | grep "Utilization Estimates" -A 20 > $REPORT_OUT_DIR/hls.txt
cat $REPORT_OUT_DIR/hls.txt
echo "HLS synthesis complete"
echo "HLS-generated IP is at $HLS_IP_REPO"
cd $OLDDIR
fi
# generate bitstream if requested
TARGET_NAME="$NETWORK-$PLATFORM"
VIVADO_OUT_DIR="$BNN_PATH/output/vivado/$TARGET_NAME"
BITSTREAM_PATH="$BNN_PATH/output/bitstream"
TARGET_BITSTREAM="$BITSTREAM_PATH/$NETWORK-$PLATFORM.bit"
TARGET_TCL="$BITSTREAM_PATH/$NETWORK-$PLATFORM.tcl"
if [[ ("$MODE" == "b") || ("$MODE" == "a") ]]; then
mkdir -p "$BNN_PATH/output/vivado"
mkdir -p $BITSTREAM_PATH
echo "Setting up Vivado project..."
if [ -d "$VIVADO_OUT_DIR" ]; then
read -p "Remove existing project at $VIVADO_OUT_DIR (y/n)? " -n 1 -r
echo # (optional) move to a new line
if [[ $REPLY =~ ^[Nn]$ ]]
then
echo "Cancelled"
exit 1
fi
rm -rf $VIVADO_OUT_DIR
fi
vivado -mode batch -notrace -source $VIVADO_SCRIPT -tclargs $HLS_IP_REPO $TARGET_NAME $VIVADO_OUT_DIR $VIVADO_SCRIPT_DIR
cp -f "$VIVADO_OUT_DIR/$TARGET_NAME.runs/impl_1/procsys_wrapper.bit" $TARGET_BITSTREAM
cp -f "$VIVADO_OUT_DIR/procsys.tcl" $TARGET_TCL
# extract parts of the post-implementation reports
cat "$VIVADO_OUT_DIR/$TARGET_NAME.runs/impl_1/procsys_wrapper_timing_summary_routed.rpt" | grep "| Design Timing Summary" -B 3 -A 10 > $REPORT_OUT_DIR/vivado.txt
cat "$VIVADO_OUT_DIR/$TARGET_NAME.runs/impl_1/procsys_wrapper_utilization_placed.rpt" | grep "| Slice LUTs" -B 3 -A 11 >> $REPORT_OUT_DIR/vivado.txt
cat "$VIVADO_OUT_DIR/$TARGET_NAME.runs/impl_1/procsys_wrapper_utilization_placed.rpt" | grep "| CLB LUTs" -B 3 -A 11 >> $REPORT_OUT_DIR/vivado.txt
cat "$VIVADO_OUT_DIR/$TARGET_NAME.runs/impl_1/procsys_wrapper_utilization_placed.rpt" | grep "| Block RAM Tile" -B 3 -A 5 >> $REPORT_OUT_DIR/vivado.txt
cat "$VIVADO_OUT_DIR/$TARGET_NAME.runs/impl_1/procsys_wrapper_utilization_placed.rpt" | grep "| DSPs" -B 3 -A 3 >> $REPORT_OUT_DIR/vivado.txt
echo "Bitstream copied to $TARGET_BITSTREAM"
fi
echo "Done!"
exit 0
実行コマンドは以下である。
./make-hw-custom.sh cnvCustom pynqZ1-Z2 a
半日も待っていればコンパイルが終わるだろう。
bitstreamはoutputフォルダに出力される。
またoutputフォルダのreportフォルダにFPGAの使用率等も出るので参考にされたし。
4.5.2. S/Wビルド
次にDMA等を担当するS/Wビルドを行う。.so
のDLLが吐き出される。最終的にはこれを使ってFPGAに画像を送ることとなる。
/BNN-PYNQ/bnn/src/network/
ディレクトリ内で作業をする。
hwと同様にファイルをコピーして開く。
$cp make-sw.sh make-sw-custom.sh
概ね同じままでも動くがlibcma
まわりでエラーが出る場合があるのでPYNQを落としてきてパスを上書きしてあげるとエラーが出ないことがある。
PYNQ_INCLUDE_PATH="/YOUR_PATH/PYNQ/sdbuild/packages/libsds/libcma/"
最終的なコードは以下である。
NETWORKS=$(ls -d *W*A*/ | cut -f1 -d'/' | tr "\n" " ")
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <network> <runtime> where ">&2
echo "<network> = $NETWORKS" >&2
echo "<runtime> = python_sw python_hw" >&2
exit 1
fi
NETWORK=$1
RUNTIME=$2
BOARD="Pynq-Z1"
#if [ -z "$XILINX_BNN_ROOT" ]; then
# echo "Need to set XILINX_BNN_ROOT"
# exit 1
#fi
if [ -z "$XILINX_BNN_ROOT" ]; then
export XILINX_BNN_ROOT="$( ( cd "$(dirname "$0")/.."; pwd) )"
fi
if [ -z "$VIVADOHLS_INCLUDE_PATH" ]; then
VIVADOHLS_INCLUDE_PATH="/tools/Xilinx/Vivado/2018.3/include"
#"$(which vivado_hls)/../../include/"
#echo "Need to set VIVADOHLS_INCLUDE_PATH to rebuild from source"
#echo "The pre-compiled shared objects will be included"
#exit 1
fi
OLD_DIR=$(pwd)
cd $XILINX_BNN_ROOT
if [ -d "xilinx-tiny-cnn/" ]; then
echo "xilinx-tiny-cnn already cloned"
else
git clone https://github.com/Xilinx/xilinx-tiny-cnn.git
fi
cd $OLD_DIR
if [[ ("$BOARD" == "Pynq-Z1") || ("$BOARD" == "Pynq-Z2") ]]; then
DEF_BOARD="PYNQ"
PLATFORM="pynqZ1-Z2"
elif [[ ("$BOARD" == "Ultra96") ]]; then
DEF_BOARD="ULTRA"
PLATFORM="ultra96"
else
echo "Error: BOARD variable has to be Ultra96, Pynq-Z1 and Pynq-Z2 Board."
exit 1
fi
TINYCNN_PATH=$XILINX_BNN_ROOT/xilinx-tiny-cnn
BNN_PATH=$XILINX_BNN_ROOT/network
BNNLIB=$XILINX_BNN_ROOT/library
HOSTLIB=$BNNLIB/host
HLSLIB=$BNNLIB/hls
HLSTOP=$BNN_PATH/$NETWORK/hw
DRIVER_PATH=$BNNLIB/driver
SRCS_HOSTLIB=$HOSTLIB/*.cpp
SRCS_HLSLIB=$HLSLIB/*.cpp
SRCS_HLSTOP=$HLSTOP/top.cpp
SRCS_HOST=$BNN_PATH/$NETWORK/sw/main.cpp
OUTPUT_DIR=$XILINX_BNN_ROOT/network/output/sw
mkdir -p $OUTPUT_DIR
OUTPUT_FILE="$OUTPUT_DIR/$RUNTIME-$NETWORK-$PLATFORM"
PYNQ_INCLUDE_PATH="/YOUR_PATH/PYNQ/sdbuild/packages/libsds/libcma/"
if [[ ("$RUNTIME" == "python_sw") ]]; then
SRCS_HOST=$BNN_PATH/$NETWORK/sw/main_python.cpp
SRCS_ALL="$SRCS_HOSTLIB $SRCS_HLSTOP $SRCS_HOST"
arm-linux-gnueabihf-g++-7 -g -DOFFLOAD -DRAWHLS -std=c++11 -pthread -O2 -fPIC -shared $SRCS_ALL -I$VIVADOHLS_INCLUDE_PATH -I$TINYCNN_PATH -I$HOSTLIB -I$HLSLIB -I$HLSTOP -o $OUTPUT_FILE.so
elif [[ ("$RUNTIME" == "python_hw") ]]; then
SRCS_HOST=$BNN_PATH/$NETWORK/sw/main_python.cpp
SRCS_ALL="$DRIVER_PATH/platform-xlnk.cpp $SRCS_HOSTLIB $SRCS_HOST"
arm-linux-gnueabihf-g++-7 -g -DOFFLOAD -D$DEF_BOARD -std=c++11 -pthread -O3 -fPIC -shared $SRCS_ALL -I$PYNQ_INCLUDE_PATH -I$DRIVER_PATH -I$VIVADOHLS_INCLUDE_PATH -I$TINYCNN_PATH -I$HOSTLIB -I$HLSLIB -I$HLSTOP -o $OUTPUT_FILE.so -lcma
fi
echo "Output at $OUTPUT_FILE"
実行コマンドは以下である。
./make-sw-custom.sh cnvCustom python_hw
./make-sw-custom.sh cnvCustom python_sw
4.5.3. PYNQの準備
ここでPYNQで扱う場合は、このままで良い。他のボードで実行する場合はこちらを参照されたし。
https://qiita.com/harmegiddo/private/0bab2b39b75db9ce88f7
PYNQの場合はこのまま読み進めて頂きたい。
まず、これまでに出来上がったファイルをPYNQへ移していく。
なお、母艦マシンの必要なファイルのありかは以下である
・Bitstream: /BNN-PYNQ/bnn/src/network/output/cnvCustom-pynqZ1-Z2.bit
・Bitstream: /BNN-PYNQ/bnn/src/network/output/cnvCustom-pynqZ1-Z2.tcl
・SW Library: /BNN-PYNQ/bnn/src/network/output/sw/python_hw-cnvCustom-pynqZ1-Z2.so
・HW Library: /BNN-PYNQ/bnn/src/network/output/sw/python_hw-cnvCustom-pynqZ1-Z2.so
PYNQボード上の作業フォルダは以下である。
/usr/local/lib/python3.6/dist-packages/bnn
生成したBitstreamは以下のフォルダへ移動する。
/usr/local/lib/python3.6/dist-packages/bnn/bitstreams/pynqZ1-Z2
※tclとbitを両方移す
生成したDMA用のライブラリは以下のフォルダへ移動する。
/usr/local/lib/python3.6/dist-packages/bnn/libraries/pynqZ1-Z2
※python_hw/sw_cnvCustom-pynqZ1-Z2.so
次に、これらファイルを読み込めるようにソースコードを変更していく。
以下を開く。
$ vi /usr/local/lib/python3.6/dist-packages/bnn/bnn.py
これがBNN-PYNQのベースファイルである。ここで新しく読み込むビットストリームの名前を定義する。
from pynq import Overlay, PL
from PIL import Image
import numpy as np
import cffi
import os
import tempfile
RUNTIME_HW = "python_hw"
RUNTIME_SW = "python_sw"
NETWORK_CNVW1A1 = "cnvW1A1"
NETWORK_CNVW1A2 = "cnvW1A2"
NETWORK_CNVW2A2 = "cnvW2A2"
NETWORK_LFCW1A1 = "lfcW1A1"
NETWORK_LFCW1A2 = "lfcW1A2"
NETWORK_CNVCUSTOM = "cnvCustom"
if os.environ['BOARD'] == 'Ultra96':
PLATFORM="ultra96"
elif os.environ['BOARD'] == 'Pynq-Z1' or os.environ['BOARD'] == 'Pynq-Z2':
PLATFORM="pynqZ1-Z2"
else:
raise RuntimeError("Board not supported")
...
また識別する際に、画像サイズを調整するならCnvClassifier
クラスにこのような記述をしておいてもよい
def image_to_custom(self, img, fp):
img = img.resize((60, 60))
img = (np.array(img))
img = img[:,:].flatten()
fp.write(img.tobytes())
def classify_custom_image(self, img):
#f = tempfile.NamedTemporaryFile()
#self.image_to_custom(img, f)
#f.flush()
#print(f.name)
with tempfile.NamedTemporaryFile() as tmp:
#self.image_to_custom(img, tmp)
tmp.write(img.tobytes())
tmp.flush()
result = self.bnn.inference(tmp.name)
self.usecPerImage = self.bnn.usecPerImage
return result
次にbnn.py
を初期化出来るように以下のファイルを編集する。
$ vi /usr/local/lib/python3.6/dist-packages/bnn/__init__.py
以下のようにファイルの中身を書き換える
from .bnn import PynqBNN, CnvClassifier, LfcClassifier, RUNTIME_HW, RUNTIME_SW
from .bnn import NETWORK_CNVW1A1, NETWORK_CNVW1A2, NETWORK_CNVW2A2, NETWORK_LFCW1A1, NETWORK_LFCW1A2, NETWORK_CNVCUSTOM, available_params
__version__ = 0.1
ここまでくればPYNQのJupiter上で操作することができる。
まずimport
系を書く。
import cv2
import numpy as np
from PIL import Image
import bnn
import matplotlib.pyplot as plt
%matplotlib inline
次に画像読み込みのプログラムを書く。
batch.log
には各テストデータの所在とラベルが書かれているものとする。
こんな感じな具合で。
/data/img/img1.png 0
/data/img/img2.png 1
で、画像読み込みがこちら。
img_size_x = 60
img_size_y = 60
img_size_c = 1
# flatternさせる
def load_data_all_no_one_hot(filename="", classes_num=10, randomly = False):
setFileList = []
tmp_line = []
f = open(filename, 'r')
for line in f:
tmp_line.append(line.rstrip())
_num = len(tmp_line)
if randomly == True:
tmp_line = random.sample(tmp_line, _num)
images = []
labels = []
for i in range(_num):
parse = tmp_line[i].split()
# read data
img = cv2.imread(parse[0])
img = cv2.resize(img, (img_size_x, img_size_y))[:,:,0]
img = np.array(img, np.uint8)
# img
images.append(img.flatten())
#print(float(parse[1]))
memo = int(round( float(parse[1]) * float(classes_num - 1) ))
labels.append(memo)
f.close()
return (images, labels)
imgs, labls = load_data_all_no_one_hot("/home/xilinx/jupyter_notebooks/data/out_length/test/batch.log")
次に、開発してきたBNNを読み込み、識別する。
classifier = bnn.CnvClassifier(bnn.NETWORK_CNVLENGTH,'custom',bnn.RUNTIME_HW)
class_out = classifier.classify_custom_image(imgs[i])
4.5.4. PYNQで動かした結果
10. Debug
batch.log
filename1 class1
filename2 class2
filename3 class3
...
generate_bin.py
import cv2
import numpy as np
import tempfile
def load_data_all(filename=""):
setFileList = []
tmp_line = []
f = open(filename, 'r')
for line in f:
tmp_line.append(line.rstrip())
_num = len(tmp_line)
images = []
labels = []
for i in range(_num):
parse = tmp_line[i].split()
# read data
img = cv2.imread(parse[0])
img = cv2.resize(img, (60, 60))[:,:,0]
img = np.array(img, np.uint8)
# saving, grayscale
images.append(img.flatten())
labels.append(int(parse[1]))
f.close()
return (images, labels)
imgs, labls = load_data_all("./batch.log")
# generate test data
f = open('tests.bin','w')
for i in range(len(imgs)):
f.write(imgs[i].tobytes())
f.close()