MIOpen とは?
cuDNN の AMD 版みたいなものです. 機械学習のカーネルを GPU(or accelerator)で高速化します.
HIP と OpenCL の二つのバックエンドがあります.
(カーネルは OpenCL コードのようで, HIP バックエンドだと clang-opencl で変換. HIP のほうが asm を使えるなどからか少し対応状況がよいようです(bfloat16 など))
cuDNN とは異なり, 現状 MIOpen は Linux のみの対応になります(Windows や macOS, 他の unix 系システムへの対応は予定されていない https://github.com/ROCmSoftwarePlatform/MIOpen/issues/38).
ただ, MIOpen OpenCL 版は asm など使っていなければポータブルであるはずなので,少しいじれば非 ROCm 環境で(NVIDIA/Intel GPU でも?)動くようですので, 頑張れば Windows で動かすなどもできるかも ?
MIOpen(C/C++ API) を使う.
動作環境
- Radeon VEGA 56
- Ubuntu 16.04 or 18.04
- ROCm や開発環境を apt インストール済み https://github.com/RadeonOpenCompute/ROCm
- プログラムは,
/opt/rocm/miopen/lib/cmake/miopen
で CMake セットアップしているものとする.
HIP 版は環境設定やコンパイラ設定が面倒なので, OpenCL 版を使います.
Ubuntu だと apt install miopen-opencl
しておきます.
(すでに HIP 版がインストールされている場合は, apt remove miopen-hip
で消しておきます)
MIOpen のドキュメントページには, ビルドやインストールの手順, API リファレンスしかありません.
(チュートリアルコードもありません)
MIOpen API は, 基本的には cuDNN の API に合わせてあるようなので, cuDNN を参考にしてコードを書いてみます.
cuDNN もドキュメント https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html ではチュートリアルコードがなくてつらいが,
Convolution を使うサンプルが以下にあります.
(また, cuDNN のダウンロードサイトで SDK サンプルが落とせるようです(未確認)が, ChainerX CC あたりのコードを見た方が役立ちそうです)
データフォーマット, データタイプについて
MIOpen では NCHW
のデータフォーマットのみの対応になります(as of 2.0.1).
fp16 は一応 fully supported っぽいようです(as of 2.0.1).
MIOpen porting guide
なぜか PDF で, かつ内容は古いですが, 参考になります.
MIOpen のほうが後発だからか, 多少 API はすっきりしています.
(cuDNN だと cudnnSetTensor4dDescriptor と cudnnSetFilter4dDescriptor で引数順序が違っていたりしてややこしい)
cudnnCreateFilterDescriptor => TensorDescriptor API を使います.
MIOpen convolution では, workspace を取得してから, find algo する必要があります.
(cuDNN では get/find algo してから workspace 取得)
コード例
コード例です. cuDNN-sample と MIOpen test/main.cpp
を参考にしています.
OpenCL backend の場合, miopenGetStream
で OpenCL queue を取得し, その後 clGetCommandQueueInfo
で OpenCL context を取得します.
(MIOpen 内部で OpenCL context を作っているため, 自前で作ると seg fault する)
Conv アルゴリズムは, 固定の場合でも miopenFindConvolutionForwardAlgorithm
自体は呼ぶ必要があるようです(以下では Find で見つかったアルゴリズム perf.fwd_algo
を使っていますが, miopen test/main.cpp
ではアルゴリズム決め打ち)
# include <cstdio>
# include <cstdlib>
# include <vector>
# include <iostream>
# include <array>
# include <numeric>
// https://stackoverflow.com/questions/28500496/opencl-function-found-deprecated-by-visual-studio
# define CL_USE_DEPRECATED_OPENCL_1_2_APIS
# include <CL/cl.h>
# include <miopen/miopen.h>
# include <miopen/version.h>
# define CHECK_MIOPEN(cmd) \
{ \
miopenStatus_t err = (cmd); \
if (err != miopenStatusSuccess) { \
fprintf(stderr, "error: '%s'(%d) at %s:%d\n", \
miopenGetErrorString(err), err, __FILE__, __LINE__); \
exit(EXIT_FAILURE); \
} \
}
int main(int argc, char **argv)
{
std::cout << "MIOPEN_VERSION_MAJOR:" << MIOPEN_VERSION_MAJOR << std::endl;
std::cout << "MIOPEN_VERSION_MINOR:" << MIOPEN_VERSION_MINOR << std::endl;
std::cout << "MIOPEN_VERSION_PATCH:" << MIOPEN_VERSION_PATCH << std::endl;
miopenHandle_t handle;
CHECK_MIOPEN(miopenCreate(&handle));
// get OpenCL queue from miopen.
cl_command_queue q{};
miopenGetStream(handle, &q);
CHECK_MIOPEN(miopenEnableProfiling(handle, true));
// input tensor
const int in_shape[4] = {1, 1, 5, 5}; // NCHW
miopenTensorDescriptor_t in_desc;
CHECK_MIOPEN(miopenCreateTensorDescriptor(&in_desc));
CHECK_MIOPEN(miopenSet4dTensorDescriptor(in_desc, miopenFloat,
in_shape[0], in_shape[1], in_shape[2], in_shape[3])); // NCHW
// filter tensor
const int filt_shape[4] = {1, 1, 2, 2}; // KCHW
miopenTensorDescriptor_t filt_desc;
CHECK_MIOPEN(miopenCreateTensorDescriptor(&filt_desc));
CHECK_MIOPEN(miopenSet4dTensorDescriptor(filt_desc, miopenFloat,
filt_shape[0], filt_shape[1], filt_shape[2], filt_shape[3]));
miopenConvolutionDescriptor_t conv_desc;
CHECK_MIOPEN(miopenCreateConvolutionDescriptor(&conv_desc));
const int pad_h = 1;
const int pad_w = 1;
const int stride_h = 1;
const int stride_w = 1;
const int dilation_h = 1;
const int dilation_w = 1;
CHECK_MIOPEN(miopenInitConvolutionDescriptor(conv_desc, miopenConvolution, pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w));
// output
int out_shape[4]; // NCHW
CHECK_MIOPEN(miopenGetConvolutionForwardOutputDim(conv_desc, in_desc, filt_desc,
&out_shape[0], &out_shape[1],
&out_shape[2], &out_shape[3]));
miopenTensorDescriptor_t out_desc;
CHECK_MIOPEN(miopenCreateTensorDescriptor(&out_desc));
CHECK_MIOPEN(miopenSet4dTensorDescriptor(out_desc, miopenFloat,
out_shape[0], out_shape[1], out_shape[2], out_shape[3]));
// workspace
size_t ws_size = 0;
CHECK_MIOPEN(miopenConvolutionForwardGetWorkSpaceSize(handle, /* w */filt_desc, /* x */in_desc, conv_desc, /* y */out_desc, &ws_size));
//std::cout << "ws_size = " << ws_size << "\n";
// Get OpenCL context from queue
cl_context cl_ctx;
clGetCommandQueueInfo(q, CL_QUEUE_CONTEXT, sizeof(cl_context), &cl_ctx, nullptr);
size_t in_data_size = in_shape[0] * in_shape[1] * in_shape[2] * in_shape[3];
size_t filt_data_size = filt_shape[0] * filt_shape[1] * filt_shape[2] * filt_shape[3];
size_t out_data_size = out_shape[0] * out_shape[1] * out_shape[2] * out_shape[3];
cl_int errcode = 0;
cl_mem in_data = clCreateBuffer(cl_ctx, CL_MEM_READ_ONLY, in_data_size * sizeof(float), nullptr, &errcode);
if (errcode != CL_SUCCESS) {
std::cerr << "in_data clCreateBuffer failed.\n";
return EXIT_FAILURE;
}
cl_mem filt_data = clCreateBuffer(cl_ctx, CL_MEM_READ_ONLY, filt_data_size * sizeof(float), nullptr, &errcode);
if (errcode != CL_SUCCESS) {
std::cerr << "in_data clCreateBuffer failed.\n";
return EXIT_FAILURE;
}
cl_mem out_data = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE, out_data_size * sizeof(float), nullptr, &errcode);
if (errcode != CL_SUCCESS) {
std::cerr << "out_data clCreateBuffer failed.\n";
return EXIT_FAILURE;
}
cl_mem ws_data = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE, ws_size, nullptr, &errcode);
if (errcode != CL_SUCCESS) {
std::cerr << "ws_data clCreateBuffer failed.\n";
return EXIT_FAILURE;
}
std::vector<float> in_data_buf(in_data_size);
std::vector<float> filt_data_buf(filt_data_size);
std::vector<float> out_data_buf(out_data_size);
std::vector<uint8_t> ws_buf(ws_size);
//
// fill in_data_buf and filt_data_buf ...
//
//std::cout << "in_data_size = " << in_data_size << "\n";
//std::cout << "filt_data_size = " << filt_data_size << "\n";
//std::cout << "out_data_size = " << out_data_size << "\n";
errcode = clEnqueueWriteBuffer(q, in_data, CL_TRUE, 0, in_data_size * sizeof(float), in_data_buf.data(), 0, nullptr, nullptr);
if (errcode != CL_SUCCESS) {
std::cerr << "in_data clEnqueueWriteBuffer failed. " << errcode << "\n";
return EXIT_FAILURE;
}
errcode = clEnqueueWriteBuffer(q, filt_data, CL_TRUE, 0, filt_data_size * sizeof(float), filt_data_buf.data(), 0, nullptr, nullptr);
if (errcode != CL_SUCCESS) {
std::cerr << "in_data clEnqueueWriteBuffer failed. " << errcode << "\n";
return EXIT_FAILURE;
}
errcode = clEnqueueWriteBuffer(q, out_data, CL_TRUE, 0, out_data_size * sizeof(float), out_data_buf.data(), 0, nullptr, nullptr);
if (errcode != CL_SUCCESS) {
std::cerr << "in_data clEnqueueWriteBuffer failed.\n";
return EXIT_FAILURE;
}
errcode = clEnqueueWriteBuffer(q, ws_data, CL_TRUE, 0, ws_size, ws_buf.data(), 0, nullptr, nullptr);
if (errcode != CL_SUCCESS) {
std::cerr << "in_data clEnqueueWriteBuffer failed.\n";
return EXIT_FAILURE;
}
miopenConvAlgoPerf_t perf{};
int algo_count = 0;
bool exhaustive_search = false;
std::cout << "find conv algo" << std::endl;
CHECK_MIOPEN(miopenFindConvolutionForwardAlgorithm(handle, in_desc,
/* cl_mem */in_data, filt_desc, /* cl_mem */filt_data,
conv_desc, out_desc, /* cl_mem */out_data, /* req algos*/1, &algo_count, &perf, /* cl_mem */ws_data, ws_size, exhaustive_search));
float alpha = 1.0f;
float beta = 0.0f;
CHECK_MIOPEN(miopenConvolutionForward(handle,
&alpha,
in_desc,
/* cl_mem */in_data,
filt_desc,
/* cl_mem */filt_data,
conv_desc,
perf.fwd_algo,
&beta,
out_desc,
out_data,
ws_data,
ws_size));
float time = -1.0f;
CHECK_MIOPEN(miopenGetKernelTime(handle, &time));
//std::cout << "time : " << time << "\n";
clReleaseMemObject(in_data);
clReleaseMemObject(filt_data);
clReleaseMemObject(out_data);
clReleaseMemObject(ws_data);
clReleaseCommandQueue(q);
CHECK_MIOPEN(miopenDestroyConvolutionDescriptor(conv_desc));
CHECK_MIOPEN(miopenDestroyTensorDescriptor(in_desc));
CHECK_MIOPEN(miopenDestroyTensorDescriptor(filt_desc));
CHECK_MIOPEN(miopenDestroyTensorDescriptor(out_desc));
CHECK_MIOPEN(miopenDestroy(handle));
参考までに MIOPEN_ENABLE_LOGGING=1
MIOPEN_LOG_LEVEL=6
で実行した結果を添付します.
MIOPEN_VERSION_MAJOR:2
MIOPEN_VERSION_MINOR:0
MIOPEN_VERSION_PATCH:1
MIOpen(OpenCL): Info [Handle] stream: 0x55c42a95b300, device_id: 0x55c42ab344e0
MIOpen(OpenCL): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor**){
MIOpen(OpenCL): tensorDesc = 0x300000000
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenSet4dTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int, int, int){
MIOpen(OpenCL): tensorDesc =
MIOpen(OpenCL): dataType = 1
MIOpen(OpenCL): n = 1
MIOpen(OpenCL): c = 1
MIOpen(OpenCL): h = 5
MIOpen(OpenCL): w = 5
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor**){
MIOpen(OpenCL): tensorDesc = 0x7
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenSet4dTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int, int, int){
MIOpen(OpenCL): tensorDesc =
MIOpen(OpenCL): dataType = 1
MIOpen(OpenCL): n = 1
MIOpen(OpenCL): c = 1
MIOpen(OpenCL): h = 2
MIOpen(OpenCL): w = 2
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenCreateConvolutionDescriptor(miopenConvolutionDescriptor**){
MIOpen(OpenCL): convDesc = 0x30
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenInitConvolutionDescriptor(miopenConvolutionDescriptor_t, miopenConvolutionMode_t, int, int, int, int, int, int){
MIOpen(OpenCL): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(OpenCL): c_mode = 0
MIOpen(OpenCL): pad_h = 1
MIOpen(OpenCL): pad_w = 1
MIOpen(OpenCL): stride_h = 1
MIOpen(OpenCL): stride_w = 1
MIOpen(OpenCL): dilation_h = 1
MIOpen(OpenCL): dilation_w = 1
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenGetConvolutionForwardOutputDim(miopenConvolutionDescriptor_t, miopenTensorDescriptor_t, miopenTensorDescriptor_t, int*, int*, int*, int*){
MIOpen(OpenCL): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {1, 1}, {1, 1}, {1, 1},
MIOpen(OpenCL): inputTensorDesc = 1, 1, 5, 5
MIOpen(OpenCL): filterDesc = 1, 1, 2, 2
MIOpen(OpenCL): n = -1836566048
MIOpen(OpenCL): c = 32720
MIOpen(OpenCL): h = -1843288951
MIOpen(OpenCL): w = 32720
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor**){
MIOpen(OpenCL): tensorDesc = 0x55c42a4bb018
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenSet4dTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int, int, int){
MIOpen(OpenCL): tensorDesc =
MIOpen(OpenCL): dataType = 1
MIOpen(OpenCL): n = 1
MIOpen(OpenCL): c = 1
MIOpen(OpenCL): h = 6
MIOpen(OpenCL): w = 6
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenConvolutionForwardGetWorkSpaceSize(miopenHandle_t, miopenTensorDescriptor_t, miopenTensorDescriptor_t, miopenConvolutionDescriptor_t, miopenTensorDescriptor_t, size_t*){
MIOpen(OpenCL): handle = stream: 0x55c42a95b300, device_id: 0x55c42ab344e0
MIOpen(OpenCL): wDesc = 1, 1, 2, 2
MIOpen(OpenCL): yDesc = 1, 1, 6, 6
MIOpen(OpenCL): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {1, 1}, {1, 1}, {1, 1},
MIOpen(OpenCL): workSpaceSize = 0
MIOpen(OpenCL): }
MIOpen(OpenCL): Info [ForwardGetWorkSpaceSize]
MIOpen(OpenCL): Info [DetectAmdRocmMetadataVersion] ROCm MD version AMDHSA_1_0, MIOpen version 2.0.1.7405-rocm-rel-2.7-22-4e39a83
MIOpen(OpenCL): Info2 [ValidateGcnAssemblerImpl] Running: '/opt/rocm/opencl/bin/x86_64/clang --version'
MIOpen(OpenCL): Info2 [ValidateGcnAssemblerImpl] clang version 8.0
MIOpen(OpenCL): Info2 [ValidateGcnAssemblerImpl] Target: amdgcn-unknown-amdhsa
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm3x3U: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm1x1U: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm1x1UV2: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm5x10u2v2f1: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm7x7c3h224w224k64u2v2p3q3f1: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm5x10u2v2b1: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwd11x11: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwdGen: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwd3x3: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwd1x1: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwd: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4Fwd: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4_1x1: Not applicable
MIOpen(OpenCL): miopenStatus_t miopenFindConvolutionForwardAlgorithm(miopenHandle_t, miopenTensorDescriptor_t, const void*, miopenTensorDescriptor_t, const void*, miopenConvolutionDescriptor_t, miopenTensorDescriptor_t, void*, int, int*, miopenConvAlgoPerf_t*, void*, size_t, bool){
MIOpen(OpenCL): handle = stream: 0x55c42a95b300, device_id: 0x55c42ab344e0
MIOpen(OpenCL): xDesc = 1, 1, 5, 5
MIOpen(OpenCL): x = 0x55c42a7f3fe0
MIOpen(OpenCL): wDesc = 1, 1, 2, 2
MIOpen(OpenCL): w = 0x55c42a7589e0
MIOpen(OpenCL): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {1, 1}, {1, 1}, {1, 1},
MIOpen(OpenCL): yDesc = 1, 1, 6, 6
MIOpen(OpenCL): y = 0x55c42a5dc830
MIOpen(OpenCL): requestAlgoCount = 1
MIOpen(OpenCL): returnedAlgoCount = 0
MIOpen(OpenCL): perfResults =
MIOpen(OpenCL): workSpace = 0x55c42a5f8930
MIOpen(OpenCL): workSpaceSize = 576
MIOpen(OpenCL): exhaustiveSearch = 0
MIOpen(OpenCL): }
MIOpen(OpenCL): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 576
MIOpen(OpenCL): Info [Measure] Db::Prefetch time: 66.9299 ms
MIOpen(OpenCL): Info2 [FindRecordUnsafe] Looking for key 1-5-5-2x2-1-6-6-1-1x1-1x1-1x1-0-NCHW-FP32-F in file /home/syoyo/.config/miopen//gfx906_60.OpenCL.2_0_1.ufdb.txt
MIOpen(OpenCL): Info2 [FindRecordUnsafe] Key match: 1-5-5-2x2-1-6-6-1-1x1-1x1-1x1-0-NCHW-FP32-F
MIOpen(OpenCL): Info2 [FindRecordUnsafe] Contents found: miopenConvolutionFwdAlgoGEMM:gemm,0.01232,576,MIOpenGEMM,0_0_36_4_36_36_1_4
MIOpen(OpenCL): Info2 [Measure] Db::FindRecord time: 0.109111 ms
MIOpen(OpenCL): Info2 [LogFindDbItem] Kernel cache entry not found for solver <miopenConvolutionFwdAlgoGEMM::gemm> at network config: 1-5-5-2x2-1-6-6-1-1x1-1x1-1x1-0-NCHW-FP32-F and kernel cache key: MIOpenGEMM, 0_0_36_4_36_36_1_4
MIOpen(OpenCL): Info2 [LogFindDbItem] Find-db record content: <miopenConvolutionFwdAlgoGEMM::gemm> at network config: 0_0_36_4_36_36_1_4 and algorithm name: MIOpenGEMM
MIOpen(OpenCL): Info [TryLoad] Find-db regenerating.
MIOpen(OpenCL): void miopen::DirConvFindCore(miopen::Handle&, const miopen::TensorDescriptor&, ConstData_t, const miopen::TensorDescriptor&, ConstData_t, const miopen::TensorDescriptor&, Data_t, Data_t, size_t, const miopen::ConvolutionDescriptor&, bool, miopen::DbRecord&){
MIOpen(OpenCL): "convolution, non 1x1" = convolution, non 1x1
MIOpen(OpenCL): }
MIOpen(OpenCL): Info2 [GetKernels] 0 kernels for key: miopenIm2d2Col "c1i5_5w2_2p1_1s1_1d1_1t1"
MIOpen(OpenCL): Info2 [AddKernel] Key: miopenIm2Col "c1i5_5w2_2p1_1s1_1d1_1t1"
MIOpen(OpenCL): Info2 [AddKernelDumpKernelParams] runcl MIOpenIm2d2Col.cl -k Im2d2Col -dumpilisa -r 10 if#0: if#0: if#0: iv#0 256,1,1/256,1,1 -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=1 -DNUM_IM_BLKS=1 -DLOCAL_MEM_SIZE=297 -DSTRIDE_GT_1=0 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1
MIOpen(OpenCL): Info2 [CallGemm] gemm_desc: {isColMajor 0, transA 0, transB 0, m 1, n 36, k 4, lda 4, ldb 36, ldc 36, batch_count 1, strideA 0, strideB 0, strideC 0, alpha 1, beta 0, dataType 1}
MIOpen(OpenCL): miopenStatus_t miopen::CallGemm(miopen::Handle&, miopen::GemmDescriptor, ConstData_t, int, ConstData_t, int, Data_t, int, miopen::FindDbKCacheKey*, bool, miopen::GemmBackend_t){
MIOpen(OpenCL): "MIOpenGEMM" = MIOpenGEMM
MIOpen(OpenCL): }
MIOpen(OpenCL): Info2 [GetKernels] 0 kernels for key: MIOpenGEMM "0_0_36_4_36_36_1_4"
MIOpen(OpenCL): Info2 [AddKernel] Key: MIOpenGEMM "0_0_36_4_36_36_1_4"
MIOpen(OpenCL): Info2 [AddKernelDumpKernelParams] runcl (source provided by miopengemm) -k miog_betac_alphaab -dumpilisa -r 10 if#0: if#0: if#0: iv#0 36,1,1/1,1,1
MIOpen(OpenCL): Info2 [GetKernels] 1 kernels for key: MIOpenGEMM "0_0_36_4_36_36_1_4"
MIOpen(OpenCL): Info2 [CallGemm] gemm_desc: {isColMajor 0, transA 0, transB 0, m 1, n 36, k 4, lda 4, ldb 36, ldc 36, batch_count 1, strideA 0, strideB 0, strideC 0, alpha 1, beta 0, dataType 1}
MIOpen(OpenCL): miopenStatus_t miopen::CallGemm(miopen::Handle&, miopen::GemmDescriptor, ConstData_t, int, ConstData_t, int, Data_t, int, miopen::FindDbKCacheKey*, bool, miopen::GemmBackend_t){
MIOpen(OpenCL): "MIOpenGEMM" = MIOpenGEMM
MIOpen(OpenCL): }
MIOpen(OpenCL): Info2 [GetKernels] 1 kernels for key: MIOpenGEMM "0_0_36_4_36_36_1_4"
MIOpen(OpenCL): Info [SetValues] 1-5-5-2x2-1-6-6-1-1x1-1x1-1x1-0-NCHW-FP32-F, content inserted: miopenConvolutionFwdAlgoGEMM:gemm,0.01264,576,MIOpenGEMM,0_0_36_4_36_36_1_4
MIOpen(OpenCL): Info2 [SearchForSolution] ConvBinWinograd3x3U: Not applicable
MIOpen(OpenCL): Info2 [SearchForSolution] ConvBinWinogradRxS: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm3x3U: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm1x1U: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm1x1UV2: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm5x10u2v2f1: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm7x7c3h224w224k64u2v2p3q3f1: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvAsm5x10u2v2b1: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwd11x11: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwdGen: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwd3x3: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwd1x1: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvOclDirectFwd: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4Fwd: Not applicable
MIOpen(OpenCL): Info2 [SearchForAllSolutions] ConvHipImplicitGemmV4_1x1: Not applicable
MIOpen(OpenCL): Info2 [StoreRecordUnsafe] Storing record: 1-5-5-2x2-1-6-6-1-1x1-1x1-1x1-0-NCHW-FP32-F
MIOpen(OpenCL): Info2 [FindRecordUnsafe] Looking for key 1-5-5-2x2-1-6-6-1-1x1-1x1-1x1-0-NCHW-FP32-F in file /home/syoyo/.config/miopen//gfx906_60.OpenCL.2_0_1.ufdb.txt
MIOpen(OpenCL): Info2 [FindRecordUnsafe] Key match: 1-5-5-2x2-1-6-6-1-1x1-1x1-1x1-0-NCHW-FP32-F
MIOpen(OpenCL): Info2 [FindRecordUnsafe] Contents found: miopenConvolutionFwdAlgoGEMM:gemm,0.01232,576,MIOpenGEMM,0_0_36_4_36_36_1_4
MIOpen(OpenCL): Info2 [Measure] Db::StoreRecord time: 0.211942 ms
MIOpen(OpenCL): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM 0.01264 576
MIOpen(OpenCL): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: gemm , 576, 0.01264
MIOpen(OpenCL): miopenStatus_t miopenConvolutionForward(miopenHandle_t, const void*, miopenTensorDescriptor_t, const void*, miopenTensorDescriptor_t, const void*, miopenConvolutionDescriptor_t, miopenConvFwdAlgorithm_t, const void*, miopenTensorDescriptor_t, void*, void*, size_t){
MIOpen(OpenCL): handle = stream: 0x55c42a95b300, device_id: 0x55c42ab344e0
MIOpen(OpenCL): alpha = 0x7ffc0672187c
MIOpen(OpenCL): xDesc = 1, 1, 5, 5
MIOpen(OpenCL): x = 0x55c42a7f3fe0
MIOpen(OpenCL): wDesc = 1, 1, 2, 2
MIOpen(OpenCL): w = 0x55c42a7589e0
MIOpen(OpenCL): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {1, 1}, {1, 1}, {1, 1},
MIOpen(OpenCL): algo = 0
MIOpen(OpenCL): beta = 0x7ffc06721880
MIOpen(OpenCL): yDesc = 1, 1, 6, 6
MIOpen(OpenCL): y = 0x55c42a5dc830
MIOpen(OpenCL): workSpace = 0x55c42a5f8930
MIOpen(OpenCL): workSpaceSize = 576
MIOpen(OpenCL): }
MIOpen(OpenCL): Info [ConvolutionForward] algo = 0, workspace = 576
MIOpen(OpenCL): void miopen::ConvolutionDescriptor::ConvFwdGemm(miopen::Handle&, const miopen::ConvFwdTensors&, Data_t, std::size_t) const{
MIOpen(OpenCL): "convolution, non 1x1" = convolution, non 1x1
MIOpen(OpenCL): }
MIOpen(OpenCL): Info2 [GetKernels] 0 kernels for key: miopenIm2d2Col "c1i5_5w2_2p1_1s1_1d1_1t1"
MIOpen(OpenCL): Info2 [AddKernel] Key: miopenIm2Col "c1i5_5w2_2p1_1s1_1d1_1t1"
MIOpen(OpenCL): Info2 [CallGemm] gemm_desc: {isColMajor 0, transA 0, transB 0, m 1, n 36, k 4, lda 4, ldb 36, ldc 36, batch_count 1, strideA 0, strideB 0, strideC 0, alpha 1, beta 0, dataType 1}
MIOpen(OpenCL): miopenStatus_t miopen::CallGemm(miopen::Handle&, miopen::GemmDescriptor, ConstData_t, int, ConstData_t, int, Data_t, int, miopen::FindDbKCacheKey*, bool, miopen::GemmBackend_t){
MIOpen(OpenCL): "MIOpenGEMM" = MIOpenGEMM
MIOpen(OpenCL): }
MIOpen(OpenCL): Info2 [GetKernels] 1 kernels for key: MIOpenGEMM "0_0_36_4_36_36_1_4"
MIOpen(OpenCL): miopenStatus_t miopenDestroyConvolutionDescriptor(miopenConvolutionDescriptor_t){
MIOpen(OpenCL): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {1, 1}, {1, 1}, {1, 1},
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){
MIOpen(OpenCL): tensorDesc = 1, 1, 5, 5
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){
MIOpen(OpenCL): tensorDesc = 1, 1, 2, 2
MIOpen(OpenCL): }
MIOpen(OpenCL): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){
MIOpen(OpenCL): tensorDesc = 1, 1, 6, 6
MIOpen(OpenCL): }
その他のコード例
サンプルコードの動作を確認されましたら, あとは ROCm TensorFlow や ROCm PyTorch のコードなどを参考にするとよいでしょう.
初回時に seg fault?
上記のサンプルですが, OS を起動して最初のときだけ, seg fault するケースがあるようです.
カーネルのキャッシュ周りとかでしょうか.
Immediate mode
MIOpen 2.0 から Immediate mode が導入されました.
Find() によるアルゴリズムの選択は遅いので, Find での結果などを Find-Db
に記録しておき, Immediate mode ではデータベースからアルゴリズムを取得するようにして高速化を計ります.
ただし Immediate mode が利用できるのは, VEGA 系(gfx900 or gtf906)になります. また, OpenCL backend だと fp32 のみの対応になります.
デバッグ
https://qiita.com/syoyo/items/6684a491bef39ff5ed78
https://github.com/ROCmSoftwarePlatform/MIOpen/wiki/MIOpen-Environment-Variables
環境変数がいくつかあります. lldb/gdb でブレークポイントも仕掛けることができます(この場合は MIOpen をソースコードからデバッグビルドを推奨). cuDNN だと, よくわからないエラーがでると中身はブラックボックス バイナリなのでお手上げですが, MIOpen なら一応中身コードがあるので, 動作がおかしいときの原因解明が多少はやりやすくなる... かも?
RNN, LTSM 系はうまく動かないかも
ROCm TensorFlow から利用しているだけで, 詳細に調査はしていませんが, RNN, LTSM, FFT 系なコードがあると非常に遅い(CPU fallback している?) and/or アルゴリズム探索が終わらず無限ループというのに遭遇したりしますので注意ください.
MIOpen では, RNN では RNN ReLu, RNN Tanh, LSTM, GRU が対応されています.
Fusion API
複数のカーネル(conv, activation, etc)を一つのカーネルにして計算効率を高めます.
最高で 3 倍くらい効率化はされそうです. 玄人向けですかね.
fp16, bfloat16
MIOpen 2.0.1 のヘッダでは以下のデータ型がサポートされています.
bfloat16 は OpenCL backend で使えるかな?
/*! @ingroup tensor
* @enum miopenDataType_t
* MIOpen floating point datatypes. Both 32-bit and 16-bit floats are supported in MIOpen.
*/
typedef enum {
miopenHalf = 0, /*!< 16-bit floating point (Fully supported) */
miopenFloat = 1, /*!< 32-bit floating point (Fully supported) */
miopenInt32 = 2, /*!< 32-bit int point (Partially supported) */
miopenInt8 = 3, /*!< 8-bit int point (Partially supported) */
miopenInt8x4 =
4, /*!< Pack of four 8-bit int points in NCHW_VECT_C format (Partially supported) */
miopenBFloat16 = 5, /*!< 16-bit binary floating point (8-bit exponent, 7-bit fraction)
(Partially supported) */
} miopenDataType_t;
複数 GPU 利用
miopenCreateWithStream()
や miopenSetStream()
がありますので, これらを使い明示的に stream(OpenCL backend の場合は command queue)を指定すれば複数 GPU で動作させることができると想像します.
つまり, OpenCL command queue を各デバイス(GPU)ごとに作成して, それら各 queue で miopenCreateWithStream
を呼ぶ、など.
RCCL との連携
複数 GPU 通信を効率化する RCCL ライブラリがあります(NCCL の ROCm 版)
これと組み合わせることでより効率的な複数 GPU 動作を行うことができると想像しますが, RCCL は HIP 限定なので MIOpen もまずは HIP backend 版を使うことになります.
TODO
- MIOpen HIP 版を試す.
- cuDNN と比較して計算精度や速度比較を行う.
- fp16/bflot16, Fusion API とかを試す.
- 複数 GPU で MIOpen の動作検証をする.
- MIOpen から OpenCL カーネルのコードを抜き出して自作機械学習ライブラリに取り込み, clspv で Vulkan で動くようにしたい.
-
優秀な cuDNN 若人さまが, 人類史上最速で優秀な MIOpen 若人さまへと昇華なされるスキームを確立する旅に出たい
- 「MIOpen で実装してみた」ユーチューバーはよ