目的

OpenMP 4以降からOpenMPがGPUに対応したが、誰でも気軽に利用できる環境（※）を実際に用意する方法についての情報は潤沢ではない。今回、実際に動かすことに成功したため、情報を共有しておく。（cmakeのオプションの精査などがまだ甘いと思うが、その辺はご容赦を。）

※IBMやCrayのコンパイラを用意しろと言われてもそう簡単には用意できないわけで、GNUやLLVMのようなコンパイラによる環境が望ましい、というわけである

主な参考資料

LLVMのオンラインドキュメント
- https://clang.llvm.org/get_started.html とか
- https://llvm.org/docs/GettingStarted.html とか
- https://clang.llvm.org/docs/OpenMPSupport.html llvmのOpenMPサポート状況はこんならしいよ
NVIDIAの資料
- http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf やや古いが、環境変数設定の参考などに
hpcwiki
- https://hpc-wiki.info/hpc/Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs 終わってみたら、最終的にはこのページだけ見ればいいんじゃないかな？という気分になった……

環境

HW: Intel Xeon + Tesla V100
OS: RHEL 7.8
LLVM: commit 1956a8a7cb79e94dbe073e36eba2d6b003f91046
- git cloneしたら落ちてきた、Date: Sun Apr 26 21:28:32 2020 -0700 のもの
ビルド環境: cmake 3.15.0, devtoolset-7, CUDA 10.1
- https://llvm.org/docs/GettingStarted.html#id7 をみて十分新しいcmakeとdevtoolsetを導入。
- devtoolsetは既に8や9があるが、新しすぎるとCUDAの対応するgccのバージョンの問題が起きたため7を利用。
- CUDAは既に10.2があるが、現時点ではLLVM側が10.1までしか認識していないようで、10.2を入れると実行時にCUDAのバージョンが変だと警告が出た。（動いたので多分大丈夫だが、念のため。）

手順1

git cloneでllvmを入手し、ビルド場所を確保する。

$ git clone https://github.com/llvm/llvm-project.git
$ cd llvm-project
$ mkdir build-debug
$ cd build-debug

cmake、make、make installする。（makeにはかなり時間がかかる。）

$ cmake -DCMAKE_BUILD_TYPE=Debug -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DCMAKE_INSTALL_PREFIX="/path/to/llvm-nvptx-debug" -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_BACKTRACES=ON -DLLVM_ENABLE_WERROR=OFF -DBUILD_SHARED_LIBS=OFF -DLLVM_ENABLE_RTTI=ON -G "Unix Makefiles" -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;libcxx;libcxxabi;lld;openmp" -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70 ../llvm
$ make -j 4
$ make install

ターゲットGPUがV100なので-DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70を指定したが、このあたりは実際に使うGPUにあわせて選択しよう。
これで/path/to/llvm-nvptx-debug/bin/clangを叩けばコンパイルできる。
試しに行列積を実行させてみる。

matmul1.c

// 行列の確保や初期化は省略、n*nの配列を3つ用意しただけ
#pragma omp target map(tofrom:a[:n*n]) map(tofrom:b[:n*n]) map(tofrom:c[:n*n])
  {
#pragma omp parallel for
     for(int i=0; i<n; i++){
      for(int j=0; j<n; j++){
        for(int k=0; k<n; k++){
          c[i*n+j] += a[i*n+k] * b[k*n+j];
        }
      }
    }
  }

コンパイルして実行させる。環境変数は、コンパイルだけならPATHだけで良いが、コンパイルしたものを実行する際にはLD_LIBRARY_PATHも必要（指定しないとlibomp.soがなくて躓く）。

$ export PATH=/path/to/llvm-nvptx-debug/bin:${PATH}
$ export LD_LIBRARY_PATH=/path/to/llvm-nvptx-debug/lib:${LD_LIBRARY_PATH}
$ clang -fopenmp -fopenmp-targets=nvptx64 matmul1.c
clang-11: warning: No library 'libomptarget-nvptx-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Expect degraded performance due to no inlining of runtime functions on target devices. [-Wopenmp-target]
$ ./a.out
（省略）

なにやらwarningは出るのだが、ちゃんと実行はされる。CPUで動いていたりしない？という点については、環境変数LIBOMPTARGET_DEBUGを宣言して実行すればCPU-GPU間の通信やらGPUカーネルの起動やらの情報がたくさんでてくるのですぐにわかる。もちろん、-fopemp-targetsオプションを外せばこのようなGPU駆動に関するメッセージは一切出なくなる、つまり、従来通りのCPU上での並列計算となる。
実際に出力されたメッセージの例は以下の通り。（ppc64やaarch64のファイルを読みにいって躓いているのが気になるが、何かコンパイル時にオプションを与えておけばこれもなくせるのだろうか？）
CPU-GPU間のデータ転送サイズなどもしっかり表示されているため、map指示節の挙動の確認にも使えそうだ。

$ ./a.out
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> Registering RTL libomptarget.rtl.cuda.so supporting 1 devices!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x00000000004020a0 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x00000000004020a0 is compatible with RTL libomptarget.rtl.cuda.so!
Libomptarget --> RTL 0x0000000000430600 has index 0!
Libomptarget --> Registering image 0x00000000004020a0 with RTL libomptarget.rtl.cuda.so!
Libomptarget --> Done registering entries!
n = 10 # これはプログラム中で問題サイズをprintfしたもの
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Entering target region with entry point 0x0000000000402046 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target CUDA RTL --> Init requires flags to 1
Target CUDA RTL --> Getting device 0
Target CUDA RTL --> Max CUDA blocks per grid 2147483647 exceeds the hard team limit 65536, capping at the hard limit
Target CUDA RTL --> Using 1024 CUDA threads per block
Target CUDA RTL --> Using warp size 32
Target CUDA RTL --> Max number of CUDA blocks 65536, threads 1024 & warp size 32
Target CUDA RTL --> Default number of teams set according to library's default 128
Target CUDA RTL --> Default number of threads set according to library's default 128
Libomptarget --> Device 0 is ready to use.
Target CUDA RTL --> Load data from image 0x00000000004020a0
Target CUDA RTL --> CUDA module successfully loaded!
Target CUDA RTL --> Entry point 0x0000000000000000 maps to __omp_offloading_28_4cb6343_main_l35 (0x0000000000d49900)
Target CUDA RTL --> Sending global device environment data 4 bytes
Libomptarget --> Entry  0: Base=0x00007fff0000000a, Begin=0x00007fff0000000a, Size=4, Type=0x320
Libomptarget --> Entry  1: Base=0x00000000004a7f20, Begin=0x00000000004a7f20, Size=800, Type=0x23
Libomptarget --> Entry  2: Base=0x00000000004a78c0, Begin=0x00000000004a78c0, Size=800, Type=0x23
Libomptarget --> Entry  3: Base=0x00000000004a7bf0, Begin=0x00000000004a7bf0, Size=800, Type=0x23
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a7f20, Size=800)...
Libomptarget --> Creating new map entry: HstBase=0x00000000004a7f20, HstBegin=0x00000000004a7f20, HstEnd=0x00000000004a8240, TgtBegin=0x00007fffbe600000
Libomptarget --> There are 800 bytes allocated at target address 0x00007fffbe600000 - is new
Libomptarget --> Moving 800 bytes (hst:0x00000000004a7f20) -> (tgt:0x00007fffbe600000)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a78c0, Size=800)...
Libomptarget --> Creating new map entry: HstBase=0x00000000004a78c0, HstBegin=0x00000000004a78c0, HstEnd=0x00000000004a7be0, TgtBegin=0x00007fffbe600400
Libomptarget --> There are 800 bytes allocated at target address 0x00007fffbe600400 - is new
Libomptarget --> Moving 800 bytes (hst:0x00000000004a78c0) -> (tgt:0x00007fffbe600400)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a7bf0, Size=800)...
Libomptarget --> Creating new map entry: HstBase=0x00000000004a7bf0, HstBegin=0x00000000004a7bf0, HstEnd=0x00000000004a7f10, TgtBegin=0x00007fffbe600800
Libomptarget --> There are 800 bytes allocated at target address 0x00007fffbe600800 - is new
Libomptarget --> Moving 800 bytes (hst:0x00000000004a7bf0) -> (tgt:0x00007fffbe600800)
Libomptarget --> Forwarding first-private value 0x00007fff0000000a to the target construct
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a7f20, Size=800)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004a7f20, TgtPtrBegin=0x00007fffbe600000, Size=800, RefCount=1
Libomptarget --> Obtained target argument 0x00007fffbe600000 from host pointer 0x00000000004a7f20
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a78c0, Size=800)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004a78c0, TgtPtrBegin=0x00007fffbe600400, Size=800, RefCount=1
Libomptarget --> Obtained target argument 0x00007fffbe600400 from host pointer 0x00000000004a78c0
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a7bf0, Size=800)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004a7bf0, TgtPtrBegin=0x00007fffbe600800, Size=800, RefCount=1
Libomptarget --> Obtained target argument 0x00007fffbe600800 from host pointer 0x00000000004a7bf0
Libomptarget --> Launching target execution __omp_offloading_28_4cb6343_main_l35 with pointer 0x0000000000d3eb60 (index=0).
Target CUDA RTL --> Setting CUDA threads per block to default 128
Target CUDA RTL --> Using requested number of teams 1
Target CUDA RTL --> Launch kernel with 1 blocks and 128 threads
Target CUDA RTL --> Launch of entry point at 0x0000000000d3eb60 successful!
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a7bf0, Size=800)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004a7bf0, TgtPtrBegin=0x00007fffbe600800, Size=800, updated RefCount=1
Libomptarget --> There are 800 bytes allocated at target address 0x00007fffbe600800 - is last
Libomptarget --> Moving 800 bytes (tgt:0x00007fffbe600800) -> (hst:0x00000000004a7bf0)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a7bf0, Size=800)...
Libomptarget --> Deleting tgt data 0x00007fffbe600800 of size 800
Libomptarget --> Removing mapping with HstPtrBegin=0x00000000004a7bf0, TgtPtrBegin=0x00007fffbe600800, Size=800
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a78c0, Size=800)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004a78c0, TgtPtrBegin=0x00007fffbe600400, Size=800, updated RefCount=1
Libomptarget --> There are 800 bytes allocated at target address 0x00007fffbe600400 - is last
Libomptarget --> Moving 800 bytes (tgt:0x00007fffbe600400) -> (hst:0x00000000004a78c0)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a78c0, Size=800)...
Libomptarget --> Deleting tgt data 0x00007fffbe600400 of size 800
Libomptarget --> Removing mapping with HstPtrBegin=0x00000000004a78c0, TgtPtrBegin=0x00007fffbe600400, Size=800
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a7f20, Size=800)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00000000004a7f20, TgtPtrBegin=0x00007fffbe600000, Size=800, updated RefCount=1
Libomptarget --> There are 800 bytes allocated at target address 0x00007fffbe600000 - is last
Libomptarget --> Moving 800 bytes (tgt:0x00007fffbe600000) -> (hst:0x00000000004a7f20)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000000004a7f20, Size=800)...
Libomptarget --> Deleting tgt data 0x00007fffbe600000 of size 800
Libomptarget --> Removing mapping with HstPtrBegin=0x00000000004a7f20, TgtPtrBegin=0x00007fffbe600000, Size=800
result: # ここで行列計算結果を出力しているのだが、省略
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x00000000004020a0 is compatible with RTL 0x0000000000430600!
Libomptarget --> Unregistered image 0x00000000004020a0 from RTL 0x0000000000430600!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x000000000041ee10
Libomptarget --> Done unregistering library!
Libomptarget --> Deinit target library!

手順2

手順1でコンパイルした際に出ていたwarningを消すことを考える。

$ clang -fopenmp -fopenmp-targets=nvptx64 matmul1.c
clang-11: warning: No library 'libomptarget-nvptx-sm_70.bc' found in the default clang lib directory or in LIBRARY_PATH. Expect degraded performance due to no inlining of runtime functions on target devices. [-Wopenmp-target]

実はこれはllvm（clang, clang++）でllvmをビルドすれば解消する。要するにこんな感じ。

$ cd llvm-project
$ mkdir build-clang
$ cd build-clang
$ cmake -DCMAKE_BUILD_TYPE=Debug -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" -DCMAKE_INSTALL_PREFIX="/path/to/llvm-nvptx-debug-clang" -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_BACKTRACES=ON -DLLVM_ENABLE_WERROR=OFF -DBUILD_SHARED_LIBS=OFF -DLLVM_ENABLE_RTTI=ON -G "Unix Makefiles" -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;libcxx;libcxxabi;lld;openmp" -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70 -DCMAKE_C_COMPILER=/path/to/llvm-nvptx-debug/bin/clang -DCMAKE_CXX_COMPILER=/path/to/llvm-nvptx-debug/bin/clang++ ../llvm
$ make -j 4
$ make install

CMAKE_C_COMPILERとCMAKE_CXX_COMPILERで先ほど作成したclang, clang++を指定するのがポイント。
これで生成されたものにたいして環境変数PATHとLD_LIBRARY_PATHを指定してコンパイル・実行してやれば良い。

終わりに

これでいつでもどこでも（？）OpenMPでGPUを叩いて遊べるはず。
性能はまだ未調査。

RHEL7.8にllvmをいれてOpenMPでGPUを駆動するまで

目的

主な参考資料

環境

手順1

手順2

終わりに