LoginSignup
1
0

More than 5 years have passed since last update.

OpenACCでCUDA-awareなMPI_Send,Recvをする

Last updated at Posted at 2018-03-07

はじめに

MPI Examplesf_ex01.fをベースに、OpenACCでCUDA-awareなMPI_Senc,Recvをするコード。
参考文献にあるように、!$acc host_data use_device()を使えばできるのだが、本当にCUDA-awareなMPIができているのか確認するのが難しい。
今回はprintとOpenMPIのデバッグログから出来ていると判断。
(nvprofを使うとグラフィカルにわかるかもしれない。)

環境とコンパイル

環境

PGI Compiler: 17.10 Community Edition (pgfortran 17.10-0 linuxpower target on Linuxpower)
MPI: OpenMPI 1.10.2 (PGIコンパイラに付属のもの)
マシン: IBM Power System S822LC (2CPU-2GPU版)

コンパイルコマンド

ちゃんと-ta=nvidiaを付けないとCUDA-aware MPIにならない!

コンパイル
$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin/mpif90 -acc -ta=nvidia:cc60 -Minfo=acc send_recv.F90 -o f90_send_recv.o
hello:
     27, Generating enter data create(buffer)
     31, Generating update device(buffer)
     51, Generating update self(buffer)
     55, Generating exit data delete(buffer)

実行とCUDA-aware MPIの確認

普通に実行して確認

!$acc update host(buffer)する前と後で値が0から5678に変わっている。

普通に実行してprintで確認
$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin --tag-output -n 2 ./f90_send_recv.o
[1,1]<stdout>: MPI_Recv @            1
[1,0]<stdout>: MPI_Send @            0
[1,0]<stdout>: processor             0  sent          5678
[1,1]<stdout>: before update host: processor             1  got             0
[1,1]<stdout>: after  update host: processor             1  got          5678

OpenMPIのデバッグログを出しながら実行して確認

以下の行より、smcuda (shared memory with CUDA)でデバイスメモリがとれていることを確認できる(たぶん)
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on mpool smcuda: address=0x3fffa3be0008, bufsize=134217728

OpenMPIのデバッグログを出しながら実行して確認
$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin --mca mpi_common_cuda_verbose 20 --tag-output -n 2 ./f90_send_recv.o
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 1, initializing
[1,1]<stddiag>:[minsky:145555] CUDA: Library successfully opened libcuda.so.1
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamCreate
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxGetCurrent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventCreate
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventRecord
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemHostRegister_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemHostUnregister
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerGetAttribute
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventQuery
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventDestroy_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamWaitEvent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemcpyAsync
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemcpy
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemFree_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemAlloc_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemGetAddressRange_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcGetEventHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcOpenEventHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcOpenMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcCloseMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcGetMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxGetDevice
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuDeviceCanAccessPeer
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuDeviceGet
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerSetAttribute
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxSetCurrent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventSynchronize
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamSynchronize
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamDestroy_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerGetAttributes
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 2, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 3, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 4, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 3
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 1, initializing
[1,0]<stddiag>:[minsky:145554] CUDA: Library successfully opened libcuda.so.1
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamCreate
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxGetCurrent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventCreate
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventRecord
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemHostRegister_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemHostUnregister
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerGetAttribute
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventQuery
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventDestroy_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamWaitEvent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemcpyAsync
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemcpy
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemFree_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemAlloc_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemGetAddressRange_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcGetEventHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcOpenEventHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcOpenMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcCloseMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcGetMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxGetDevice
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuDeviceCanAccessPeer
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuDeviceGet
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerSetAttribute
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxSetCurrent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventSynchronize
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamSynchronize
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamDestroy_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerGetAttributes
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 2, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 3, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 4, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 3
[1,1]<stdout>: MPI_Recv @            1
[1,1]<stddiag>:[minsky:145555] CUDA: entering stage three init
[1,1]<stddiag>:[minsky:145555] CUDA: cuCtxGetCurrent succeeded
[1,0]<stdout>: MPI_Send @            0
[1,0]<stdout>: processor             0  sent          5678
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on mpool smcuda: address=0x3fffa3be0008, bufsize=134217728
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on test region
[1,1]<stdout>: before update host: processor             1  got             0
[1,1]<stdout>: after  update host: processor             1  got          5678
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 2
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 1
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=3, cuda still in use
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=2, cuda still in use
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 0
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=1, cleaning up started
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, cuMemHostUnregister returned 0, ctx_ok=1
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=1, cleaning up all done

実コード

f_ex01.fにOpenACCを追加してGPU化

send_recv.F90
module fmpi
  !DEC$ NOFREEFORM
  use mpi
  !DEC$ FREEFORM
end module fmpi
!****************************************************************
!  This is a simple send/receive program in MPI
!  Processor 0 sends an integer to processor 1,
!  while processor 1 receives the integer from proc. 0
!****************************************************************
program hello
  use fmpi
  implicit none
  integer myid, ierr,numprocs
  integer tag,source,destination,count
  integer buffer
  integer status(MPI_STATUS_SIZE)

  call MPI_INIT( ierr )
  call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
  call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

  tag         = 1234
  source      = 0
  destination = 1
  count       = 1
  buffer      = 0
  !$acc enter data create(buffer)  

  if(myid .eq. source)then
     buffer = 5678
     !$acc update device(buffer)

     print *, "MPI_Send @", myid
     !$acc host_data use_device(buffer)
     call MPI_Send(buffer, count, MPI_INTEGER,destination,&
          tag, MPI_COMM_WORLD, ierr)
     !$acc end host_data

     write(*,*)"processor ",myid," sent ",buffer
  endif


  if(myid .eq. destination)then
     print *, "MPI_Recv @", myid
     !$acc host_data use_device(buffer)
     call MPI_Recv(buffer, count, MPI_INTEGER,source,&
          tag, MPI_COMM_WORLD, status,ierr)
     !$acc end host_data

     write(*,*)"before update host: processor ",myid," got ",buffer
     !$acc update host(buffer)     
     write(*,*)"after  update host: processor ",myid," got ",buffer
  endif

  !$acc exit data delete(buffer)  
  call MPI_FINALIZE(ierr)

end program hello

参考文献

  1. 第91回 お試しアカウント付き並列プログラミング講習会 OpenACCとMPIによるマルチGPUプログラミング入門 P.24
  2. Chandrasekaran & Juckeland, OpenACC for Programmers: Concepts and Strategies | Pearson P.211
1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0