More than 5 years have passed since last update.

OpenACCでCUDA-awareなMPI_Send,Recvをする

Last updated at 2018-03-08Posted at 2018-03-07

はじめに

MPI Examplesのf_ex01.fをベースに、OpenACCでCUDA-awareなMPI_Senc,Recvをするコード。
参考文献にあるように、!$acc host_data use_device()を使えばできるのだが、本当にCUDA-awareなMPIができているのか確認するのが難しい。
今回はprintとOpenMPIのデバッグログから出来ていると判断。
（nvprofを使うとグラフィカルにわかるかもしれない。）

環境とコンパイル

環境

PGI Compiler: 17.10 Community Edition (pgfortran 17.10-0 linuxpower target on Linuxpower)
MPI: OpenMPI 1.10.2 (PGIコンパイラに付属のもの)
マシン: IBM Power System S822LC (2CPU-2GPU版)

コンパイルコマンド

ちゃんと-ta=nvidiaを付けないとCUDA-aware MPIにならない！

コンパイル

$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin/mpif90 -acc -ta=nvidia:cc60 -Minfo=acc send_recv.F90 -o f90_send_recv.o
hello:
     27, Generating enter data create(buffer)
     31, Generating update device(buffer)
     51, Generating update self(buffer)
     55, Generating exit data delete(buffer)

実行とCUDA-aware MPIの確認

普通に実行して確認

!$acc update host(buffer)する前と後で値が0から5678に変わっている。

普通に実行してprintで確認

$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin --tag-output -n 2 ./f90_send_recv.o
[1,1]<stdout>: MPI_Recv @            1
[1,0]<stdout>: MPI_Send @            0
[1,0]<stdout>: processor             0  sent          5678
[1,1]<stdout>: before update host: processor             1  got             0
[1,1]<stdout>: after  update host: processor             1  got          5678

OpenMPIのデバッグログを出しながら実行して確認

以下の行より、smcuda (shared memory with CUDA)でデバイスメモリがとれていることを確認できる（たぶん）
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on mpool smcuda: address=0x3fffa3be0008, bufsize=134217728

OpenMPIのデバッグログを出しながら実行して確認

$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin --mca mpi_common_cuda_verbose 20 --tag-output -n 2 ./f90_send_recv.o
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 1, initializing
[1,1]<stddiag>:[minsky:145555] CUDA: Library successfully opened libcuda.so.1
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamCreate
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxGetCurrent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventCreate
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventRecord
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemHostRegister_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemHostUnregister
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerGetAttribute
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventQuery
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventDestroy_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamWaitEvent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemcpyAsync
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemcpy
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemFree_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemAlloc_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemGetAddressRange_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcGetEventHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcOpenEventHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcOpenMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcCloseMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcGetMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxGetDevice
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuDeviceCanAccessPeer
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuDeviceGet
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerSetAttribute
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxSetCurrent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventSynchronize
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamSynchronize
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamDestroy_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerGetAttributes
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 2, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 3, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 4, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 3
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 1, initializing
[1,0]<stddiag>:[minsky:145554] CUDA: Library successfully opened libcuda.so.1
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamCreate
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxGetCurrent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventCreate
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventRecord
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemHostRegister_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemHostUnregister
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerGetAttribute
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventQuery
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventDestroy_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamWaitEvent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemcpyAsync
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemcpy
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemFree_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemAlloc_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemGetAddressRange_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcGetEventHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcOpenEventHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcOpenMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcCloseMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcGetMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxGetDevice
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuDeviceCanAccessPeer
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuDeviceGet
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerSetAttribute
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxSetCurrent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventSynchronize
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamSynchronize
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamDestroy_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerGetAttributes
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 2, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 3, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 4, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 3
[1,1]<stdout>: MPI_Recv @            1
[1,1]<stddiag>:[minsky:145555] CUDA: entering stage three init
[1,1]<stddiag>:[minsky:145555] CUDA: cuCtxGetCurrent succeeded
[1,0]<stdout>: MPI_Send @            0
[1,0]<stdout>: processor             0  sent          5678
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on mpool smcuda: address=0x3fffa3be0008, bufsize=134217728
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on test region
[1,1]<stdout>: before update host: processor             1  got             0
[1,1]<stdout>: after  update host: processor             1  got          5678
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 2
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 1
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=3, cuda still in use
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=2, cuda still in use
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 0
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=1, cleaning up started
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, cuMemHostUnregister returned 0, ctx_ok=1
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=1, cleaning up all done

実コード

f_ex01.fにOpenACCを追加してGPU化

send_recv.F90

module fmpi
  !DEC$ NOFREEFORM
  use mpi
  !DEC$ FREEFORM
end module fmpi
!****************************************************************
!  This is a simple send/receive program in MPI
!  Processor 0 sends an integer to processor 1,
!  while processor 1 receives the integer from proc. 0
!****************************************************************
program hello
  use fmpi
  implicit none
  integer myid, ierr,numprocs
  integer tag,source,destination,count
  integer buffer
  integer status(MPI_STATUS_SIZE)

  call MPI_INIT( ierr )
  call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
  call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

  tag         = 1234
  source      = 0
  destination = 1
  count       = 1
  buffer      = 0
  !$acc enter data create(buffer)  

  if(myid .eq. source)then
     buffer = 5678
     !$acc update device(buffer)

     print *, "MPI_Send @", myid
     !$acc host_data use_device(buffer)
     call MPI_Send(buffer, count, MPI_INTEGER,destination,&
          tag, MPI_COMM_WORLD, ierr)
     !$acc end host_data

     write(*,*)"processor ",myid," sent ",buffer
  endif


  if(myid .eq. destination)then
     print *, "MPI_Recv @", myid
     !$acc host_data use_device(buffer)
     call MPI_Recv(buffer, count, MPI_INTEGER,source,&
          tag, MPI_COMM_WORLD, status,ierr)
     !$acc end host_data

     write(*,*)"before update host: processor ",myid," got ",buffer
     !$acc update host(buffer)     
     write(*,*)"after  update host: processor ",myid," got ",buffer
  endif

  !$acc exit data delete(buffer)  
  call MPI_FINALIZE(ierr)

end program hello

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up