はじめに
MPI Examplesのf_ex01.fをベースに、OpenACCでCUDA-awareなMPI_Senc,Recvをするコード。
参考文献にあるように、!$acc host_data use_device()
を使えばできるのだが、本当にCUDA-awareなMPIができているのか確認するのが難しい。
今回はprintとOpenMPIのデバッグログから出来ていると判断。
(nvprofを使うとグラフィカルにわかるかもしれない。)
環境とコンパイル
環境
PGI Compiler: 17.10 Community Edition (pgfortran 17.10-0 linuxpower target on Linuxpower)
MPI: OpenMPI 1.10.2 (PGIコンパイラに付属のもの)
マシン: IBM Power System S822LC (2CPU-2GPU版)
コンパイルコマンド
ちゃんと-ta=nvidia
を付けないとCUDA-aware MPIにならない!
$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin/mpif90 -acc -ta=nvidia:cc60 -Minfo=acc send_recv.F90 -o f90_send_recv.o
hello:
27, Generating enter data create(buffer)
31, Generating update device(buffer)
51, Generating update self(buffer)
55, Generating exit data delete(buffer)
実行とCUDA-aware MPIの確認
普通に実行して確認
!$acc update host(buffer)
する前と後で値が0から5678に変わっている。
$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin --tag-output -n 2 ./f90_send_recv.o
[1,1]<stdout>: MPI_Recv @ 1
[1,0]<stdout>: MPI_Send @ 0
[1,0]<stdout>: processor 0 sent 5678
[1,1]<stdout>: before update host: processor 1 got 0
[1,1]<stdout>: after update host: processor 1 got 5678
OpenMPIのデバッグログを出しながら実行して確認
以下の行より、smcuda (shared memory with CUDA)でデバイスメモリがとれていることを確認できる(たぶん)
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on mpool smcuda: address=0x3fffa3be0008, bufsize=134217728
$ /opt/pgi/linuxpower/2017/mpi/openmpi-1.10.2/bin --mca mpi_common_cuda_verbose 20 --tag-output -n 2 ./f90_send_recv.o
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 1, initializing
[1,1]<stddiag>:[minsky:145555] CUDA: Library successfully opened libcuda.so.1
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamCreate
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxGetCurrent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventCreate
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventRecord
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemHostRegister_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemHostUnregister
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerGetAttribute
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventQuery
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventDestroy_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamWaitEvent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemcpyAsync
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemcpy
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemFree_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemAlloc_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuMemGetAddressRange_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcGetEventHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcOpenEventHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcOpenMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcCloseMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuIpcGetMemHandle
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxGetDevice
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuDeviceCanAccessPeer
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuDeviceGet
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerSetAttribute
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuCtxSetCurrent
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuEventSynchronize
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamSynchronize
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuStreamDestroy_v2
[1,1]<stddiag>:[minsky:145555] CUDA: successful dlsym of cuPointerGetAttributes
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 2, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 3, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: stage_one_init_ref_count is now 4, no need to init
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 3
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 1, initializing
[1,0]<stddiag>:[minsky:145554] CUDA: Library successfully opened libcuda.so.1
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamCreate
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxGetCurrent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventCreate
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventRecord
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemHostRegister_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemHostUnregister
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerGetAttribute
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventQuery
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventDestroy_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamWaitEvent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemcpyAsync
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemcpy
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemFree_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemAlloc_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuMemGetAddressRange_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcGetEventHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcOpenEventHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcOpenMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcCloseMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuIpcGetMemHandle
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxGetDevice
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuDeviceCanAccessPeer
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuDeviceGet
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerSetAttribute
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuCtxSetCurrent
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuEventSynchronize
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamSynchronize
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuStreamDestroy_v2
[1,0]<stddiag>:[minsky:145554] CUDA: successful dlsym of cuPointerGetAttributes
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 2, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 3, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: stage_one_init_ref_count is now 4, no need to init
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 3
[1,1]<stdout>: MPI_Recv @ 1
[1,1]<stddiag>:[minsky:145555] CUDA: entering stage three init
[1,1]<stddiag>:[minsky:145555] CUDA: cuCtxGetCurrent succeeded
[1,0]<stdout>: MPI_Send @ 0
[1,0]<stdout>: processor 0 sent 5678
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on mpool smcuda: address=0x3fffa3be0008, bufsize=134217728
[1,1]<stddiag>:[minsky:145555] CUDA: cuMemHostRegister OK on test region
[1,1]<stdout>: before update host: processor 1 got 0
[1,1]<stdout>: after update host: processor 1 got 5678
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 2
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 1
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=3, cuda still in use
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=2, cuda still in use
[1,0]<stddiag>:[minsky:145554] CUDA: mca_common_cuda_fini, never completed initialization so skipping fini, ref_count is now 0
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=1, cleaning up started
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, cuMemHostUnregister returned 0, ctx_ok=1
[1,1]<stddiag>:[minsky:145555] CUDA: mca_common_cuda_fini, ref_count=1, cleaning up all done
実コード
f_ex01.fにOpenACCを追加してGPU化
module fmpi
!DEC$ NOFREEFORM
use mpi
!DEC$ FREEFORM
end module fmpi
!****************************************************************
! This is a simple send/receive program in MPI
! Processor 0 sends an integer to processor 1,
! while processor 1 receives the integer from proc. 0
!****************************************************************
program hello
use fmpi
implicit none
integer myid, ierr,numprocs
integer tag,source,destination,count
integer buffer
integer status(MPI_STATUS_SIZE)
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
tag = 1234
source = 0
destination = 1
count = 1
buffer = 0
!$acc enter data create(buffer)
if(myid .eq. source)then
buffer = 5678
!$acc update device(buffer)
print *, "MPI_Send @", myid
!$acc host_data use_device(buffer)
call MPI_Send(buffer, count, MPI_INTEGER,destination,&
tag, MPI_COMM_WORLD, ierr)
!$acc end host_data
write(*,*)"processor ",myid," sent ",buffer
endif
if(myid .eq. destination)then
print *, "MPI_Recv @", myid
!$acc host_data use_device(buffer)
call MPI_Recv(buffer, count, MPI_INTEGER,source,&
tag, MPI_COMM_WORLD, status,ierr)
!$acc end host_data
write(*,*)"before update host: processor ",myid," got ",buffer
!$acc update host(buffer)
write(*,*)"after update host: processor ",myid," got ",buffer
endif
!$acc exit data delete(buffer)
call MPI_FINALIZE(ierr)
end program hello