More than 3 years have passed since last update.

Fortran の多次元配列のメモリ配置と行列積

Fortran

Posted at 2020-10-29

はじめに

定番中の定番ネタ。
話で聞くだけで自分で手を動かしたことがないことに気づいたので、今更ながらやってみた記録。

2 次元配列のメモリ配置を確認する

ソースコード

! main.f03
program array_map
      implicit none

      interface
          function my_objdump(ptr, len) bind(c)
              integer :: my_objdump, ptr
              integer, value :: len
          end function
      end interface

      integer, parameter :: N = 32
      integer, dimension(N,N) :: array
      integer :: i, j, istat

      do i=1,N
        do j=1,N
            array(i,j) = (j-1)*N + i
        end do
      end do

      istat = my_objdump(array(1,1), N*N)
end program

! my_objdump.c
#include <stdio.h>

int my_objdump(unsigned char *attr, int length) {
    int i;

    for (i = 0; i < length; i+=8) {
        printf("%p: %02x%02x%02x%02x %02x%02x%02x%02x\n",
                attr + i,
                attr[i+3], attr[i+2], attr[i+1], attr[i+0],
                attr[i+4+3], attr[i+4+2], attr[i+4+1], attr[i+4+0]);
    }

    printf("%d@%p\n", length, attr);
    return 0;
}

何をしているか、何をしたいか

Fortran でこんな多次元配列を作る ( array(i,j) )。

\	i=1	i=2	i=3	...	i=32
j=1	1	2	3	...	32
j=2	33	34	35	...	64
j=3	65	66	67	...	96
...	...	...	...	...	...

C で書いた my_objdump に &array(1,1) を渡し、メモリ上の値を dump していく(Intel CPU なので、見やすいように 4byte ずつ逆順で出力している)。
メモリ上の順番は

1 -> 2 -> 3 -> ... -> 31 -> 32 -> 33 -> 34 -> ... ( i を先に incr する )
1 -> 33 -> 65 -> ... -> 993 -> 2 -> 34 -> ... ( j を先に incr する )

のどちら?というのが今回のテーマ。

結果

$./a.out  | head
0x404080: 00000001 00000002
0x404088: 00000003 00000004
0x404090: 00000005 00000006
0x404098: 00000007 00000008
0x4040a0: 00000009 0000000a
0x4040a8: 0000000b 0000000c
0x4040b0: 0000000d 0000000e
0x4040b8: 0000000f 00000010
0x4040c0: 00000011 00000012
0x4040c8: 00000013 00000014

i が先に incr されているね。

行列積の計算

NxN 正方行列の掛け算を考える。

行列積とは

こんなん書かれんでもわかるわ!!っていうのは重々承知だけど、書かないと落ち着かない(&考えらない)ので書かせてね。

C = A x B

のとき

C(r,c) = Σ A(r,i)B(i,c) for i in 1 to N

である。

また、今後「掛け算の左側の行列」を「A」、「掛け算の右側の行列」を「B」と記す。

行列積を計算するとき、どういう順番で各行列の要素にアクセスするか

C(r,c) を計算するとき、

A は row が固定で column が incr される
B は column が固定で row が incr される

メモリアクセスの効率化(キャッシュ)

キャッシュを持つ CPU の場合、飛び飛びの番地にアクセスするより、連続した(近しい番地の)番地にアクセスしたほうがパフォーマンスがよくなる(∵メモリアクセス時にキャッシュにガコっとコピーされ、以後はキャッシュアクセスで済むから(死ぬほど雑な説明))。
つまり、

A はメモリ上に A(1,1) -> A(1,2) -> A(1,3) -> ... -> A(1,32) -> A(2,1) -> ... の順番にあるといい
B はメモリ上に B(1,1) -> B(2,1) -> B(3,1) -> ... -> B(32,1) -> B(1,2) -> ... の順番にあるといい

ということになる。

計算性能の比較

ソースコード

program all_cpu_multmat
    use, intrinsic :: iso_fortran_env
    implicit none

    integer, parameter :: N = 256
    integer, parameter :: C = 50

    integer, dimension(N,N) :: matA, matB, matC
    real, dimension(N,N) :: tmpMat

    integer(int32) :: tbegin, tend, cps, cmax

    integer :: i
    real :: sec
    real :: f1

    call init_mats()

    call system_clock(tbegin, cps, cmax)
    do i=1,C
        call mult_mat_1()
    end do
    call system_clock(tend)
    sec = real(tend - tbegin)/cps
    f1 = sec_to_mflops(sec)
    write (*,*) f1

    contains

        ! sec -> mflops
        function sec_to_mflops(sec) result(mf)
            real :: mf, tmp, sec

            tmp = N/1000.0
            tmp = tmp**3

            mf = C*1000.0*(2.0/sec) * tmp

        end function

        ! 初期化
        subroutine init_mats()
            integer :: i, j, counter

            counter = 1
            do i=1,N
                do j=1,N
                    matA(i,j) = counter
                    matB(i,j) = counter + 1
                    counter = counter + 2
                end do
            end do
        end subroutine

        ! row,column x row,column
        subroutine mult_mat_1()
            integer :: row, column, tmp, i

            do row=1,N
                do column=1,N
                    tmp = 0
                    do i=1,N
                        tmp = tmp + matA(row,i)*matB(i,column)
                    end do
                    matC(row,column) = tmp
                end do
            end do
        end subroutine

        ! row,column x column,row
        subroutine mult_mat_2()
            integer :: row, column, tmp, i

            do row=1,N
                do column=1,N
                    tmp = 0
                    do i=1,N
                        tmp = tmp + matA(row,i)*matB(column,i)
                    end do
                    matC(row,column) = tmp
                end do
            end do
        end subroutine

        ! column,row x column,row
        subroutine mult_mat_3()
            integer :: row, column, tmp, i

            do row=1,N
                do column=1,N
                    tmp = 0
                    do i=1,N
                        tmp = tmp + matA(i,row)*matB(column,i)
                    end do
                    matC(row,column) = tmp
                end do
            end do
        end subroutine

        ! column,row x row,column
        subroutine mult_mat_4()
            integer :: row, column, tmp, i

            do row=1,N
                do column=1,N
                    tmp = 0
                    do i=1,N
                        tmp = tmp + matA(i,row)*matB(i,column)
                    end do
                    matC(row,column) = tmp
                end do
            end do
        end subroutine
end program

何をしているか

mult_mat_1 ~ mult_mat_4 にて行列積の計算を実装
matA, matB を行列 A, B を表す配列とするが、行列の表し方を次の 4 通りに変えている
- mult_mat_1: matA(i,j) = A(i,j), matB(i,j) = B(i,j)
- mult_mat_2: matA(i,j) = A(i,j), matB(i,j) = B(j,i)
- mult_mat_3: matA(i,j) = A(j,i), matB(i,j) = B(j,i)
- mult_mat_4: matA(i,j) = A(j,i), matB(i,j) = B(i,j)
256x256 正方行列の掛け算を行い、その時間から MIOPS を算出
- Million Integer Operations Per Sec
- ソースコード上 MFLOPS になってるのは勘弁して...
- 1 要素の計算に 2N 回の整数演算が必要
- それが N*N 要素ある

どうなることを期待しているか?

matA(column, row) が効率的
matB(row, column) が効率的

なはずなので、

matA(i,j) = A(j,i), matB(i,j) = B(i,j) が一番はやい ( mult_mat_4 )
matA(i,j) = A(i,j), matB(i,j) = B(j,i) が一番おそい ( mult_mat_2 )
残りの 2 つはその中間 ( mult_mat_1, mult_mat_3 )

となることが予想される。

実験環境

CPU: Intel Core i5 9400F
コンパイラ: nvfortran 20.9-0 ( From NVIDIA HPC SDK )
コンパイルオプション: -O0

結果

それぞれ 10 回ずつ実行し、平均 MIOPS で比較する。

	mult_mat_1	mult_mat_2	mult_mat_3	mult_mat_4
#1	741.370728	605.894531	702.563599	983.424377
#2	761.562378	636.706604	704.925171	984.001099
#3	753.693542	575.942932	704.629089	984.001099
#4	741.370728	587.643433	750.658569	984.578613
#5	743.341553	586.001343	728.177917	981.123901
#6	756.753235	578.924011	738.109131	985.735596
#7	751.330872	648.520264	734.232727	985.735596
#8	755.050354	611.637573	738.759155	984.001099
#9	745.654175	589.708923	760.526672	984.578613
#10	740.716064	586.411011	735.198059	985.156799
Avg	749.084362	600.7390625	729.7780089	984.2336792

想定通りの結果が得られた。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up