EPYCインスタンスが来たぞー

というわけで、「AWSでEPYCでHPCがしたい！」という全世界のユーザーは歓喜したとかなんとか(ほんとか？
なので、EPYCインスタンスのベンチマークを取ることにします。

https://qiita.com/telmin_orca/items/bea941314b8051793db2
の続き。Ryzenを3世代にわたって適当にベンチ取った記録。

ハードウェア

/proc/cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x8001227
cpu MHz         : 2598.604
cache size      : 512 KB
physical id     : 0
siblings        : 48
core id         : 0
cpu cores       : 24
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.32
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

EPYCはMCMですね！というわけで、パッケージ内もNUMAというなんかもーややこしいなぁという構造をしているわけで…はぁ

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 24 25 26 27 28 29 30 31
node 0 size: 63310 MB
node 0 free: 62715 MB
node 1 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
node 1 size: 63374 MB
node 1 free: 62707 MB
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
node 2 size: 63352 MB
node 2 free: 62892 MB
node distances:
node   0   1   2
  0:  10  16  16
  1:  16  10  16
  2:  16  16  10

3クラスタなのかい。なんかきもちわるい

ベンチマーク

STREAM

なんもコントロールしない版

$ ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           30452.7     0.737589     0.564149     1.488491
Scale:          31323.2     0.691516     0.548470     1.135745
Add:            32674.3     0.912753     0.788686     1.135196
Triad:          30949.2     0.919351     0.832648     1.123737
-------------------------------------------------------------

numactlでlocalallocした版

$ numactl --localalloc ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           35016.2     0.907369     0.490627     3.395209
Scale:          34924.1     1.116415     0.491921     5.284777
Add:            36964.8     0.988101     0.697144     1.739983
Triad:          40281.4     0.772501     0.639745     1.186238
-------------------------------------------------------------

numactlでinterleave=allした版

$ numactl --interleave=all ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           51429.9     0.336871     0.334044     0.343149
Scale:          51399.8     0.335059     0.334240     0.339794
Add:            55332.8     0.468382     0.465724     0.474934
Triad:          55190.6     0.469062     0.466924     0.475884
-------------------------------------------------------------

おお、すげえ伸びた

24スレッドのみ(=1物理コアに1スレッド割り当て) で、numactlで適当にバインドしてみると？

$ OMP_NUM_THREADS=24 numactl --cpubind=0,1,2 --membind=0,1,2 ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           18458.0     0.935805     0.930755     0.942491
Scale:          18457.4     0.937502     0.930783     0.945430
Add:            19904.0     1.302825     1.294708     1.312583
Triad:          19975.5     1.305319     1.290069     1.317308
-------------------------------------------------------------

なんか遅くなった。アフィニティちゃんと設定しないとダメか？

24スレッドでinterleave=allしてみると…

$ OMP_NUM_THREADS=24 numactl --interleave=all ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           51883.6     0.334359     0.331124     0.352611
Scale:          50945.1     0.342290     0.337223     0.363494
Add:            55125.5     0.469205     0.467475     0.473028
Triad:          55026.9     0.472796     0.468313     0.504610
-------------------------------------------------------------

大して変わらん。
ということは、ちゃんと引けているんだな…なるほど…

FFT

3次元FFT。

何もしない版

$ ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005765 msec, 38.195858 GFLOPS.
On-board: 0.003017 msec, 72.975774 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.064630 msec, 31.150834 GFLOPS.
On-board: 0.061369 msec, 32.806107 GFLOPS.

なんかクソ速いな

24スレッド版

$ OMP_NUM_THREADS=24 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.004932 msec, 44.643677 GFLOPS.
On-board: 0.003317 msec, 66.378631 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.099970 msec, 20.138779 GFLOPS.
On-board: 0.086103 msec, 23.381948 GFLOPS.

ちょっと落ちた

numactl --interleave=all入れてみた

$ numactl --interleave=all ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005450 msec, 40.403388 GFLOPS.
On-board: 0.003117 msec, 70.648559 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.064362 msec, 31.280534 GFLOPS.
On-board: 0.065973 msec, 30.516550 GFLOPS.

うーん？

$ OMP_NUM_THREADS=24 numactl --interleave=all ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005006 msec, 43.985406 GFLOPS.
On-board: 0.003303 msec, 66.665421 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.085201 msec, 23.629514 GFLOPS.
On-board: 0.070042 msec, 28.743652 GFLOPS.

あんまり変わらん…

一応念のため

$ numactl --localalloc ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005704 msec, 38.601561 GFLOPS.
On-board: 0.003038 msec, 72.470349 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.063662 msec, 31.624237 GFLOPS.
On-board: 0.063053 msec, 31.929787 GFLOPS.

特に有意な差っていうのはないなぁ…そんなもんか

DGEMM, SGEMM

GEMMはどうなるだろう。
きっと、ThreadRipperの傾向をみるに、32スレッドでは性能が結構落ちるのだろう。まぁ、最内ループは命令ほぼ埋まってて、論理スレッドは意味なくなるはずだからそれは正しいのだが。

まずはなにもしないで

$ ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:20:32 initialize done.
238.255 GFlops, 89.1692
Sgemm start
1:21:7 initialize done.
727.395 GFlops, 99.035

24スレッドにしてみて。

$ OMP_NUM_THREADS=24 ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:22:28 initialize done.
201.881 GFlops, 22.8572
Sgemm start
1:23:2 initialize done.
693.773 GFlops, 6.97528

いや下がっとるやないかーい！(輝夜月風)

numactlでinterleave=allかけてみて

$ OMP_NUM_THREADS=24 numactl --interleave=all ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:24:54 initialize done.
338.489 GFlops, 0.915681
Sgemm start
1:25:28 initialize done.
706.782 GFlops, 0.809519

結構マシになった。あと、分散も大幅に減ったので、やはりnuma制御は必須か…ううむ。めんどくさい

あと、まさかとは思うけど…まさかとは思うけど…48スレッドでも確認してみよう。

$ numactl --interleave=all ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:26:35 initialize done.
391.78 GFlops, 1.15513
Sgemm start
1:27:10 initialize done.
814.027 GFlops, 5.30401

えぇ…(ドン引き

HPL

$ ./lu2 -n 32768
main top omp_max_threads=48 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=48 procs=48
Emax=  1.618e-05
Nswap=0 cpsec =  11626.1 wsec=407.86 57.5105 Gflops
swaprows    time=     5.35257e+09 ops/cycle=0.100302
scalerow    time=      1.5302e+08 ops/cycle=3.5085
trans rtoc  time=     9.99468e+10 ops/cycle=0.00537157
trans ctor  time=     7.00587e+10 ops/cycle=0.00766316
trans mmul  time=     4.07757e+11 ops/cycle=0.0197496
tr nr  cdec time=     4.99821e+09 ops/cycle=0.107413
trans vvmul time=     1.90071e+09 ops/cycle=0.282458
trans findp time=     3.06734e+09 ops/cycle=0.175028
solve tri u time=     1.55983e+10 ops/cycle=2.10074e-06
solve tri   time=     2.93913e+10 ops/cycle=74.819
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=     4.19544e+07 ops/cycle=204.745
trans mmul8 time=     1.51567e+11 ops/cycle=0.028337
trans mmul4 time=      2.5248e+11 ops/cycle=0.00850557
trans mmul2 time=     3.70726e+09 ops/cycle=0.289632
DGEMM2K     time=     1.90254e+11 ops/cycle=117.51
DGEMM1K     time=     1.09285e+10 ops/cycle=50.3048
DGEMM512    time=     7.84843e+09 ops/cycle=35.0233
DGEMMrest   time=     8.24044e+10 ops/cycle=3.33572
col dec t   time=     5.83185e+11 ops/cycle=0.0294587
Total       time=     8.93463e+11 ops/cycle=26.2532

50て。

$ numactl --interleave=all ./lu2 -n 32768
main top omp_max_threads=48 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=48 procs=48
Emax=  1.618e-05
Nswap=0 cpsec =  10791.6 wsec=346.712 67.6534 Gflops
swaprows    time=     1.42583e+09 ops/cycle=0.376533
scalerow    time=     1.82128e+08 ops/cycle=2.94777
trans rtoc  time=     9.52548e+10 ops/cycle=0.00563616
trans ctor  time=     5.13747e+10 ops/cycle=0.0104501
trans mmul  time=     3.89029e+11 ops/cycle=0.0207004
tr nr  cdec time=     5.81692e+09 ops/cycle=0.0922947
trans vvmul time=     2.36237e+09 ops/cycle=0.227259
trans findp time=     3.43785e+09 ops/cycle=0.156165
solve tri u time=     8.44437e+09 ops/cycle=3.88045e-06
solve tri   time=     2.34857e+10 ops/cycle=93.6326
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=     1.19951e+06 ops/cycle=7161.23
trans mmul8 time=     1.44905e+11 ops/cycle=0.02964
trans mmul4 time=     2.40618e+11 ops/cycle=0.00892485
trans mmul2 time=      3.5034e+09 ops/cycle=0.306486
DGEMM2K     time=     1.27428e+11 ops/cycle=175.446
DGEMM1K     time=     7.71364e+09 ops/cycle=71.2706
DGEMM512    time=      6.1085e+09 ops/cycle=44.9992
DGEMMrest   time=      7.1944e+10 ops/cycle=3.82072
col dec t   time=     5.41751e+11 ops/cycle=0.0317118
Total       time=     7.61303e+11 ops/cycle=30.8107

いや、60てｗ

$ OMP_NUM_THREADS=24 numactl --interleave=all ./lu2 -n 32768
main top omp_max_threads=24 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=24 procs=48
Emax=  6.855e-06
Nswap=0 cpsec =  1752.89 wsec=78.1302 300.22 Gflops
swaprows    time=      1.2731e+09 ops/cycle=0.421703
scalerow    time=     1.54877e+08 ops/cycle=3.46642
trans rtoc  time=     1.15648e+09 ops/cycle=0.46423
trans ctor  time=     7.43654e+08 ops/cycle=0.721936
trans mmul  time=     1.97772e+09 ops/cycle=4.07189
tr nr  cdec time=     1.02064e+09 ops/cycle=0.526014
trans vvmul time=     3.40386e+08 ops/cycle=1.57724
trans findp time=     6.78659e+08 ops/cycle=0.791077
solve tri u time=     3.49471e+09 ops/cycle=9.37646e-06
solve tri   time=     1.67429e+10 ops/cycle=131.341
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=          618574 ops/cycle=13886.7
trans mmul8 time=     7.82076e+08 ops/cycle=5.49175
trans mmul4 time=      6.9772e+08 ops/cycle=3.07786
trans mmul2 time=      4.9601e+08 ops/cycle=2.16476
DGEMM2K     time=     1.38579e+11 ops/cycle=161.328
DGEMM1K     time=     5.73259e+09 ops/cycle=95.9
DGEMM512    time=     3.96182e+09 ops/cycle=69.3817
DGEMMrest   time=     1.19867e+10 ops/cycle=22.9319
col dec t   time=      4.9462e+09 ops/cycle=3.47335
Total       time=     1.70473e+11 ops/cycle=137.595

いきなり300になるなよｗｗ

$ OMP_NUM_THREADS=24 numactl --interleave=all ./lu2 -n 65536
main top omp_max_threads=24 procs=48
optchar = n optarg=65536
N=65536 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=65536 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=24 procs=48
Emax=  1.712e-03
Nswap=0 cpsec =  12019.3 wsec=555.36 337.889 Gflops
swaprows    time=     3.90339e+09 ops/cycle=0.550159
scalerow    time=     3.09347e+08 ops/cycle=6.94199
trans rtoc  time=     4.34846e+09 ops/cycle=0.493849
trans ctor  time=     3.12601e+09 ops/cycle=0.686973
trans mmul  time=     6.34419e+09 ops/cycle=5.07745
tr nr  cdec time=     4.45475e+09 ops/cycle=0.482066
trans vvmul time=     1.32924e+09 ops/cycle=1.61558
trans findp time=       3.121e+09 ops/cycle=0.688075
solve tri u time=     6.51899e+09 ops/cycle=1.00531e-05
solve tri   time=     6.40876e+10 ops/cycle=137.251
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=          607926 ops/cycle=56519.6
trans mmul8 time=     2.63849e+09 ops/cycle=6.51124
trans mmul4 time=     1.78724e+09 ops/cycle=4.80627
trans mmul2 time=     1.91419e+09 ops/cycle=2.24375
DGEMM2K     time=     1.10841e+12 ops/cycle=165.329
DGEMM1K     time=     1.93481e+10 ops/cycle=113.655
DGEMM512    time=     1.39368e+10 ops/cycle=78.893
DGEMMrest   time=     4.25798e+10 ops/cycle=25.8224
col dec t   time=     1.83769e+10 ops/cycle=3.73944
Total       time=     1.21635e+12 ops/cycle=154.273

行列を大きくすると結果が伸びる。ということは、もう少し大きくするともっと良くなる？
まぁHPLってそういうもんだけど、無限に時間がかかる=俺の財布が死ぬのでやりたくない

まとめ

さて、珍しく途中で計算してなかったので、ここでピーク性能を計算すると、~~~AWSのところに表示されていたクロックは2GHzだったので~~~
** 2018/11/09修正。これ、僕の目が腐ってただけで、公式に2.5GHzが正しいです。なので、下の式をみてください **

24(Core) * 2(clock) * 2(SIMD) * 2(FMA) * 2(port) = 384GFlops/DP

24(Core) * 2.5(clock) * 2(SIMD) * 2(FMA) * 2(port) = 480GFlops/DP

仮に480だとすると、DGEMM, SGEMMの実効効率は、それぞれ

391.78 / 480 = 81.6%
814.027 / 960 = 84.7%

と、なんかまずまずな数字に見えます。

HPLでは

337.8 / 480 = 70.3%

になってしまうので、なんかもう少し欲しいなぁという感じでしょうか。
ほかの実装ならもう少し効率がいいとかあるのかもしれません。他のも試してみたいところですね。ただまぁ、HPLってめっちゃ時間かかるので、ほんと財布にやさしくない…

メモリ帯域は、AWSで使っているSKUの情報があんまりないのであれですが、似たようなSKUから推察するに、DDR4の8chっぽいですね。
https://en.wikipedia.org/wiki/Epyc

だとすると、6chのSkylakeに負けているような…気が…しないでも…ないような…気がする…

次はSkylakeあたりでちゃんとベンチとるべきかもしれませんね…
いや、その前に別のEPYC環境で追試しとくべきか。

AWSのEPYCインスタンスをHPCで使えるか検証してみた

EPYCインスタンスが来たぞー

ハードウェア

ベンチマーク

STREAM

FFT

DGEMM, SGEMM

HPL

まとめ