EPYCインスタンスが来たぞー
というわけで、「AWSでEPYCでHPCがしたい!」という全世界のユーザーは歓喜したとかなんとか(ほんとか?
なので、EPYCインスタンスのベンチマークを取ることにします。
https://qiita.com/telmin_orca/items/bea941314b8051793db2
の続き。Ryzenを3世代にわたって適当にベンチ取った記録。
ハードウェア
/proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD EPYC 7571
stepping : 2
microcode : 0x8001227
cpu MHz : 2598.604
cache size : 512 KB
physical id : 0
siblings : 48
core id : 0
cpu cores : 24
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4399.32
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:
EPYCはMCMですね!というわけで、パッケージ内もNUMAというなんかもーややこしいなぁという構造をしているわけで…はぁ
$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 24 25 26 27 28 29 30 31
node 0 size: 63310 MB
node 0 free: 62715 MB
node 1 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
node 1 size: 63374 MB
node 1 free: 62707 MB
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
node 2 size: 63352 MB
node 2 free: 62892 MB
node distances:
node 0 1 2
0: 10 16 16
1: 16 10 16
2: 16 16 10
3クラスタなのかい。なんかきもちわるい
ベンチマーク
STREAM
なんもコントロールしない版
$ ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function Best Rate MB/s Avg time Min time Max time
Copy: 30452.7 0.737589 0.564149 1.488491
Scale: 31323.2 0.691516 0.548470 1.135745
Add: 32674.3 0.912753 0.788686 1.135196
Triad: 30949.2 0.919351 0.832648 1.123737
-------------------------------------------------------------
numactlでlocalallocした版
$ numactl --localalloc ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function Best Rate MB/s Avg time Min time Max time
Copy: 35016.2 0.907369 0.490627 3.395209
Scale: 34924.1 1.116415 0.491921 5.284777
Add: 36964.8 0.988101 0.697144 1.739983
Triad: 40281.4 0.772501 0.639745 1.186238
-------------------------------------------------------------
numactlでinterleave=allした版
$ numactl --interleave=all ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function Best Rate MB/s Avg time Min time Max time
Copy: 51429.9 0.336871 0.334044 0.343149
Scale: 51399.8 0.335059 0.334240 0.339794
Add: 55332.8 0.468382 0.465724 0.474934
Triad: 55190.6 0.469062 0.466924 0.475884
-------------------------------------------------------------
おお、すげえ伸びた
24スレッドのみ(=1物理コアに1スレッド割り当て) で、numactlで適当にバインドしてみると?
$ OMP_NUM_THREADS=24 numactl --cpubind=0,1,2 --membind=0,1,2 ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function Best Rate MB/s Avg time Min time Max time
Copy: 18458.0 0.935805 0.930755 0.942491
Scale: 18457.4 0.937502 0.930783 0.945430
Add: 19904.0 1.302825 1.294708 1.312583
Triad: 19975.5 1.305319 1.290069 1.317308
-------------------------------------------------------------
なんか遅くなった。アフィニティちゃんと設定しないとダメか?
24スレッドでinterleave=allしてみると…
$ OMP_NUM_THREADS=24 numactl --interleave=all ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function Best Rate MB/s Avg time Min time Max time
Copy: 51883.6 0.334359 0.331124 0.352611
Scale: 50945.1 0.342290 0.337223 0.363494
Add: 55125.5 0.469205 0.467475 0.473028
Triad: 55026.9 0.472796 0.468313 0.504610
-------------------------------------------------------------
大して変わらん。
ということは、ちゃんと引けているんだな…なるほど…
FFT
3次元FFT。
何もしない版
$ ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005765 msec, 38.195858 GFLOPS.
On-board: 0.003017 msec, 72.975774 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.064630 msec, 31.150834 GFLOPS.
On-board: 0.061369 msec, 32.806107 GFLOPS.
なんかクソ速いな
24スレッド版
$ OMP_NUM_THREADS=24 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.004932 msec, 44.643677 GFLOPS.
On-board: 0.003317 msec, 66.378631 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.099970 msec, 20.138779 GFLOPS.
On-board: 0.086103 msec, 23.381948 GFLOPS.
ちょっと落ちた
numactl --interleave=all入れてみた
$ numactl --interleave=all ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005450 msec, 40.403388 GFLOPS.
On-board: 0.003117 msec, 70.648559 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.064362 msec, 31.280534 GFLOPS.
On-board: 0.065973 msec, 30.516550 GFLOPS.
うーん?
$ OMP_NUM_THREADS=24 numactl --interleave=all ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005006 msec, 43.985406 GFLOPS.
On-board: 0.003303 msec, 66.665421 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.085201 msec, 23.629514 GFLOPS.
On-board: 0.070042 msec, 28.743652 GFLOPS.
あんまり変わらん…
一応念のため
$ numactl --localalloc ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005704 msec, 38.601561 GFLOPS.
On-board: 0.003038 msec, 72.470349 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.063662 msec, 31.624237 GFLOPS.
On-board: 0.063053 msec, 31.929787 GFLOPS.
特に有意な差っていうのはないなぁ…そんなもんか
DGEMM, SGEMM
GEMMはどうなるだろう。
きっと、ThreadRipperの傾向をみるに、32スレッドでは性能が結構落ちるのだろう。まぁ、最内ループは命令ほぼ埋まってて、論理スレッドは意味なくなるはずだからそれは正しいのだが。
まずはなにもしないで
$ ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:20:32 initialize done.
238.255 GFlops, 89.1692
Sgemm start
1:21:7 initialize done.
727.395 GFlops, 99.035
24スレッドにしてみて。
$ OMP_NUM_THREADS=24 ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:22:28 initialize done.
201.881 GFlops, 22.8572
Sgemm start
1:23:2 initialize done.
693.773 GFlops, 6.97528
いや下がっとるやないかーい!(輝夜月風)
numactlでinterleave=allかけてみて
$ OMP_NUM_THREADS=24 numactl --interleave=all ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:24:54 initialize done.
338.489 GFlops, 0.915681
Sgemm start
1:25:28 initialize done.
706.782 GFlops, 0.809519
結構マシになった。あと、分散も大幅に減ったので、やはりnuma制御は必須か…ううむ。めんどくさい
あと、まさかとは思うけど…まさかとは思うけど…48スレッドでも確認してみよう。
$ numactl --interleave=all ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:26:35 initialize done.
391.78 GFlops, 1.15513
Sgemm start
1:27:10 initialize done.
814.027 GFlops, 5.30401
えぇ…(ドン引き
HPL
$ ./lu2 -n 32768
main top omp_max_threads=48 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=48 procs=48
Emax= 1.618e-05
Nswap=0 cpsec = 11626.1 wsec=407.86 57.5105 Gflops
swaprows time= 5.35257e+09 ops/cycle=0.100302
scalerow time= 1.5302e+08 ops/cycle=3.5085
trans rtoc time= 9.99468e+10 ops/cycle=0.00537157
trans ctor time= 7.00587e+10 ops/cycle=0.00766316
trans mmul time= 4.07757e+11 ops/cycle=0.0197496
tr nr cdec time= 4.99821e+09 ops/cycle=0.107413
trans vvmul time= 1.90071e+09 ops/cycle=0.282458
trans findp time= 3.06734e+09 ops/cycle=0.175028
solve tri u time= 1.55983e+10 ops/cycle=2.10074e-06
solve tri time= 2.93913e+10 ops/cycle=74.819
matmul nk8 time= 0 ops/cycle=inf
matmul snk time= 4.19544e+07 ops/cycle=204.745
trans mmul8 time= 1.51567e+11 ops/cycle=0.028337
trans mmul4 time= 2.5248e+11 ops/cycle=0.00850557
trans mmul2 time= 3.70726e+09 ops/cycle=0.289632
DGEMM2K time= 1.90254e+11 ops/cycle=117.51
DGEMM1K time= 1.09285e+10 ops/cycle=50.3048
DGEMM512 time= 7.84843e+09 ops/cycle=35.0233
DGEMMrest time= 8.24044e+10 ops/cycle=3.33572
col dec t time= 5.83185e+11 ops/cycle=0.0294587
Total time= 8.93463e+11 ops/cycle=26.2532
50て。
$ numactl --interleave=all ./lu2 -n 32768
main top omp_max_threads=48 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=48 procs=48
Emax= 1.618e-05
Nswap=0 cpsec = 10791.6 wsec=346.712 67.6534 Gflops
swaprows time= 1.42583e+09 ops/cycle=0.376533
scalerow time= 1.82128e+08 ops/cycle=2.94777
trans rtoc time= 9.52548e+10 ops/cycle=0.00563616
trans ctor time= 5.13747e+10 ops/cycle=0.0104501
trans mmul time= 3.89029e+11 ops/cycle=0.0207004
tr nr cdec time= 5.81692e+09 ops/cycle=0.0922947
trans vvmul time= 2.36237e+09 ops/cycle=0.227259
trans findp time= 3.43785e+09 ops/cycle=0.156165
solve tri u time= 8.44437e+09 ops/cycle=3.88045e-06
solve tri time= 2.34857e+10 ops/cycle=93.6326
matmul nk8 time= 0 ops/cycle=inf
matmul snk time= 1.19951e+06 ops/cycle=7161.23
trans mmul8 time= 1.44905e+11 ops/cycle=0.02964
trans mmul4 time= 2.40618e+11 ops/cycle=0.00892485
trans mmul2 time= 3.5034e+09 ops/cycle=0.306486
DGEMM2K time= 1.27428e+11 ops/cycle=175.446
DGEMM1K time= 7.71364e+09 ops/cycle=71.2706
DGEMM512 time= 6.1085e+09 ops/cycle=44.9992
DGEMMrest time= 7.1944e+10 ops/cycle=3.82072
col dec t time= 5.41751e+11 ops/cycle=0.0317118
Total time= 7.61303e+11 ops/cycle=30.8107
いや、60てw
$ OMP_NUM_THREADS=24 numactl --interleave=all ./lu2 -n 32768
main top omp_max_threads=24 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=24 procs=48
Emax= 6.855e-06
Nswap=0 cpsec = 1752.89 wsec=78.1302 300.22 Gflops
swaprows time= 1.2731e+09 ops/cycle=0.421703
scalerow time= 1.54877e+08 ops/cycle=3.46642
trans rtoc time= 1.15648e+09 ops/cycle=0.46423
trans ctor time= 7.43654e+08 ops/cycle=0.721936
trans mmul time= 1.97772e+09 ops/cycle=4.07189
tr nr cdec time= 1.02064e+09 ops/cycle=0.526014
trans vvmul time= 3.40386e+08 ops/cycle=1.57724
trans findp time= 6.78659e+08 ops/cycle=0.791077
solve tri u time= 3.49471e+09 ops/cycle=9.37646e-06
solve tri time= 1.67429e+10 ops/cycle=131.341
matmul nk8 time= 0 ops/cycle=inf
matmul snk time= 618574 ops/cycle=13886.7
trans mmul8 time= 7.82076e+08 ops/cycle=5.49175
trans mmul4 time= 6.9772e+08 ops/cycle=3.07786
trans mmul2 time= 4.9601e+08 ops/cycle=2.16476
DGEMM2K time= 1.38579e+11 ops/cycle=161.328
DGEMM1K time= 5.73259e+09 ops/cycle=95.9
DGEMM512 time= 3.96182e+09 ops/cycle=69.3817
DGEMMrest time= 1.19867e+10 ops/cycle=22.9319
col dec t time= 4.9462e+09 ops/cycle=3.47335
Total time= 1.70473e+11 ops/cycle=137.595
いきなり300になるなよww
$ OMP_NUM_THREADS=24 numactl --interleave=all ./lu2 -n 65536
main top omp_max_threads=24 procs=48
optchar = n optarg=65536
N=65536 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=65536 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=24 procs=48
Emax= 1.712e-03
Nswap=0 cpsec = 12019.3 wsec=555.36 337.889 Gflops
swaprows time= 3.90339e+09 ops/cycle=0.550159
scalerow time= 3.09347e+08 ops/cycle=6.94199
trans rtoc time= 4.34846e+09 ops/cycle=0.493849
trans ctor time= 3.12601e+09 ops/cycle=0.686973
trans mmul time= 6.34419e+09 ops/cycle=5.07745
tr nr cdec time= 4.45475e+09 ops/cycle=0.482066
trans vvmul time= 1.32924e+09 ops/cycle=1.61558
trans findp time= 3.121e+09 ops/cycle=0.688075
solve tri u time= 6.51899e+09 ops/cycle=1.00531e-05
solve tri time= 6.40876e+10 ops/cycle=137.251
matmul nk8 time= 0 ops/cycle=inf
matmul snk time= 607926 ops/cycle=56519.6
trans mmul8 time= 2.63849e+09 ops/cycle=6.51124
trans mmul4 time= 1.78724e+09 ops/cycle=4.80627
trans mmul2 time= 1.91419e+09 ops/cycle=2.24375
DGEMM2K time= 1.10841e+12 ops/cycle=165.329
DGEMM1K time= 1.93481e+10 ops/cycle=113.655
DGEMM512 time= 1.39368e+10 ops/cycle=78.893
DGEMMrest time= 4.25798e+10 ops/cycle=25.8224
col dec t time= 1.83769e+10 ops/cycle=3.73944
Total time= 1.21635e+12 ops/cycle=154.273
行列を大きくすると結果が伸びる。ということは、もう少し大きくするともっと良くなる?
まぁHPLってそういうもんだけど、無限に時間がかかる=俺の財布が死ぬのでやりたくない
まとめ
さて、珍しく途中で計算してなかったので、ここでピーク性能を計算すると、~AWSのところに表示されていたクロックは2GHzだったので~
** 2018/11/09修正。これ、僕の目が腐ってただけで、公式に2.5GHzが正しいです。なので、下の式をみてください **
24(Core) * 2(clock) * 2(SIMD) * 2(FMA) * 2(port) = 384GFlops/DP
24(Core) * 2.5(clock) * 2(SIMD) * 2(FMA) * 2(port) = 480GFlops/DP
仮に480だとすると、DGEMM, SGEMMの実効効率は、それぞれ
391.78 / 480 = 81.6%
814.027 / 960 = 84.7%
と、なんかまずまずな数字に見えます。
HPLでは
337.8 / 480 = 70.3%
になってしまうので、なんかもう少し欲しいなぁという感じでしょうか。
ほかの実装ならもう少し効率がいいとかあるのかもしれません。他のも試してみたいところですね。ただまぁ、HPLってめっちゃ時間かかるので、ほんと財布にやさしくない…
メモリ帯域は、AWSで使っているSKUの情報があんまりないのであれですが、似たようなSKUから推察するに、DDR4の8chっぽいですね。
https://en.wikipedia.org/wiki/Epyc
だとすると、6chのSkylakeに負けているような…気が…しないでも…ないような…気がする…
次はSkylakeあたりでちゃんとベンチとるべきかもしれませんね…
いや、その前に別のEPYC環境で追試しとくべきか。