LoginSignup
18
5

More than 5 years have passed since last update.

AWSのEPYCインスタンスをHPCで使えるか検証してみた

Last updated at Posted at 2018-11-09

EPYCインスタンスが来たぞー

というわけで、「AWSでEPYCでHPCがしたい!」という全世界のユーザーは歓喜したとかなんとか(ほんとか?
なので、EPYCインスタンスのベンチマークを取ることにします。

https://qiita.com/telmin_orca/items/bea941314b8051793db2
の続き。Ryzenを3世代にわたって適当にベンチ取った記録。

ハードウェア

/proc/cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x8001227
cpu MHz         : 2598.604
cache size      : 512 KB
physical id     : 0
siblings        : 48
core id         : 0
cpu cores       : 24
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.32
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

EPYCはMCMですね!というわけで、パッケージ内もNUMAというなんかもーややこしいなぁという構造をしているわけで…はぁ

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 24 25 26 27 28 29 30 31
node 0 size: 63310 MB
node 0 free: 62715 MB
node 1 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
node 1 size: 63374 MB
node 1 free: 62707 MB
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
node 2 size: 63352 MB
node 2 free: 62892 MB
node distances:
node   0   1   2
  0:  10  16  16
  1:  16  10  16
  2:  16  16  10

3クラスタなのかい。なんかきもちわるい

ベンチマーク

STREAM

なんもコントロールしない版

$ ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           30452.7     0.737589     0.564149     1.488491
Scale:          31323.2     0.691516     0.548470     1.135745
Add:            32674.3     0.912753     0.788686     1.135196
Triad:          30949.2     0.919351     0.832648     1.123737
-------------------------------------------------------------

numactlでlocalallocした版

$ numactl --localalloc ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           35016.2     0.907369     0.490627     3.395209
Scale:          34924.1     1.116415     0.491921     5.284777
Add:            36964.8     0.988101     0.697144     1.739983
Triad:          40281.4     0.772501     0.639745     1.186238
-------------------------------------------------------------

numactlでinterleave=allした版

$ numactl --interleave=all ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           51429.9     0.336871     0.334044     0.343149
Scale:          51399.8     0.335059     0.334240     0.339794
Add:            55332.8     0.468382     0.465724     0.474934
Triad:          55190.6     0.469062     0.466924     0.475884
-------------------------------------------------------------

おお、すげえ伸びた

24スレッドのみ(=1物理コアに1スレッド割り当て) で、numactlで適当にバインドしてみると?

$ OMP_NUM_THREADS=24 numactl --cpubind=0,1,2 --membind=0,1,2 ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           18458.0     0.935805     0.930755     0.942491
Scale:          18457.4     0.937502     0.930783     0.945430
Add:            19904.0     1.302825     1.294708     1.312583
Triad:          19975.5     1.305319     1.290069     1.317308
-------------------------------------------------------------

なんか遅くなった。アフィニティちゃんと設定しないとダメか?

24スレッドでinterleave=allしてみると…

$ OMP_NUM_THREADS=24 numactl --interleave=all ./stream_cxx.out -s 1G
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 1073741824 (elements), Offset = 0 (elements)
Memory per array = 8192 MiB (= 8 GiB).
Total Memory required = 24576 MiB (= 24 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           51883.6     0.334359     0.331124     0.352611
Scale:          50945.1     0.342290     0.337223     0.363494
Add:            55125.5     0.469205     0.467475     0.473028
Triad:          55026.9     0.472796     0.468313     0.504610
-------------------------------------------------------------

大して変わらん。
ということは、ちゃんと引けているんだな…なるほど…

FFT

3次元FFT。

何もしない版

$ ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005765 msec, 38.195858 GFLOPS.
On-board: 0.003017 msec, 72.975774 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.064630 msec, 31.150834 GFLOPS.
On-board: 0.061369 msec, 32.806107 GFLOPS.

なんかクソ速いな

24スレッド版

$ OMP_NUM_THREADS=24 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.004932 msec, 44.643677 GFLOPS.
On-board: 0.003317 msec, 66.378631 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.099970 msec, 20.138779 GFLOPS.
On-board: 0.086103 msec, 23.381948 GFLOPS.

ちょっと落ちた

numactl --interleave=all入れてみた

$ numactl --interleave=all ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005450 msec, 40.403388 GFLOPS.
On-board: 0.003117 msec, 70.648559 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.064362 msec, 31.280534 GFLOPS.
On-board: 0.065973 msec, 30.516550 GFLOPS.

うーん?

$ OMP_NUM_THREADS=24 numactl --interleave=all ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005006 msec, 43.985406 GFLOPS.
On-board: 0.003303 msec, 66.665421 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.085201 msec, 23.629514 GFLOPS.
On-board: 0.070042 msec, 28.743652 GFLOPS.

あんまり変わらん…

一応念のため

$ numactl --localalloc ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.005704 msec, 38.601561 GFLOPS.
On-board: 0.003038 msec, 72.470349 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.063662 msec, 31.624237 GFLOPS.
On-board: 0.063053 msec, 31.929787 GFLOPS.

特に有意な差っていうのはないなぁ…そんなもんか

DGEMM, SGEMM

GEMMはどうなるだろう。
きっと、ThreadRipperの傾向をみるに、32スレッドでは性能が結構落ちるのだろう。まぁ、最内ループは命令ほぼ埋まってて、論理スレッドは意味なくなるはずだからそれは正しいのだが。

まずはなにもしないで

$ ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:20:32 initialize done.
238.255 GFlops, 89.1692
Sgemm start
1:21:7 initialize done.
727.395 GFlops, 99.035

24スレッドにしてみて。

$ OMP_NUM_THREADS=24 ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:22:28 initialize done.
201.881 GFlops, 22.8572
Sgemm start
1:23:2 initialize done.
693.773 GFlops, 6.97528

いや下がっとるやないかーい!(輝夜月風)

numactlでinterleave=allかけてみて

$ OMP_NUM_THREADS=24 numactl --interleave=all ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:24:54 initialize done.
338.489 GFlops, 0.915681
Sgemm start
1:25:28 initialize done.
706.782 GFlops, 0.809519

結構マシになった。あと、分散も大幅に減ったので、やはりnuma制御は必須か…ううむ。めんどくさい

あと、まさかとは思うけど…まさかとは思うけど…48スレッドでも確認してみよう。

$ numactl --interleave=all ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:26:35 initialize done.
391.78 GFlops, 1.15513
Sgemm start
1:27:10 initialize done.
814.027 GFlops, 5.30401

えぇ…(ドン引き

HPL

$ ./lu2 -n 32768
main top omp_max_threads=48 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=48 procs=48
Emax=  1.618e-05
Nswap=0 cpsec =  11626.1 wsec=407.86 57.5105 Gflops
swaprows    time=     5.35257e+09 ops/cycle=0.100302
scalerow    time=      1.5302e+08 ops/cycle=3.5085
trans rtoc  time=     9.99468e+10 ops/cycle=0.00537157
trans ctor  time=     7.00587e+10 ops/cycle=0.00766316
trans mmul  time=     4.07757e+11 ops/cycle=0.0197496
tr nr  cdec time=     4.99821e+09 ops/cycle=0.107413
trans vvmul time=     1.90071e+09 ops/cycle=0.282458
trans findp time=     3.06734e+09 ops/cycle=0.175028
solve tri u time=     1.55983e+10 ops/cycle=2.10074e-06
solve tri   time=     2.93913e+10 ops/cycle=74.819
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=     4.19544e+07 ops/cycle=204.745
trans mmul8 time=     1.51567e+11 ops/cycle=0.028337
trans mmul4 time=      2.5248e+11 ops/cycle=0.00850557
trans mmul2 time=     3.70726e+09 ops/cycle=0.289632
DGEMM2K     time=     1.90254e+11 ops/cycle=117.51
DGEMM1K     time=     1.09285e+10 ops/cycle=50.3048
DGEMM512    time=     7.84843e+09 ops/cycle=35.0233
DGEMMrest   time=     8.24044e+10 ops/cycle=3.33572
col dec t   time=     5.83185e+11 ops/cycle=0.0294587
Total       time=     8.93463e+11 ops/cycle=26.2532

50て。

$ numactl --interleave=all ./lu2 -n 32768
main top omp_max_threads=48 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=48 procs=48
Emax=  1.618e-05
Nswap=0 cpsec =  10791.6 wsec=346.712 67.6534 Gflops
swaprows    time=     1.42583e+09 ops/cycle=0.376533
scalerow    time=     1.82128e+08 ops/cycle=2.94777
trans rtoc  time=     9.52548e+10 ops/cycle=0.00563616
trans ctor  time=     5.13747e+10 ops/cycle=0.0104501
trans mmul  time=     3.89029e+11 ops/cycle=0.0207004
tr nr  cdec time=     5.81692e+09 ops/cycle=0.0922947
trans vvmul time=     2.36237e+09 ops/cycle=0.227259
trans findp time=     3.43785e+09 ops/cycle=0.156165
solve tri u time=     8.44437e+09 ops/cycle=3.88045e-06
solve tri   time=     2.34857e+10 ops/cycle=93.6326
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=     1.19951e+06 ops/cycle=7161.23
trans mmul8 time=     1.44905e+11 ops/cycle=0.02964
trans mmul4 time=     2.40618e+11 ops/cycle=0.00892485
trans mmul2 time=      3.5034e+09 ops/cycle=0.306486
DGEMM2K     time=     1.27428e+11 ops/cycle=175.446
DGEMM1K     time=     7.71364e+09 ops/cycle=71.2706
DGEMM512    time=      6.1085e+09 ops/cycle=44.9992
DGEMMrest   time=      7.1944e+10 ops/cycle=3.82072
col dec t   time=     5.41751e+11 ops/cycle=0.0317118
Total       time=     7.61303e+11 ops/cycle=30.8107

いや、60てw

$ OMP_NUM_THREADS=24 numactl --interleave=all ./lu2 -n 32768
main top omp_max_threads=24 procs=48
optchar = n optarg=32768
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=32768 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=24 procs=48
Emax=  6.855e-06
Nswap=0 cpsec =  1752.89 wsec=78.1302 300.22 Gflops
swaprows    time=      1.2731e+09 ops/cycle=0.421703
scalerow    time=     1.54877e+08 ops/cycle=3.46642
trans rtoc  time=     1.15648e+09 ops/cycle=0.46423
trans ctor  time=     7.43654e+08 ops/cycle=0.721936
trans mmul  time=     1.97772e+09 ops/cycle=4.07189
tr nr  cdec time=     1.02064e+09 ops/cycle=0.526014
trans vvmul time=     3.40386e+08 ops/cycle=1.57724
trans findp time=     6.78659e+08 ops/cycle=0.791077
solve tri u time=     3.49471e+09 ops/cycle=9.37646e-06
solve tri   time=     1.67429e+10 ops/cycle=131.341
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=          618574 ops/cycle=13886.7
trans mmul8 time=     7.82076e+08 ops/cycle=5.49175
trans mmul4 time=      6.9772e+08 ops/cycle=3.07786
trans mmul2 time=      4.9601e+08 ops/cycle=2.16476
DGEMM2K     time=     1.38579e+11 ops/cycle=161.328
DGEMM1K     time=     5.73259e+09 ops/cycle=95.9
DGEMM512    time=     3.96182e+09 ops/cycle=69.3817
DGEMMrest   time=     1.19867e+10 ops/cycle=22.9319
col dec t   time=      4.9462e+09 ops/cycle=3.47335
Total       time=     1.70473e+11 ops/cycle=137.595

いきなり300になるなよww

$ OMP_NUM_THREADS=24 numactl --interleave=all ./lu2 -n 65536
main top omp_max_threads=24 procs=48
optchar = n optarg=65536
N=65536 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
N=65536 Seed=1 NB=2048 usehuge=0
Board id=0 # boards=1
read/set mat end
copy mat end
before lumcolumn omp_max_threads=24 procs=48
Emax=  1.712e-03
Nswap=0 cpsec =  12019.3 wsec=555.36 337.889 Gflops
swaprows    time=     3.90339e+09 ops/cycle=0.550159
scalerow    time=     3.09347e+08 ops/cycle=6.94199
trans rtoc  time=     4.34846e+09 ops/cycle=0.493849
trans ctor  time=     3.12601e+09 ops/cycle=0.686973
trans mmul  time=     6.34419e+09 ops/cycle=5.07745
tr nr  cdec time=     4.45475e+09 ops/cycle=0.482066
trans vvmul time=     1.32924e+09 ops/cycle=1.61558
trans findp time=       3.121e+09 ops/cycle=0.688075
solve tri u time=     6.51899e+09 ops/cycle=1.00531e-05
solve tri   time=     6.40876e+10 ops/cycle=137.251
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=          607926 ops/cycle=56519.6
trans mmul8 time=     2.63849e+09 ops/cycle=6.51124
trans mmul4 time=     1.78724e+09 ops/cycle=4.80627
trans mmul2 time=     1.91419e+09 ops/cycle=2.24375
DGEMM2K     time=     1.10841e+12 ops/cycle=165.329
DGEMM1K     time=     1.93481e+10 ops/cycle=113.655
DGEMM512    time=     1.39368e+10 ops/cycle=78.893
DGEMMrest   time=     4.25798e+10 ops/cycle=25.8224
col dec t   time=     1.83769e+10 ops/cycle=3.73944
Total       time=     1.21635e+12 ops/cycle=154.273

行列を大きくすると結果が伸びる。ということは、もう少し大きくするともっと良くなる?
まぁHPLってそういうもんだけど、無限に時間がかかる=俺の財布が死ぬのでやりたくない

まとめ

さて、珍しく途中で計算してなかったので、ここでピーク性能を計算すると、~AWSのところに表示されていたクロックは2GHzだったので~
** 2018/11/09修正。これ、僕の目が腐ってただけで、公式に2.5GHzが正しいです。なので、下の式をみてください **

24(Core) * 2(clock) * 2(SIMD) * 2(FMA) * 2(port) = 384GFlops/DP
24(Core) * 2.5(clock) * 2(SIMD) * 2(FMA) * 2(port) = 480GFlops/DP

仮に480だとすると、DGEMM, SGEMMの実効効率は、それぞれ

391.78 / 480 = 81.6%
814.027 / 960 = 84.7%

と、なんかまずまずな数字に見えます。

HPLでは

337.8 / 480 = 70.3%

になってしまうので、なんかもう少し欲しいなぁという感じでしょうか。
ほかの実装ならもう少し効率がいいとかあるのかもしれません。他のも試してみたいところですね。ただまぁ、HPLってめっちゃ時間かかるので、ほんと財布にやさしくない…

メモリ帯域は、AWSで使っているSKUの情報があんまりないのであれですが、似たようなSKUから推察するに、DDR4の8chっぽいですね。
https://en.wikipedia.org/wiki/Epyc

だとすると、6chのSkylakeに負けているような…気が…しないでも…ないような…気がする…

次はSkylakeあたりでちゃんとベンチとるべきかもしれませんね…
いや、その前に別のEPYC環境で追試しとくべきか。

18
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
18
5