More than 5 years have passed since last update.

AMD Ryzen Threadripper 1950Xでベンチ取った記録

Posted at 2018-11-08

今更ね。Threadripper 1950Xでベンチとか取ったりしてるわけですよ。

読み物的には、
https://qiita.com/telmin_orca/items/2d30323a7c96db929ecf

の続き。

ハードウェア情報

cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD Ryzen Threadripper 1950X 16-Core Processor
stepping        : 1
microcode       : 0x8001129
cpu MHz         : 2200.000
cache size      : 512 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 16
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid amd_dcm aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 hw_pstate avic fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold overflow_recov succor smca
bogomips        : 6786.29
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

Memory

Memory Device
        Array Handle: 0x0037
        Error Information Handle: 0x003F
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM 1
        Bank Locator: P0 CHANNEL A
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 3003 MHz
        Manufacturer: Unknown
        Serial Number: 00000000
        Asset Tag: Not Specified
        Part Number: CMK32GX4M2B3000C15
        Rank: 2
        Configured Clock Speed: 1467 MHz
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V

$ cat /proc/meminfo
MemTotal:       65729192 kB
MemFree:        63709928 kB
MemAvailable:   64435684 kB
Buffers:            2264 kB
Cached:          1119860 kB
SwapCached:            0 kB
Active:           651584 kB
Inactive:         563276 kB
Active(anon):      93852 kB
Inactive(anon):     8528 kB
Active(file):     557732 kB
Inactive(file):   554748 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      24420348 kB
SwapFree:       24420348 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         92808 kB
Mapped:            37448 kB
Shmem:              9632 kB
Slab:             267244 kB
SReclaimable:     121256 kB
SUnreclaim:       145988 kB
KernelStack:        8688 kB
PageTables:         6952 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    57284944 kB
Committed_AS:     590544 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      301032 kB
VmallocChunk:   34359310332 kB
HardwareCorrupted:     0 kB
AnonHugePages:      4096 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      232028 kB
DirectMap2M:     4880384 kB
DirectMap1G:    61865984 kB

64GB積んでるやつでしたわ。
2400が定格のはずだけど…なんかそれより速くなってないか…

ベンチマーク

以下は相変わらず
https://github.com/telmin/YAMADABenchmarkSuite
を使用してます。
これもちゃんとメンテしないとなぁ…

STREAM

$ OMP_NUM_THREADS=16 ./stream_cxx.out -s 500M
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 524288000 (elements), Offset = 0 (elements)
Memory per array = 4000 MiB (= 3.90625 GiB).
Total Memory required = 12000 MiB (= 11.7188 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           44463.4     0.188855     0.188663     0.189471
Scale:          44171.1     0.190106     0.189912     0.190254
Add:            47620.6     0.264557     0.264233     0.265262
Triad:          47695.5     0.263960     0.263818     0.264307
-------------------------------------------------------------

$ OMP_NUM_THREADS=32 ./stream_cxx.out -s 500M
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 524288000 (elements), Offset = 0 (elements)
Memory per array = 4000 MiB (= 3.90625 GiB).
Total Memory required = 12000 MiB (= 11.7188 GiB).
Each kernel will be executed 10 times.
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           42352.1     0.198253     0.198068     0.198475
Scale:          42183.2     0.199021     0.198861     0.199232
Add:            45896.0     0.274497     0.274161     0.275280
Triad:          46188.5     0.273190     0.272425     0.273858
-------------------------------------------------------------

Ryzenの結果と比較すると大体倍で、チャネルも倍なので、おおむねその通りの数字になっているのではないだろうか。

FFTW

3次元FFT

$ OMP_NUM_THREADS=16 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.039270 msec, 5.607428 GFLOPS.
On-board: 0.037642 msec, 5.849856 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.704716 msec, 2.856847 GFLOPS.
On-board: 0.120807 msec, 16.665106 GFLOPS.

$ OMP_NUM_THREADS=32 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.023219 msec, 9.483667 GFLOPS.
On-board: 0.024668 msec, 8.926539 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.302052 msec, 6.665296 GFLOPS.
On-board: 0.108893 msec, 18.488524 GFLOPS.

…これはなかなか面白い結果になった気がする。
Theadripperは物理16コアの論理32スレッドの構造なので、OpenMPのスレッド的には16でもちゃんと引き切れるはずなのだが。
というか、STREAMでは引けているのだが、FFTになると性能が出ない。これはつまり、メモリ側ではなく、演算側に何かボトルネックがあるということ？
演算器がフルで埋まらないのか？うーん

xGEMM

DGEMMとSGEMM。

行列サイズがある程度小さい場合と大きい場合を実行する。

$ OMP_NUM_THREADS=16 ./gemms
M 2048 N 2048 K 2048 al -1 b 1
Dgemm start
memory use 0.09375 GB
23:33:30 initialize done.
355.006 GFlops, 5.40285
Sgemm start
23:34:0 initialize done.
732.785 GFlops, 46.0008

$ OMP_NUM_THREADS=32 ./gemms
M 2048 N 2048 K 2048 al -1 b 1
Dgemm start
memory use 0.09375 GB
23:34:58 initialize done.
287.46 GFlops, 31.8844
Sgemm start
23:35:28 initialize done.
526.99 GFlops, 26.8436

ピーク性能を雑に計算してみると(また雑にやるのか…)、1950Xは

16(core) * 3.75(clock) * 2(SIMD) * 2(FMA) * 2(port) = 480 Flops/DP

とかなので、16スレッドの時はDGEMMでは約73.9%ぐらい、SGEMMでは約76.2%出ていることになる。

行列大きいときにはどんな風になるかというと

$ $ OMP_NUM_THREADS=16 ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:32:53 initialize done.
410.368 GFlops, 16.3553
Sgemm start
1:33:55 initialize done.
792.339 GFlops, 82.0971

$ OMP_NUM_THREADS=32 ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:34:59 initialize done.
303.08 GFlops, 43.8209
Sgemm start
1:36:6 initialize done.
637.024 GFlops, 97.3738

とか。

なかなか面白い。ここにきて効率が85%まで上がるDGEMM。SGEMMは82.5%。

HPL

N=32768, 16Thread

$ OMP_NUM_THREADS=32 ./lu2 -n 32768
Nswap=0 cpsec =  1251.89 wsec=78.6553 298.216 Gflops
swaprows    time=     1.57564e+09 ops/cycle=0.340733
scalerow    time=     1.81038e+08 ops/cycle=2.96552
trans rtoc  time=     1.59613e+09 ops/cycle=0.336357
trans ctor  time=     1.41867e+09 ops/cycle=0.378433
trans mmul  time=     3.37718e+09 ops/cycle=2.38455
tr nr  cdec time=     8.83348e+08 ops/cycle=0.607768
trans vvmul time=     3.44348e+08 ops/cycle=1.55909
trans findp time=     5.40833e+08 ops/cycle=0.992674
solve tri u time=     3.80403e+09 ops/cycle=8.61403e-06
solve tri   time=     2.40074e+10 ops/cycle=91.5977
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=          147560 ops/cycle=58213.2
trans mmul8 time=     1.59286e+09 ops/cycle=2.69638
trans mmul4 time=      1.2552e+09 ops/cycle=1.71088
trans mmul2 time=     5.26504e+08 ops/cycle=2.03938
DGEMM2K     time=     2.25837e+11 ops/cycle=98.9952
DGEMM1K     time=     7.22312e+09 ops/cycle=76.1106
DGEMM512    time=     4.39899e+09 ops/cycle=62.4866
DGEMMrest   time=     1.48088e+10 ops/cycle=18.5618
col dec t   time=      7.3319e+09 ops/cycle=2.34317
Total       time=     2.65152e+11 ops/cycle=88.4635

N=65536, 16Thread

$ OMP_NUM_THREADS=32 ./lu2 -n 65536
Nswap=0 cpsec =  12910.4 wsec=811.004 231.38 Gflops 
swaprows    time=     5.47763e+09 ops/cycle=0.392046
scalerow    time=     3.57613e+08 ops/cycle=6.00505
trans rtoc  time=     7.23227e+09 ops/cycle=0.296931
trans ctor  time=     5.90193e+09 ops/cycle=0.363861
trans mmul  time=     1.13457e+10 ops/cycle=2.83917
tr nr  cdec time=     4.36849e+09 ops/cycle=0.491585 
trans vvmul time=      1.4129e+09 ops/cycle=1.51991
trans findp time=     2.95845e+09 ops/cycle=0.725882
solve tri u time=     9.05843e+09 ops/cycle=7.2348e-06
solve tri   time=     1.32972e+11 ops/cycle=66.1498
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=          143446 ops/cycle=239531
trans mmul8 time=     5.62061e+09 ops/cycle=3.05658
trans mmul4 time=     3.51078e+09 ops/cycle=2.44673
trans mmul2 time=     2.20799e+09 ops/cycle=1.94519
DGEMM2K     time=     2.57941e+12 ops/cycle=71.044
DGEMM1K     time=     3.71842e+10 ops/cycle=59.1387
DGEMM512    time=     2.26805e+10 ops/cycle=48.4783
DGEMMrest   time=     5.97308e+10 ops/cycle=18.4078
col dec t   time=      2.8972e+10 ops/cycle=2.37192
Total       time=     2.74448e+12 ops/cycle=68.3736

HPLは傾向的にはメモリを大量に使ったほうが速くなると思うんじゃが…
うーむ、なんで性能が落ちるのやら。

この時点でHPLでは6割ぐらい？
ボトルネックをきちんと洗い出しておかないとあんまりいい評価にならなそうだな…TODOということで一つ…
とか言ってたらやらないからちゃんとメンテしろよな…

N=32768, 32Thread

$ OMP_NUM_THREADS=32 ./lu2 -n 32768
Nswap=0 cpsec =  3681.38 wsec=156.199 150.169 Gflops
swaprows    time=     2.23137e+09 ops/cycle=0.240601
scalerow    time=     2.23747e+08 ops/cycle=2.39945
trans rtoc  time=     3.17154e+10 ops/cycle=0.0169278
trans ctor  time=     3.92966e+10 ops/cycle=0.013662
trans mmul  time=     1.15372e+11 ops/cycle=0.0698011
tr nr  cdec time=     3.71547e+09 ops/cycle=0.144496
trans vvmul time=     1.74624e+09 ops/cycle=0.307443
trans findp time=      1.9485e+09 ops/cycle=0.27553
solve tri u time=     6.98219e+09 ops/cycle=4.69308e-06
solve tri   time=     3.22386e+10 ops/cycle=68.2109
matmul nk8  time=               0 ops/cycle=inf
matmul snk  time=     7.68726e+06 ops/cycle=1117.42
trans mmul8 time=     3.76865e+10 ops/cycle=0.113966
trans mmul4 time=      7.5779e+10 ops/cycle=0.0283388
trans mmul2 time=     1.90277e+09 ops/cycle=0.564306
DGEMM2K     time=     2.85387e+11 ops/cycle=78.3382
DGEMM1K     time=      1.1104e+10 ops/cycle=49.5096
DGEMM512    time=     7.12292e+09 ops/cycle=38.5906
DGEMMrest   time=     2.64021e+10 ops/cycle=10.4112
col dec t   time=     1.90337e+11 ops/cycle=0.0902601
Total       time=     5.28254e+11 ops/cycle=44.4033

DGEMMの傾向通り、32スレッドを使うと性能が半分ぐらいに落ち込む。

気になるのは、DGEMMの効率は85%とか出ているのに、これで見ると、298GFlopsとかなので、62%とすごく微妙な数字になっていること。
うーん…この数字を見ただけでは特に思いつくなにかはない…

ただ、気になるのが

$ for((i=0;i<5;++i)); do ./gemms; done
M 10240 N 10240 K 10240 al -1 b 1  
Dgemm start                        
memory use 2.34375 GB              
18:25:4 initialize done.           
418.267 GFlops, 9.5063             
Sgemm start
18:26:6 initialize done.
824.668 GFlops, 68.7114
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:26:40 initialize done.
326.868 GFlops, 21.1696
Sgemm start
18:27:44 initialize done.
727.008 GFlops, 96.0577
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:28:16 initialize done.
291.473 GFlops, 14.6529
Sgemm start
18:29:24 initialize done.
669.241 GFlops, 98.7481
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:29:59 initialize done.
275.481 GFlops, 11.3038
Sgemm start
18:31:1 initialize done.
650.018 GFlops, 97.4159
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:31:33 initialize done.
268.727 GFlops, 8.76842
Sgemm start
18:32:36 initialize done.
637.529 GFlops, 97.3936

とかして、gemmをひたすら走らせるようにしていたら、地味に性能が落ちていくこと。
熱かなぁとかいいたいところだけど、この端末、どうしてかlm_sensorがちゃんと動かないので温度が取れないのであった…(完)

Chainer

いや、なんか流行りのディープラーニングとかもやっとくべきかなぁ…みたいな感じで…(ミーハー

使ったのはChainer(https://github.com/chainer/chainer) のmasterのHEAD。

環境は基本的にpipで作ったが、numpyだけはOpenBLASを自分でビルドしたものを使うために自分で作り直した。

$ pip freeze
chainer==6.0.0a1
Cython==0.29
filelock==3.0.10
numpy==1.16.0.dev0+c41c011
Pillow==5.3.0
protobuf==3.6.1
six==1.11.0

exampleよりMNISTとimagenet(alexnet, resnet50)

MNIST

epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.191196    0.0977384             0.941817       0.9699                    11.0106
2           0.0741213   0.0782149             0.97685        0.9749                    21.9254
3           0.0490689   0.0800515             0.9849         0.9761                    32.7277
4           0.0383919   0.0716018             0.98775        0.977                     43.4329
5           0.0272046   0.0785329             0.990817       0.9795                    54.1377
6           0.0265784   0.080272              0.991533       0.9799                    64.8223
7           0.0199796   0.0873966             0.9934         0.9818                    75.5063
8           0.0199766   0.102661              0.99375        0.9741                    86.1714
9           0.0156266   0.0917326             0.994917       0.9786                    96.8369
10          0.0138881   0.0876374             0.995883       0.9801                    107.496
11          0.0155263   0.100137              0.995417       0.9781                    118.157
12          0.0130711   0.0894545             0.995933       0.9806                    128.811
13          0.0115755   0.0894831             0.9964         0.9826                    139.475
14          0.0104541   0.0842216             0.997          0.983                     150.129
15          0.0129433   0.0979006             0.996133       0.9825                    160.79
16          0.00923257  0.0886761             0.997183       0.9827                    171.477
17          0.00851278  0.100802              0.99765        0.9816                    182.287
18          0.00883718  0.106414              0.997533       0.9794                    193.263
19          0.0105243   0.124271              0.996833       0.978                     204.412
20          0.00644211  0.126378              0.998083       0.979                     215.717

alexnetとresnet50は、20 epochやったらなんか死ぬほど時間かかりそうだったので、出力を10 itrごとにして1 epoch終わるまでの時間を出力

alexnet

$ python train_imagenet.py -a alex -E 1 ../../../chainer_imagenet_tools/train.txt ../../../chainer_imagenet_tools/test.txt
epoch       iteration   main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  lr          elapsed_time
0           10          6.88606                           0.034375                                 0.01        41.3393
0           20          6.27314                           0.078125                                 0.01        79.9324
0           30          5.35235                           0.06875                                  0.01        118.542
0           40          5.14217                           0.053125                                 0.01        156.84
0           50          4.69628                           0.0875                                   0.01        194.714
0           60          4.7203                            0.078125                                 0.01        232.382
0           70          4.63171                           0.053125                                 0.01        270.075
0           80          4.46738                           0.0875                                   0.01        307.801
0           90          4.54931                           0.0375                                   0.01        345.483
0           100         4.5486                            0.109375                                 0.01        383.219
0           110         4.36946                           0.1                                      0.01        420.975
0           120         4.28879                           0.10625                                  0.01        458.817
0           130         4.40169                           0.115625                                 0.01        496.668
0           140         4.43244                           0.1375                                   0.01        534.556
0           150         4.31911                           0.125                                    0.01        572.412
0           160         4.30947                           0.15625                                  0.01        610.293
0           170         4.37239                           0.11875                                  0.01        648.203
0           180         4.18134                           0.15                                     0.01        686.034
0           190         4.28241                           0.146875                                 0.01        723.877
0           200         4.06563                           0.184375                                 0.01        761.713
0           210         3.97252                           0.203125                                 0.01        799.537

resnet50

$ python train_imagenet.py -a resnet50 -E 1 ../../../chainer_imagenet_tools/train.txt ../../../chainer_imagenet_tools/test.txt
epoch       iteration   main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  lr          elapsed_time
0           10          7.64181                           0.046875                                 0.01        167.915
0           20          7.41486                           0.084375                                 0.01        332.973
0           30          6.54101                           0.053125                                 0.01        499.218
0           40          6.55887                           0.1125                                   0.01        665.126
0           50          6.06997                           0.1                                      0.01        833.873
0           60          5.465                             0.128125                                 0.01        1003.78
0           70          4.97667                           0.121875                                 0.01        1173.57
0           80          4.56097                           0.178125                                 0.01        1342.09
0           90          4.54822                           0.1875                                   0.01        1507.72
0           100         4.60423                           0.18125                                  0.01        1673.12
0           110         4.25759                           0.16875                                  0.01        1838.45
0           120         4.23644                           0.184375                                 0.01        2004.29
0           130         4.35106                           0.190625                                 0.01        2172.86
0           140         4.20861                           0.178125                                 0.01        2342.25
0           150         4.33225                           0.146875                                 0.01        2509.55
0           160         3.9726                            0.19375                                  0.01        2675.24
0           170         3.70222                           0.23125                                  0.01        2843.33
0           180         3.86638                           0.228125                                 0.01        3011.9
0           190         3.76488                           0.228125                                 0.01        3178.59
0           200         3.79957                           0.240625                                 0.01        3344.85
0           210         3.83217                           0.209375                                 0.01        3509.89

ディープラーニングマジでわからない勢なので、不当に遅いとかあったら指摘してください。

おまけ

instructionもとってみたよ。
https://github.com/tanakamura/instruction-bench

こんなん見て喜ぶのは変態ぐらいだ(？

AMD Ryzen Threadripper 1950X 16-Core Processor
== latency/throughput ==
   reg64:                                     add:   latency: CPI=    1.00, IPC=    1.00
   reg64:                                     add:throughput: CPI=    0.26, IPC=    3.88
   reg64:                                     lea:   latency: CPI=    1.00, IPC=    1.00
   reg64:                                     lea:throughput: CPI=    0.26, IPC=    3.88
   reg64:                             xor dst,dst:   latency: CPI=    0.26, IPC=    3.88
   reg64:                             xor dst,dst:throughput: CPI=    0.26, IPC=    3.88
   reg64:                                     xor:   latency: CPI=    0.26, IPC=    3.88
   reg64:                                     xor:throughput: CPI=    0.26, IPC=    3.88
   reg64:                                    load:   latency: CPI=    4.00, IPC=    0.25
   reg64:                                    load:throughput: CPI=    0.63, IPC=    1.60
   reg64:                                   crc32:   latency: CPI=    3.02, IPC=    0.33
   reg64:                                   crc32:throughput: CPI=    3.02, IPC=    0.33
   reg64:              store [mem+0]->load[mem+0]:   latency: CPI=   38.15, IPC=    0.03
   reg64:              store [mem+0]->load[mem+0]:throughput: CPI=    3.17, IPC=    0.32
   reg64:              store [mem+0]->load[mem+1]:   latency: CPI=   36.97, IPC=    0.03
   reg64:              store [mem+0]->load[mem+1]:throughput: CPI=   14.01, IPC=    0.07
    m128:                                    pxor:   latency: CPI=    0.25, IPC=    4.00
    m128:                                    pxor:throughput: CPI=    0.25, IPC=    4.00
    m128:                                    padd:   latency: CPI=    1.00, IPC=    1.00
    m128:                                    padd:throughput: CPI=    0.33, IPC=    3.00
    m128:                                  pmuldq:   latency: CPI=    3.00, IPC=    0.33
    m128:                                  pmuldq:throughput: CPI=    1.00, IPC=    1.00
    m128:                                  loadps:throughput: CPI=    0.50, IPC=    2.00
    m128:                            loadps->movq:   latency: CPI=    9.00, IPC=    0.11
    m128:                              movq->movq:   latency: CPI=    6.00, IPC=    0.17
    m128:                              movq->movq:throughput: CPI=    1.00, IPC=    1.00
    m128:                                   xorps:   latency: CPI=    0.25, IPC=    4.00
    m128:                                   xorps:throughput: CPI=    0.25, IPC=    4.00
    m128:                                   addps:   latency: CPI=    3.00, IPC=    0.33
    m128:                                   addps:throughput: CPI=    0.50, IPC=    2.00
    m128:                                   mulps:   latency: CPI=    3.00, IPC=    0.33
    m128:                                   mulps:throughput: CPI=    0.50, IPC=    2.00
    m128:                                   divps:   latency: CPI=   10.00, IPC=    0.10
    m128:                                   divps:throughput: CPI=    3.00, IPC=    0.33
    m128:                                   divpd:   latency: CPI=    8.00, IPC=    0.12
    m128:                                   divpd:throughput: CPI=    4.00, IPC=    0.25
    m128:                                 rsqrtps:   latency: CPI=    5.00, IPC=    0.20
    m128:                                 rsqrtps:throughput: CPI=    1.00, IPC=    1.00
    m128:                                   rcpps:   latency: CPI=    5.00, IPC=    0.20
    m128:                                   rcpps:throughput: CPI=    1.00, IPC=    1.00
    m128:                                 blendps:   latency: CPI=    1.00, IPC=    1.00
    m128:                                 blendps:throughput: CPI=    0.50, IPC=    2.00
    m128:                                blendvps:   latency: CPI=    1.00, IPC=    1.00
    m128:                                blendvps:throughput: CPI=    0.50, IPC=    2.00
    m128:                                  pshufb:   latency: CPI=    1.00, IPC=    1.00
    m128:                                  pshufb:throughput: CPI=    0.50, IPC=    2.00
    m128:                                  shufps:   latency: CPI=    1.00, IPC=    1.00
    m128:                                  shufps:throughput: CPI=    0.50, IPC=    2.00
    m128:                                  pmullw:   latency: CPI=    3.00, IPC=    0.33
    m128:                                  pmullw:throughput: CPI=    1.00, IPC=    1.00
    m128:                                  phaddd:   latency: CPI=    2.00, IPC=    0.50
    m128:                                  phaddd:throughput: CPI=    2.00, IPC=    0.50
    m128:                                  haddps:   latency: CPI=    2.00, IPC=    0.50
    m128:                                  haddps:throughput: CPI=    2.00, IPC=    0.50
    m128:                                  pinsrd:   latency: CPI=    1.67, IPC=    0.60
    m128:                                  pinsrd:throughput: CPI=    1.31, IPC=    0.77
    m128:                            pinsrd->pexr:   latency: CPI=    8.00, IPC=    0.12
    m128:                                    dpps:   latency: CPI=   15.00, IPC=    0.07
    m128:                                    dpps:throughput: CPI=    4.00, IPC=    0.25
    m128:                                cvtps2dq:   latency: CPI=    4.00, IPC=    0.25
    m128:                                cvtps2dq:throughput: CPI=    1.00, IPC=    1.00
   reg64:                                  popcnt:   latency: CPI=    1.00, IPC=    1.00
   reg64:                                  popcnt:throughput: CPI=    0.26, IPC=    3.88
    m128:                                  aesenc:   latency: CPI=    4.00, IPC=    0.25
    m128:                                  aesenc:throughput: CPI=    0.50, IPC=    2.00
    m128:                              aesenclast:   latency: CPI=    4.00, IPC=    0.25
    m128:                              aesenclast:throughput: CPI=    0.50, IPC=    2.00
    m128:                                  aesdec:   latency: CPI=    4.00, IPC=    0.25
    m128:                                  aesdec:throughput: CPI=    0.50, IPC=    2.00
    m128:                              aesdeclast:   latency: CPI=    4.00, IPC=    0.25
    m128:                              aesdeclast:throughput: CPI=    0.50, IPC=    2.00
    m256:                            movaps [mem]:   latency: CPI=    1.00, IPC=    1.00
    m256:                            movaps [mem]:throughput: CPI=    1.00, IPC=    1.00
    m256:                         vmovdqu [mem+1]:   latency: CPI=    1.50, IPC=    0.67
    m256:                         vmovdqu [mem+1]:throughput: CPI=    1.50, IPC=    0.67
    m256:          vmovdqu [mem+63] (cross cache):   latency: CPI=    1.50, IPC=    0.67
    m256:          vmovdqu [mem+63] (cross cache):throughput: CPI=    1.50, IPC=    0.67
    m256:        vmovdqu [mem+2MB-1] (cross page):   latency: CPI=    1.50, IPC=    0.67
    m256:        vmovdqu [mem+2MB-1] (cross page):throughput: CPI=    1.50, IPC=    0.67
    m256:                                   xorps:   latency: CPI=    0.50, IPC=    2.00
    m256:                                   xorps:throughput: CPI=    0.50, IPC=    2.00
    m256:                                   mulps:   latency: CPI=    3.00, IPC=    0.33
    m256:                                   mulps:throughput: CPI=    1.00, IPC=    1.00
    m256:                                   addps:   latency: CPI=    3.00, IPC=    0.33
    m256:                                   addps:throughput: CPI=    1.00, IPC=    1.00
    m256:                                   divps:   latency: CPI=   10.00, IPC=    0.10
    m256:                                   divps:throughput: CPI=    6.00, IPC=    0.17
    m256:                                   divpd:   latency: CPI=    8.00, IPC=    0.12
    m256:                                   divpd:throughput: CPI=    8.00, IPC=    0.12
    m256:                                 rsqrtps:   latency: CPI=    5.00, IPC=    0.20
    m256:                                 rsqrtps:throughput: CPI=    2.00, IPC=    0.50
    m256:                                   rcpps:   latency: CPI=    5.00, IPC=    0.20
    m256:                                   rcpps:throughput: CPI=    2.00, IPC=    0.50
    m256:                                  sqrtps:   latency: CPI=    8.00, IPC=    0.12
    m256:                                  sqrtps:throughput: CPI=    8.00, IPC=    0.12
    m256:                              vperm2f128:   latency: CPI=    3.00, IPC=    0.33
    m256:                              vperm2f128:throughput: CPI=    3.00, IPC=    0.33
    m256:                                    pxor:   latency: CPI=    0.50, IPC=    2.00
    m256:                                    pxor:throughput: CPI=    0.50, IPC=    2.00
    m256:                                   paddd:   latency: CPI=    1.00, IPC=    1.00
    m256:                                   paddd:throughput: CPI=    0.67, IPC=    1.50
    m256:                                 vpermps:   latency: CPI=    5.00, IPC=    0.20
    m256:                                 vpermps:throughput: CPI=    4.00, IPC=    0.25
    m256:                                 vpermpd:   latency: CPI=    2.00, IPC=    0.50
    m256:                                 vpermpd:throughput: CPI=    2.00, IPC=    0.50
    m256:                               vpmovsxwd:   latency: CPI=    2.00, IPC=    0.50
    m256:                               vpmovsxwd:throughput: CPI=    2.00, IPC=    0.50
    m256:                              vpgatherdd:   latency: CPI=   20.81, IPC=    0.05
    m256:                              vpgatherdd:throughput: CPI=   20.00, IPC=    0.05
    m256:             gather32(<ld+ins>x8 + perm):   latency: CPI=   17.39, IPC=    0.06
    m256:             gather32(<ld+ins>x8 + perm):throughput: CPI=    5.03, IPC=    0.20
    m256:                              vgatherdpd:   latency: CPI=   15.69, IPC=    0.06
    m256:                              vgatherdpd:throughput: CPI=   12.00, IPC=    0.08
    m256:             gather64(<ld+ins>x4 + perm):   latency: CPI=   13.01, IPC=    0.08
    m256:             gather64(<ld+ins>x4 + perm):throughput: CPI=    3.03, IPC=    0.33
    m256:                                 vpshufb:   latency: CPI=    1.00, IPC=    1.00
    m256:                                 vpshufb:throughput: CPI=    1.00, IPC=    1.00
    m256:                                  vfmaps:   latency: CPI=    5.00, IPC=    0.20
    m256:                                  vfmaps:throughput: CPI=    1.00, IPC=    1.00
    m256:                                  vfmapd:   latency: CPI=    5.00, IPC=    0.20
    m256:                                  vfmapd:throughput: CPI=    1.00, IPC=    1.00
    m128:                                  vfmaps:   latency: CPI=    5.00, IPC=    0.20
    m128:                                  vfmaps:throughput: CPI=    0.50, IPC=    2.00
    m128:                                  vfmapd:   latency: CPI=    5.00, IPC=    0.20
    m128:                                  vfmapd:throughput: CPI=    0.50, IPC=    2.00

まとめ

Threadripper 1950Xの性能をとりあえず適当に測りました。
これはクソポエムの一種です。
この記事自体にはあんまり深い意味はないけど、EPYCの測定の布石として…

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up