今更ね。Threadripper 1950Xでベンチとか取ったりしてるわけですよ。
読み物的には、
https://qiita.com/telmin_orca/items/2d30323a7c96db929ecf
の続き。
ハードウェア情報
cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen Threadripper 1950X 16-Core Processor
stepping : 1
microcode : 0x8001129
cpu MHz : 2200.000
cache size : 512 KB
physical id : 0
siblings : 32
core id : 0
cpu cores : 16
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid amd_dcm aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 hw_pstate avic fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold overflow_recov succor smca
bogomips : 6786.29
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]
Memory
Memory Device
Array Handle: 0x0037
Error Information Handle: 0x003F
Total Width: 64 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3003 MHz
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: CMK32GX4M2B3000C15
Rank: 2
Configured Clock Speed: 1467 MHz
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
$ cat /proc/meminfo
MemTotal: 65729192 kB
MemFree: 63709928 kB
MemAvailable: 64435684 kB
Buffers: 2264 kB
Cached: 1119860 kB
SwapCached: 0 kB
Active: 651584 kB
Inactive: 563276 kB
Active(anon): 93852 kB
Inactive(anon): 8528 kB
Active(file): 557732 kB
Inactive(file): 554748 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 24420348 kB
SwapFree: 24420348 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 92808 kB
Mapped: 37448 kB
Shmem: 9632 kB
Slab: 267244 kB
SReclaimable: 121256 kB
SUnreclaim: 145988 kB
KernelStack: 8688 kB
PageTables: 6952 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 57284944 kB
Committed_AS: 590544 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 301032 kB
VmallocChunk: 34359310332 kB
HardwareCorrupted: 0 kB
AnonHugePages: 4096 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 232028 kB
DirectMap2M: 4880384 kB
DirectMap1G: 61865984 kB
64GB積んでるやつでしたわ。
2400が定格のはずだけど…なんかそれより速くなってないか…
ベンチマーク
以下は相変わらず
https://github.com/telmin/YAMADABenchmarkSuite
を使用してます。
これもちゃんとメンテしないとなぁ…
STREAM
$ OMP_NUM_THREADS=16 ./stream_cxx.out -s 500M
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 524288000 (elements), Offset = 0 (elements)
Memory per array = 4000 MiB (= 3.90625 GiB).
Total Memory required = 12000 MiB (= 11.7188 GiB).
Each kernel will be executed 10 times.
Function Best Rate MB/s Avg time Min time Max time
Copy: 44463.4 0.188855 0.188663 0.189471
Scale: 44171.1 0.190106 0.189912 0.190254
Add: 47620.6 0.264557 0.264233 0.265262
Triad: 47695.5 0.263960 0.263818 0.264307
-------------------------------------------------------------
$ OMP_NUM_THREADS=32 ./stream_cxx.out -s 500M
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 524288000 (elements), Offset = 0 (elements)
Memory per array = 4000 MiB (= 3.90625 GiB).
Total Memory required = 12000 MiB (= 11.7188 GiB).
Each kernel will be executed 10 times.
Function Best Rate MB/s Avg time Min time Max time
Copy: 42352.1 0.198253 0.198068 0.198475
Scale: 42183.2 0.199021 0.198861 0.199232
Add: 45896.0 0.274497 0.274161 0.275280
Triad: 46188.5 0.273190 0.272425 0.273858
-------------------------------------------------------------
Ryzenの結果と比較すると大体倍で、チャネルも倍なので、おおむねその通りの数字になっているのではないだろうか。
FFTW
3次元FFT
$ OMP_NUM_THREADS=16 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.039270 msec, 5.607428 GFLOPS.
On-board: 0.037642 msec, 5.849856 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.704716 msec, 2.856847 GFLOPS.
On-board: 0.120807 msec, 16.665106 GFLOPS.
$ OMP_NUM_THREADS=32 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.023219 msec, 9.483667 GFLOPS.
On-board: 0.024668 msec, 8.926539 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.302052 msec, 6.665296 GFLOPS.
On-board: 0.108893 msec, 18.488524 GFLOPS.
…これはなかなか面白い結果になった気がする。
Theadripperは物理16コアの論理32スレッドの構造なので、OpenMPのスレッド的には16でもちゃんと引き切れるはずなのだが。
というか、STREAMでは引けているのだが、FFTになると性能が出ない。これはつまり、メモリ側ではなく、演算側に何かボトルネックがあるということ?
演算器がフルで埋まらないのか?うーん
xGEMM
DGEMMとSGEMM。
行列サイズがある程度小さい場合と大きい場合を実行する。
$ OMP_NUM_THREADS=16 ./gemms
M 2048 N 2048 K 2048 al -1 b 1
Dgemm start
memory use 0.09375 GB
23:33:30 initialize done.
355.006 GFlops, 5.40285
Sgemm start
23:34:0 initialize done.
732.785 GFlops, 46.0008
$ OMP_NUM_THREADS=32 ./gemms
M 2048 N 2048 K 2048 al -1 b 1
Dgemm start
memory use 0.09375 GB
23:34:58 initialize done.
287.46 GFlops, 31.8844
Sgemm start
23:35:28 initialize done.
526.99 GFlops, 26.8436
ピーク性能を雑に計算してみると(また雑にやるのか…)、1950Xは
16(core) * 3.75(clock) * 2(SIMD) * 2(FMA) * 2(port) = 480 Flops/DP
とかなので、16スレッドの時はDGEMMでは約73.9%ぐらい、SGEMMでは約76.2%出ていることになる。
行列大きいときにはどんな風になるかというと
$ $ OMP_NUM_THREADS=16 ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:32:53 initialize done.
410.368 GFlops, 16.3553
Sgemm start
1:33:55 initialize done.
792.339 GFlops, 82.0971
$ OMP_NUM_THREADS=32 ./gemms
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
1:34:59 initialize done.
303.08 GFlops, 43.8209
Sgemm start
1:36:6 initialize done.
637.024 GFlops, 97.3738
とか。
なかなか面白い。ここにきて効率が85%まで上がるDGEMM。SGEMMは82.5%。
HPL
N=32768, 16Thread
$ OMP_NUM_THREADS=32 ./lu2 -n 32768
Nswap=0 cpsec = 1251.89 wsec=78.6553 298.216 Gflops
swaprows time= 1.57564e+09 ops/cycle=0.340733
scalerow time= 1.81038e+08 ops/cycle=2.96552
trans rtoc time= 1.59613e+09 ops/cycle=0.336357
trans ctor time= 1.41867e+09 ops/cycle=0.378433
trans mmul time= 3.37718e+09 ops/cycle=2.38455
tr nr cdec time= 8.83348e+08 ops/cycle=0.607768
trans vvmul time= 3.44348e+08 ops/cycle=1.55909
trans findp time= 5.40833e+08 ops/cycle=0.992674
solve tri u time= 3.80403e+09 ops/cycle=8.61403e-06
solve tri time= 2.40074e+10 ops/cycle=91.5977
matmul nk8 time= 0 ops/cycle=inf
matmul snk time= 147560 ops/cycle=58213.2
trans mmul8 time= 1.59286e+09 ops/cycle=2.69638
trans mmul4 time= 1.2552e+09 ops/cycle=1.71088
trans mmul2 time= 5.26504e+08 ops/cycle=2.03938
DGEMM2K time= 2.25837e+11 ops/cycle=98.9952
DGEMM1K time= 7.22312e+09 ops/cycle=76.1106
DGEMM512 time= 4.39899e+09 ops/cycle=62.4866
DGEMMrest time= 1.48088e+10 ops/cycle=18.5618
col dec t time= 7.3319e+09 ops/cycle=2.34317
Total time= 2.65152e+11 ops/cycle=88.4635
N=65536, 16Thread
$ OMP_NUM_THREADS=32 ./lu2 -n 65536
Nswap=0 cpsec = 12910.4 wsec=811.004 231.38 Gflops
swaprows time= 5.47763e+09 ops/cycle=0.392046
scalerow time= 3.57613e+08 ops/cycle=6.00505
trans rtoc time= 7.23227e+09 ops/cycle=0.296931
trans ctor time= 5.90193e+09 ops/cycle=0.363861
trans mmul time= 1.13457e+10 ops/cycle=2.83917
tr nr cdec time= 4.36849e+09 ops/cycle=0.491585
trans vvmul time= 1.4129e+09 ops/cycle=1.51991
trans findp time= 2.95845e+09 ops/cycle=0.725882
solve tri u time= 9.05843e+09 ops/cycle=7.2348e-06
solve tri time= 1.32972e+11 ops/cycle=66.1498
matmul nk8 time= 0 ops/cycle=inf
matmul snk time= 143446 ops/cycle=239531
trans mmul8 time= 5.62061e+09 ops/cycle=3.05658
trans mmul4 time= 3.51078e+09 ops/cycle=2.44673
trans mmul2 time= 2.20799e+09 ops/cycle=1.94519
DGEMM2K time= 2.57941e+12 ops/cycle=71.044
DGEMM1K time= 3.71842e+10 ops/cycle=59.1387
DGEMM512 time= 2.26805e+10 ops/cycle=48.4783
DGEMMrest time= 5.97308e+10 ops/cycle=18.4078
col dec t time= 2.8972e+10 ops/cycle=2.37192
Total time= 2.74448e+12 ops/cycle=68.3736
HPLは傾向的にはメモリを大量に使ったほうが速くなると思うんじゃが…
うーむ、なんで性能が落ちるのやら。
この時点でHPLでは6割ぐらい?
ボトルネックをきちんと洗い出しておかないとあんまりいい評価にならなそうだな…TODOということで一つ…
とか言ってたらやらないからちゃんとメンテしろよな…
N=32768, 32Thread
$ OMP_NUM_THREADS=32 ./lu2 -n 32768
Nswap=0 cpsec = 3681.38 wsec=156.199 150.169 Gflops
swaprows time= 2.23137e+09 ops/cycle=0.240601
scalerow time= 2.23747e+08 ops/cycle=2.39945
trans rtoc time= 3.17154e+10 ops/cycle=0.0169278
trans ctor time= 3.92966e+10 ops/cycle=0.013662
trans mmul time= 1.15372e+11 ops/cycle=0.0698011
tr nr cdec time= 3.71547e+09 ops/cycle=0.144496
trans vvmul time= 1.74624e+09 ops/cycle=0.307443
trans findp time= 1.9485e+09 ops/cycle=0.27553
solve tri u time= 6.98219e+09 ops/cycle=4.69308e-06
solve tri time= 3.22386e+10 ops/cycle=68.2109
matmul nk8 time= 0 ops/cycle=inf
matmul snk time= 7.68726e+06 ops/cycle=1117.42
trans mmul8 time= 3.76865e+10 ops/cycle=0.113966
trans mmul4 time= 7.5779e+10 ops/cycle=0.0283388
trans mmul2 time= 1.90277e+09 ops/cycle=0.564306
DGEMM2K time= 2.85387e+11 ops/cycle=78.3382
DGEMM1K time= 1.1104e+10 ops/cycle=49.5096
DGEMM512 time= 7.12292e+09 ops/cycle=38.5906
DGEMMrest time= 2.64021e+10 ops/cycle=10.4112
col dec t time= 1.90337e+11 ops/cycle=0.0902601
Total time= 5.28254e+11 ops/cycle=44.4033
DGEMMの傾向通り、32スレッドを使うと性能が半分ぐらいに落ち込む。
気になるのは、DGEMMの効率は85%とか出ているのに、これで見ると、298GFlopsとかなので、62%とすごく微妙な数字になっていること。
うーん…この数字を見ただけでは特に思いつくなにかはない…
ただ、気になるのが
$ for((i=0;i<5;++i)); do ./gemms; done
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:25:4 initialize done.
418.267 GFlops, 9.5063
Sgemm start
18:26:6 initialize done.
824.668 GFlops, 68.7114
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:26:40 initialize done.
326.868 GFlops, 21.1696
Sgemm start
18:27:44 initialize done.
727.008 GFlops, 96.0577
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:28:16 initialize done.
291.473 GFlops, 14.6529
Sgemm start
18:29:24 initialize done.
669.241 GFlops, 98.7481
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:29:59 initialize done.
275.481 GFlops, 11.3038
Sgemm start
18:31:1 initialize done.
650.018 GFlops, 97.4159
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
18:31:33 initialize done.
268.727 GFlops, 8.76842
Sgemm start
18:32:36 initialize done.
637.529 GFlops, 97.3936
とかして、gemmをひたすら走らせるようにしていたら、地味に性能が落ちていくこと。
熱かなぁとかいいたいところだけど、この端末、どうしてかlm_sensorがちゃんと動かないので温度が取れないのであった…(完)
Chainer
いや、なんか流行りのディープラーニングとかもやっとくべきかなぁ…みたいな感じで…(ミーハー
使ったのはChainer(https://github.com/chainer/chainer) のmasterのHEAD。
環境は基本的にpipで作ったが、numpyだけはOpenBLASを自分でビルドしたものを使うために自分で作り直した。
$ pip freeze
chainer==6.0.0a1
Cython==0.29
filelock==3.0.10
numpy==1.16.0.dev0+c41c011
Pillow==5.3.0
protobuf==3.6.1
six==1.11.0
exampleよりMNISTとimagenet(alexnet, resnet50)
- MNIST
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.191196 0.0977384 0.941817 0.9699 11.0106
2 0.0741213 0.0782149 0.97685 0.9749 21.9254
3 0.0490689 0.0800515 0.9849 0.9761 32.7277
4 0.0383919 0.0716018 0.98775 0.977 43.4329
5 0.0272046 0.0785329 0.990817 0.9795 54.1377
6 0.0265784 0.080272 0.991533 0.9799 64.8223
7 0.0199796 0.0873966 0.9934 0.9818 75.5063
8 0.0199766 0.102661 0.99375 0.9741 86.1714
9 0.0156266 0.0917326 0.994917 0.9786 96.8369
10 0.0138881 0.0876374 0.995883 0.9801 107.496
11 0.0155263 0.100137 0.995417 0.9781 118.157
12 0.0130711 0.0894545 0.995933 0.9806 128.811
13 0.0115755 0.0894831 0.9964 0.9826 139.475
14 0.0104541 0.0842216 0.997 0.983 150.129
15 0.0129433 0.0979006 0.996133 0.9825 160.79
16 0.00923257 0.0886761 0.997183 0.9827 171.477
17 0.00851278 0.100802 0.99765 0.9816 182.287
18 0.00883718 0.106414 0.997533 0.9794 193.263
19 0.0105243 0.124271 0.996833 0.978 204.412
20 0.00644211 0.126378 0.998083 0.979 215.717
alexnetとresnet50は、20 epochやったらなんか死ぬほど時間かかりそうだったので、出力を10 itrごとにして1 epoch終わるまでの時間を出力
- alexnet
$ python train_imagenet.py -a alex -E 1 ../../../chainer_imagenet_tools/train.txt ../../../chainer_imagenet_tools/test.txt
epoch iteration main/loss validation/main/loss main/accuracy validation/main/accuracy lr elapsed_time
0 10 6.88606 0.034375 0.01 41.3393
0 20 6.27314 0.078125 0.01 79.9324
0 30 5.35235 0.06875 0.01 118.542
0 40 5.14217 0.053125 0.01 156.84
0 50 4.69628 0.0875 0.01 194.714
0 60 4.7203 0.078125 0.01 232.382
0 70 4.63171 0.053125 0.01 270.075
0 80 4.46738 0.0875 0.01 307.801
0 90 4.54931 0.0375 0.01 345.483
0 100 4.5486 0.109375 0.01 383.219
0 110 4.36946 0.1 0.01 420.975
0 120 4.28879 0.10625 0.01 458.817
0 130 4.40169 0.115625 0.01 496.668
0 140 4.43244 0.1375 0.01 534.556
0 150 4.31911 0.125 0.01 572.412
0 160 4.30947 0.15625 0.01 610.293
0 170 4.37239 0.11875 0.01 648.203
0 180 4.18134 0.15 0.01 686.034
0 190 4.28241 0.146875 0.01 723.877
0 200 4.06563 0.184375 0.01 761.713
0 210 3.97252 0.203125 0.01 799.537
- resnet50
$ python train_imagenet.py -a resnet50 -E 1 ../../../chainer_imagenet_tools/train.txt ../../../chainer_imagenet_tools/test.txt
epoch iteration main/loss validation/main/loss main/accuracy validation/main/accuracy lr elapsed_time
0 10 7.64181 0.046875 0.01 167.915
0 20 7.41486 0.084375 0.01 332.973
0 30 6.54101 0.053125 0.01 499.218
0 40 6.55887 0.1125 0.01 665.126
0 50 6.06997 0.1 0.01 833.873
0 60 5.465 0.128125 0.01 1003.78
0 70 4.97667 0.121875 0.01 1173.57
0 80 4.56097 0.178125 0.01 1342.09
0 90 4.54822 0.1875 0.01 1507.72
0 100 4.60423 0.18125 0.01 1673.12
0 110 4.25759 0.16875 0.01 1838.45
0 120 4.23644 0.184375 0.01 2004.29
0 130 4.35106 0.190625 0.01 2172.86
0 140 4.20861 0.178125 0.01 2342.25
0 150 4.33225 0.146875 0.01 2509.55
0 160 3.9726 0.19375 0.01 2675.24
0 170 3.70222 0.23125 0.01 2843.33
0 180 3.86638 0.228125 0.01 3011.9
0 190 3.76488 0.228125 0.01 3178.59
0 200 3.79957 0.240625 0.01 3344.85
0 210 3.83217 0.209375 0.01 3509.89
ディープラーニングマジでわからない勢なので、不当に遅いとかあったら指摘してください。
おまけ
instructionもとってみたよ。
https://github.com/tanakamura/instruction-bench
こんなん見て喜ぶのは変態ぐらいだ(?
AMD Ryzen Threadripper 1950X 16-Core Processor
== latency/throughput ==
reg64: add: latency: CPI= 1.00, IPC= 1.00
reg64: add:throughput: CPI= 0.26, IPC= 3.88
reg64: lea: latency: CPI= 1.00, IPC= 1.00
reg64: lea:throughput: CPI= 0.26, IPC= 3.88
reg64: xor dst,dst: latency: CPI= 0.26, IPC= 3.88
reg64: xor dst,dst:throughput: CPI= 0.26, IPC= 3.88
reg64: xor: latency: CPI= 0.26, IPC= 3.88
reg64: xor:throughput: CPI= 0.26, IPC= 3.88
reg64: load: latency: CPI= 4.00, IPC= 0.25
reg64: load:throughput: CPI= 0.63, IPC= 1.60
reg64: crc32: latency: CPI= 3.02, IPC= 0.33
reg64: crc32:throughput: CPI= 3.02, IPC= 0.33
reg64: store [mem+0]->load[mem+0]: latency: CPI= 38.15, IPC= 0.03
reg64: store [mem+0]->load[mem+0]:throughput: CPI= 3.17, IPC= 0.32
reg64: store [mem+0]->load[mem+1]: latency: CPI= 36.97, IPC= 0.03
reg64: store [mem+0]->load[mem+1]:throughput: CPI= 14.01, IPC= 0.07
m128: pxor: latency: CPI= 0.25, IPC= 4.00
m128: pxor:throughput: CPI= 0.25, IPC= 4.00
m128: padd: latency: CPI= 1.00, IPC= 1.00
m128: padd:throughput: CPI= 0.33, IPC= 3.00
m128: pmuldq: latency: CPI= 3.00, IPC= 0.33
m128: pmuldq:throughput: CPI= 1.00, IPC= 1.00
m128: loadps:throughput: CPI= 0.50, IPC= 2.00
m128: loadps->movq: latency: CPI= 9.00, IPC= 0.11
m128: movq->movq: latency: CPI= 6.00, IPC= 0.17
m128: movq->movq:throughput: CPI= 1.00, IPC= 1.00
m128: xorps: latency: CPI= 0.25, IPC= 4.00
m128: xorps:throughput: CPI= 0.25, IPC= 4.00
m128: addps: latency: CPI= 3.00, IPC= 0.33
m128: addps:throughput: CPI= 0.50, IPC= 2.00
m128: mulps: latency: CPI= 3.00, IPC= 0.33
m128: mulps:throughput: CPI= 0.50, IPC= 2.00
m128: divps: latency: CPI= 10.00, IPC= 0.10
m128: divps:throughput: CPI= 3.00, IPC= 0.33
m128: divpd: latency: CPI= 8.00, IPC= 0.12
m128: divpd:throughput: CPI= 4.00, IPC= 0.25
m128: rsqrtps: latency: CPI= 5.00, IPC= 0.20
m128: rsqrtps:throughput: CPI= 1.00, IPC= 1.00
m128: rcpps: latency: CPI= 5.00, IPC= 0.20
m128: rcpps:throughput: CPI= 1.00, IPC= 1.00
m128: blendps: latency: CPI= 1.00, IPC= 1.00
m128: blendps:throughput: CPI= 0.50, IPC= 2.00
m128: blendvps: latency: CPI= 1.00, IPC= 1.00
m128: blendvps:throughput: CPI= 0.50, IPC= 2.00
m128: pshufb: latency: CPI= 1.00, IPC= 1.00
m128: pshufb:throughput: CPI= 0.50, IPC= 2.00
m128: shufps: latency: CPI= 1.00, IPC= 1.00
m128: shufps:throughput: CPI= 0.50, IPC= 2.00
m128: pmullw: latency: CPI= 3.00, IPC= 0.33
m128: pmullw:throughput: CPI= 1.00, IPC= 1.00
m128: phaddd: latency: CPI= 2.00, IPC= 0.50
m128: phaddd:throughput: CPI= 2.00, IPC= 0.50
m128: haddps: latency: CPI= 2.00, IPC= 0.50
m128: haddps:throughput: CPI= 2.00, IPC= 0.50
m128: pinsrd: latency: CPI= 1.67, IPC= 0.60
m128: pinsrd:throughput: CPI= 1.31, IPC= 0.77
m128: pinsrd->pexr: latency: CPI= 8.00, IPC= 0.12
m128: dpps: latency: CPI= 15.00, IPC= 0.07
m128: dpps:throughput: CPI= 4.00, IPC= 0.25
m128: cvtps2dq: latency: CPI= 4.00, IPC= 0.25
m128: cvtps2dq:throughput: CPI= 1.00, IPC= 1.00
reg64: popcnt: latency: CPI= 1.00, IPC= 1.00
reg64: popcnt:throughput: CPI= 0.26, IPC= 3.88
m128: aesenc: latency: CPI= 4.00, IPC= 0.25
m128: aesenc:throughput: CPI= 0.50, IPC= 2.00
m128: aesenclast: latency: CPI= 4.00, IPC= 0.25
m128: aesenclast:throughput: CPI= 0.50, IPC= 2.00
m128: aesdec: latency: CPI= 4.00, IPC= 0.25
m128: aesdec:throughput: CPI= 0.50, IPC= 2.00
m128: aesdeclast: latency: CPI= 4.00, IPC= 0.25
m128: aesdeclast:throughput: CPI= 0.50, IPC= 2.00
m256: movaps [mem]: latency: CPI= 1.00, IPC= 1.00
m256: movaps [mem]:throughput: CPI= 1.00, IPC= 1.00
m256: vmovdqu [mem+1]: latency: CPI= 1.50, IPC= 0.67
m256: vmovdqu [mem+1]:throughput: CPI= 1.50, IPC= 0.67
m256: vmovdqu [mem+63] (cross cache): latency: CPI= 1.50, IPC= 0.67
m256: vmovdqu [mem+63] (cross cache):throughput: CPI= 1.50, IPC= 0.67
m256: vmovdqu [mem+2MB-1] (cross page): latency: CPI= 1.50, IPC= 0.67
m256: vmovdqu [mem+2MB-1] (cross page):throughput: CPI= 1.50, IPC= 0.67
m256: xorps: latency: CPI= 0.50, IPC= 2.00
m256: xorps:throughput: CPI= 0.50, IPC= 2.00
m256: mulps: latency: CPI= 3.00, IPC= 0.33
m256: mulps:throughput: CPI= 1.00, IPC= 1.00
m256: addps: latency: CPI= 3.00, IPC= 0.33
m256: addps:throughput: CPI= 1.00, IPC= 1.00
m256: divps: latency: CPI= 10.00, IPC= 0.10
m256: divps:throughput: CPI= 6.00, IPC= 0.17
m256: divpd: latency: CPI= 8.00, IPC= 0.12
m256: divpd:throughput: CPI= 8.00, IPC= 0.12
m256: rsqrtps: latency: CPI= 5.00, IPC= 0.20
m256: rsqrtps:throughput: CPI= 2.00, IPC= 0.50
m256: rcpps: latency: CPI= 5.00, IPC= 0.20
m256: rcpps:throughput: CPI= 2.00, IPC= 0.50
m256: sqrtps: latency: CPI= 8.00, IPC= 0.12
m256: sqrtps:throughput: CPI= 8.00, IPC= 0.12
m256: vperm2f128: latency: CPI= 3.00, IPC= 0.33
m256: vperm2f128:throughput: CPI= 3.00, IPC= 0.33
m256: pxor: latency: CPI= 0.50, IPC= 2.00
m256: pxor:throughput: CPI= 0.50, IPC= 2.00
m256: paddd: latency: CPI= 1.00, IPC= 1.00
m256: paddd:throughput: CPI= 0.67, IPC= 1.50
m256: vpermps: latency: CPI= 5.00, IPC= 0.20
m256: vpermps:throughput: CPI= 4.00, IPC= 0.25
m256: vpermpd: latency: CPI= 2.00, IPC= 0.50
m256: vpermpd:throughput: CPI= 2.00, IPC= 0.50
m256: vpmovsxwd: latency: CPI= 2.00, IPC= 0.50
m256: vpmovsxwd:throughput: CPI= 2.00, IPC= 0.50
m256: vpgatherdd: latency: CPI= 20.81, IPC= 0.05
m256: vpgatherdd:throughput: CPI= 20.00, IPC= 0.05
m256: gather32(<ld+ins>x8 + perm): latency: CPI= 17.39, IPC= 0.06
m256: gather32(<ld+ins>x8 + perm):throughput: CPI= 5.03, IPC= 0.20
m256: vgatherdpd: latency: CPI= 15.69, IPC= 0.06
m256: vgatherdpd:throughput: CPI= 12.00, IPC= 0.08
m256: gather64(<ld+ins>x4 + perm): latency: CPI= 13.01, IPC= 0.08
m256: gather64(<ld+ins>x4 + perm):throughput: CPI= 3.03, IPC= 0.33
m256: vpshufb: latency: CPI= 1.00, IPC= 1.00
m256: vpshufb:throughput: CPI= 1.00, IPC= 1.00
m256: vfmaps: latency: CPI= 5.00, IPC= 0.20
m256: vfmaps:throughput: CPI= 1.00, IPC= 1.00
m256: vfmapd: latency: CPI= 5.00, IPC= 0.20
m256: vfmapd:throughput: CPI= 1.00, IPC= 1.00
m128: vfmaps: latency: CPI= 5.00, IPC= 0.20
m128: vfmaps:throughput: CPI= 0.50, IPC= 2.00
m128: vfmapd: latency: CPI= 5.00, IPC= 0.20
m128: vfmapd:throughput: CPI= 0.50, IPC= 2.00
まとめ
Threadripper 1950Xの性能をとりあえず適当に測りました。
これはクソポエムの一種です。
この記事自体にはあんまり深い意味はないけど、EPYCの測定の布石として…