どこかにバイナリ落ちてないかな、と思ったら
【基本】CPUやGPUの理論値FLOPSの計算方法と測定方法
という記事を発見。
- 実際に自分のパソコン(Windows)でLINPACKを測ってみた。
という副見出しには、インテルがXEON向けに提供しているのがあると書いてある。
Intel® LINPACK Benchmark Download – License Agreement
なるほど。フルチューンドのバイナリが存在するわけね。実行してみようと思ったけど微妙に使い方が分からないのでLinux版を落としてきてUbuntuで動かしてみる。
実際には、片手間に走らせられるような甘いモノではなかったと後で後悔する。
Acer Nitro5 Corei5 8300H
Sample data file lininput_xeon64.
Current date/time: Tue Jul 16 17:36:04 2019
CPU frequency: 3.989 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 4
Parameters are set to:
Number of tests: 15
Number of equations to solve (problem size) : | 1000 | 2000 | 5000 | 10000 | 15000 | 18000 | 20000 | 22000 | 25000 | 26000 | 27000 | 30000 | 35000 | 40000 | 45000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Leading dimension of array : | 1000 | 2000 | 5008 | 10000 | 15000 | 18008 | 20016 | 22008 | 25000 | 26000 | 27000 | 30000 | 35000 | 40000 | 45000 |
Number of trials to run : | 4 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 |
Data alignment value (in Kbytes) : | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 1 | 1 | 1 | 1 |
Maximum memory requested that can be used=16200901024, at the size=45000 |
=================== Timing linear equation system solver ===================
Size | LDA | Align. | Time(s) | GFlops | Residual | Residual(norm) Check |
---|---|---|---|---|---|---|
1000 | 1000 | 4 | 0.007 | 92.4596 | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.005 | 125.7945 | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.005 | 128.9757 | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.005 | 129.1973 | 1.502479e-12 | 5.123838e-02 pass |
2000 | 2000 | 4 | 0.040 | 132.6302 | 5.141276e-12 | 4.472280e-02 pass |
2000 | 2000 | 4 | 0.041 | 129.8964 | 5.141276e-12 | 4.472280e-02 pass |
5000 | 5008 | 4 | 0.536 | 155.5857 | 2.567213e-11 | 3.579772e-02 pass |
5000 | 5008 | 4 | 0.545 | 152.9813 | 2.567213e-11 | 3.579772e-02 pass |
10000 | 10000 | 4 | 4.316 | 154.5192 | 1.034086e-10 | 3.646293e-02 pass |
10000 | 10000 | 4 | 4.213 | 158.2908 | 1.034086e-10 | 3.646293e-02 pass |
15000 | 15000 | 4 | 14.515 | 155.0456 | 2.206862e-10 | 3.475844e-02 pass |
15000 | 15000 | 4 | 15.133 | 148.7150 | 2.206862e-10 | 3.475844e-02 pass |
18000 | 18008 | 4 | 24.324 | 159.8681 | 2.839891e-10 | 3.110030e-02 pass |
18000 | 18008 | 4 | 24.294 | 160.0637 | 2.839891e-10 | 3.110030e-02 pass |
20000 | 20016 | 4 | 33.223 | 160.5565 | 3.839458e-10 | 3.398761e-02 pass |
20000 | 20016 | 4 | 33.210 | 160.6187 | 3.839458e-10 | 3.398761e-02 pass |
22000 | 22008 | 4 | 44.239 | 160.4836 | 4.191068e-10 | 3.069793e-02 pass |
22000 | 22008 | 4 | 52.114 | 136.2331 | 4.191068e-10 | 3.069793e-02 pass |
25000 | 25000 | 4 | 84.023 | 123.9891 | 5.526876e-10 | 3.142937e-02 pass |
25000 | 25000 | 4 | 77.841 | 133.8358 | 5.526876e-10 | 3.142937e-02 pass |
26000 | 26000 | 4 | 93.571 | 125.2384 | 6.643859e-10 | 3.493541e-02 pass |
26000 | 26000 | 4 | 72.181 | 162.3513 | 6.643859e-10 | 3.493541e-02 pass |
27000 | 27000 | 4 | 80.921 | 162.1760 | 6.191166e-10 | 3.019127e-02 pass |
30000 | 30000 | 1 | 110.933 | 162.2761 | 8.044883e-10 | 3.171301e-02 pass |
35000 | 35000 | 1 | 176.163 | 162.2688 | 1.155255e-09 | 3.353531e-02 pass |
40000 | 40000 | 1 | 260.541 | 163.7740 | 1.430789e-09 | 3.182123e-02 pass |
45000 | 45000 | 1 | 14033.765 | 4.3291 | 1.628164e-09 | 2.864583e-02 pass |
Performance Summary (GFlops)
Size | LDA | Align. | Average | Maximal |
---|---|---|---|---|
1000 | 1000 | 4 | 119.1068 | 129.1973 |
2000 | 2000 | 4 | 131.2633 | 132.6302 |
5000 | 5008 | 4 | 154.2835 | 155.5857 |
10000 | 10000 | 4 | 156.4050 | 158.2908 |
15000 | 15000 | 4 | 151.8803 | 155.0456 |
18000 | 18008 | 4 | 159.9659 | 160.0637 |
20000 | 20016 | 4 | 160.5876 | 160.6187 |
22000 | 22008 | 4 | 148.3584 | 160.4836 |
25000 | 25000 | 4 | 128.9125 | 133.8358 |
26000 | 26000 | 4 | 143.7948 | 162.3513 |
27000 | 27000 | 4 | 162.1760 | 162.1760 |
30000 | 30000 | 1 | 162.2761 | 162.2761 |
35000 | 35000 | 1 | 162.2688 | 162.2688 |
40000 | 40000 | 1 | 163.7740 | 163.7740 |
45000 | 45000 | 1 | 4.3291 | 4.3291 |
Residual checks PASSED
End of tests 2019/07/16 22:19
- UBUNTU Linux on Windows 10
$ ./runme_xeon64
This is a SAMPLE run script for running a shared-memory version of
Intel(R) Distribution for LINPACK* Benchmark. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
*Other names and brands may be claimed as the property of others.
./runme_xeon64: 28: [: -gt: unexpected operator
Fri Jul 12 22:10:10 DST 2019
Sample data file lininput_xeon64.
Current date/time: Fri Jul 12 22:10:10 2019
CPU frequency: 3.989 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 4
Parameters are set to:
Number of tests: 15
Number of equations to solve (problem size) : 1000 2000 5000 10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array : 1000 2000 5008 10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run : 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1
Data alignment value (in Kbytes) : 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1
Maximum memory requested that can be used=16200901024, at the size=45000
=================== Timing linear equation system solver ===================
Size | LDA | Align. | Time(s) | GFlops | Residual | Residual(norm) Check |
---|---|---|---|---|---|---|
1000 | 1000 | 4 | 0.009 Sec | 77.4449 GFlops | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.007 Sec | 99.1057 GFlops | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.007 Sec | 97.7584 GFlops | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.005 Sec | 125.8793 GFlops | 1.502479e-12 | 5.123838e-02 pass |
2000 | 2000 | 4 | 0.061 Sec | 87.2840 GFlops | 5.141276e-12 | 4.472280e-02 pass |
2000 | 2000 | 4 | 0.049 Sec | 109.8297 GFlops | 5.141276e-12 | 4.472280e-02 pass |
5000 | 5008 | 4 | 0.662 Sec | 125.9371 GFlops | 2.567213e-11 | 3.579772e-02 pass |
5000 | 5008 | 4 | 0.804 Sec | 103.7036 GFlops | 2.567213e-11 | 3.579772e-02 pass |
10000 | 10000 | 4 | 5.038 Sec | 132.3704 GFlops | 1.034086e-10 | 3.646293e-02 pass |
10000 | 10000 | 4 | 5.161 Sec | 129.2063 GFlops | 1.034086e-10 | 3.646293e-02 pass |
15000 | 15000 | 4 | 21.230 Sec | 106.0046 GFlops | 2.206862e-10 | 3.475844e-02 pass |
15000 | 15000 | 4 | 20.339 Sec | 110.6467 GFlops | 2.206862e-10 | 3.475844e-02 pass |
18000 | 18008 | 4 | 39.051 Sec | 99.5791 GFlops | 2.839891e-10 | 3.110030e-02 pass |
18000 | 18008 | 4 | 31.372 Sec | 123.9542 GFlops | 2.839891e-10 | 3.110030e-02 pass |
20000 | 20016 | 4 | 48.966 Sec | 108.9351 GFlops | 3.839458e-10 | 3.398761e-02 pass |
20000 | 20016 | 4 | 47.882 Sec | 111.4016 GFlops | 3.839458e-10 | 3.398761e-02 pass |
22000 | 22008 | 4 | 66.404 Sec | 106.9153 GFlops | 4.191068e-10 | 3.069793e-02 pass |
22000 | 22008 | 4 | 60.647 Sec | 117.0651 GFlops | 4.191068e-10 | 3.069793e-02 pass |
25000 | 25000 | 4 | 93.090 Sec | 111.9123 GFlops | 5.526876e-10 | 3.142937e-02 pass |
25000 | 25000 | 4 | 84.308 Sec | 123.5697 GFlops | 5.526876e-10 | 3.142937e-02 pass |
26000 | 26000 | 4 | 92.848 Sec | 126.2134 GFlops | 6.643859e-10 | 3.493541e-02 pass |
26000 | 26000 | 4 | 93.009 Sec | 125.9950 GFlops | 6.643859e-10 | 3.493541e-02 pass |
27000 | 27000 | 4 | 114.919 Sec | 114.1970 GFlops | 6.191166e-10 | 3.019127e-02 pass |
30000 | 30000 | 1 | 153.655 Sec | 117.1575 GFlops | 8.044883e-10 | 3.171301e-02 pass |
35000 | 35000 | 1 | 244.501 Sec | 116.9146 GFlops | 1.199457e-09 | 3.481843e-02 pass |
40000 | 40000 | 1 | 1218.164 Sec | 35.0280 GFlops | 1.430789e-09 | 3.182123e-02 pass |
最後の方はPCそもものがあまりに重くなって、マトモに動かせる状況ではなくなってしまった。
アレイサイズ45000のテスト途中で継続を断念。
コンスタントに100GFlops程度の性能は出ているようだが、大きな配列ではメモリに入りきらずに盛大にスワップアウトが起きて半分以下に性能が低下している。
おそらく、最後のテストはもっと酷い結果になっていた可能性が高い。
なお、インテル製だから文句を言う筋合いは無いのかもしれないが、このベンチマークプログラムはXeonでなくても動かせるが、AMDのプロセッサでは動かせないようだ、残念。
Ryzenと対決させたらどんだけ速いのか知りたかったのに。
仕方ないので、手持ちのインテルプロセッサを試してみる。
Sandy Bridge Corei7 2600K
Sample data file lininput_xeon64.
Current date/time: Sat Jul 13 18:04:32 2019
CPU frequency: 3.491 GHz
Number of CPUs: 1 Number of cores: 4 Number of threads: 4
Parameters are set to:
Number of tests: 15
Number of equations to solve (problem size) : 1000 2000 5000 10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array : 1000 2000 5008 10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run : 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1
Data alignment value (in Kbytes) : 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1
Maximum memory requested that can be used=7200601024, at the size=30000
=================== Timing linear equation system solver ===================
Size | LDA | Align. | Time(s) | GFlops | Residual | Residual(norm) Check |
---|---|---|---|---|---|---|
1000 | 1000 | 4 | 0.046 | 14.6478 | 1.070422e-12 | 3.650412e-02 pass |
1000 | 1000 | 4 | 0.011 | 58.8439 | 1.070422e-12 | 3.650412e-02 pass |
1000 | 1000 | 4 | 0.016 | 42.0040 | 1.070422e-12 | 3.650412e-02 pass |
1000 | 1000 | 4 | 0.012 | 57.2930 | 1.070422e-12 | 3.650412e-02 pass |
2000 | 2000 | 4 | 0.333 | 16.0505 | 4.491907e-12 | 3.907409e-02 pass |
2000 | 2000 | 4 | 0.101 | 52.8872 | 4.491907e-12 | 3.907409e-02 pass |
5000 | 5008 | 4 | 1.440 | 57.8912 | 2.066952e-11 | 2.882198e-02 pass |
5000 | 5008 | 4 | 1.212 | 68.7985 | 2.066952e-11 | 2.882198e-02 pass |
10000 | 10000 | 4 | 10.566 | 63.1154 | 9.739365e-11 | 3.434199e-02 pass |
10000 | 10000 | 4 | 14.245 | 46.8129 | 9.739365e-11 | 3.434199e-02 pass |
15000 | 15000 | 4 | 35.600 | 63.2145 | 2.097041e-10 | 3.302874e-02 pass |
15000 | 15000 | 4 | 39.833 | 56.4966 | 2.097041e-10 | 3.302874e-02 pass |
18000 | 18008 | 4 | 55.005 | 70.6958 | 3.084368e-10 | 3.377762e-02 pass |
18000 | 18008 | 4 | 46.225 | 84.1240 | 3.084368e-10 | 3.377762e-02 pass |
20000 | 20016 | 4 | 60.971 | 87.4865 | 3.810194e-10 | 3.372857e-02 pass |
20000 | 20016 | 4 | 69.257 | 77.0193 | 3.810194e-10 | 3.372857e-02 pass |
22000 | 22008 | 4 | 81.899 | 86.6872 | 4.221506e-10 | 3.092088e-02 pass |
22000 | 22008 | 4 | 87.022 | 81.5843 | 4.221506e-10 | 3.092088e-02 pass |
25000 | 25000 | 4 | 149.883 | 69.5071 | 6.010080e-10 | 3.417717e-02 pass |
25000 | 25000 | 4 | 152.261 | 68.4214 | 6.010080e-10 | 3.417717e-02 pass |
26000 | 26000 | 4 | 171.803 | 68.2100 | 6.133844e-10 | 3.225359e-02 pass |
26000 | 26000 | 4 | 161.636 | 72.5006 | 6.133844e-10 | 3.225359e-02 pass |
27000 | 27000 | 4 | 171.397 | 76.5678 | 6.508096e-10 | 3.173678e-02 pass |
30000 | 30000 | 1 | 246.497 | 73.0304 | 7.763313e-10 | 3.060306e-02 pass |
Performance Summary (GFlops)
Size | LDA | Align. | Average | Maximal |
---|---|---|---|---|
1000 | 1000 | 4 | 43.1972 | 58.8439 |
2000 | 2000 | 4 | 34.4689 | 52.8872 |
5000 | 5008 | 4 | 63.3448 | 68.7985 |
10000 | 10000 | 4 | 54.9642 | 63.1154 |
15000 | 15000 | 4 | 59.8556 | 63.2145 |
18000 | 18008 | 4 | 77.4099 | 84.1240 |
20000 | 20016 | 4 | 82.2529 | 87.4865 |
22000 | 22008 | 4 | 84.1358 | 86.6872 |
25000 | 25000 | 4 | 68.9642 | 69.5071 |
26000 | 26000 | 4 | 70.3553 | 72.5006 |
27000 | 27000 | 4 | 76.5678 | 76.5678 |
30000 | 30000 | 1 | 73.0304 | 73.0304 |
Residual checks PASSED | ||||
End of tests | ||||
2019/07/13 18:38 | ||||
この世代だと100GFlopsは難しいか。 |
Critea VF-AG Corei7 4700MQ
Sample data file lininput_xeon64.
Current date/time: Mon Jul 15 23:17:23 2019
CPU frequency: 3.172 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 4
Parameters are set to:
Number of tests: 15
Number of equations to solve (problem size) : 1000 2000 5000 10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array : 1000 2000 5008 10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run : 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1
Data alignment value (in Kbytes) : 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1
Maximum memory requested that can be used=16200901024, at the size=45000
=================== Timing linear equation system solver ===================
Size | LDA | Align. | Time(s) | GFlops | Residual | Residual(norm) Check |
---|---|---|---|---|---|---|
1000 | 1000 | 4 | 0.008 | 85.0169 | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.007 | 95.6646 | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.007 | 93.6088 | 1.502479e-12 | 5.123838e-02 pass |
1000 | 1000 | 4 | 0.007 | 95.8483 | 1.502479e-12 | 5.123838e-02 pass |
2000 | 2000 | 4 | 0.051 | 103.8164 | 5.141276e-12 | 4.472280e-02 pass |
2000 | 2000 | 4 | 0.051 | 104.1368 | 5.141276e-12 | 4.472280e-02 pass |
5000 | 5008 | 4 | 0.750 | 111.1409 | 2.567213e-11 | 3.579772e-02 pass |
5000 | 5008 | 4 | 0.731 | 113.9918 | 2.567213e-11 | 3.579772e-02 pass |
10000 | 10000 | 4 | 6.127 | 108.8381 | 1.034086e-10 | 3.646293e-02 pass |
10000 | 10000 | 4 | 6.723 | 99.1960 | 1.034086e-10 | 3.646293e-02 pass |
15000 | 15000 | 4 | 28.406 | 79.2234 | 2.206862e-10 | 3.475844e-02 pass |
15000 | 15000 | 4 | 31.787 | 70.7968 | 2.206862e-10 | 3.475844e-02 pass |
18000 | 18008 | 4 | 73.501 | 52.9061 | 2.839891e-10 | 3.110030e-02 pass |
18000 | 18008 | 4 | 70.655 | 55.0374 | 2.839891e-10 | 3.110030e-02 pass |
20000 | 20016 | 4 | 94.526 | 56.4303 | 3.839458e-10 | 3.398761e-02 pass |
20000 | 20016 | 4 | 90.065 | 59.2251 | 3.839458e-10 | 3.398761e-02 pass |
22000 | 22008 | 4 | 117.070 | 60.6444 | 4.191068e-10 | 3.069793e-02 pass |
22000 | 22008 | 4 | 151.761 | 46.7817 | 4.191068e-10 | 3.069793e-02 pass |
25000 | 25000 | 4 | 173.265 | 60.1270 | 5.526876e-10 | 3.142937e-02 pass |
25000 | 25000 | 4 | 289.820 | 35.9461 | 5.526876e-10 | 3.142937e-02 pass |
26000 | 26000 | 4 | 225.227 | 52.0305 | 6.643859e-10 | 3.493541e-02 pass |
26000 | 26000 | 4 | 232.029 | 50.5052 | 6.643859e-10 | 3.493541e-02 pass |
27000 | 27000 | 4 | 251.399 | 52.2017 | 6.191166e-10 | 3.019127e-02 pass |
30000 | 30000 | 1 | 298.730 | 60.2612 | 8.044883e-10 | 3.171301e-02 pass |
35000 | 35000 | 1 | 465.466 | 61.4133 | 1.155255e-09 | 3.353531e-02 pass |
40000 | 40000 | 1 | 696.626 | 61.2522 | 1.430789e-09 | 3.182123e-02 pass |
45000 | 45000 | 1 | 3433.360 | 17.6952 | 1.628164e-09 | 2.864583e-02 pass |
Performance Summary (GFlops)
Size | LDA | Align. | Average | Maximal |
---|---|---|---|---|
1000 | 1000 | 4 | 92.5347 | 95.8483 |
2000 | 2000 | 4 | 103.9766 | 104.1368 |
5000 | 5008 | 4 | 112.5664 | 113.9918 |
10000 | 10000 | 4 | 104.0170 | 108.8381 |
15000 | 15000 | 4 | 75.0101 | 79.2234 |
18000 | 18008 | 4 | 53.9718 | 55.0374 |
20000 | 20016 | 4 | 57.8277 | 59.2251 |
22000 | 22008 | 4 | 53.7131 | 60.6444 |
25000 | 25000 | 4 | 48.0366 | 60.1270 |
26000 | 26000 | 4 | 51.2679 | 52.0305 |
27000 | 27000 | 4 | 52.2017 | 52.2017 |
30000 | 30000 | 1 | 60.2612 | 60.2612 |
35000 | 35000 | 1 | 61.4133 | 61.4133 |
40000 | 40000 | 1 | 61.2522 | 61.2522 |
45000 | 45000 | 1 | 17.6952 | 17.6952 |
Residual checks PASSED
End of tests
2019/07/16 01:44
意外と速いのがHaswell。
配列サイズが小さいうちは100GFlopsも普通に叩き出すが、大きな配列になると、どんどん性能が落ちるようだ。
10秒以上計算に時間がかかるようになると性能低下が目立つ。