Edited at

C++は本当にRustに速度で負けるのか clang++編 〜「RustがC++に速度で勝った話」のベンチマークを追試する〜 その2


何故 clang++ 編?

前記事で、clang でも結果変わらないって書いてあったじゃん?

→ 再度追試を行ったら違う結果が出てきました。ごめんなさい。

(恐らくコンパイルした後にリンクのコマンドを打つのを忘れていたものと思われる……。辛い)

また、こちらの記事の方が正確に検証しています。

先に本記事の結論を書いておきます。


本記事の結論


  • clang++ は -O3 でリンクアンローリングを行う(前記事からの修正 / 具体的には自分でコンパイルして調べて頂ければ)

  • 下記ベンチマーク環境においては、clang++ での出力結果は Rust の出力と比較して、統計的に有意(p < 0.01)に速い


ベンチマーク環境


コンパイラ on 環境その2


  • clang


    • 8.0.0



  • Rustc


    • 1.39.0-nightly



環境その2でコンパイルしたものを環境その1で流用しています。


環境その1


  • OS


    • Fedora 30



  • CPU


    • AMD A6-1450



  • Memory


    • DDR3-1333 4GB x 1 (4GB)




環境その2


  • OS


    • Fedora 30



  • CPU


    • Intel i7 4771



  • Memory


    • DDR3-1600 8GB x 2 (16GB)




測定条件


コンパイルコマンド

# libSolveRust.so は前回のものを流用するものとして、省略

# clang++ でのコンパイル
clang++ main.cpp -I/usr/include/c++/9/x86_64-redhat-linux/ -O3 -c -o main.o
# リンク
g++ main.o -L. -lSolveRust

このコンパイルオプションにてmain.cppのみ編集し、 Rust → C++ → C++ → Rustの順番で実行するバイナリ及び、 C++ → Rust → Rust → C++の順番で実行するバイナリの計2つを用意した。


実行回数


  • 環境その1ではRust → C++ → C++ → Rustのバイナリを20回実行した。(Rust, C++の各々で40サンプルずつ)

  • 環境その2では2つのバイナリを各々100回実行するのを2回行った。(Rust, C++の各々で800サンプルずつ)


計測結果

以下で貼られるグラフは、横軸が実行回数、縦軸がSolveの実行時間(ms)です。


環境その1

生アウトプット

Generating Problem...

SequentialNative, PreProcess, 46.536, Solve, 7435.52, PostProcess, 6.584
Rust, PreProcess, 43.817, Solve, 7459.02, PostProcess, 6.512
Rust, PreProcess, 43.7, Solve, 7469.44, PostProcess, 6.947
SequentialNative, PreProcess, 49.377, Solve, 7603.81, PostProcess, 5.76
Generating Problem...
SequentialNative, PreProcess, 43.793, Solve, 7432.91, PostProcess, 6.422
Rust, PreProcess, 42.819, Solve, 7471.99, PostProcess, 5.735
Rust, PreProcess, 43.535, Solve, 7475.79, PostProcess, 5.66
SequentialNative, PreProcess, 43.664, Solve, 7433.53, PostProcess, 5.713
Generating Problem...
SequentialNative, PreProcess, 41.582, Solve, 7430.47, PostProcess, 6.403
Rust, PreProcess, 43.698, Solve, 7471.96, PostProcess, 5.635
Rust, PreProcess, 43.163, Solve, 7462.47, PostProcess, 5.72
SequentialNative, PreProcess, 42.248, Solve, 7419.78, PostProcess, 5.25
Generating Problem...
SequentialNative, PreProcess, 43.649, Solve, 7421.57, PostProcess, 5.47
Rust, PreProcess, 41.425, Solve, 7475.46, PostProcess, 5.697
Rust, PreProcess, 41.874, Solve, 7469.53, PostProcess, 5.636
SequentialNative, PreProcess, 43.456, Solve, 7432.9, PostProcess, 5.651
Generating Problem...
SequentialNative, PreProcess, 44.004, Solve, 7443.77, PostProcess, 5.707
Rust, PreProcess, 42.41, Solve, 7467.45, PostProcess, 5.665
Rust, PreProcess, 43.423, Solve, 7471.64, PostProcess, 5.658
SequentialNative, PreProcess, 42.495, Solve, 7417.09, PostProcess, 5.378
Generating Problem...
SequentialNative, PreProcess, 43.938, Solve, 7415.61, PostProcess, 5.558
Rust, PreProcess, 42.728, Solve, 7459.35, PostProcess, 5.663
Rust, PreProcess, 41.35, Solve, 7465.66, PostProcess, 5.687
SequentialNative, PreProcess, 43.277, Solve, 7421, PostProcess, 5.445
Generating Problem...
SequentialNative, PreProcess, 41.57, Solve, 7407.09, PostProcess, 5.531
Rust, PreProcess, 42.595, Solve, 7451.06, PostProcess, 5.621
Rust, PreProcess, 43.365, Solve, 7458.52, PostProcess, 5.556
SequentialNative, PreProcess, 41.133, Solve, 7424.89, PostProcess, 5.367
Generating Problem...
SequentialNative, PreProcess, 43.564, Solve, 7441.41, PostProcess, 5.779
Rust, PreProcess, 41.476, Solve, 7471.34, PostProcess, 5.633
Rust, PreProcess, 41.266, Solve, 7462.76, PostProcess, 5.714
SequentialNative, PreProcess, 43.277, Solve, 7456.74, PostProcess, 5.464
Generating Problem...
SequentialNative, PreProcess, 43.55, Solve, 7419.47, PostProcess, 5.526
Rust, PreProcess, 41.23, Solve, 7489.61, PostProcess, 5.523
Rust, PreProcess, 42.019, Solve, 7467.35, PostProcess, 5.541
SequentialNative, PreProcess, 41.42, Solve, 7422.92, PostProcess, 5.624
Generating Problem...
SequentialNative, PreProcess, 41.751, Solve, 7440.62, PostProcess, 5.373
Rust, PreProcess, 42.39, Solve, 7474.82, PostProcess, 5.516
Rust, PreProcess, 41.008, Solve, 7465.9, PostProcess, 5.436
SequentialNative, PreProcess, 41.701, Solve, 7429.15, PostProcess, 5.501
Generating Problem...
SequentialNative, PreProcess, 44.135, Solve, 7416, PostProcess, 5.71
Rust, PreProcess, 43.874, Solve, 7477.67, PostProcess, 5.568
Rust, PreProcess, 42.902, Solve, 7476.97, PostProcess, 5.579
SequentialNative, PreProcess, 43.827, Solve, 7444.09, PostProcess, 5.682
Generating Problem...
SequentialNative, PreProcess, 41.671, Solve, 7450.28, PostProcess, 5.814
Rust, PreProcess, 41.627, Solve, 7472.48, PostProcess, 5.467
Rust, PreProcess, 43.53, Solve, 7467.89, PostProcess, 5.763
SequentialNative, PreProcess, 41.289, Solve, 7431.97, PostProcess, 5.555
Generating Problem...
SequentialNative, PreProcess, 41.948, Solve, 7422.41, PostProcess, 5.309
Rust, PreProcess, 43.945, Solve, 7472.94, PostProcess, 5.808
Rust, PreProcess, 43.039, Solve, 7464.25, PostProcess, 5.4
SequentialNative, PreProcess, 42.787, Solve, 7427.44, PostProcess, 5.594
Generating Problem...
SequentialNative, PreProcess, 43.292, Solve, 7428.2, PostProcess, 5.551
Rust, PreProcess, 41.181, Solve, 7465.17, PostProcess, 5.479
Rust, PreProcess, 41.408, Solve, 7464.66, PostProcess, 5.809
SequentialNative, PreProcess, 42.568, Solve, 7429.14, PostProcess, 5.548
Generating Problem...
SequentialNative, PreProcess, 43.786, Solve, 7427.12, PostProcess, 5.46
Rust, PreProcess, 43.255, Solve, 7462.49, PostProcess, 5.565
Rust, PreProcess, 46.043, Solve, 7472.55, PostProcess, 5.524
SequentialNative, PreProcess, 42.449, Solve, 7441.32, PostProcess, 5.633
Generating Problem...
SequentialNative, PreProcess, 43.566, Solve, 7418.71, PostProcess, 5.55
Rust, PreProcess, 42.665, Solve, 7456.97, PostProcess, 5.627
Rust, PreProcess, 42.722, Solve, 7465.98, PostProcess, 5.671
SequentialNative, PreProcess, 43.447, Solve, 7423.07, PostProcess, 5.568
Generating Problem...
SequentialNative, PreProcess, 42.095, Solve, 7423.1, PostProcess, 5.58
Rust, PreProcess, 41.659, Solve, 7466.42, PostProcess, 5.654
Rust, PreProcess, 41.801, Solve, 7455.87, PostProcess, 5.612
SequentialNative, PreProcess, 43.358, Solve, 7410.63, PostProcess, 5.559
Generating Problem...
SequentialNative, PreProcess, 42.212, Solve, 7428.34, PostProcess, 5.506
Rust, PreProcess, 42.701, Solve, 7478.29, PostProcess, 5.496
Rust, PreProcess, 43.163, Solve, 7483.22, PostProcess, 5.577
SequentialNative, PreProcess, 42.366, Solve, 7445.22, PostProcess, 5.581
Generating Problem...
SequentialNative, PreProcess, 43.629, Solve, 7451.63, PostProcess, 5.63
Rust, PreProcess, 42.068, Solve, 7480.78, PostProcess, 5.635
Rust, PreProcess, 42.279, Solve, 7492.77, PostProcess, 5.615
SequentialNative, PreProcess, 41.138, Solve, 7432.22, PostProcess, 5.757
Generating Problem...
SequentialNative, PreProcess, 43.525, Solve, 7427.78, PostProcess, 5.664
Rust, PreProcess, 42.975, Solve, 7462.58, PostProcess, 5.613
Rust, PreProcess, 42.951, Solve, 7467.68, PostProcess, 5.495
SequentialNative, PreProcess, 44.966, Solve, 7422.13, PostProcess, 5.551


img1.png


t-検定

https://wandbox.org/permlink/bO5mBM9SDfLtjodi

    Welch Two Sample t-test

data: cpp and rust
t = -7.2259, df = 45.615, p-value = 4.377e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-45.3178 -25.5672
sample estimates:
mean of x mean of y
7433.776 7469.219


環境その2

(生アウトプットはでかすぎるので省略)

img2.png


  • 200回ごとに切れ切れに計測しているので、その間にCPU温度が下がり周期的に時間が短くなっている場所がある。


    • これは弊環境の空冷が貧弱なためである



  • 400回目までが、Rust → C++ → C++ → Rust の順番での計測、それ以降が C++ → Rust → Rust → C++ である。


t-検定

https://wandbox.org/permlink/yqkI2ZcpB54kfnFp

    Welch Two Sample t-test

data: cpp and rust
t = -3.2078, df = 1596.6, p-value = 0.001364
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-63.32819 -15.26924
sample estimates:
mean of x mean of y
2491.068 2530.366


結論

Rust より C++ の方が速い 🎉


感想

言語間の速度の比較は話題になるものの、実装者の腕やらコンパイラの最適化の差やらで厳密に定義することは本当に難しいなぁと思いました。

また、実行するCPUによってもレジスタリネーミングの強さやら、分岐予測の強さやら、命令のレイテンシー・スループットの違いなどなど、パラメーターが多すぎるのでこの結果が他のCPUでも同じとは到底言えません。

本結果は、あくまでも私の使える環境について結果を述べた似すぎず、かつコンパイラも上記バージョンのみしか使って居ないことはご了承ください。

とは言え、LLVM を使うような言語では、誤差は数%の差に過ぎないので、殆どのケースでは気にしなくてよいのではと思います。

おまけ


g++ はどうなの?

恐らく 100ms 差ぐらい付けられて負けます。@環境その2

追記:

ログ出力したら、CPUが低温ならば g++ の方が勝つような……。意味が分からん……。

ちなみにこの計測は loop-unrolling なし です。恐らく分岐予測がよく効いているものと思われます。(逆に言えばループアンローリングが聞くのはしょっぱい古いCPUだけかもしれません。)

CPU温度が上がるとクロックさがるだけじゃないんでしょうかね。

CPUはちゃんと冷やしましょう。

Generating Problem...

Rust, PreProcess, 10.298, Solve, 1740.17, PostProcess, 1.321
SequentialNative, PreProcess, 11.324, Solve, 1664.96, PostProcess, 1.167
SequentialNative, PreProcess, 10.829, Solve, 1646, PostProcess, 1.335
Rust, PreProcess, 11.378, Solve, 1740.03, PostProcess, 1.328
Generating Problem...
Rust, PreProcess, 10.438, Solve, 1743.86, PostProcess, 1.138
SequentialNative, PreProcess, 11.021, Solve, 1711.19, PostProcess, 1.299
SequentialNative, PreProcess, 11.244, Solve, 1715.12, PostProcess, 1.191
Rust, PreProcess, 10.281, Solve, 1740.28, PostProcess, 1.371
Generating Problem...
Rust, PreProcess, 10.584, Solve, 1771.32, PostProcess, 1.187
SequentialNative, PreProcess, 11.321, Solve, 1691.37, PostProcess, 1.259
SequentialNative, PreProcess, 10.662, Solve, 1685.1, PostProcess, 1.217
Rust, PreProcess, 11.471, Solve, 1771.29, PostProcess, 1.244
Generating Problem...
Rust, PreProcess, 10.587, Solve, 1741.31, PostProcess, 1.339
SequentialNative, PreProcess, 11.796, Solve, 1731.38, PostProcess, 1.172
SequentialNative, PreProcess, 11.583, Solve, 1734.81, PostProcess, 1.249
Rust, PreProcess, 11.23, Solve, 1761.27, PostProcess, 1.198
Generating Problem...
Rust, PreProcess, 10.377, Solve, 1755.59, PostProcess, 1.227
SequentialNative, PreProcess, 11.584, Solve, 1712.65, PostProcess, 1.264
SequentialNative, PreProcess, 11.349, Solve, 1714.66, PostProcess, 1.193
Rust, PreProcess, 11.142, Solve, 1744.15, PostProcess, 1.349
Generating Problem...
Rust, PreProcess, 10.327, Solve, 1740.28, PostProcess, 1.252
SequentialNative, PreProcess, 10.746, Solve, 1732.61, PostProcess, 1.238
SequentialNative, PreProcess, 10.985, Solve, 1730.71, PostProcess, 1.218
Rust, PreProcess, 11.272, Solve, 1755.21, PostProcess, 1.382
Generating Problem...
Rust, PreProcess, 10.903, Solve, 1761.61, PostProcess, 1.199
SequentialNative, PreProcess, 11.187, Solve, 1758.05, PostProcess, 1.271
SequentialNative, PreProcess, 11.57, Solve, 1808.02, PostProcess, 1.229
Rust, PreProcess, 11.812, Solve, 1778.94, PostProcess, 1.31
Generating Problem...
Rust, PreProcess, 10.703, Solve, 1790.12, PostProcess, 1.253
SequentialNative, PreProcess, 11.148, Solve, 1776.15, PostProcess, 1.21
SequentialNative, PreProcess, 11.759, Solve, 1797.99, PostProcess, 1.276
Rust, PreProcess, 12.531, Solve, 1815.31, PostProcess, 1.428
Generating Problem...
Rust, PreProcess, 10.896, Solve, 1763.88, PostProcess, 1.258
SequentialNative, PreProcess, 11.721, Solve, 1814.81, PostProcess, 1.444
SequentialNative, PreProcess, 12.809, Solve, 1796.03, PostProcess, 1.259
Rust, PreProcess, 12.445, Solve, 1777.48, PostProcess, 1.256
Generating Problem...
Rust, PreProcess, 10.502, Solve, 1763, PostProcess, 1.264
SequentialNative, PreProcess, 11.59, Solve, 1758.24, PostProcess, 1.215
SequentialNative, PreProcess, 11.914, Solve, 1756.01, PostProcess, 1.371
Rust, PreProcess, 11.353, Solve, 1752.24, PostProcess, 1.364
Generating Problem...
Rust, PreProcess, 11.144, Solve, 1762.41, PostProcess, 1.648
SequentialNative, PreProcess, 11.629, Solve, 1977.53, PostProcess, 1.322
SequentialNative, PreProcess, 13.661, Solve, 1805.7, PostProcess, 1.278
Rust, PreProcess, 11.349, Solve, 1776.45, PostProcess, 1.224
Generating Problem...
Rust, PreProcess, 11.592, Solve, 1772.13, PostProcess, 1.234
SequentialNative, PreProcess, 11.724, Solve, 1809.37, PostProcess, 1.23
SequentialNative, PreProcess, 12.203, Solve, 1787.89, PostProcess, 1.361
Rust, PreProcess, 12.22, Solve, 1782.05, PostProcess, 1.338
Generating Problem...
Rust, PreProcess, 11.399, Solve, 1866.52, PostProcess, 1.303
SequentialNative, PreProcess, 12.211, Solve, 1911.61, PostProcess, 1.393
SequentialNative, PreProcess, 13.346, Solve, 1929.25, PostProcess, 1.217
Rust, PreProcess, 11.559, Solve, 1755.84, PostProcess, 1.257
Generating Problem...
Rust, PreProcess, 11.583, Solve, 1801.09, PostProcess, 1.359
SequentialNative, PreProcess, 12.749, Solve, 1841.62, PostProcess, 1.292
SequentialNative, PreProcess, 13.008, Solve, 1882.31, PostProcess, 1.365
Rust, PreProcess, 15.713, Solve, 1826.94, PostProcess, 1.255
Generating Problem...
Rust, PreProcess, 12.021, Solve, 1871.9, PostProcess, 1.237
SequentialNative, PreProcess, 12.568, Solve, 1933.16, PostProcess, 1.268
SequentialNative, PreProcess, 14.05, Solve, 1944.84, PostProcess, 1.317
Rust, PreProcess, 13.348, Solve, 1855.93, PostProcess, 1.215
Generating Problem...
Rust, PreProcess, 11.941, Solve, 1811.95, PostProcess, 1.225
SequentialNative, PreProcess, 12.096, Solve, 1847.09, PostProcess, 1.327
SequentialNative, PreProcess, 12.583, Solve, 1873.07, PostProcess, 1.244
Rust, PreProcess, 12.592, Solve, 1859.44, PostProcess, 1.274
Generating Problem...
Rust, PreProcess, 11.605, Solve, 1841.15, PostProcess, 1.259
SequentialNative, PreProcess, 15.006, Solve, 1859.05, PostProcess, 1.232
SequentialNative, PreProcess, 12.989, Solve, 1848.43, PostProcess, 1.225
Rust, PreProcess, 12.971, Solve, 1839.02, PostProcess, 1.289
Generating Problem...
Rust, PreProcess, 11.813, Solve, 1806.84, PostProcess, 1.388
SequentialNative, PreProcess, 18.662, Solve, 1924.03, PostProcess, 1.475
SequentialNative, PreProcess, 14.061, Solve, 1936.88, PostProcess, 1.291
Rust, PreProcess, 12.808, Solve, 1842.07, PostProcess, 1.278
Generating Problem...
Rust, PreProcess, 11.617, Solve, 1834.12, PostProcess, 1.47
SequentialNative, PreProcess, 13.264, Solve, 1906.06, PostProcess, 1.28
SequentialNative, PreProcess, 12.754, Solve, 1988.05, PostProcess, 1.433
Rust, PreProcess, 14.747, Solve, 1878.82, PostProcess, 1.348
Generating Problem...
Rust, PreProcess, 11.886, Solve, 1867.49, PostProcess, 1.334
SequentialNative, PreProcess, 14.298, Solve, 1932.84, PostProcess, 1.324
SequentialNative, PreProcess, 13.551, Solve, 1944.08, PostProcess, 1.262
Rust, PreProcess, 14.161, Solve, 1874.66, PostProcess, 1.261
Generating Problem...
Rust, PreProcess, 14.477, Solve, 2033.64, PostProcess, 1.507
SequentialNative, PreProcess, 16.542, Solve, 2147.52, PostProcess, 1.416
SequentialNative, PreProcess, 15.122, Solve, 2105.68, PostProcess, 1.566
Rust, PreProcess, 15.674, Solve, 2022.19, PostProcess, 1.397
Generating Problem...
Rust, PreProcess, 13.042, Solve, 1942.61, PostProcess, 1.519
SequentialNative, PreProcess, 14.888, Solve, 2057.59, PostProcess, 1.393
SequentialNative, PreProcess, 15.27, Solve, 2062.63, PostProcess, 1.361
Rust, PreProcess, 15.091, Solve, 1993.46, PostProcess, 1.393
Generating Problem...
Rust, PreProcess, 13.817, Solve, 2001.04, PostProcess, 1.329
SequentialNative, PreProcess, 14.899, Solve, 2114.05, PostProcess, 1.514
SequentialNative, PreProcess, 14.662, Solve, 2107.47, PostProcess, 1.389
Rust, PreProcess, 15.896, Solve, 2020.75, PostProcess, 1.427
Generating Problem...
Rust, PreProcess, 13.41, Solve, 1978.48, PostProcess, 1.547
SequentialNative, PreProcess, 18.481, Solve, 2164.13, PostProcess, 1.599
SequentialNative, PreProcess, 18.607, Solve, 2134.39, PostProcess, 1.523
Rust, PreProcess, 16.256, Solve, 2093.99, PostProcess, 1.423
Generating Problem...
Rust, PreProcess, 14.789, Solve, 2040.35, PostProcess, 1.375
SequentialNative, PreProcess, 15.638, Solve, 2146.89, PostProcess, 1.557
SequentialNative, PreProcess, 17.118, Solve, 2123.09, PostProcess, 1.345
Rust, PreProcess, 15.063, Solve, 2034.15, PostProcess, 1.397
Generating Problem...
Rust, PreProcess, 12.828, Solve, 1893.72, PostProcess, 1.285
SequentialNative, PreProcess, 13.921, Solve, 1961.94, PostProcess, 1.286
SequentialNative, PreProcess, 13.517, Solve, 1986.06, PostProcess, 1.332
Rust, PreProcess, 13.932, Solve, 1905.88, PostProcess, 1.339
Generating Problem...
Rust, PreProcess, 12.983, Solve, 1911.49, PostProcess, 1.383
SequentialNative, PreProcess, 15.077, Solve, 1998.38, PostProcess, 1.337
SequentialNative, PreProcess, 13.65, ^C

追記ここまで。


-march=native -mtune=native はどうなの?

g++ + 環境その2 でしか試してないですが、速度は悪化しました。


原因究明(挫折)

しようと思いましたが、アセンブリを眺めてもよくわかりませんでした。ただし以下のような試行を行いました。


  • Rust 側のコードではポインタを受け取り slice にしています。しかしC++ではVectorTというラッパーオブジェクトを受け取っています。


    • これによって、Rust側ではメモリ領域が連続していることが容易に分かりますが、C++は解析が甘いとうまくSIMD最適化されない可能性があります。



  • Rust では &mut[_]&[_] が同一の参照先を持つことがありません。それを用いた最適化がおこなわれている可能性があります。


    • (一説によると無効化されている話もあるが正確なところは調べておりません)



以上を考慮して、ポインタを直接扱い、また__restrict__を指定するバージョンを作成しました。(動作が完全一致していることは確かめてません

diff --git a/src/ConjugateGradient/SolveRust.rs b/src/ConjugateGradient/SolveRust.rs

index 18d8526..b3f1626 100644
--- a/src/ConjugateGradient/SolveRust.rs
+++ b/src/ConjugateGradient/SolveRust.rs
@@ -1,3 +1,5 @@
+#![crate_type = "cdylib"] // dylibでも動くが無駄な情報がついてくるのでcdylibの方がいい
+
// 疎行列・ベクトル積 y = A x
fn spmv(y : &mut[f64], data : &[f64], column : &[usize], nonzero : &[usize], max_nonzero : usize, x : &[f64])
{
diff --git a/src/ConjugateGradient/SolveSequentialNative.hpp b/src/ConjugateGradient/SolveSequentialNative.hpp
index d4a9b4f..e37c3be 100644
--- a/src/ConjugateGradient/SolveSequentialNative.hpp
+++ b/src/ConjugateGradient/SolveSequentialNative.hpp
@@ -3,6 +3,8 @@

#include "Problem.hpp"

+#include <cstring>
+
namespace ConjugateGradient
{
// 普通に逐次実行で解く
@@ -10,21 +12,29 @@ namespace ConjugateGradient
{
private:
// 疎行列・ベクトル積 y = A x
- static void SpMV(Problem::VectorT& y, const Problem::MatrixT& A, const Problem::VectorT& x)
+ // static void SpMV(Problem::VectorT& __restrict__ y, const Problem::MatrixT& __restrict__ A, const Problem::VectorT& x)
+ static void SpMV(
+ double *const __restrict__ y,
+ double const *const __restrict__ data,
+ std::size_t const *const __restrict__ column,
+ std::size_t const *const __restrict__ nonzero,
+ std::size_t const max_nonzero,
+ double const *const __restrict__ px,
+ std::size_t const n
+ )
{
- const auto n = y.N;
for(auto i = decltype(n)(0); i < n; i++)
{
double y_i = 0;

- const auto nnz = A.Nonzero(i);
- for(auto idx = decltype(nnz)(0); idx < nnz; idx++)
+ auto const nnz = nonzero[i];
+ for(auto idx = decltype(nnz){}; idx < nnz; ++idx)
{
- const auto a_ij = A.Data(i, idx);
- const auto j = A.Column(i, idx);
- const auto x_j = x[j];
- const auto ax = a_ij * x_j;
- y_i += ax;
+ // const auto a_ij = A.Data(i, idx);
+ auto const a_ij = data[i * max_nonzero + idx];
+ auto const j = column[i * max_nonzero + idx];
+ auto const x_j = px[j];
+ y_i += a_ij * x_j;
}

y[i] = y_i;
@@ -32,85 +42,59 @@ namespace ConjugateGradient
}

// ベクトルの加算y += αx
- static void Add(Problem::VectorT& y, const double alpha, const Problem::VectorT& x)
+ static void Add(double *const __restrict__ py, const double alpha, double const *const __restrict__ px, std::size_t const n)
{
- const auto n = y.N;
- for(auto i = decltype(n)(0); i < n; i++)
+ for(auto i = decltype(n){}; i < n; ++i)
{
- const auto x_i = x[i];
- auto y_i = y[i];
-
- y_i += alpha * x_i;
- y[i] = y_i;
+ py[i] += alpha * px[i];
}
}
// ベクトルの加算y = x + βy
- static void Add(Problem::VectorT& y, const Problem::VectorT& x, const double beta)
+ static void Add(double *const __restrict__ py, double const *const __restrict__ px, const double beta, std::size_t const n)
{
- const auto n = y.N;
- for(auto i = decltype(n)(0); i < n; i++)
+ for(auto i = decltype(n){}; i < n; ++i)
{
- const auto x_i = x[i];
- const auto y_i0 = y[i];
-
- const auto y_i = x_i + beta * y_i0;
- y[i] = y_i;
+ py[i] = px[i] + beta * py[i];
}
}
// ベクトルの引き算z = x - y
- static void Sub(Problem::VectorT& z, const Problem::VectorT& x, const Problem::VectorT& y)
+ static void Sub(double *const __restrict__ pz, double const *const __restrict__ px, double const *const __restrict__ py, std::size_t const n)
{
- const auto n = y.N;
- for(auto i = decltype(n)(0); i < n; i++)
+ for(auto i = decltype(n){}; i < n; ++i)
{
- const auto x_i = x[i];
- const auto y_i = y[i];
-
- const auto z_i = x_i - y_i;
- z[i] = z_i;
+ pz[i] = px[i] - py[i];
}
}
// ベクトルの内積r = x・y
- static auto Dot(const Problem::VectorT& x, const Problem::VectorT& y)
+ static auto Dot(double const *const __restrict__ px, double const *const __restrict__ py, std::size_t const n)
{
- auto r = std::remove_reference_t<decltype(x)>::Type(0);
+ auto r = Problem::VectorT::Type{};

- const auto n = y.N;
- for(auto i = decltype(n)(0); i < n; i++)
+ for(auto i = decltype(n){}; i < n; ++i)
{
- const auto x_i = x[i];
- const auto y_i = y[i];
-
- const auto xy = x_i * y_i;
- r += xy;
+ r += px[i] * py[i];
}
return r;
}
// ベクトルの内積r = x・x
- static auto Dot(const Problem::VectorT& x)
+ static auto Dot(double const *const px, std::size_t const n)
{
- auto r = std::remove_reference_t<decltype(x)>::Type(0);
+ auto r = Problem::VectorT::Type{};

- const auto n = x.N;
- for(auto i = decltype(n)(0); i < n; i++)
+ for(auto i = decltype(n){}; i < n; ++i)
{
- const auto x_i = x[i];
-
- const auto xx = x_i * x_i;
- r += xx;
+ r += px[i] * px[i];
}
return r;
}
// ベクトルの複製y = x
- static void Copy(Problem::VectorT& y, const Problem::VectorT& x)
+ static void Copy(double *const __restrict__ py, double const *const __restrict__ px, std::size_t const n)
{
- const auto n = x.N;
- for(auto i = decltype(n)(0); i < n; i++)
- {
- const auto x_i = x[i];
-
- y[i] = x_i;
- }
+ std::memcpy(py, px, n * sizeof(double));
+ // for(auto i = decltype(n){}; i < n; ++i)
+ // {
+ // py[i] = px[i];
+ // }
}

// 解ベクトル
@@ -146,49 +130,40 @@ namespace ConjugateGradient
std::fill_n(x(), n, 0);
}

- void Solve(const Problem& problem)
+ static void SolveRustLike(
+ double const *const __restrict__ data_ptr,
+ std::size_t const *const __restrict__ column_ptr,
+ std::size_t const *const __restrict__ nonzero_ptr,
+ std::size_t const max_non_zero,
+ double *const __restrict__ x_ptr,
+ double const *const __restrict__ b_ptr,
+ double *const __restrict__ ap_ptr,
+ double *const __restrict__ r_ptr,
+ double *const __restrict__ p_ptr,
+ std::size_t const n,
+ std::size_t const loop_count
+ )
{
- const auto& A = problem.A();
- const auto& b = problem.B();
-
- // 初期値を設定
- // (Ap)_0 = A * x
- // r_0 = b - Ap
- // p_0 = r_0
- // rr = r・r
- SpMV(Ap, A, x);
- Sub(r, b, Ap);
- Copy(p, r);
- auto rr = Dot(r);
+ SpMV(ap_ptr, data_ptr, column_ptr, nonzero_ptr, max_non_zero, x_ptr, n);
+ Sub(r_ptr, b_ptr, ap_ptr, n);
+ Copy(p_ptr, r_ptr, n);
+ auto rr = Dot(r_ptr, n);

#ifdef OUTPUT_RESIDUAL
std::cout << std::endl << 0 << ", " << rr << std::endl;
#endif

- const auto loop = Problem::SOLVE_LOOP;
- for(auto i = decltype(loop)(0); i < loop; i++)
+ for(auto i = decltype(loop_count){}; i < loop_count; ++i)
{
- // 計算を実行
- // Ap = A * p
- // α = rr/(p・Ap)
- // x' += αp
- // r' -= αAp
- // r'r' = r'・r'
- SpMV(Ap, A, p);
- const auto pAp = Dot(p, Ap);
- const auto alpha = rr / pAp;
- Add(x, alpha, p);
- Add(r, -alpha, Ap);
- const auto rrNew = Dot(r);
-
- // 収束判定
-
- // 残りの計算を実行
- // β= r'r'/rr
- // p = r' + βp
- // rr = r'r'
- const auto beta = rrNew / rr;
- Add(p, r, beta);
+ SpMV(ap_ptr, data_ptr, column_ptr, nonzero_ptr, max_non_zero, p_ptr, n);
+ auto const pAp = Dot(p_ptr, ap_ptr, n);
+ auto const alpha = rr / pAp;
+ Add(x_ptr, alpha, p_ptr, n);
+ Add(r_ptr, -alpha, ap_ptr, n);
+ auto const rrNew = Dot(r_ptr, n);
+
+ auto const beta = rrNew / rr;
+ Add(p_ptr, r_ptr, beta, n);
rr = rrNew;

#ifdef OUTPUT_RESIDUAL
@@ -196,6 +171,15 @@ namespace ConjugateGradient
#endif
}
}
+ void Solve(const Problem& problem)
+ {
+ const auto& A = problem.A();
+ const auto& b = problem.B();
+
+ SolveRustLike(A.Data(), A.Column(), A.Nonzero(), A.MaxNonzeroCount, x(), b(), Ap(), r(), p(), x.N, Problem::SOLVE_LOOP);
+
+ return;
+ }

void PostProcess(Vector<double>& result)
{
diff --git a/src/ConjugateGradient/Solver.hpp b/src/ConjugateGradient/Solver.hpp
index b48b1a2..202d600 100644
--- a/src/ConjugateGradient/Solver.hpp
+++ b/src/ConjugateGradient/Solver.hpp
@@ -5,6 +5,7 @@
#include "Timer.hpp"

#include <iostream>
+#include <cfloat>

namespace ConjugateGradient
{
diff --git a/src/ConjugateGradient/main.cpp b/src/ConjugateGradient/main.cpp
index 85f5f72..ebb6b08 100644
--- a/src/ConjugateGradient/main.cpp
+++ b/src/ConjugateGradient/main.cpp
@@ -104,14 +104,6 @@ int main()
OutputProblem(problem);
#endif

-#ifdef SOLVER_SEQUENTIAL_NATIVE
- {
- std::cout << "SequentialNative, ";
- ConjugateGradient::SolveSequentialNative solver;
- ConjugateGradient::Solver::Solve(problem, solver);
- }
-#endif
-
#ifdef SOLVER_OPENMP_NATIVE
{
std::cout << "OpenMPNative(";
@@ -223,6 +215,14 @@ int main()
}
#endif

+#ifdef SOLVER_SEQUENTIAL_NATIVE
+ {
+ std::cout << "SequentialNative, ";
+ ConjugateGradient::SolveSequentialNative solver;
+ ConjugateGradient::Solver::Solve(problem, solver);
+ }
+#endif
+
#ifdef SOLVER_RUST
{
std::cout << "Rust, ";
@@ -231,5 +231,21 @@ int main()
}
#endif

+#ifdef SOLVER_RUST
+ {
+ std::cout << "Rust, ";
+ ConjugateGradient::SolveRust solver;
+ ConjugateGradient::Solver::Solve(problem, solver);
+ }
+#endif
+
+#ifdef SOLVER_SEQUENTIAL_NATIVE
+ {
+ std::cout << "SequentialNative, ";
+ ConjugateGradient::SolveSequentialNative solver;
+ ConjugateGradient::Solver::Solve(problem, solver);
+ }
+#endif
+
return 0;
}

これを実行した所、Rustに少々負けるぐらいの速度になりました。(上が平均時間下が標準偏差)

image.png

これをモニョモニョやっている際に、clang++ のミスに気がついて検証し直したのが本記事となります。

いつかちゃんと上のコードを最適化したい……。