More than 3 years have passed since last update.

円周率計算の高速化 (4) 並列化

Last updated at 2021-07-22Posted at 2020-02-24

円周率計算の高速化(1) 比較元の再構成と問題設定
円周率計算の高速化(2) 計算の正しさの確認
円周率計算の高速化(3) 無駄な計算の除去。約7％高速に
円周率計算の高速化(4) 並列化。60％の時間削減

円周率計算の高速化シリーズの4本目です。今回は並列化します。

理論

Binary Splittng の計算は

M_k=\begin{pmatrix}z_k & y_k \\ 0 & x_k\end{pmatrix}

としたときの

M_0M_1\cdots M_{n-1}

と実質的に同等という話をしました。つまり $M_0M_1\cdots M_{m-1}$ と $M_mM_{m+1}\cdots M_{n-1}$ の計算は独立に行うことができるわけです。なので使えるスレッドも半分ずつ担当させることにしましょう。例えば 8 スレッド実行できる環境で $M_0M_1\cdots M_{15}$ を計算しようとする場合、BS の最初の呼び出しでは

$M_0M_1\cdots M_{15}$ (8 threads) → $M_0M_1\cdots M_7$ (4 threads) + $M_8M_9\cdots M_{15}$ (4 threads)

と分けます。以下同様に 2 分割ずつして行くことで 1 スレッドに 2 個の $M_i$ という形で各スレッドが独自に計算できる範囲が決まることになるので、それぞれこれまでと同じ通常の BS を適用することができます。

次は統合していく方向になります。基本的な計算ルールはこれまでの BS と同じで、両担当スレッドの仕事の結果得られた $M_i M_j$ をかけることになります。ただ多倍長計算そのものに GMP ライブラリを使っている以上、それらの中身をマルチスレッド化することはできないのでこの計算はシリアルに行う事になります。

実装

この方針に基づいて parallel_drm() を実装します。

void Computer::parallel_drm(const int64_t n0,
                            const int64_t n1,
                            mpz_class& x0,
                            mpz_class& y0,
                            mpz_class& z0,
                            bool need_z,
                            const int number_of_threads) {
  // 1 スレッド割り当て時は元と同じ実装を使う
  if (number_of_threads == 1) {
    drm(n0, n1, x0, y0, z0, need_z);
    return;
  }

  int num_new_threads = number_of_threads / 2;
  int num_rest_threads = number_of_threads - num_new_threads;

  int64_t m = (n0 + n1) / 2;
  mpz_class x1, y1, z1;
  {
    auto clausure = [&] {
      parallel_drm(m, n1, x1, y1, z1, need_z, num_new_threads);
    };
    std::thread th(clausure);
    parallel_drm(n0, m, x0, y0, z0, true, num_rest_threads);
    th.join();
  }

  // このルーチンは元と同じ
  y0 *= x1;
  y1 *= z0;
  y0 += y1;

  x0 *= x1;
  if (need_z)
    z0 *= z1;
}

void Computer::compute() {
  const int64_t precision = config_.digits * std::log2(10) + 10;
  pi_.set_prec(precision);

  const int64_t n = terms(config_.digits);
  mpz_class x, y, z;
  parallel_drm(0, n, x, y, z, false, config_.number_of_threads);
  postProcess(x, y);
}

これで一応の並列化を行うことができました。コードの全体はこちらです。パラメータが増えたのでコマンドラインで指定できるように変更しつつ、Computer::Configuration にまとめました。

実験

ということで実験を行いました。これまでとは別の計算環境で 1 億桁の計算(基底変換と出力を除く)の時間をスレッド数別に計測した結果がこちらです。ちなみに 55 論理 CPU です。

Threads	1	2	4	8	16	32	64
Time[sec]	112.137	71.296	51.719	42.743	39.078	37.345	38.562
並列度[%]	100	78.6	54.2	32.8	17.9	9.4	4.5

基本的にスレッド数が増えれば計算時間は減っていますが、並列度(1スレッドでの時間/$n$スレッドでの時間/$n$)はあまり高くない形になりました。[理論]の最後に書いている辺りが原因だとは思いますが、まだ違和感もあるので後の投稿で対策を取ることにしましょう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up