恐ろしく速い高速化のエンジニアリング。CUDA GPU : 2.45秒　C++ CPU コンパイラ最適化 : 2.99秒

Last updated at 2024-07-27Posted at 2024-07-25

C++でマンデルブロ集合をプロットし、最適化あり・なしでの処理時間を比較

C++のソースコードを、最適化の有無でコンパイルして実行し、処理時間を測定するものです。

C++ 最適化なしの実行:
計算時間: 41597ミリ秒

C++　コンパイラ最適化なし　計算時間: 42.13秒

C++ 最適化ありの実行:
計算時間: 2892ミリ秒

C++　コンパイラ最適化あり計算時間: 2.99秒

CUDA GPU : 2.45秒

最適化がプログラムのパフォーマンスに与える影響が顕著ですね。

コードの説明
必要なライブラリのインストール:

Pythonパッケージ（matplotlib, numpy）とC++の開発ツール（g++, cmake, libopencv-dev）をインストールします。
C++ソースコードの作成:

cpp_code変数にマンデリブロ集合をプロットするC++コードを格納し、ファイルに書き込みます。
コンパイル:

最適化なし (-O0) と最適化あり (-O2) の2つのバージョンをコンパイルします。
実行と処理時間の測定:

最適化なしと最適化ありでそれぞれプログラムを実行し、処理時間を測定して表示します。
生成された画像の表示:

プログラムが生成したマンデリブロ集合の画像を表示します。
このスクリプトを実行することで、最適化の有無による処理時間の違いを比較し、最適化がパフォーマンスに与える影響を確認できます。

C++でマンデルブロ集合をプロットし、最適化あり・なしでの処理時間を比較するコード。

# 必要なライブラリのインストール
!pip install matplotlib numpy
!apt-get install -y g++ cmake libopencv-dev

import subprocess
import time
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# C++のソースコード
cpp_code = """
#include <iostream>
#include <fstream>
#include <vector>
#include <complex>
#include <chrono>
#include <opencv2/opencv.hpp>

using namespace std;
using namespace std::chrono;

int mandelbrot(const complex<double>& c, int max_iter) {
    complex<double> z = c;
    for (int n = 0; n < max_iter; ++n) {
        if (abs(z) > 2.0) {
            return n;
        }
        z = z * z + c;
    }
    return max_iter;
}

void plot_mandelbrot(double xmin, double xmax, double ymin, double ymax, int width, int height, int max_iter, const string& filename) {
    vector<uint8_t> img(width * height);
    double x_scale = (xmax - xmin) / (width - 1);
    double y_scale = (ymax - ymin) / (height - 1);

    for (int y = 0; y < height; ++y) {
        for (int x = 0; x < width; ++x) {
            double cr = xmin + x * x_scale;
            double ci = ymin + y * y_scale;
            complex<double> c(cr, ci);
            int color = mandelbrot(c, max_iter);
            img[y * width + x] = static_cast<uint8_t>(color % 256);
        }
    }

    cv::Mat image(height, width, CV_8UC1, img.data());
    cv::applyColorMap(image, image, cv::COLORMAP_INFERNO);
    cv::imwrite(filename, image);
}

int main() {
    auto start = high_resolution_clock::now();

    int width = 800, height = 600, max_iter = 256;
    vector<tuple<double, double, double, double>> positions = {
        {-2.0, 1.0, -1.5, 1.5},
        {-1.0, 0.5, -0.5, 0.5},
        {-1.5, -1.0, -1.0, 1.0},
        {-0.5, 0.5, -0.5, 0.5},
        {-2.5, -1.5, -1.5, 1.5}
    };

    for (size_t i = 0; i < positions.size(); ++i) {
        double xmin, xmax, ymin, ymax;
        tie(xmin, xmax, ymin, ymax) = positions[i];
        string filename = "mandelbrot_plot_" + to_string(i + 1) + ".png";
        plot_mandelbrot(xmin, xmax, ymin, ymax, width, height, max_iter, filename);
    }

    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(stop - start);
    cout << "計算時間: " << duration.count() << "ミリ秒" << endl;

    return 0;
}
"""

# C++ソースコードをファイルに書き込む
with open('mandelbrot.cpp', 'w') as file:
    file.write(cpp_code)

# pkg-config コマンドで OpenCV のフラグを取得
pkg_config_output = subprocess.check_output(['pkg-config', '--cflags', '--libs', 'opencv4']).decode().strip()

# 最適化なしでコンパイル
compile_command_no_opt = f'g++ -o mandelbrot_no_opt mandelbrot.cpp {pkg_config_output}'
subprocess.run(compile_command_no_opt, shell=True, check=True)

# 最適化ありでコンパイル
compile_command_opt = f'g++ -O2 -o mandelbrot_opt mandelbrot.cpp {pkg_config_output}'
subprocess.run(compile_command_opt, shell=True, check=True)

# 最適化なしで実行し、処理時間を測定
print("最適化なしの実行:")
start_time = time.time()
result_no_opt = subprocess.run(['./mandelbrot_no_opt'], capture_output=True, text=True)
end_time = time.time()
print(result_no_opt.stdout)
print(f'計算時間: {end_time - start_time:.2f}秒')

# 最適化ありで実行し、処理時間を測定
print("\n最適化ありの実行:")
start_time = time.time()
result_opt = subprocess.run(['./mandelbrot_opt'], capture_output=True, text=True)
end_time = time.time()
print(result_opt.stdout)
print(f'計算時間: {end_time - start_time:.2f}秒')

# 生成された画像を表示する
for i in range(1, 6):
    img = mpimg.imread(f'mandelbrot_plot_{i}.png')
    plt.figure(figsize=(10, 7))
    plt.imshow(img, cmap='inferno')
    plt.title(f'Mandelbrot Set - Plot {i}')
    plt.axis('off')
    plt.show()

================================================================

次のお題として、以下の4つの方法で１０億までのカウント処理の実行時間を比較します：

Python: 通常のPythonコードでカウント処理を実行
Numba JIT: NumbaのJITコンパイラを使用してPythonコードを最適化
C++ (最適化なし): 最適化なしでコンパイルされたC++コードでカウント処理を実行
C++ (最適化あり): 最適化ありでコンパイルされたC++コードでカウント処理を実行

実行結果。

Python counting time: 77.669248 seconds

Numba JIT Python counting time: 0.795481 seconds

C++ (no optimization) counting time: 3.189950 seconds

C++ (optimization) counting time: 0.000000 seconds　計測不能最速。

まとめ

Numba JIT:

PythonコードをJITコンパイラで最適化することで、通常のPythonよりもかなり高速に処理さました。

C++ 最適化あり vs なし:

C++コードは、最適化を行うことで処理時間が大幅に短縮されることが確認できました。

!pip install numba

import time
import subprocess
from numba import jit

# C++コードをファイルとして保存
cpp_code = """
#include <iostream>
#include <chrono>

int main() {
    auto start = std::chrono::high_resolution_clock::now();
    
    long long count = 0;
    for (long long i = 1; i <= 1000000000; ++i) {
        count += i;
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;
    std::cout << "Time taken for counting in C++: " << duration.count() << " seconds" << std::endl;

    return 0;
}
"""

# C++コードをファイルに書き込む
with open('counting.cpp', 'w') as file:
    file.write(cpp_code)

# C++コードをコンパイルする関数
def compile_cpp_code(optimization_flag=""):
    try:
        compile_command = f"g++ {optimization_flag} -o counting counting.cpp"
        subprocess.run(compile_command, shell=True, check=True)
    except subprocess.CalledProcessError as e:
        print("Error in compilation:", e)
        print("Output:", e.output)
        print("Error output:", e.stderr)

# C++コードを実行する関数
def run_cpp_code():
    try:
        result = subprocess.run(["./counting"], check=True, capture_output=True, text=True)
        cpp_time = float(result.stdout.split(": ")[1].split(" ")[0])
        return cpp_time
    except subprocess.CalledProcessError as e:
        print("Error in execution:", e)
        print("Output:", e.output)
        print("Error output:", e.stderr)
        return None

# Pythonでカウントを実行
def python_counting():
    start = time.time()
    count = 0
    for i in range(1, 1000000001):
        count += i
    end = time.time()
    return end - start

# NumbaでJITコンパイラを使用したPythonカウント
@jit(nopython=True)
def jit_counting():
    count = 0
    for i in range(1, 1000000001):
        count += i

# Pythonでの実行時間を計測
python_time = python_counting()
print(f"Python counting time: {python_time:.6f} seconds")

# Numba JITでの実行時間を計測
start = time.time()
jit_counting()
jit_time = time.time() - start
print(f"Numba JIT Python counting time: {jit_time:.6f} seconds")

# C++コードをコンパイル（最適化なし）
compile_cpp_code()

# C++での実行時間を計測（最適化なし）
cpp_time_no_opt = run_cpp_code()
if cpp_time_no_opt:
    print(f"C++ (no optimization) counting time: {cpp_time_no_opt:.6f} seconds")

# C++コードをコンパイル（最適化あり）
compile_cpp_code("-O2")

# C++での実行時間を計測（最適化あり）
cpp_time_opt = run_cpp_code()
if cpp_time_opt:
    print(f"C++ (optimization) counting time: {cpp_time_opt:.6f} seconds")

# 実行結果の比較
if cpp_time_no_opt and cpp_time_opt:
    print(f"Numba JIT Python is {python_time / jit_time:.2f} times faster than regular Python")
    print(f"C++ (no optimization) is {python_time / cpp_time_no_opt:.2f} times faster than Python")
    print(f"C++ (optimization) is {python_time / cpp_time_opt:.2f} times faster than Python")
    print(f"C++ (no optimization) is {cpp_time_no_opt / cpp_time_opt:.2f} times slower than C++ (optimization)")
else:
    print("C++ execution failed")

ポエ太郎　Googleコラボで使われているC++のコンパイラーについて教えてください。つまりLinuxで標準搭載されているコンパイラーですよね。そして最適化ありの場合は恐ろしく速度の速いコードを出力しますよね。どのような技術が使用されているんですか? かなり長い歴史を持っているコンパイラーですよね。

ChatGPT
Google Colabで使用されているC++コンパイラーは、主にGNU Compiler Collection (GCC)です。GCCは、Linuxディストリビューションに標準搭載されていることが多く、幅広く使用されています。

GCCの設計者と歴史
GCCは、1987年にリチャード・ストールマン（Richard Stallman）によってFree Software Foundation (FSF) の一部として開発されました。GCCは最初、GNUプロジェクトの一環として開発され、オープンソースソフトウェアの柱となるコンパイラーとして設計されました。GCCは、時間をかけて多くの改良と最適化を受け、非常に優れた性能と互換性を持つコンパイラーとして知られています。

最適化技術
GCCが恐ろしく速いコードを出力できる理由の一つは、その強力な最適化技術にあります。以下に、GCCが使用するいくつかの主要な最適化技術を紹介します：

インライン展開 (Inlining):
関数呼び出しのオーバーヘッドを減らすため、小さな関数をその呼び出し元にインライン展開します。

ループ最適化 (Loop Optimization):
ループアンローリングやループフュージョンなどの技術を使用して、ループの実行速度を向上させます。

デッドコード除去 (Dead Code Elimination):
実行されないコードや不要なコードを削除します。

共通サブエクスプレッション除去 (Common Subexpression Elimination):
同じ計算を複数回行わないように、共通のサブエクスプレッションを一度だけ計算し、その結果を再利用します。

定数畳み込み (Constant Folding):
コンパイル時に定数計算を行い、実行時の計算を減らします。

レジスタ割り当て (Register Allocation):
効果的にレジスタを使用することで、メモリアクセスを減らし、実行速度を向上させます。

歴史と評価
GCCは30年以上にわたる開発と改良を経て、非常に成熟したコンパイラーとなっています。オープンソースコミュニティによって積極的にメンテナンスされ、様々なアーキテクチャや言語をサポートしています。その高い性能と信頼性から、広範な用途で使用されています。

ポエ太郎　
マンデルブロ集合の計算において、最適化ありのC++処理時間は、GoogleコラボのGPUを使用した処理時間とほぼ同じでした。これほどまでにC++の最適化コンパイラーを使用した場合のコードの実行速度は速いものなんですか? GPUと同じというのはどういうことですか?説明してください。

ChatGPT
C++の最適化コンパイラーを使用した場合の実行速度がGPU処理時間とほぼ同じになるのは、いくつかの要因が絡んでいます。ここでは、その背景を説明します。

C++コンパイラーの最適化
C++の最適化コンパイラー（例えばGCCやClang）は、非常に高度な最適化技術を使用して、CPU上での実行速度を最大化します。以下のような技術が使われています：

ループアンローリング: ループの繰り返し回数を減らし、処理速度を向上させます。
ベクトル化: 複数のデータを同時に処理するSIMD（Single Instruction, Multiple Data）命令を使用します。
インライン展開: 関数呼び出しのオーバーヘッドを削減します。
デッドコード除去: 不要なコードを削除し、実行速度を最適化します。
レジスタ割り当て: 効果的なレジスタ使用により、メモリアクセスを最小限にします。
これらの最適化により、CPU上でのC++コードの実行速度は劇的に向上します。

GPUの特性と比較
GPU（Graphics Processing Unit）は、大量の並列処理に優れています。以下の特性があります：

多数のコア: 多数のスレッドを同時に実行できるため、並列処理が得意です。
高スループット: 一度に多くのデータを処理する能力があります。
しかし、GPUにはデータ転送のオーバーヘッドがあり、CPUからGPUへのデータ転送や、逆のデータ転送が必要です。また、GPUは特定のタイプの計算に非常に効率的ですが、すべての計算に対して効率的とは限りません。

なぜC++とGPUの処理時間が同じなのか？

計算の性質:

マンデルブロ集合の計算は、多くの場合、各点の独立した計算の集合であり、GPUに向いていますが、同時にCPUでのベクトル化や最適化も効果的です。
C++の最適化が非常に効果的で、CPU上でほぼ最大限のハードウェア性能を引き出せている場合、GPUとの速度差が縮まることがあります。

データ転送オーバーヘッド:
小さなデータセットや、データ転送のオーバーヘッドが大きい場合、GPUの利点が薄れることがあります。

CPUの高性能化:
近年のCPUは多くのコアを持ち、高度なSIMD命令セットを持っています。これにより、特定のタスクではGPUに匹敵する性能を発揮します。

まとめ
最適化されたC++コードがGPUとほぼ同じ速度で実行される場合、これは以下のような状況が考えられます：

最適化コンパイラーが非常に効率的に動作している。
計算タスクがCPUに適している。
データ転送オーバーヘッドがGPUの利点を相殺している。

このように、C++の最適化コンパイラーを使用した場合のコードの実行速度が速くなる背景には、様々な技術的要因が絡んでいます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

恐ろしく速い高速化のエンジニアリング。CUDA GPU : 2.45秒 C++ CPU コンパイラ最適化 : 2.99秒