More than 3 years have passed since last update.

CUDAプログラミングをGoogle Colabで行う。

Last updated at 2021-05-24Posted at 2019-03-14

1. 概要

Google Colabでも、CUDAプログラミングが簡単に出来る。やり方としては、2種類ある。

jupyternotebookでのプラグイン
github等で、ソースコードをまとめて持ってくる。

1.1 プラグインを使う

3行程度で環境構築が出来てしまう。そして、1ページ程度のコードなら簡単に実行できる。また、V2を使えば、複数ファイルでもコンパイルして実行できる。
また、2019年4月現在、K80だけでなくTesla T4が使えるのでTensor Coreを動かすことも出来る。

1.2. github等から

githubから持ってきてコンパイルする。

2. 環境構築

コンパイラが入っていることの確認

!/usr/local/cuda/bin/nvcc --version

NVCCコンパイラをGoogle Colabで使えるようにするための設定

!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git
%load_ext nvcc_plugin

CUDA samplesのヘッダーファイル活用.
- checkCudaErrors関数等があるヘッダーファイル(helper_cuda.h)をインストールする。

!git clone https://github.com/NVIDIA/cuda-samples/
!cp cuda-samples/Common/* /usr/local/include

3. 使い方

3.1. プラグイン

プラグインは、V1(単一ソースコード対応)とV2(複数ソースコード対応)の2つがある。現時点ではデフォルトのモジュールロード(%load_ext)で両方入る。

3.1.1. V1の場合

%%cuをセルの頭に入れると、NVCCが起動し実行される。以下のコードでは、Hello Worldが表示される。

%%cu
# include <iostream>
int main() {
    std::cout << "Hello world\n";
    return 0;
}

サンプルコードは様々存在するので、それを動かしてみると良いと思う。一ページ程度のコードなら、簡単に動く。
ただし、コマンド実行時の引数args等は補完されないので、その部分は書き換えが必要である。

また、前述のcuda-samplesのコードも実行できる

!cd cuda-samples/Samples/simpleCUBLAS; make
!cuda-samples/bin/x86_64/linux/release/simpleCUBLAS

ソースコード
- %%cu

3.1.2. V2の場合

複数のcudaソースコード(.cuファイル)をまとめてコンパイルして実行できるところに特徴がある。%%cudaと%%cuda_runをセルのマジックコマンドとして実行できる。詳細は、以下に記載がある。
README

ソースコード
- %%cuda_run
- %%cuda

3.2. githubからソースコードを持ってくる。

3.2.1. cuda-samples

NVIDIAのサンプルでやってみる。Tesla T4が割り当たるとTensorCoreも使うことができる。ドキュメントもそろっている。CUDA Samples に詳細の記述がある。

nvccのコンパイラオプションの関係で一度、--threads 0及び 86を除いてみる。

TensorCoreは以下のようにして実行できる.(cuda_runtimeを利用している。)

!git clone https://github.com/NVIDIA/cuda-samples
%cd cuda-samples/Samples/cudaTensorCoreGemm
!sed -i -e 's/--threads 0//g' Makefile
!sed -i -e 's/ 86/ 80/g' Makefile
!make
!../../bin/x86_64/linux/release/cudaTensorCoreGemm

メルセンヌツイスターは以下のように実行できる。(cuda_runtimeを利用している)

!git clone https://github.com/NVIDIA/cuda-samples
%cd cuda-samples/Samples/MersenneTwisterGP11213
!sed -i -e 's/--threads 0//g' Makefile
!sed -i -e 's/ 86/ 80/g' Makefile
!make
!../../bin/x86_64/linux/release/MersenneTwisterGP11213

行列積の演算は以下のようにして実行できる。(cuda_driverを利用している。)

!git clone https://github.com/NVIDIA/cuda-samples
%cd cuda-samples/Samples/matrixMulDrv/
!sed -i -e 's/--threads 0//g' Makefile
!sed -i -e 's/ 86/ 80/g' Makefile
!make
!../../bin/x86_64/linux/release/matrixMulDrv

3.2.2. cuLDPC

cuLDPCを例にして実行する。以下の手順で実行できる。

!git clone https://github.com/robertwgh/cuLDPC
%cd cuLDPC/src
%pwd
!make
!./app

4. より進んだ使い方

4.1. プログラムの作成

Google Colabから以下のようにするとファイルを書き込める。(Jupyter notebookの機能) 以下の場合test.cuというファイルが作成される。

%%writefile test.cu

main(){}

4.2. プロファイルの取り方

プロファイルは以下のようにして取ることができる。traceは、呼出しを監視するものである。例えば、cudaLaunchKernelの呼び出しは、apiフラグで監視する。そして、GPU内での実行は、gpuフラグで監視する。他にも、--print-gpu-summaryや--print-api-summary等参考になるオプションがある。

!nvprof --print-gpu-trace --print-api-trace cuda-samples/bin/x86_64/linux/release/simpleCUBLAS

出力例(sgemmのみ)
上の行のcudaLaunchKernelがapiオプションによる出力である。そのあとのvolta_sgemm_64x64_nnがgpuオプションによる出力である。これにより、sgemm演算は、cudaLaunchKernelにより、GPUコードの起動が命令されて、動いていることがわかる。

925.08ms  43.776us                    -               -         -         -         -         -           -           -           -                -         -         -  cudaLaunchKernel (volta_sgemm_64x64_nn [334])
925.12ms  82.624us              (5 5 1)        (64 1 1)       126  8.2500KB        0B         -           -           -           -     Tesla T4 (0)         1         7  volta_sgemm_64x64_nn [334]

参考までにパラメータは以下の通りである。

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

4.3. ルーフラインを取る

計算機デバイスで有名なルーフラインモデルの測定をGPUのツールで取ることが出来る。
ただし、nvprofでは、ComputeCapability 7.2までしか取れない。このため、Volta以降では、Nsight Computeを使って取る必要がある。ただしプロファイルが非常に重たいので、回数制限--launch-count 10などして取得カーネル数を制限したほうが良いかもしれない。

!/usr/local/cuda/NsightCompute-1.0/nv-nsight-cu-cli samples/bin/x86_64/linux/release/simpleCUBLAS

データはこんな感じで表示される。

GPU Device 0: "Tesla T4" with compute capability 7.5

simpleCUBLAS test running..
simpleCUBLAS test passed.
==PROF== Profiling -    1: 0%....50%....100%
==PROF== Report: profile.nsight-cuprof-report
volta_sgemm_64x64_nn, 2019-May-21 15:07:22
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
Memory Frequency                                                         cycle/nsecond                           1.15
SOL FB                                                                               %                           3.28
Elapsed Cycles                                                                   cycle                      45,164.20
SM Frequency                                                             cycle/usecond                         537.26
Memory [%]                                                                           %                          12.74
Duration                                                                       usecond                          84.06
SOL L2                                                                               %                           5.53
SOL TEX                                                                              %                          21.05
SM [%]                                                                               %                          15.53
---------------------------------------------------------------------- --------------- ------------------------------

Section: Compute Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Executed Ipc Active                                                         inst/cycle                           1.03
Executed Ipc Elapsed                                                        inst/cycle                           0.62
Issued Ipc Active                                                           inst/cycle                           1.03
Issue Slots Busy                                                                     %                          25.67
SM Busy                                                                              %                          25.67
---------------------------------------------------------------------- --------------- ------------------------------

Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput                                                         Gbyte/second                          12.28
Mem Busy                                                                             %                          12.74
Max Bandwidth                                                                        %                           9.54
L2 Hit Rate                                                                          %                          87.36
Mem Pipes Busy                                                                       %                          13.77
L1 Hit Rate                                                                          %                          30.17
---------------------------------------------------------------------- --------------- ------------------------------

Section: Scheduler Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Active Warps Per Scheduler                                                        warp                           1.00
Eligible Warps Per Scheduler                                                      warp                           0.51
No Eligible                                                                          %                          48.65
Instructions Per Active Issue Slot                                          inst/cycle                              1
Issued Warp Per Scheduler                                                                                        0.51
One or More Eligible                                                                 %                          51.35
---------------------------------------------------------------------- --------------- ------------------------------

Section: Warp State Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Not Predicated Off Threads Per Warp                                                                        31.52
Avg. Active Threads Per Warp                                                                                       32
Warp Cycles Per Executed Instruction                                             cycle                           1.95
Warp Cycles Per Issued Instruction                                                                               1.95
Warp Cycles Per Issue Active                                                                                     1.95
---------------------------------------------------------------------- --------------- ------------------------------

Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler                                          inst                       7,013.44
Executed Instructions                                                             inst                      1,122,150
Avg. Issued Instructions Per Scheduler                                            inst                       7,015.62
Issued Instructions                                                               inst                      1,122,500
---------------------------------------------------------------------- --------------- ------------------------------

Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size                                                                                                         64
Grid Size                                                                                                          25
Registers Per Thread                                                   register/thread                            126
Shared Memory Configuration Size                                                 Kbyte                             48
Dynamic Shared Memory Per Block                                             byte/block                              0
Static Shared Memory Per Block                                             Kbyte/block                           8.25
Threads                                                                         thread                          1,600
Waves Per SM                                                                                                     0.09
---------------------------------------------------------------------- --------------- ------------------------------

Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM                                                                   block                             16
Block Limit Registers                                                         register                              8
Block Limit Local Mem                                                             byte                              7
Block Limit Warps                                                                 warp                             16
Achieved Active Warps Per SM                                                      warp                           2.18
Achieved Occupancy                                                                   %                           6.80
Theoretical Active Warps per SM                                             warp/cycle                             14
Theoretical Occupancy                                                                %                          43.75
---------------------------------------------------------------------- --------------- ------------------------------

取れるメトリックは、以下のようにしてわかる。

!/usr/local/cuda/NsightCompute-1.0/nv-nsight-cu-cli --query-metrics >& metrics.log
!wc metrics.log

A. 参考資料

(NVIDIA)How to Implement Performance Metrics in CUDA C/C++
(NVIDIA) CUDA Code Samples
- (NVIDIA)cuda-samples (上記のサンプルコードをgithubに載せたもの)
- (Qiita)CUDAによる素数リスト生成プログラム
  - このプログラムで、argsの相当する部分を編集するとそのまま動く
  - K80では、49秒であった。(K80でnum=134217728とした場合)
  - T4では、14秒であった。(T4でnum=134217728とした場合)
(Qiita)ルーフラインモデルとは
- TPU/K80/Haswell (die) roofline
  - ルーフラインモデルで分析した例
- Intel KNL/NVIDIA V100のルーフライン
(PerformancePortability)Measuring Roofline Quantities on NVIDIA GPUs
- (NVIDIA) Profiler nvprof
- (NVIDIA)Nsight Compute CLI
  - Volta以降では、metricsがnsightでないと取れなくなった。(nvprofでは取れない。)このため、トレースはそのまま使えるが、メトリックの測定が必要ならこのツールが必要である。
  - (NVIDIA FORUM)nvprof --analysis-metrics not working for RTX 2070 (CUDA 10.0)
Running CUDA C/C++ in Jupyter or how to run nvcc in Google CoLab
- 2018年5月時点の環境設定方法。1年しか経っていないが、手順が簡素化している。
(NVIDIA)J. CUDA Environment Variables

その他のオンラインサービス

Wandbox

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up