More than 5 years have passed since last update.

GPU搭載のベアメタルで CUDA が使えるって？

Last updated at 2016-10-16Posted at 2016-10-14

クラウドの時間課金のベアメタルサーバーで、GPUが使えると言っても、どうやって使うんだろう？ GPU(Graphics Processing Units) なんだから、グラフィックを描画して、何処かに表示するのかね？　と思うのが普通ですよね。僕も最初は、そんな風に思ってました。しかしGPUはGPUでも、グラフィック描画のためのGPUではなくて、機械学習とかに利用するための汎用計算用のためのGPUなんですね。

サーバーのGPUって何よ？、何か良い事あるのかね？

この様な汎用計算用のGPUの事を GPGPU (General-Purpose computing on Graphics Processing Units) と呼んで区別しています。
「俺のPCの NVidia GTX GeForce と何が違うの？」とか聞かれると、ざっと、こんな感じです。

ディスプレイを繋げるための HDMI、DVI、VGA のディスプレイポートが付いてません。
メモリが沢山積まれています。 K80では24GB、M60では16GB とサーバー本体と同じくらいのメモリがカードに付いています。
高速演算を長時間実行できる様に、発熱を抑えるために動作周波数を抑え気味に、冷却性能重視
業務用なので、プロ仕様で耐久性が高い設計になっているから、当然ながら、値段も高い

どうやってプログラムするのかね？

GPUは、INTEL互換のCPUではなくて、異なる命令機械語で動作するプロセッサなので、専用の開発ツールを使ってプログラムを作る必要がある。その最も普及しているのが、次の CUDA なんですね。

CUDA（Compute Unified Device Architecture：クーダ）とは、NVIDIAが提供するGPU向けのC言語の統合開発環境であり、コンパイラ (nvcc) やライブラリなどから構成されている。アプリケーションを実行する基盤となるプラットフォーム／アーキテクチャそのものをCUDAと呼ぶこともある。 [ウィッキペディア CUDA] (https://ja.wikipedia.org/wiki/CUDA)

基本的な処理の流れは、サーバー本体から、NVIDIA の GPUで実行できるクロスコンパイルしたコードと、データを送り込んで、GPU で処理した結果をサーバー本体に受け取リます。

嬉しい事に、CUDA はアキバで売っている１万円以下の GTX GeForce でも動作しますから、気軽に動作やプログラミングを始めることができます。

CUDA のプログラム開発環境をセットアップ

Ubuntu 14.04 では、このLinux ディストリビューションのリポジトリから、CUDA Toolkit 5.5 をインストールできますが、これらを利用せずに、NVIDIA社から配布される最新バージョンを導入します。この資料は、NVIDIA社の CUDA Toolkit Documentation v8.0 にあり、参考しながら、進めていきます。

まずは環境チェックから

今回利用するのは、Intel Xeon E5-2620 v3 x2 の２CPU 12コアと、Ubuntu 14.04 が導入されたサーバーに Tesla K80 を搭載された、ベアメタルサーバーです。

早速、GPUカードのインストールを確認します。

root@server1:~# lspci | grep -i nvidia
83:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Linux ディストリビューションのバージョンを確認して、CUDA のサポート対象がチェックします。 CUDAのサポート対象かどうかは、CUDA Toolkit Documentation v8.0 に記載があります。

root@server1:~# uname -m && cat /etc/*release
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"

必要パッケージと CUDA Tool Kit のインストール

CUDAに必要なパッケージをインストールします。 Linuxのパッケージマネージャーのファイルとして、ダウンロードできるファイルなど、複数のインストール方法が提供されていますが、ローカルで実行できるタイプのインストール手段を選択します。

# apt-get update
# apt-get install gcc g++ make linux-source

CUDAの開発環境は、NVIDIA社の CUDA のダウンロードサイトから無料でダウンロードできます。 2016年10月15日現在の最新バージョンは、CUDA Toolkit 8.0 です。手元の MAC にダウンロードして、再びクラウドのベアメタルにアップロードするのは、面倒なので、クラウドのベアメタルから直接 wget で取得します。

# mkdir nvidia
# cd nvidia
# wget https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda_8.0.44_linux-run

インストーラーを実行します。

root@server1:~/nvidia# sh cuda_8.0.44_linux-run

インストーラーの質問に答えていきます。

Do you accept the previously read EULA?
accept/decline/quit: accept

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 367.48?
(y)es/(n)o/(q)uit: y

Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]: y

Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]: n

Install the CUDA 8.0 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-8.0 ]: 

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 8.0 Samples?
(y)es/(n)o/(q)uit: y

Enter CUDA Samples Location
 [ default is /root ]:

インストールが完了したら、シェルにパスを設定します。

root@server1:~# vi .bashrc

以下の行を追加

export PATH="/usr/local/cuda-8.0/bin:$PATH"

ライブラリのロードパスを以下のファイルの様に追加

root@server1:~# cat /etc/ld.so.conf
include /etc/ld.so.conf.d/*.conf
/usr/local/cuda-8.0/lib64

以下のコマンドで有効化

root@server1:~/nvidia# ldconfig

これで、CUDA Toolkit 8.0 のインストールは完了です。これからサンプルプログラムの中にあるユーティリティでインストールをチェックします。

インストールのチェック

サンプルプログラムの中で、deviceQueriy を実行することで、デバイスの情報を取得できます。 deviceQuery をビルドする方法は以下です。

root@server1:~# cd NVIDIA_CUDA-8.0_Samples/
root@server1:~/NVIDIA_CUDA-8.0_Samples# cd 1_Utilities/
root@server1:~/NVIDIA_CUDA-8.0_Samples/1_Utilities# cd deviceQuery
root@server1:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery# make

コマンドが出来たら実行します。 TESLA K80 には、２つのGPUが入っていて、合計で 4992 コア、22.8GB RAM が搭載されていることがわかります。　アキバで売っているGTX 750Ti では、 640コア、2GB RAM なので、桁違いの資源ですね。

root@server1:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery# ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K80"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 11440 MBytes (11995578368 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla K80"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 11440 MBytes (11995578368 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 132 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 2, Device0 = Tesla K80, Device1 = Tesla K80
Result = PASS

サンプルコードの中に、bandwidthTest　があります。これは、PC 本体と GPUカード間の転送速度を測るユーティリティです。

root@server1:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/bandwidthTest# ./bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla K80
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			8491.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			9888.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			156560.7

Result = PASS

以上で、GP-GPUを使って、プログラムを開発する環境ができました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

GPU搭載のベアメタル で CUDA が使えるって？