More than 1 year has passed since last update.

Ubuntu20.04に入っているNvidiaのGPUを動かすためのアレコレ

Last updated at 2023-06-01Posted at 2023-05-30

はじめに

Ubuntu20.04にNvidiaのGPU(具体的にはA100)を入れて動かすまでに行うためのアレコレのメモ。関連してよく使うコマンドとか。
いつも書きかけ。

OS等

cat /etc/os-release

出力

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

head /proc/cpuinso

出力

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 106
model name	: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
stepping	: 6
microcode	: 0xd000389
cpu MHz		: 800.783
cache size	: 24576 KB
physical id	: 0

業者による初期設定済み

GPUが搭載されたワークステーションを業者に発注して設置してもらったため、ある程度は既に設定がなされていた。

例えば、デフォルトでは設定されているであろうNouveauはlsmod | grep nouveauしても確認できず、lsmod | grep nvidiaで

nvidia_drm             61440  1
nvidia_modeset       1241088  1 nvidia_drm
nvidia              56463360  82 nvidia_modeset
drm_kms_helper        184320  4 ast,nvidia_drm
drm                   495616  9 drm_kms_helper,drm_vram_helper,ast,nvidia,nvidia_drm,ttm

となっていて、既にnvidiaのモジュールが動いている様子が確認できた。
nvidia-smiで確認すると

Tue May 30 10:37:19 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:4B:00.0 Off |                    0 |
| N/A   43C    P0    48W / 300W |      4MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1862      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

が確認できる。

Ref: 【Ubuntu + NVIDIA】Ubuntu に NVIDIA ドライバーをインストール – たまテク

Cudaのバージョンを/usr/local/cuda/bin/nvcc -Vで確認する(PATHを通してないので直打)と

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

と確認できる。

よく使うコマンド

設定の確認

# ドライバーの確認
lsmod | grep nouveau
lsmod | grep nvidia 
# GPU名の確認
lspci |grep -i nvidia #要sudo 
# cuda関係の状況の確認
nvidia-smi

CuPyをpyenv+virtualenvで作った環境に入れる

ゆくゆくはcupyでコードを書きたいので、cupyを入れる。OSデフォルトのpythonの環境は汚したくないので、pyenv+virtualenvを使って、環境を構築する。

Python3.11.3のインストール

出来るだけプレーンな状況でテスト環境を作りたいので、anacondaとかはつかわない。新しもの好きなので2023/05/30現在最新の3.11.3を入れる。
コマンド pyenv install 3.11.3

出力

Downloading Python-3.11.3.tar.xz...
-> https://www.python.org/ftp/python/3.11.3/Python-3.11.3.tar.xz
Installing Python-3.11.3...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/hogehoge/.pyenv/versions/3.11.3/lib/python3.11/tkinter/__init__.py", line 38, in <module>
    import _tkinter # If this fails your Python may not be configured for Tk
    ^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named '_tkinter'
WARNING: The Python tkinter extension was not compiled and GUI subsystem has been detected. Missing the Tk toolkit?
Installed Python-3.11.3 to /home/hogehoge/.pyenv/versions/3.11.3

うまくtkinterが入っていないけれど、とりあえずよし☚
(aptでpython-tkとpyton3-tkは入れてあって、import tkinterはOSデフォルトのpythonでは動く。多分何かのライブラリを入れておかねばならないけれど、困るまでは見送り。)

pyenv virtualenv 3.11.3 first_testで、今作った環境をfirst_testという名前で運用することにする。
適当に/home/hogehoge/test/first_testとかのディレクトリに入って、pyenv local first_testでテスト環境に切り替える。

cupy+jupyterのインストール

コマンド: pip3 install cupy(時間が結構かかる)

出力

Collecting cupy
  Downloading cupy-12.1.0.tar.gz (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 25.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting numpy<1.27,>=1.20
  Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 24.7 MB/s eta 0:00:00
Collecting fastrlock>=0.5
  Using cached fastrlock-0.8.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_24_x86_64.whl (49 kB)
Installing collected packages: fastrlock, numpy, cupy
  DEPRECATION: cupy is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for cupy ... done
Successfully installed cupy-12.1.0 fastrlock-0.8.1 numpy-1.24.3

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python3.11 -m pip install --upgrade pip

とpip3 install jupyter(出力略)、pip3 install tabulate(出力略)でミニマムの環境をインストール。(Jupyterは付随してすごくたくさんライブラリ入るね)

cupyの動作テスト

CuPy 入門 — ディープラーニング入門：Chainer チュートリアルのコードを改変した下記のコードをpython3 test.pyで実行する。

test.py

import numpy as np
import cupy as cp
import time
import tabulate

def get_w_np(x, t):
    """                    
    Solve a linear system, XX*w = t, by matrix inversion 
    """
    xx = np.dot(x.T, x)
    xx_inv = np.linalg.inv(xx)
    xt = np.dot(x.T, t)
    w = np.dot(xx_inv, xt)
    return w

def get_w_cp(x, t):
    """
    Solve a linear system, XX*w = t, by matrix inversion
    """
    xx = cp.dot(x.T, x)
    xx_inv = cp.linalg.inv(xx)
    xt = cp.dot(x.T, t)
    w = cp.dot(xx_inv, xt)
    return w

Nlist = [10, 100, 1000, 10000]

# Measuring time for numpy (CPU)  
print("'get_w_np' on CPU")
times_cpu = []
for N in Nlist:
    np.random.seed(0)
    x = np.random.rand(N, N)
    t = np.random.rand(N, 1)

    time_start = time.time()

    w = get_w_np(x, t)

    time_end = time.time()
    elapsed_time = time_end - time_start

    print('N={:>5}:{:>8.5f} sec'.format(N, elapsed_time))
    times_cpu.append(elapsed_time)

# Measuring time for cupy (GPU)
print("'get_w_cp' on GPU")
times_gpu = []  # GPUの計算時間保存用
for N in Nlist:
    cp.random.seed(0)
    x = cp.random.rand(N, N)
    t = cp.random.rand(N, 1)

    cp.cuda.Stream.null.synchronize() # GPU 上での処理が終わるのを待機    time_start = time.time()

    w = get_w_cp(x, t)

    cp.cuda.Stream.null.synchronize() # GPU 上での処理が終わるのを待機
    time_end = time.time()
    elapsed_time = time_end - time_start

    print('N={:>5}:{:>8.5f} sec'.format(N, elapsed_time))
    times_gpu.append(elapsed_time)

# 
times_cpu = np.asarray(times_cpu)
times_gpu = np.asarray(times_gpu)
ratio = ['{:.2f} x'.format(r) for r in times_cpu / times_gpu]

# tabulate を用いてテーブルを作成
table = tabulate.tabulate(
    zip(Nlist, times_cpu, times_gpu, ratio),
    headers=['N', 'NumPyでの実行時間 (sec)', 'CuPy での実行時間 (sec)', '高速化倍率'])

print(table)

デフォルトでは環境変数でOMP_NUM_THREADS=1としている。numpyのOpenBLASを実行した場合の出力は以下の通り。

出力(1thread)

'get_w_np' on CPU
N=   10: 0.00018 sec
N=  100: 0.00059 sec
N= 1000: 0.13014 sec
N=10000:44.42248 sec
'get_w_cp' on GPU
N=   10: 0.73800 sec
N=  100: 0.00068 sec
N= 1000: 0.00835 sec
N=10000: 0.40778 sec
    N    NumPyでの実行時間 (sec)    CuPy での実行時間 (sec)  高速化倍率
-----  -------------------------  -------------------------  ------------
   10                0.000183105                0.737995     0.00 x
  100                0.000587225                0.000683069  0.86 x
 1000                0.130135                   0.00835443   15.58 x
10000               44.4225                     0.407781     108.94 x

16 coreのCPUを二個積んでいるので、16 threadと32 threadにして確認した際の速度比較は以下の通り:

出力(16thread)

    N    NumPyでの実行時間 (sec)    CuPy での実行時間 (sec)  高速化倍率
-----  -------------------------  -------------------------  ------------
   10                0.000550747                0.788793     0.00 x
  100                0.00157523                 0.000721693  2.18 x
 1000                0.0368173                  0.00833082   4.42 x
10000                5.98699                    0.408392     14.66 x

出力(32thread)

    N    NumPyでの実行時間 (sec)    CuPy での実行時間 (sec)  高速化倍率
-----  -------------------------  -------------------------  ------------
   10                0.000542164                0.765965     0.00 x
  100                0.0742831                  0.000671387  110.64 x
 1000                0.0622079                  0.00815129   7.63 x
10000                4.37956                    0.405994     10.79 x

(1000次元の逆行列計算(多分これが計算の律速)までは16 thread並列の方が速い。)

Xeon(R) Gold 6326 (1.48 TFLOP/s)×2とA100 (9.7 TFLOP/s)なので理論性能比では3.3倍くらいだけれども、逆行列演算(多分これが計算の律速)だとそれ以上に差がついている。(バンド幅とか？)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up