CuPyで簡単CUDAプログラミング

CuPy

Last updated at 2025-03-24Posted at 2025-03-24

この文書ではOpenCVのVideoCaptureで取り込んだ画像をCUDAでグレースケール化して表示させるプログラムを作成する。

なお動作確認はRTX3080を搭載したUbuntu20.04 PC上でおこなっている。

CuPyインストール

CUDAのバージョンを確認。

$ nvidia-smi
Sat Nov 16 12:20:51 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |

CUDAバージョンに合わせてcupyをインストール。

$ pip3 install cupy-cuda12x

カメラ画像表示

OpenCVで簡単に作成する。

import cv2

def main():
    device = "/dev/video2"

    cap = cv2.VideoCapture(device)
    if not cap.isOpened():
        print("[Error] Faild to open video device:", device)
        return

    while True:
        ret, frame = cap.read()
        if not ret:
            print("[Error] Faild to read video frame")
            break


        cv2.imshow("video", frame)
        if cv2.waitKey(10) & 0xFF == ord('q'):
            break


if __name__ == '__main__':
    main()

グレースケール化カーネルを用意

RGBをグレースケール化するにあたって下記計算式を使用する。

${\displaystyle Y=0.2126R+0.7152G+0.0722B}$

CUDA C++のコードをcupy.RawKernel()に与えてカーネルを作成する。

カーネルとはGPUにおける処理単位でありCuPyでは複数の方法でカーネルを作成できる。ここではCUDA C++の関数を用いてカーネルを作成している。

to_grayscale = cp.RawKernel(r'''
typedef unsigned char  uint8_t;
extern "C" __global__
void to_grayscale(const uint8_t* input, uint8_t* output, int height, int width) {
    static const float rr = 0.2126;
    static const float gr = 0.7152;
    static const float br = 0.0722;

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height) {
        const int idx = y * width + x;

        const uint8_t b = input[idx*3 + 0];
        const uint8_t g = input[idx*3 + 1];
        const uint8_t r = input[idx*3 + 2];

        output[idx] = (uint8_t)(b*br) + (uint8_t)(g*gr) + (uint8_t)(r*rr);
    }
}
''', 'to_grayscale')

カーネルを実行するコードは以下の通り。

        block = (16,16)
        grid = ((width + block[0] - 1) // block[0],
                (height + block[1] - 1) // block[1])

        to_grayscale(grid, block,
                     (input, output, height, width))

CUDAでは同時に大量のスレッドでカーネル関数が実行される。カーネル関数内ではblockIdx,blockDim,threadIdxを用いて自身が画像のどの部分を処理しているかを判定している。

今回の場合block = (16,16)が指定されているので、x方向16個, y方向16個の計256個のスレッドが同時に実行される。to_grayscale()内ではblockDimでブロックのサイズ（ここでは16x16）がわかり、threadIdxにてブロック内の位置がわかる。

またgridでは16x16のブロックがx方向、y方向にいくつあれば画像を全て処理できるか指定している。
to_grayscale()内ではblockIdxで今どのgridを処理しているかがわかる。

引用: https://www.3dgep.com/cuda-thread-execution-model/

コード全体


import cv2
import numpy as np
import cupy as cp

to_grayscale = cp.RawKernel(r'''
typedef unsigned char  uint8_t;
extern "C" __global__
void to_grayscale(const uint8_t* input, uint8_t* output, int height, int width) {
    static const float rr = 0.2126;
    static const float gr = 0.7152;
    static const float br = 0.0722;

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height) {
        const int idx = y * width + x;

        const uint8_t b = input[idx*3 + 0];
        const uint8_t g = input[idx*3 + 1];
        const uint8_t r = input[idx*3 + 2];

        output[idx] = (uint8_t)(b*br) + (uint8_t)(g*gr) + (uint8_t)(r*rr);
    }
}
''', 'to_grayscale')


def main():
    device = "/dev/video2"

    cap = cv2.VideoCapture(device)
    if not cap.isOpened():
        print("[Error] Faild to open video device:", device)
        return

    while True:
        ret, frame = cap.read()
        if not ret:
            print("[Error] Faild to read video frame")
            break

        height, width, _ = frame.shape
        output = cp.zeros((height, width), dtype=cp.uint8)
        input = cp.array(frame)

        block = (16,16)
        grid = ((width + block[0] - 1) // block[0],
                (height + block[1] - 1) // block[1])

        to_grayscale(grid, block,
                     (input, output, height, width))

        cv2.imshow("video", cp.asnumpy(output))
        if cv2.waitKey(10) & 0xFF == ord('q'):
            break


if __name__ == '__main__':
    main()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up