More than 1 year has passed since last update.

共有 GPU メモリの検証コード

Last updated at 2023-08-31Posted at 2023-08-30

Stable Diffusion WebUI AUTIMATIC1111 で共有メモリ

というコメントが GitHub に上がっているのをみて、
そもそも
Windows(WSL2)
+ CUDA
+ Python
+ PyTorch
で共有メモリは確保できるのか？
そんな疑問から 30年ぶりぐらいにコードしてみた元assembler屋です。

参考資料は以下です。2015年ぐらいに電子書籍で買ってたら、今(2023現在)古本がすごい値上がりしてますね。

参考文献
CUDA C プロフェッショナルプログラミング

CUDA で GPU を確保

GPU のメモリを確保して、20秒後に開放する。
(が、しかし解放しないてきとーに Ctrl+C で終了してください)
gpumem01.cu

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

#define SIZE (1024 * 1024 * 1024)

__global__ void kernel(int *data, int *clockRate) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  data[idx] = idx;
  clock_t start = clock64();
  clock_t now;
  for (;;) {
    now = clock64();
    clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
    int rate = *clockRate;
    double seconds = cycles / (double)rate;
    if (seconds >= 20) {
      break;
    }
  }
}

int main() {
  int *host_data;
  int *device_data;
  int *device_clockRate;
  size_t bytes = SIZE * sizeof(int);
  host_data = (int*)malloc(bytes);
  cudaMalloc(&device_data, bytes);
  cudaMalloc(&device_clockRate, sizeof(int));
  cudaDeviceProp prop;
  cudaGetDeviceProperties(&prop, 0);
  int clockRate = prop.clockRate;
  cudaMemcpy(device_clockRate, &clockRate, sizeof(int), cudaMemcpyHostToDevice);
  dim3 block(32);
  dim3 grid((SIZE + block.x - 1) / block.x);
  kernel<<<grid, block>>>(device_data, device_clockRate);
  cudaDeviceSynchronize();
  cudaMemcpy(host_data, device_data, bytes, cudaMemcpyDeviceToHost);
  for (int i = 0; i < SIZE; i++) {
    printf("%d ", host_data[i]);
    if ((i + 1) % 16 == 0) printf("\n");
  }
  printf("\n");
  free(host_data);
  cudaFree(device_data);
}

CUDA で共有メモリ 4GiB 確保して20秒後に開放する。

(なぜかちゃんと20秒がうごいて解放しる)
sheardmem_20sec.cu

#include <stdio.h>
#include <cuda_runtime.h>
#include <unistd.h>

__global__ void kernel(int *array)
{
    // modify the array
}

int main()
{
    int *array;
    // allocate 1GB of unified memory
    cudaMallocManaged(&array, 1024*1024*1024*sizeof(int));
    // launch kernel to modify the array
    kernel<<<1, 1>>>(array);
    // synchronize device and host
    cudaDeviceSynchronize();
    // print the result
    printf("array[0] = %d\n", array[0]);
    // wait for 20 seconds
    printf("Waiting for 20 seconds...\n");
    sleep(20);
    // free the memory
    cudaFree(array);
}

Python で確保

似たようなことをやる。
gpu02.py

f = open("test.txt", "w+b")

f.seek(2 * 1024 * 1024 * 1024 - 1) # 2GB - 1バイト
f.write(b"\x00")
f.flush()

import mmap
m = mmap.mmap(f.fileno(), 0)

m[0:4] = b"abcd"

print(m[0:4])

import torch # PyTorch ライブラリをインポートする
device = torch.device("cuda")
g = torch.cuda.HalfTensor(2 * 1024 * 1024 * 1024 // 2).to(device)

import time
time.sleep(20)
del g
m.close()
f.close()

以上。

参考文献
CUDA C プロフェッショナルプログラミング
 YouTube 動画リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

共有 GPU メモリの検証コード

Stable Diffusion WebUI AUTIMATIC1111 で共有メモリ

CUDA で GPU を確保

CUDA で 共有メモリ 4GiB 確保して20秒後に開放する。

Python で確保

CUDA で共有メモリ 4GiB 確保して20秒後に開放する。