More than 5 years have passed since last update.

nvidiaのgpuメモリが解放されない場合の解決方法

Last updated at 2017-08-23Posted at 2017-05-24

メモリが解放されない問題

複数gpu環境でchainerを何度も実行していた。
gpu状況を確認したところ何も動いてないが、メモリががっつり取られている状況が発生。
結論からいうとプロセスが残ってる。最近のchainerってプロセス並列化してるので親を消しても子プロセスがいっぱい残ってる図式のよう。

nvidia-smi

Processesには何もないのにMemory-Usageでメモリいっぱい使っている。
これのせいでout of memoryになって何も実行できない。

+------------------------------------------------------+                       
| NVIDIA-SMI    .       Driver Version:                |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:05:00.0     Off |                  Off |
| N/A   32C    P8    26W / 149W |  12200MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:06:00.0     Off |                  Off |
| N/A   29C    P8    29W / 149W |  12200MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

プロセスで何か残ってるか確認

lsof /dev/nvidia*

詳細確認方法

ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -p `lsof -n -w -t /dev/nvidia*`

使ってないプロセスを消す。
※一括で消えるので、ちゃんと確認してから消してください。

kill -9 $(lsof -t)

監視しておくとわかりやすい。

watch -n1 "nvidia-smi -i 1"

+------------------------------------------------------+                       
| NVIDIA-SMI    .       Driver Version:    .           |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:05:00.0     Off |                  Off |
| N/A   32C    P8    26W / 149W |     56MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:06:00.0     Off |                  Off |
| N/A   29C    P8    29W / 149W |     56MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

綺麗になりました。

NG集

リセットコマンドでいける！と思ったがプロセスを消してから実行しなさいエラーが出た

sudo nvidia-smi --gpu-reset -i 0

cudaを書いて解放が可能なのかを調べていた。

http://d.hatena.ne.jp/RabbitMan/20121110/1352549552
cuda実行方法
http://qiita.com/gunn/items/34075251ec6687e06800

cudaでメモリ解放の試みと勘違い？
cudaFreeって書いてるのに解放してくれなかった。プロセスが掴んでるせいか？

t906.cu

# include <stdio.h>
# include <stdlib.h>
# define DSIZE_MAX 100000000
# define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


int main(int argc, char *argv[]){
  if (argc < 2) {printf("must specify allocation size on command line\n"); return 1;}
  const int dsize = atoi(argv[1]);
  if ((dsize < 1)||(dsize > DSIZE_MAX)) {printf("invalid size %d\n", dsize); return 1;}

  int *data;
  cudaMalloc(&data, dsize*sizeof(int));
  cudaCheckErrors("cudaMalloc fail");
  cudaFree(data);
  cudaCheckErrors("cudaFree fail");
  return 0;
}

ビルド
nvcc -o t906 t906.cu
実行
./t906 20000000

特に何もなかった。メモリ解放を期待した。
https://devtalk.nvidia.com/default/topic/892873/how-to-effectively-free-large-memory-allocation-/?offset=1

参考

https://stackoverflow.com/questions/8223811/top-command-for-gpus-using-cuda
https://askubuntu.com/questions/346394/how-to-write-a-shscript-to-kill-9-a-pid-which-is-found-via-lsof-i
https://stackoverflow.com/questions/3855127/find-and-kill-process-locking-port-3000-on-mac
https://stackoverflow.com/questions/4354257/stop-all-cuda-processes-in-linux-without-restarting-the-computer

134

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up