Help us understand the problem. What is going on with this article?

nvidiaのgpuメモリが解放されない場合の解決方法

More than 3 years have passed since last update.

メモリが解放されない問題

複数gpu環境でchainerを何度も実行していた。
gpu状況を確認したところ何も動いてないが、メモリががっつり取られている状況が発生。
結論からいうとプロセスが残ってる。最近のchainerってプロセス並列化してるので親を消しても子プロセスがいっぱい残ってる図式のよう。

nvidia-smi

Processesには何もないのにMemory-Usageでメモリいっぱい使っている。
これのせいでout of memoryになって何も実行できない。

+------------------------------------------------------+                       
| NVIDIA-SMI    .       Driver Version:                |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:05:00.0     Off |                  Off |
| N/A   32C    P8    26W / 149W |  12200MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:06:00.0     Off |                  Off |
| N/A   29C    P8    29W / 149W |  12200MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

プロセスで何か残ってるか確認

lsof /dev/nvidia*

詳細確認方法

ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -p `lsof -n -w -t /dev/nvidia*`

使ってないプロセスを消す。
※一括で消えるので、ちゃんと確認してから消してください。

kill -9 $(lsof -t)

監視しておくとわかりやすい。

watch -n1 "nvidia-smi -i 1"
+------------------------------------------------------+                       
| NVIDIA-SMI    .       Driver Version:    .           |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:05:00.0     Off |                  Off |
| N/A   32C    P8    26W / 149W |     56MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:06:00.0     Off |                  Off |
| N/A   29C    P8    29W / 149W |     56MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

綺麗になりました。

NG集

リセットコマンドでいける!と思ったがプロセスを消してから実行しなさいエラーが出た

sudo nvidia-smi --gpu-reset -i 0

https://devtalk.nvidia.com/default/topic/958159/11-gb-of-gpu-ram-used-and-no-process-listed-by-nvidia-smi/

cudaを書いて解放が可能なのかを調べていた。

http://d.hatena.ne.jp/RabbitMan/20121110/1352549552
cuda実行方法
http://qiita.com/gunn/items/34075251ec6687e06800

cudaでメモリ解放の試みと勘違い?
cudaFreeって書いてるのに解放してくれなかった。プロセスが掴んでるせいか?

t906.cu
#include <stdio.h>
#include <stdlib.h>
#define DSIZE_MAX 100000000
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


int main(int argc, char *argv[]){
  if (argc < 2) {printf("must specify allocation size on command line\n"); return 1;}
  const int dsize = atoi(argv[1]);
  if ((dsize < 1)||(dsize > DSIZE_MAX)) {printf("invalid size %d\n", dsize); return 1;}

  int *data;
  cudaMalloc(&data, dsize*sizeof(int));
  cudaCheckErrors("cudaMalloc fail");
  cudaFree(data);
  cudaCheckErrors("cudaFree fail");
  return 0;
}

ビルド
nvcc -o t906 t906.cu
実行
./t906 20000000

特に何もなかった。メモリ解放を期待した。
https://devtalk.nvidia.com/default/topic/892873/how-to-effectively-free-large-memory-allocation-/?offset=1

参考

https://stackoverflow.com/questions/8223811/top-command-for-gpus-using-cuda
https://askubuntu.com/questions/346394/how-to-write-a-shscript-to-kill-9-a-pid-which-is-found-via-lsof-i
https://stackoverflow.com/questions/3855127/find-and-kill-process-locking-port-3000-on-mac
https://stackoverflow.com/questions/4354257/stop-all-cuda-processes-in-linux-without-restarting-the-computer

miyamotok0105
C#->Python, Swift->PHP Laravel, Vuejs, AWS->? ITが好き。日本の技術を世界レベルへ。 最近はあまり記事書くのに時間が取れずに、エラー解消ポエムを呟いてる。
https://www.wantedly.com/companies/company_788076?ql=gaJpZM4Aq7PI&fbclid=IwAR3l-mt7gbT5lDekWHNLakUUy7Zkn17Gz7HgydVuMnCEC2F8--5h12K6xYA
gibjapan
受託・人材サービス・教育・コンサルなど多岐に渡り幅広く活動している会社です。
https://www.gibjapan.org
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
No comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
ユーザーは見つかりませんでした