19
19

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

TensorFlowでつまづいた(Out of GPU Memoryとは)

Last updated at Posted at 2015-12-03

はじめに

お初の投稿です。前々から開発の備忘録としてブログのようなものを探していたのですが、Qiitaに出会い、いつか投稿しようと考えていました。
で、今回、解決できない壁にぶち当たりまして、投稿させていただくことになりました。

問題点(未解決12/3時点)(解決?12/7時点)

GoogleからTensorFlowが公開されてもうすぐ一ヶ月がたとうとしています。そんな私も最近Deeplearningを勉強し始めていたこともあり、TensorFlowに飛びつきました。

TensorFlowについて、すでに色々なところでまとめられており、チュートリアルもスムーズに行きました。(後日まとめてみようかと思っています)

そして、画像データの認識を行おうとプログラムを書き実行してみました。
すると以下のエラーが発生

$ python test.py 
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:05:00.0
Total memory: 11.99GiB
Free memory: 11.47GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 11701021287
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 12
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 256 (256B) Pool: chunks: 64 free: 24 cumulative malloc: 134728 cumulative freed: 134688
Number of chunks: 64, in_use chunks: 40
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 4096 (4.0KiB) Pool: chunks: 8 free: 2 cumulative malloc: 2812 cumulative freed: 2806
Number of chunks: 8, in_use chunks: 6
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 8192 (8.0KiB) Pool: chunks: 8 free: 3 cumulative malloc: 2814 cumulative freed: 2809
Number of chunks: 8, in_use chunks: 5
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 16384 (16.0KiB) Pool: chunks: 8 free: 3 cumulative malloc: 11233 cumulative freed: 11228
Number of chunks: 8, in_use chunks: 5
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 65536 (64.0KiB) Pool: chunks: 16 free: 16 cumulative malloc: 44896 cumulative freed: 44896
Number of chunks: 16, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 98304 (96.0KiB) Pool: chunks: 8 free: 8 cumulative malloc: 11224 cumulative freed: 11224
Number of chunks: 8, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 131072 (128.0KiB) Pool: chunks: 4 free: 4 cumulative malloc: 14030 cumulative freed: 14030
Number of chunks: 4, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 212992 (208.0KiB) Pool: chunks: 8 free: 3 cumulative malloc: 11232 cumulative freed: 11227
Number of chunks: 8, in_use chunks: 5
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 229376 (224.0KiB) Pool: chunks: 2 free: 1 cumulative malloc: 2 cumulative freed: 1
Number of chunks: 2, in_use chunks: 1
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 262144 (256.0KiB) Pool: chunks: 8 free: 8 cumulative malloc: 16836 cumulative freed: 16836
Number of chunks: 8, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 425984 (416.0KiB) Pool: chunks: 1 free: 1 cumulative malloc: 2806 cumulative freed: 2806
Number of chunks: 1, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 524288 (512.0KiB) Pool: chunks: 8 free: 8 cumulative malloc: 25254 cumulative freed: 25254
Number of chunks: 8, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 1048576 (1.00MiB) Pool: chunks: 8 free: 8 cumulative malloc: 25254 cumulative freed: 25254
Number of chunks: 8, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 13631488 (13.00MiB) Pool: chunks: 8 free: 3 cumulative malloc: 2814 cumulative freed: 2809
Number of chunks: 8, in_use chunks: 5
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 268435456 (256.00MiB) Pool: chunks: 1 free: 1 cumulative malloc: 1 cumulative freed: 1
Number of chunks: 1, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 369098752 (352.00MiB) Pool: chunks: 1 free: 1 cumulative malloc: 1 cumulative freed: 1
Number of chunks: 1, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 738197504 (704.00MiB) Pool: chunks: 1 free: 0 cumulative malloc: 1 cumulative freed: 0
Number of chunks: 1, in_use chunks: 1
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 1476395008 (1.38GiB) Pool: chunks: 0 free: 0 cumulative malloc: 0 cumulative freed: 0
Number of chunks: 0, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:339] Chunk size: 2952790016 (2.75GiB) Pool: chunks: 3 free: 3 cumulative malloc: 3 cumulative freed: 3
Number of chunks: 3, in_use chunks: 0
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:345] Aggregate Region Memory: 11701021287 (10.90GiB)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:347] Aggregate Chunk Memory: 10363027456 (9.65GiB)
W tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:89] Out of GPU memory, see memory state dump above
W tensorflow/core/kernels/conv_ops.cc:162] Resource exhausted: OOM when allocating tensor with shapedim { size: 28060 } dim { size: 14 } dim { size: 14 } dim { size: 64 }
W tensorflow/core/common_runtime/executor.cc:1027] 0x10426540 Compute status: Resource exhausted: OOM when allocating tensor with shapedim { size: 28060 } dim { size: 14 } dim { size: 14 } dim { size: 64 }
	 [[Node: conv2/Conv2D = Conv2D[T=DT_FLOAT, padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](pool1/MaxPool, conv2/Variable)]]
W tensorflow/core/common_runtime/executor.cc:1027] 0x127a7090 Compute status: Resource exhausted: OOM when allocating tensor with shapedim { size: 28060 } dim { size: 14 } dim { size: 14 } dim { size: 64 }
	 [[Node: conv2/Conv2D = Conv2D[T=DT_FLOAT, padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](pool1/MaxPool, conv2/Variable)]]
	 [[Node: range_1/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_394_range_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
W tensorflow/core/common_runtime/executor.cc:1027] 0x127a7090 Compute status: Resource exhausted: OOM when allocating tensor with shapedim { size: 28060 } dim { size: 14 } dim { size: 14 } dim { size: 64 }
	 [[Node: conv2/Conv2D = Conv2D[T=DT_FLOAT, padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](pool1/MaxPool, conv2/Variable)]]
	 [[Node: Cast/_13 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_393_Cast", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
  File "img_ditect_train.py", line 229, in <module>
    keep_prob: 1.0})
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 345, in run
    results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 419, in _do_run
    e.code)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shapedim { size: 28060 } dim { size: 14 } dim { size: 14 } dim { size: 64 }
	 [[Node: conv2/Conv2D = Conv2D[T=DT_FLOAT, padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](pool1/MaxPool, conv2/Variable)]]
	 [[Node: range_1/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_394_range_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'conv2/Conv2D', defined at:
  File "test.py", line 196, in <module>
    logits = inference(images_placeholder, keep_prob)
  File "test.py", line 70, in inference
    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
  File "test.py", line 46, in conv2d
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 207, in conv2d
    use_cudnn_on_gpu=use_cudnn_on_gpu, name=name)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

現在、原因を調べているんだけれどもGPUのメモリが足りない?
んーわからない・・・
どなたかご教授お願いしたい・・・

追記(12/7)

日にちを置くと物事って冷静に見れるもので。
上のエラーに関してですが、とりあえず、エラー吐かずにプログラムが通ったので追記します。あくまでとりあえずですが。

MATS様より**”shapedim { size: 28060 } dim { size: 14 } dim { size: 14 } dim { size: 64 }のところのサイズを小さくして試してみてはいかがでしょう?”**というご指摘をいただき、{ size: 28060 }という部分に着目しました。

先週の段階では、何なんだよ28060って・・・となっていて色々悪循環に陥っていたのですが、冷静になって、与えている画像の枚数じゃないか?と気が付きました。(中途半端な枚数)
えぇ、なんで気が付かなかったんでしょうね。
ちなみにプログラムは画像認識のプログラムです。完成したら載せる予定です。

というわけで、画像の学習枚数を1000枚程度に減らしたら無事、プログラムが通りました。認識精度も悪くないです。

しかし、私の認識では、与えてやる画像が多くないと高い精度が出ないという認識なので、プログラム処理的に解決してやりたいなと考えています。

では、また更新があり次第追記します。

19
19
4

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
19
19

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?