#目的
GPUを使って深層学習で学習させようとした場合に、
以下のようなエラーが出る場合がある。
※ 前提として、githubから取得するなど、実績のあるコードにて。
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
で、
前後も示すと、以下。
2019-09-21 17:13:14.228372: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 13 Chunks of size 921600 totalling 11.43MiB
2019-09-21 17:13:14.231275: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 1707008 totalling 1.63MiB
2019-09-21 17:13:14.233812: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 1843200 totalling 3.52MiB
2019-09-21 17:13:14.237542: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 3145728 totalling 3.00MiB
2019-09-21 17:13:14.240348: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 10 Chunks of size 3686400 totalling 35.16MiB
2019-09-21 17:13:14.242938: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 3791360 totalling 3.62MiB
2019-09-21 17:13:14.245915: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 4194304 totalling 8.00MiB
2019-09-21 17:13:14.248532: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 5487616 totalling 5.23MiB
2019-09-21 17:13:14.252219: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 3 Chunks of size 16777216 totalling 48.00MiB
2019-09-21 17:13:14.254865: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 20971520 totalling 40.00MiB
2019-09-21 17:13:14.257753: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 17 Chunks of size 41943040 totalling 680.00MiB
2019-09-21 17:13:14.260766: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 50331648 totalling 48.00MiB
2019-09-21 17:13:14.263354: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 74894336 totalling 71.42MiB
2019-09-21 17:13:14.266044: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 23 Chunks of size 83886080 totalling 1.80GiB
2019-09-21 17:13:14.270016: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 146800640 totalling 140.00MiB
2019-09-21 17:13:14.272684: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 2.87GiB
2019-09-21 17:13:14.275331: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 3146173440 memory_limit_: 3146173644 available bytes: 204 curr_region_allocation_bytes_: 4294967296
2019-09-21 17:13:14.279840: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:
Limit: 3146173644
InUse: 3086549760
MaxInUse: 3086550272
NumAllocs: 835
MaxAllocSize: 1363542016
2019-09-21 17:13:14.286537: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ********************x******************************************************************************x
2019-09-21 17:13:14.290629: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[128,32,16,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "error_analysis_cifar_finish.py", line 341, in <module>
train(7)
File "error_analysis_cifar_finish.py", line 325, in train
callbacks=[scheduler, cb, hist], epochs=600) # cosine decayの場合は300epoch
File "C:\Users\XYZZZ\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\keras\engine\training.py", line 1433, in fit_generator
steps_name='steps_per_epoch')
File "C:\Users\XYZZZ\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\keras\engine\training_generator.py", line 264, in model_iteration
batch_outs = batch_function(*batch_data)
File "C:\Users\XYZZZ\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\keras\engine\training.py", line 1175, in train_on_batch
outputs = self.train_function(ins) # pylint: disable=not-callable
File "C:\Users\XYZZZ\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\keras\backend.py", line 3292, in __call__
run_metadata=self.run_metadata)
File "C:\Users\XYZZZ\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[128,160,32,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node conv2d_15/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[loss/mul/_561]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[128,160,32,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node conv2d_15/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
#エラー対策
メモリの使用量を減らす。
具体的には、エラーがでないところまで、
batch_sizeを小さくした。
#補足
環境から考えて、サイズがどうで、
実行しようとしているサイズがどうで、
だから、枯渇したとか、丁寧に考えるべきなんでしょうが、
ちょっと、時短のため、省略。
#まとめ
これをみて、batch_sizeを小さくして解決する人がいれば、幸甚。
#関連(本人)
pythonをストレスなく使う!(generatorに詳しくなる。since1975らしい。)
pythonをストレスなく使う!(Pythonでは、すべてがオブジェクトとして実装されている)
pythonをストレスなく使う!(Pylintに寄り添う)
pythonをストレスなく使う!(ExpressionとStatement)
英語と日本語、両方使ってPythonを丁寧に学ぶ。
#今後
コメントなどあれば、お願いします。
勉強します、、、、