Python
機械学習
TensorFlow

TensorFlow AdamOptimizerが収束しないエラー? ReluGrad input is not finite. : Tensor had NaN values

More than 3 years have passed since last update.


はじめに

今回は「ReluGrad input is not finite. : Tensor had NaN values」というエラーが出たので、その解決策を備忘録程度に。


入力値が有限ではない? ~ReluGrad input is not finite.~

今回の問題は以下で解決しました。

Tensorflow crashed when using AdamOptimizer #323

先日公開した画像を学習するプログラムを実行していると以下のエラーが時折発生した。


エラー内容

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12

I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:05:00.0
Total memory: 11.99GiB
Free memory: 11.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 11926613607
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 12
step 0, training accuracy 0.486591
step 10, training accuracy 0.918225
step 20, training accuracy 0.963173
step 30, training accuracy 0.987819
step 40, training accuracy 0.996789
step 50, training accuracy 0.999339
step 60, training accuracy 0.999528
step 70, training accuracy 0.999528
step 80, training accuracy 0.999528
step 90, training accuracy 0.999717
step 100, training accuracy 0.999811
step 110, training accuracy 0.999717
step 120, training accuracy 0.999245
step 130, training accuracy 0.999811
step 140, training accuracy 0.999811
step 150, training accuracy 0.999622
step 160, training accuracy 0.999717
step 170, training accuracy 0.999811
step 180, training accuracy 0.999717
step 190, training accuracy 0.999811
step 200, training accuracy 0.999622
step 210, training accuracy 0.999811
step 220, training accuracy 0.999717
step 230, training accuracy 0.999717
step 240, training accuracy 0.999717
step 250, training accuracy 0.999622
E tensorflow/core/kernels/check_numerics_op.cc:142] abnormal_detected_host @0x7ffc4be93b20 = {1, 0} ReluGrad input is not finite.
W tensorflow/core/common_runtime/executor.cc:1027] 0x5b7c3f0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/conv1/Relu_grad/conv1/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](conv1/add)]]
E tensorflow/core/kernels/check_numerics_op.cc:142] abnormal_detected_host @0x7ffc4be93a60 = {1, 0} ReluGrad input is not finite.
W tensorflow/core/common_runtime/executor.cc:1027] 0x5b7c3f0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/conv2/Relu_grad/conv2/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](conv2/add)]]
E tensorflow/core/kernels/check_numerics_op.cc:142] abnormal_detected_host @0x7ffc27d45c40 = {1, 0} ReluGrad input is not finite.
W tensorflow/core/common_runtime/executor.cc:1027] 0x5b7c3f0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/fc1/Relu_grad/fc1/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](fc1/add)]]
Traceback (most recent call last):
File "train.py", line 220, in <module>
keep_prob: 0.5})
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 345, in run
results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 419, in _do_run
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
[[Node: gradients/conv1/Relu_grad/conv1/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](conv1/add)]]
Caused by op u'gradients/conv1/Relu_grad/conv1/Relu/CheckNumerics', defined at:
File "train.py", line 197, in <module>
train_op = training(loss_value, FLAGS.learning_rate)
File "train.py", line 126, in training
train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 165, in minimize
gate_gradients=gate_gradients)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 205, in compute_gradients
loss, var_list, gate_gradients=(gate_gradients == Optimizer.GATE_OP))
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/gradients.py", line 414, in gradients
in_grads = _AsList(grad_fn(op_wrapper, *out_grads))
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_grad.py", line 107, in _ReluGrad
t = _VerifyTensor(op.inputs[0], op.name, "ReluGrad input is not finite.")
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_grad.py", line 100, in _VerifyTensor
verify_input = array_ops.check_numerics(t, message=msg)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 48, in check_numerics
name=name)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
op_def=op_def)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 988, in __init__
self._traceback = _extract_stack()

...which was originally created as op u'conv1/Relu', defined at:
File "train.py", line 193, in <module>
logits = inference(images_placeholder, keep_prob)
File "train.py", line 59, in inference
h_conv1 = tf.nn.relu(conv2d(x_images, W_conv1) + b_conv1)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 506, in relu
return _op_def_lib.apply_op("Relu", features=features, name=name)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
op_def=op_def)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 988, in __init__
self._traceback = _extract_stack()


何度かプログラムを実行し直すとエラー発生しないこともあり、なんか気持ち悪い。

エラー内容を検索してみると、AdamOptimizer(最小勾配法)が収束してないのでは?という回答が。いや、でも途中まで計算結果収束していってるし・・・

読み進めていくと、原因がはっきりした。

inference関数で求められた各クラスの確率のようなもの(私のプログラムではlogits)に0の値が入ると、loss関数内のcross_entropyを計算する部分で0*log(0)を計算してしまい、NaNが代入されてしまっているとのこと。

ということで、

cross_entropy = -tf.reduce_sum(labels*tf.log(logits))

と書かれているところを、

cross_entropy = -tf.reduce_sum(labels*tf.log(tf.clip_by_value(logits,1e-10,1.0)))

とすることで解決した。

やっていることは、log内の数値を1e-10~1.0の範囲になるよう指定しているだけである。