More than 5 years have passed since last update.

TensorFlow AdamOptimizerが収束しないエラー？ ReluGrad input is not finite. : Tensor had NaN values

Last updated at 2015-12-21Posted at 2015-12-18

はじめに

今回は「ReluGrad input is not finite. : Tensor had NaN values」というエラーが出たので、その解決策を備忘録程度に。

入力値が有限ではない？ ~ReluGrad input is not finite.~

今回の問題は以下で解決しました。
Tensorflow crashed when using AdamOptimizer #323

先日公開した画像を学習するプログラムを実行していると以下のエラーが時折発生した。

エラー内容

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:05:00.0
Total memory: 11.99GiB
Free memory: 11.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 11926613607
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 12
step 0, training accuracy 0.486591
step 10, training accuracy 0.918225
step 20, training accuracy 0.963173
step 30, training accuracy 0.987819
step 40, training accuracy 0.996789
step 50, training accuracy 0.999339
step 60, training accuracy 0.999528
step 70, training accuracy 0.999528
step 80, training accuracy 0.999528
step 90, training accuracy 0.999717
step 100, training accuracy 0.999811
step 110, training accuracy 0.999717
step 120, training accuracy 0.999245
step 130, training accuracy 0.999811
step 140, training accuracy 0.999811
step 150, training accuracy 0.999622
step 160, training accuracy 0.999717
step 170, training accuracy 0.999811
step 180, training accuracy 0.999717
step 190, training accuracy 0.999811
step 200, training accuracy 0.999622
step 210, training accuracy 0.999811
step 220, training accuracy 0.999717
step 230, training accuracy 0.999717
step 240, training accuracy 0.999717
step 250, training accuracy 0.999622
E tensorflow/core/kernels/check_numerics_op.cc:142] abnormal_detected_host @0x7ffc4be93b20 = {1, 0} ReluGrad input is not finite.
W tensorflow/core/common_runtime/executor.cc:1027] 0x5b7c3f0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
	 [[Node: gradients/conv1/Relu_grad/conv1/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](conv1/add)]]
E tensorflow/core/kernels/check_numerics_op.cc:142] abnormal_detected_host @0x7ffc4be93a60 = {1, 0} ReluGrad input is not finite.
W tensorflow/core/common_runtime/executor.cc:1027] 0x5b7c3f0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
	 [[Node: gradients/conv2/Relu_grad/conv2/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](conv2/add)]]
E tensorflow/core/kernels/check_numerics_op.cc:142] abnormal_detected_host @0x7ffc27d45c40 = {1, 0} ReluGrad input is not finite.
W tensorflow/core/common_runtime/executor.cc:1027] 0x5b7c3f0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
	 [[Node: gradients/fc1/Relu_grad/fc1/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](fc1/add)]]
Traceback (most recent call last):
  File "train.py", line 220, in <module>
    keep_prob: 0.5})
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 345, in run
    results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 419, in _do_run
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
	 [[Node: gradients/conv1/Relu_grad/conv1/Relu/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](conv1/add)]]
Caused by op u'gradients/conv1/Relu_grad/conv1/Relu/CheckNumerics', defined at:
  File "train.py", line 197, in <module>
    train_op = training(loss_value, FLAGS.learning_rate)
  File "train.py", line 126, in training
    train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 165, in minimize
    gate_gradients=gate_gradients)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 205, in compute_gradients
    loss, var_list, gate_gradients=(gate_gradients == Optimizer.GATE_OP))
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/gradients.py", line 414, in gradients
    in_grads = _AsList(grad_fn(op_wrapper, *out_grads))
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_grad.py", line 107, in _ReluGrad
    t = _VerifyTensor(op.inputs[0], op.name, "ReluGrad input is not finite.")
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_grad.py", line 100, in _VerifyTensor
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 48, in check_numerics
    name=name)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'conv1/Relu', defined at:
  File "train.py", line 193, in <module>
    logits = inference(images_placeholder, keep_prob)
  File "train.py", line 59, in inference
    h_conv1 = tf.nn.relu(conv2d(x_images, W_conv1) + b_conv1)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 506, in relu
    return _op_def_lib.apply_op("Relu", features=features, name=name)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/tensorflow-GPU/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

何度かプログラムを実行し直すとエラー発生しないこともあり、なんか気持ち悪い。
エラー内容を検索してみると、AdamOptimizer(最小勾配法)が収束してないのでは？という回答が。いや、でも途中まで計算結果収束していってるし・・・

読み進めていくと、原因がはっきりした。
inference関数で求められた各クラスの確率のようなもの(私のプログラムではlogits)に0の値が入ると、loss関数内のcross_entropyを計算する部分で0log(0)*を計算してしまい、NaNが代入されてしまっているとのこと。
ということで、
cross_entropy = -tf.reduce_sum(labels*tf.log(logits))
と書かれているところを、
cross_entropy = -tf.reduce_sum(labels*tf.log(tf.clip_by_value(logits,1e-10,1.0)))
とすることで解決した。
やっていることは、log内の数値を1e-10~1.0の範囲になるよう指定しているだけである。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up