@Ueharamethodposted at 2021-07-28

Tensorflowを用いたDeeplab v3+についての質問

Q&A

Python TensorFlow segmentation Xception DeepLab

解決したいこと

初質問です。
Tensorflow-gpuでDeeplab v3+を使って、事前学習済みモデルxception_65によるセマンティックセグメンテーションを行っています。学習の段階で下記の二つのエラーが出ています。
開発環境は以下の通りです。

windows 10.0.19042
anaconda3 4.10.3
Tensorflow-gpu 1.15.0
Tensorflow 1.15.0
Tensorflow-estimator 1.15.0
python 3.6.9
keras 2.3.1
CUDA 10.0
cuDNN 7.6.2.24
Nvidia　RTX3080 Ti

また、使用する自前データサイズはすべて720×480のjpgです。

こちらのリンクを参考にして実行してます。
https://qiita.com/mucchyo/items/d21993abee5e6e44efad

発生している問題・エラー

ターミナルの出力が非常に長いので、抜粋してます。

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: xception_65/entry_flow/block3/unit_1/xception_module/separable_conv2_pointwise/BatchNorm/gamma_1
         [[node xception_65/entry_flow/block3/unit_1/xception_module/separable_conv2_pointwise/BatchNorm/gamma_1 (defined at C:\Users\owner\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
         [[xception_65/middle_flow/block1/unit_2/xception_module/separable_conv2_pointwise/BatchNorm/moving_variance/read/_467]]
  (1) Invalid argument: Nan in summary histogram for: xception_65/entry_flow/block3/unit_1/xception_module/separable_conv2_pointwise/BatchNorm/gamma_1
         [[node xception_65/entry_flow/block3/unit_1/xception_module/separable_conv2_pointwise/BatchNorm/gamma_1 (defined at C:\Users\owner\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

上のエラーは８回出ています。

I0728 13:55:28.723722 13524 learning.py:783] Caught OutOfRangeError. Stopping Training. 2 root error(s) found.
  (0) Out of range: FIFOQueue '_2_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
         [[node fifo_queue_Dequeue (defined at C:\Users\owner\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
         [[fifo_queue_Dequeue/_1479]]
  (1) Out of range: FIFOQueue '_2_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
         [[node fifo_queue_Dequeue (defined at C:\Users\owner\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

上のエラーは２回出ています。

なお、学習の途中などではなく、学習の初手で出てきます。

また、失敗しているにもかかわらず、途中でFInishtraining saving model diskとも出てきてます。

実行コードは下記です。

python train.py --logtostderr --training_number_of_steps=20000 --train_split="train" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --train_crop_size=481 --train_crop_size=481 --train_batch_size=1 --dataset="pascal_voc_seg" --tf_initial_checkpoint="./datasets/pascal_voc_seg/init_models/deeplabv3_pascal_train_aug/model.ckpt" --train_logdir="./datasets/pascal_voc_seg/exp/train_on_trainval_set/train" --dataset_dir="./datasets/pascal_voc_seg/tfrecord" --fine_tune_batch_norm=false --initialize_last_layer=false --last_layers_contain_logits_only=true

自分で試したこと

Tensorflowのコミュニティなどを調べて、InvalidArgumentErrorは学習率が高いと出ると見かけたので学習率を0.0000000001など極端に小さくしてみましたが、エラーの内容は全く変わりませんでした。
参考リンク　https://stackoverflow.com/questions/39854390/nan-in-summary-histogram#comment84708614_48355568

また、クロップサイズも720×480にしてみましたが、エラーに変化なしです。

詳しく説明したつもりですが、よろしくお願いいたします。

0 likes

Are you sure you want to delete the question?

Tensorflowを用いたDeeplab v3+についての質問

解決したいこと

発生している問題・エラー

自分で試したこと

No Answers yet.

Your answer might help someone💌