概要

今回は前投稿で構築した環境で、自前でannotationしたデータを使ったObject detection modelのTransfer Learning(以下TL)を行ったので備忘録を残す。

実施時期: 2022年4月
GPU: RTX3070
OS: Ubuntu20.04LTS
Anaconda: Python3.9.12
Tensorflow: 2.8.0

VoTTからデータ移行

次のオフィシャルのチュートリアルではTFRecord形式に変換するまでの手順やコードが記載されているが、非常に面倒なのでVoTTを使用しいきなりTFRecord形式を作成させた。ただヒントが多いので一読推奨。

TFRecorder形式でVoTTからexportしたデータがあるものとする。データには下記２種類のファイルが含まれる。

File name	Description
0001.tfrecord ~ 0010.tfrecord	1個のjpgファイルでboxingした全annotation情報　今回は10個のjpgファイルでannotation(Boxingは各10個くらい)した
tf_label_map.pbtxt	VoTTで設定した追加するlabel情報

tfrecordファイルを8個、2個にtrain用、val用に使用するため、これらを下記においた。

tf_env1/
└─ models/
   ├─ research/
      └─ object_detection/
         └─ my_data/     :新規フォルダ
            ├─ train/
               ├─ 0001.tfrecord...  :計8個
               └── ...
            ├─ val/
               ├─ ...
               └─ 0010.tfrecord...  :計2個
            ├─ save/     :TL用
            └─ tf_label_map.pbtxt

TLするObject detection modelの取得

作業パスはDocuments/tf_env1とするので、はじめにここへcdしておくこと。
TL作業にあたり前回の環境構築では足りていなかったパケージを入れる。

pip install jupyterlab==3
pip install imageio

Jupyter Labを起動し環境確認のためtest.pyを実行してもよい。

# !python models/research/object_detection/builders/model_builder_tf2_test.py

TensorFlow 2 Detection Model Zooから所望のmodelをDLする。ここではssd_resnet50_v1を選択した。

!wget http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz
!tar -xf ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz

解凍すると下記のような構成となる。

tf_env1/
└─ ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/
   ├─ checkpoint/
   |  ├─ ckpt-0.index       # TLで使用する
   |  └── ...
   ├─ saved_model/
   └── pipeline.config      # TLで使用する
└─ models/
   └── ...
└─ cocoapi/
   └── ...

念のためpipeline.configはコピーしてバックアップしておく。

TL作業

TLはmodels/research/object_detection/model_main_tf2.pyを実行し行う。このスクリプトはpipeline.configを読むため同ファイルの必要箇所を変更する。ここではバックアップを取っているので変更後、上書きした。

model {
  ssd {
    num_classes: 4    # 1
　【中略】
train_config {
  batch_size: 4    # 2
　【中略】
  fine_tune_checkpoint: "./ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0"    # 3
　【中略】
  fine_tune_checkpoint_type: "detection"    # 4
　【中略】
train_input_reader {
  label_map_path: "./models/research/object_detection/my_data/tf_label_map.pbtxt"    # 5
  tf_record_input_reader {
    input_path: "./models/research/object_detection/my_data/train/????.tfrecord"    # 6
　【中略】
eval_input_reader {
  label_map_path: "./models/research/object_detection/my_data/tf_label_map.pbtxt"    # 5
　【中略】
  tf_record_input_reader {
    input_path: "./models/research/object_detection/my_data/val/????.tfrecord"    # 7

No.	Description
1	TLで追加するlabel数
2	2のべき乗で指定する。
3	DLした元modelのチェックポイント。拡張子.indexは付けない。
4	元は'classification'となっているので、'detection'に変更する
5	VoTTでexportしたlabel情報ファイル
6	Training用TFRecordファイル　4桁の数字だったので????
7	Validation用TFRecordファイル　4桁の数字だったので????

スクリプトに渡すコマンドライン引数は下表
引数はもっとあるので詳細はmodel_main_tf2.pyのソースを参照すること。

Argument	Description
pipeline_config_path	上記で編集したpipeline.configファイル
model_dir	TL中の結果出力先　デフォルト1000iter(checkpoint_every_nで変更可能)ごとに書き出し
num_train_steps	この回数だけiterationを繰り返す。early stoppingの指定はできない
trained_checkpoint_dir	元modelのckptを指定なぜか'ckpt-0'でないとNG　'ckpt-0.index'の拡張子を取ったと思われる

今回は下記となりこれを実行する。

!python ./models/research/object_detection/model_main_tf2.py \
    --pipeline_config_path="./ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config" \
    --model_dir="./models/research/object_detection/my_data/save" \
    --alsologtostderr \
    --num_train_steps=10000 \
    --trained_checkpoint_dir="./ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0"

うまくTrainingが始まるとGPUのファンが唸りだし、指定したnum_train_steps(10000回)に達するまで下記が出力され始める。
ちなみにbatch sizeを8にするとおいらの環境ではTL実行中にエラーが発生した。またマルチモニタで作業しているとやはりエラーが出たりと、8GBのGPUではきついのかもしれない。パラメータが多くメモリサイズにシビアな印象
GPU Strategyも2種類指定できるので複数GPU刺さっている環境や複数GPU workstationがある環境ではメモリをその分増やすことも可能だろう。

Training中はLoss/total_lossに注目する。Tensorboardでも経過を確認できるらしいが今回は試さなかった。
これがtrain lossなのかval lossなのかわからない。

INFO:tensorflow:Step 100 per-step time 0.509s
I0424 13:54:13.544738 140704737298240 model_lib_v2.py:705] Step 100 per-step time 0.509s
INFO:tensorflow:{'Loss/classification_loss': 0.61247027,
 'Loss/localization_loss': 0.28274173,
 'Loss/regularization_loss': 0.26708964,
 'Loss/total_loss': 1.1623015,
 'learning_rate': 0.014666351}
【中略】
INFO:tensorflow:Step 8200 per-step time 0.305s
I0424 14:35:28.173985 140704737298240 model_lib_v2.py:705] Step 8200 per-step time 0.305s
INFO:tensorflow:{'Loss/classification_loss': 0.09592646,
 'Loss/localization_loss': 0.016893983,
 'Loss/regularization_loss': 0.11835801,
 'Loss/total_loss': 0.23117846,
 'learning_rate': 0.03324672}
I0424 14:35:28.174188 140704737298240 model_lib_v2.py:708] {'Loss/classification_loss': 0.09592646,
 'Loss/localization_loss': 0.016893983,
 'Loss/regularization_loss': 0.11835801,
 'Loss/total_loss': 0.23117846,
 'learning_rate': 0.03324672}
【中略】
 INFO:tensorflow:Step 10000 per-step time 0.303s
I0424 14:44:38.082305 140704737298240 model_lib_v2.py:705] Step 10000 per-step time 0.303s
INFO:tensorflow:{'Loss/classification_loss': 0.14718756,
 'Loss/localization_loss': 0.020133844,
 'Loss/regularization_loss': 0.12079448,
 'Loss/total_loss': 0.2881159,
 'learning_rate': 0.029201299}
I0424 14:44:38.082512 140704737298240 model_lib_v2.py:708] {'Loss/classification_loss': 0.14718756,
 'Loss/localization_loss': 0.020133844,
 'Loss/regularization_loss': 0.12079448,
 'Loss/total_loss': 0.2881159,
 'learning_rate': 0.029201299}

上記オフィシャルチュートリアルによるとtotal_lossは1（デカッ!!）より小さくなることが望ましいとあるが、とりあえず一番total lossが小さかった回数を探しておく。8000iter超えたあたりで0.2前半で落ちていたが、batch sizeが小さいため暴れている。learning rateなども調整する必要がありそう。

結果はmodel_dirで指定したパスに書き出される。ckpt-5.index以降しか残っていなかった。ファイル作成時刻ではckpt-11.indexが最後となる。
ckpt-0~4がない理由は調査中。そもそも書き出さなかったか、書いたけど最後の7個しか残さないように削除されたのか…

tf_env1/
└─ ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/
   └── ...
└─ models/
   ├─ research/
      └─ object_detection/
         └─ my_data/     :新規フォルダ
            ├─ train/
            ├─ val/
            ├─ save/     :この中はTL中に自動作成される
            |  ├─ train/
            |  ├─ ckpt-5.index
            |  ├─ ...
            |  ├─ ckpt-11.index
            |  └─ ...
            └─ ...
└─ cocoapi/
   └── ...

TL後のmodelでDetection実行

ここでは下記を参照させていただいた。詳細はこちらを参照いただきたい。

import matplotlib
import matplotlib.pyplot as plt
import os
import io
import scipy.misc
import numpy as np
from six import BytesIO
from PIL import Image, ImageDraw, ImageFont
import tensorflow as tf
from object_detection.utils import label_map_util
from object_detection.utils import config_util
from object_detection.utils import visualization_utils as viz_utils
from object_detection.builders import model_builder
%matplotlib inline

pipeline_configには、はじめに編集したconfigファイルを指定する。
model_dirには、TL中に出力されるckptファイルのどれかを指定する。普通に考えるとlossが落ち込んだところのckptにすべきだろうが、前述の通りlossが暴れているためどれと指定することができなかった。

pipeline_config = "./ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config"
#チェックポイントのパス 
model_dir = "./models/research/object_detection/my_data/save/ckpt-8"

#モデル構成情報読み込み
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']

#読み込んだ構成情報でモデルをビルド
detection_model = model_builder.build(model_config=model_config, is_training=False)

#チェックポイントから重みを復元
ckpt = tf.compat.v2.train.Checkpoint(model=detection_model)
ckpt.restore(model_dir).expect_partial() #チェックポイントのファイルの番号を指定

label_map_pathには、VoTTがexportしたlabel情報ファイルを指定する。

label_map_path = "./models/research/object_detection/my_data/tf_label_map.pbtxt"
label_map = label_map_util.load_labelmap(label_map_path)
categories = label_map_util.convert_label_map_to_categories(
    label_map,
    max_num_classes=label_map_util.get_max_label_map_index(label_map),
    use_display_name=True)
category_index = label_map_util.create_category_index(categories)
label_map_dict = label_map_util.get_label_map_dict(label_map, use_display_name=True)

def load_image_into_numpy_array(path):
  """画像ファイルをNumpy配列にする.

    TensorFlowのグラフに食わせるために画像をNumpy配列に。
  慣例として（高さ、幅、カラーチャネル）形状のNumpy配列にする。

  引数:
    path: 画像ファイルのパス.

  戻り値:
    uint8、(高さ, 幅, ３チャネル)形状のnumpy配列。 
  """
  img_data = tf.io.gfile.GFile(path, 'rb').read()
  image = Image.open(BytesIO(img_data))
  (im_width, im_height) = image.size
  return np.array(image.getdata()).reshape(
      (im_height, im_width, 3)).astype(np.uint8)

def get_keypoint_tuples(eval_config):
  """Return a tuple list of keypoint edges from the eval config.

  Args:
    eval_config: an eval config containing the keypoint edges

  Returns:
    a list of edge tuples, each in the format (start, end)
  """
  tuple_list = []
  kp_list = eval_config.keypoint_edge
  for edge in kp_list:
    tuple_list.append((edge.start, edge.end))
  return tuple_list

image_dir, image_pathでDetectionしたい画像ファイルを指定する。

image_dir = './data/source'
image_path = os.path.join(image_dir, '0009.jpg')
image_np = load_image_into_numpy_array(image_path)

# Things to try:
# Flip horizontally
# image_np = np.fliplr(image_np).copy()

# Convert image to grayscale
# image_np = np.tile(
#     np.mean(image_np, 2, keepdims=True), (1, 1, 3)).astype(np.uint8)

input_tensor = tf.convert_to_tensor(
    np.expand_dims(image_np, 0), dtype=tf.float32)
detections, predictions_dict, shapes = detect_fn(input_tensor)

label_id_offset = 1
image_np_with_detections = image_np.copy()

# Use keypoints if available in detections
keypoints, keypoint_scores = None, None
if 'detection_keypoints' in detections:
  keypoints = detections['detection_keypoints'][0].numpy()
  keypoint_scores = detections['detection_keypoint_scores'][0].numpy()

viz_utils.visualize_boxes_and_labels_on_image_array(
      image_np_with_detections,
      detections['detection_boxes'][0].numpy(),
      (detections['detection_classes'][0].numpy() + label_id_offset).astype(int),
      detections['detection_scores'][0].numpy(),
      category_index,
      use_normalized_coordinates=True,
      max_boxes_to_draw=200,
      min_score_thresh=.15,
      agnostic_mode=False,
      keypoints=keypoints,
      keypoint_scores=keypoint_scores,
      keypoint_edges=get_keypoint_tuples(configs['eval_config']))

plt.figure(figsize=(24,30))
plt.imshow(image_np_with_detections)
plt.show()

以上

TensorFlow Model Garden Transfer Learning 備忘録

概要

VoTTからデータ移行

TLするObject detection modelの取得

TL作業

TL後のmodelでDetection実行