More than 5 years have passed since last update.

OpenCVのツール"CVAT"による動画ファイルからのオートアノテーション実行とTFRecord形式データセットの生成

Last updated at 2019-05-21Posted at 2019-05-20

１．Introduction

みなさん、アノテーション祭り、楽しんでますか？私は吐き気がするほど嫌いです。

今回はオートアノテーションにチャレンジします。動画を静止画に変換する手間は必要ですが、大量の画像を一気に投入することで、自動的にアノテーションが完了する、という夢のようなツールの検証です。どうも、 前回記事 のVATICというツールが、OpenCVの公式ツールとして取り込まれたうえに、大幅な機能強化と改善が行われているようです。

２．Environment

Ubuntu 16.04 (公式の推奨環境は Ubuntu 18.04)
CUDA 9.0
cuDNN 7.2
Docker
- Client:
  - Version: 18.09.6
  - API version: 1.39
  - Go version: go1.10.8
  - Git commit: 481bc77
  - Built: Sat May 4 02:35:27 2019
  - OS/Arch: linux/amd64
  - Experimental: false
- Server: Docker Engine - Community
  - Engine:
    - Version: 18.09.6
    - API version: 1.39 (minimum version 1.12)
    - Go version: go1.10.8
    - Git commit: 481bc77
    - Built: Sat May 4 01:59:36 2019
    - OS/Arch: linux/amd64
    - Experimental: false
NVIDIA Docker 2.0.3
Google Chrome

３．Procedure

３−１．Convert video file to still image

まず、動画ファイルをffmpegというツールを使用して静止画に変換します。 CVATへは動画を直接投入することもできますが、最終的にアノテーションデータをTFRecord形式へ変換することが目的の場合は、現時点では静止画へあらかじめ変換しておく必要があります。以降の作業はホームディレクトリを基点として作業を進めていく前提とします。

ホームディレクトリの直下に Videos というフォルダが有り、その中に FreestyleFootball.mp4 という動画ファイルがある想定でコマンドを記載します。 おちゃカメラ。 - ffmpegの使い方やコマンド一覧をまとめました。動画リサイズ・静止画変換・フレーム補間について を参考にさせていただきました。ありがとうございます。

.mp4_convert_to_.jpeg

cd ~
sudo apt install -y ffmpeg
mkdir -p Videos/img

ffmpeg \
-i Videos/FreestyleFootball.mp4 \
-ss 0 \
-t 30 \
-f image2 \
-vcodec mjpeg \
-qscale 1 -qmin 1 -qmax 1 \
-r 20 \
Videos/img/%06d.jpg

オプション	概要
-i	入力ファイルの指定
-ss	静止画に変換したい動画の再生開始位置(秒)
-t	静止画に変換したい動画の長さ(秒)
-f	変換フォーマットの指定
-vcodec	コーデックの指定 Motion JPEG=mjpeg, PNG=png
-qscale	JPEG画像の品質
-r	切り出したい画像の1秒あたりの枚数(フレームレート)
%06d.jpg	書き出すファイル名の指定。%06d.jpgと指定すると6桁の連番画像ファイルが生成される

Execution_log

ffmpeg version 2.8.15-0ubuntu0.16.04.1 Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.10) 20160609
  configuration: --prefix=/usr --extra-version=0ubuntu0.16.04.1 --build-suffix=-ffmpeg --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --cc=cc --cxx=g++ --enable-gpl --enable-shared --disable-stripping --disable-decoder=libopenjpeg --disable-decoder=libschroedinger --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librtmp --enable-libschroedinger --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxvid --enable-libzvbi --enable-openal --enable-opengl --enable-x11grab --enable-libdc1394 --enable-libiec61883 --enable-libzmq --enable-frei0r --enable-libx264 --enable-libopencv
  libavutil      54. 31.100 / 54. 31.100
  libavcodec     56. 60.100 / 56. 60.100
  libavformat    56. 40.101 / 56. 40.101
  libavdevice    56.  4.100 / 56.  4.100
  libavfilter     5. 40.101 /  5. 40.101
  libavresample   2.  1.  0 /  2.  1.  0
  libswscale      3.  1.101 /  3.  1.101
  libswresample   1.  2.101 /  1.  2.101
  libpostproc    53.  3.100 / 53.  3.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'git/Videos/FreestyleFootball.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    creation_time   : 2016-08-14 07:51:03
  Duration: 00:03:49.41, start: 0.000000, bitrate: 2179 kb/s
    Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], 2050 kb/s, 25 fps, 25 tbr, 90k tbn, 50 tbc (default)
    Metadata:
      creation_time   : 2016-08-14 07:51:03
      handler_name    : ISO Media file produced by Google Inc.
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 125 kb/s (default)
    Metadata:
      creation_time   : 2016-08-14 07:51:03
      handler_name    : ISO Media file produced by Google Inc.
Please use -q:a or -q:v, -qscale is ambiguous
[swscaler @ 0x126f0e0] deprecated pixel format used, make sure you did set range correctly
Output #0, image2, to 'git/Videos/img/%06d.jpg':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    encoder         : Lavf56.40.101
    Stream #0:0(und): Video: mjpeg, yuvj420p(pc), 1280x720 [SAR 1:1 DAR 16:9], q=1-1, 200 kb/s, 20 fps, 20 tbn, 20 tbc (default)
    Metadata:
      creation_time   : 2016-08-14 07:51:03
      handler_name    : ISO Media file produced by Google Inc.
      encoder         : Lavc56.60.100 mjpeg
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> mjpeg (native))
Press [q] to stop, [?] for help
Past duration 0.799995 too large
Past duration 0.999992 too large
frame=  600 fps=325 q=1.0 Lsize=N/A time=00:00:30.00 bitrate=N/A dup=0 drop=148    
video:122679kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown

img フォルダ配下に 600枚の jpeg画像が生成されました。 30秒 x 20フレーム = 600画像

３−２．Constructing a CVAT execution environment

公式リポジトリの今後のアップデートにより手順が動作しなくなることを避けるため、私自身のリポジトリへForkしたものを使用します。私のTwitterをフォローいただいている方は、下記のコマンドを実行した後の顛末をご存知かと思いますが、私のラップトップPCとの相性が良くなかったのか、一時的に、シャットダウンも、再起動も、ハードリセットも、全てが操作不能になりました。ただ、ラップトップPCのバッテリ残量をゼロにして再度電源をONにしたところ正常に復帰しました。
公式の推奨環境は、 Ubuntu 18.04 (x86_64/amd64) ですが、私のラップトップPCは Ubuntu 16.04 (x86_64) でしたので、もしかしたらOSとの相性の問題もあるかもしれません。いずれにせよ特別な処置はしていませんが、現在は異常をきたすことなく正常に利用できています。

Clone_CVAT_repository

git clone https://github.com/PINTO0309/cvat.git
cd cvat

sudo apt-get update
sudo apt-get install -y \
apt-transport-https ca-certificates curl gnupg-agent software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable"

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

sudo groupadd docker
sudo usermod -aG docker $USER
sudo apt-get install -y python3-pip
sudo -H pip3 install docker-compose

docker-compose \
-f docker-compose.yml \
-f components/cuda/docker-compose.cuda.yml \
-f components/openvino/docker-compose.openvino.yml \
-f components/tf_annotation/docker-compose.tf_annotation.yml up -d --build

docker exec -it cvat bash -ic 'python3 ~/manage.py createsuperuser'

３−３．Execution of automatic annotation

ブラウザを起動し、アドレスバーに http://localhost:8080 を入力してアクセスします。
そうすると、下図のようにアッサリ塩味なポータルが起動します。

Create New Task ボタンをクリックすると、下図のようにダイアログが表示されますので、最低限必要な情報を入力します。

私がアノテーションしたいのは、 人 と ボール ですので、　person と sports_ball を半角空白区切りで Labels欄に入力します。なお、Tensorflow によるオートアノテーションを行う場合に Labels欄へ入力可能なラベル名は下記のいずれかです。 Tensorflow によるオートアノテーションを行わない場合は下記に限らず、自由入力で複数のラベルを指定可能です。

Label_name_that_can_be_specified_when_performing_auto_annotation

'surfboard', 'car', 'skateboard', 'boat', 'clock', 'cat', 'cow', 'knife',
'apple', 'cup', 'tv', 'baseball_bat', 'book', 'suitcase', 'tennis_racket',
'stop_sign', 'couch', 'cell_phone', 'keyboard', 'cake', 'tie', 'frisbee',
'truck', 'fire_hydrant', 'snowboard', 'bed', 'vase', 'teddy_bear',
'toaster', 'wine_glass', 'traffic_light', 'broccoli', 'backpack', 'carrot',
'potted_plant', 'donut', 'umbrella', 'parking_meter', 'bottle', 'sandwich',
'motorcycle', 'bear', 'banana', 'person', 'scissors', 'elephant',
'dining_table', 'toothbrush', 'toilet', 'skis', 'bowl', 'sheep',
'refrigerator', 'oven', 'microwave', 'train', 'orange', 'mouse', 'laptop',
'bench', 'bicycle', 'fork', 'kite', 'zebra', 'baseball_glove', 'bus',
'spoon', 'horse', 'handbag', 'pizza', 'sports_ball', 'airplane',
'hair_drier', 'hot_dog', 'remote', 'sink', 'dog', 'bird', 'giraffe', 'chair'.

Select Files ボタンをクリックして、３−１．Convert video file to still image で作成した静止画を全て指定します。私の場合は、 600枚のJPEG静止画を生成しましたので、600枚全てを選択して開きました。

Submit ボタンをクリックします。

下図の Run TF Annotation ボタンをクリックすると、

警告メッセージが表示されますので、 Ok をクリックします。

しばらく待つとボタンの表示が Cancel TF Annotation [0%] という表示に変わり、リアルタイムに進捗率が更新されていきます。精度の高いモデルを使用しているためか、かなり時間が掛かりますので気長に待ちましょう。
ちなみに、４分間5734フレームの動画をアノテーションするのに、私の環境では２時間掛かりました。

さて、変換が終わったら http://localhost:8080/?id=x のリンクをクリックすると、アノテーション結果を確認することができます。さて、どうなっているでしょうか。。。

機械任せの自動アノテーションにもかかわらず、精度が高すぎますねw
衝撃的です。

では、ブラウザの戻る矢印をクリックしてポータルに戻ります。

アノテーション済みの CVAT形式XMLファイルを出力するため、 Dump Annotation ボタンをクリックします。 n_xxxx.xml という名前のXMLファイルがダウンロードされてくるはずです。 n の部分はタスクID、 xxxx の部分はTask Nameです。

アノテーション済みCVAT形式XMLファイルの中身は下図のイメージです。 <mode> が annotation となっていることがポイントです。もし annotation となっていない場合は動画ファイルを変換してしまっています。その場合、TFRecord形式への変換はできませんので、オートアノテーションの最初の手順からやり直してください。

このあとの TFRecord形式への変換作業に使用するため、 n_xxxx.xml ファイルを cvat フォルダの直下にコピーしておきます。

Copy_XML_file_to_working_folder

cp ~/Downloads/n_xxxx.xml ~/cvat

３−４．Annotation data conversion from CVAT format to TFRecord format

さて、ここまでできたら Tensorflow Object Detection API の力を借りて、 CVAT形式から TFRecord形式へコンバージョンします。やってみると分かりますが、ココから先はCVAT Dockerコンテナ内で実施するととても面倒なことになります。（sudoコマンドが通らなかったり、Permissionが通っていなかったり、その他モロモロと、本質的ではない部分でつまづきます）
好き嫌いの問題ではありますが、HostPC上での作業をオススメします。

まずは、 Tensorflow Object Detection API を導入します。私は 前回記事 - 今更ですが、VATICによる動画の自動追尾アノテーションを使用してTFRecord形式への変換まで実施してみました[Docker編] あるいは過去の記事 - Edge TPU Accelaratorの動作を少しでも高速化したかったのでMobileNetv2-SSD/MobileNetv1-SSD+MS-COCOをPascal VOCで転移学習して.tfliteを生成した_Docker編_その２あるいは 過去の記事 - Edge TPU Accelaratorの動作を少しでも高速化したかったのでMS-COCOをPascal VOCで転移学習して.tfliteを生成した_GoogleColaboratory[GPU]編_その３ で３種類の環境を構築済みですので改めての作業は不要ですが、念の為 公式に記載されている手順 に少しだけ味付けをして下記に転記しておきます。一応、実環境で動作は確認済みです。

Installation_procedure_of_"Tensorflow_Object_Detection_API"

sudo apt-get update
sudo apt-get install -y --no-install-recommends python3-pip python3-dev
sudo -H pip3 install -r requirements.txt
git clone https://github.com/tensorflow/models.git
cd models
sudo -H pip3 install --user Cython contextlib2 pillow lxml jupyter matplotlib
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
make
cd ../../research
cp -r ../cocoapi/PythonAPI/pycocotools .
protoc object_detection/protos/*.proto --python_out=.
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim

では、ようやくCVATフォーマットからTFRecordフォーマット(.tfrecord)へ変換してみます。
下記のコマンドを実行します。 (cd ../.. でcvatフォルダへ戻ってから実行します。)
</path/to/cvat/xml> と </path/to/images> と </path/to/output/directory> を自身の環境に合わせて変更してから実行します。

Convert_CVAT_format_to_TFRecord_format

cd ../..
mkdir outputs
sed -i "s%os.path.join(output_dir.absolute(),%os.path.join(str(output_dir.absolute()),%g" "utils/tfrecords/converter.py"
sed -i "s%os.path.join(output_dir,%os.path.join(str(output_dir),%g" "utils/tfrecords/converter.py"

python3 utils/tfrecords/converter.py \
--cvat-xml </path/to/cvat/xml> \
--image-dir </path/to/images> \
--output-dir </path/to/output/directory>

Convert_CVAT_format_to_TFRecord_format_sample

cd ../..
mkdir outputs
sed -i "s%os.path.join(output_dir.absolute(),%os.path.join(str(output_dir.absolute()),%g" "utils/tfrecords/converter.py"
sed -i "s%os.path.join(output_dir,%os.path.join(str(output_dir),%g" "utils/tfrecords/converter.py"

python3 utils/tfrecords/converter.py \
--cvat-xml n_xxxx.xml \
--image-dir ${HOME}/Videos/img \
--output-dir ./outputs

How_to_use_converter.py

usage: converter.py [-h] --cvat-xml FILE --image-dir DIRECTORY --output-dir
                    DIRECTORY [--train-percentage PERCENTAGE]
                    [--min-train NUM] [--attribute NAME]

Convert CVAT XML annotations to tfrecords format

optional arguments:
  -h, --help            show this help message and exit
  --cvat-xml FILE       input file with CVAT annotation in xml format
  --image-dir DIRECTORY
                        directory which contains original images
  --output-dir DIRECTORY
                        directory for output annotations in tfrecords format
  --train-percentage PERCENTAGE
                        the percentage of training data to total data
                        (default: 90)
  --min-train NUM       The minimum number of images above which the label is
                        considered (default: 10)
  --attribute NAME      The attribute name based on which the object can
                        identified

無事にトレーニングデータの train.tfrecord ファイルと、検証用データの eval.tfrecord ファイルが生成されました。これで Tensorflow Lite のpipelineなどを使用して独自の超大量なデータセットで学習が簡単にできるようになりましたね。

label_map.pbtxt の中身は下図のようにちゃんと生成されています。

４．Finally

オートアノテーションを使用して TFRecord形式のデータ・セット作成まで成功しました。
これで大量の動画データから超大量のデータ・セットを独自生成することがあまり手間ではなくなりました。
オートアノテーション最高！！

単に高精度のFaster R-CNNでオブジェクトディテクションをしているだけですので、ふ〜ん、という感じの方々が多いとは思います。
まぁ、いいんですよ。楽しければ。

５．Reference articles

https://github.com/opencv/cvat/blob/develop/cvat/apps/documentation/installation.md
https://github.com/opencv/cvat/blob/develop/utils/tfrecords/converter.md
https://photo-tea.com/p/17/ffmpeg-command-list/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up