More than 3 years have passed since last update.

ゼロから作る物体検出(object detection)

Last updated at 2022-07-03Posted at 2022-05-09

はじめに

「ゼロから作るdeep learning 3――フレームワーク編」で解説されたdeep learningのフレームワークDeZeroを使って、物体検出を行います。

本記事のプログラムと使用したデータセット(画像とアノテーション)はこちらからダウンロードできます。

Dezero自体がpytorchと似ているので、Dezero部分を書き換えるとpytorchでも使えます。(pytorchを使ったほうが便利な機能が多数), tensorflowはﾜｶﾘﾏｾﾝ。

実際の検出例がこちら。

実装及び学習を簡単ににする目的で、ここで作る物体検出はYoloを参考にしつつも、Yolo v1(2015)よりもさらにシンプルなものを作ります。
具体的な制限としては以下のようなものがあります。

backboneはVGG16で古い(DeZeroで使えるpretrained modelの制限)
1クラスしか検出できない(今回は写真中から玩具の車を抜き出します)
一つのグリッド内で複数の物体を検出できない
損失関数も簡略化

なので、本記事で最新鋭の物体検出モデルができるわけではないです。

使用しているのはこれだけです。

python 3.7.13
DeZero 0.0.13
numpy 1.21.5
cupy 6.0.0
pillow 9.0.1
matplotlib 3.5.1

Dezeroは簡単にpip install dezeroでインストールできます。
cupyは必須ではないですが、学習時間が大幅に短くなるので、かなり効率が違います。
OSはwindows10で、GPUとしてGeforce GTX1060 6GBを使っています。

1. データをpythonに取り込む

物体検出では、画像と画像中の物体の位置のデータが必要となります。
このような感じですね。

このうち物体の位置のデータ(アノテーション)の記述方法に関しては、MS COCO等いくつかの形式がありますが、今回はPascal Voc形式を使います。このPascal Voc形式に関してはxmlで保存されています。

ここで重要となるのはまず上から4行目の＜path＞~＜/path＞になります。
ここに、このxmlファイルに対応する画像のファイルパスが記述されています。
なので、このxmlファイルと画像が一対一で対応していることになります。

<annotation verified="yes">
    <folder>Annotation</folder>
    <filename>P1020101.jpg</filename>
    <path>car-PascalVOC-export/Annotations/P1020101.jpg</path>
    <source>
        <database>Unknown</database>
    </source>
    <size>
        <width>416</width>
        <height>416</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
    <name>car</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
        <xmin>2.618705035971223</xmin>
        <ymin>112.23021582733813</ymin>
        <xmax>197.15107913669067</xmax>
        <ymax>209.9952190591277</ymax>
    </bndbox>
</object><object>
    <name>car</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
        <xmin>208.62353164343526</xmin>
        <ymin>61.35251798561151</ymin>
        <xmax>399.1654676258993</xmax>
        <ymax>164.1055231986286</ymax>
    </bndbox>
</object>
</annotation>

また画像中の物体の情報は＜object＞~＜/object＞の間に記述されています。
この中には＜name＞car＜/name＞という物体のクラス名(今回の場合は「car」)と、以下のような座標情報が記述されています。

＜bndbox＞
    ＜xmin＞2.618705035971223＜/xmin＞
    ＜ymin＞112.23021582733813＜/ymin＞
    ＜xmax＞197.15107913669067＜/xmax＞
    ＜ymax＞209.9952190591277＜/ymax＞
＜/bndbox＞

これが1つの物体に関する情報で＜object＞~＜/object＞が2つあるので、2つの物体(car)が画像中にあることを示しています。
このxml形式を解析するのにpythonのxml.etree.ElementTreeを使います。具体的な使用方法に関しては他に譲りますが,今回は先述のpathとobjectを抜き出して、辞書形式でreturnするclassを作ります。

import xml.etree.ElementTree as ET

class Anno_xml_to_List(object):
    def __init__(self, classes):
        self.classes = classes
    
    def __call__(self, xml_path):
        xml = ET.parse(xml_path).getroot()
        path = xml.find('path').text
        
        objects = []
        for obj in xml.iter('object'):
            #class name
            name = obj.find('name').text
            
            #coordinates of bounding_box
            bbox = obj.find('bndbox')
            bbs ={}
            for pt in ['xmin', 'ymin', 'xmax', 'ymax']:
                bbs[pt] = float(bbox.find(pt).text) - 1

            objects.append({'bndbox': bbs, 'name': name})
    
        return {'path':path, "object": objects}

return のobjectsはlist形式で、複数の辞書{'bndbox': bbs, 'name': name}が含まれます。
そしてbbsは{xmin: , xmax: 　, ymin:　 ,ymax: 　}の辞書です。
このclass Anno_xml_to_Listを使ってxmlファイル中から、画像のファイルパスと座標情報(bouding box)を抜き出します。

今回のファイル構成は以下のようになっています。
同じフォルダにjpgファイルとxmlファイルが含まれています。


├─dezero_object_detection.ipynb
├─img_process.py
├─img_show.py
├─utils.py
└─car-PascalVOC-export
    └─Annotations
       ├─P1020098.jpg
       ├─P1020098.xml
       ├─P1020099.jpg
       ├─P1020099.xml
       ├─P1020100.jpg
       ├─P1020100.xml
       ├─P1020101.jpg
       ├─P1020101.xml

ここからglobを使ってxmlファイルのリストだけ抜き出します。

import glob
from utils import train_test_split

#xmlファイルのpathを抜き出し
paths = glob.glob("car-PascalVOC-export/Annotations/*.xml")

#trainデータとtestデータに分割
train_pathes, test_pathes = train_test_split(paths)

transform_anno = Anno_xml_to_List([])

#trainデータのpathから1つ取り出し、targetデータを出力
target = transform_anno(train_pathes[2])

print(target["object"])

これの結果は、以下のように出力されます。

[{'bndbox': {'xmin': 1.6187050359712232, 'ymin': 111.23021582733813, 'xmax': 196.15107913669067, 'ymax': 208.9952190591277}, 'name': 'car'}, {'bndbox': {'xmin': 207.62353164343526, 'ymin': 60.35251798561151, 'xmax': 398.1654676258993, 'ymax': 163.1055231986286}, 'name': 'car'}]

これで画像と、それに対応するbounding boxをpythonに取り込むことができました。

次にoriginalの画像は416×416のサイズですが、画像サイズが大きくなると学習や推論が遅くなるので、今回224×224にリサイズして使うことにします。
bouding boxをリサイズする関数としてresize_xyxys_bndを使います。今回、物体のクラス名は使わない(carしか物体がない)ので、resize_xyxys_bndではbounding boxの座標情報のみをreturnします。

def resize_xyxy_bnd(bnd_xyxy, ratio=416/224):
    bnd_xyxy = {
       "xmin": bnd_xyxy['xmin'] / ratio,
       "ymin": bnd_xyxy['ymin'] / ratio,
       "xmax": bnd_xyxy['xmax'] / ratio,
       "ymax": bnd_xyxy['ymax'] / ratio
      }
    return bnd_xyxy

def resize_xyxys_bnd(target_obj, ratio=416/224):
    return [resize_xyxy_bnd(bnd['bndbox'], ratio) for bnd in target_obj]

またbouding boxを表示する関数としてwrite_bndboxを用意してありますので、これを使います。

#PIL形式で画像をopen
img = Image.open(target["path"])
img = img.resize((224, 224))

#bounding boxをresize
bnd_xyxys = resize_xyxys_bnd(target["object"], ratio=416./224.)

#画像の表示
write_bndbox(img, bnd_xyxys)
print(bnd_xyxys)

出力は以下のような形になります。
[{'xmin': 0.8716104039845047, 'ymin': 59.893193137797454, 'xmax': 105.61981184283343, 'ymax': 112.53588718568413}, {'xmin': 111.79728626954206, 'ymin': 32.49750968456004, 'xmax': 214.3967902600996, 'ymax': 87.8260509531077}]

224×224に対応したbounding boxが得られています。

2. xyxyとxywhの変換

1.で得られたbouding boxの座標は上画像の左の$(x_{min}, y_{min}, x_{max}, y_{max})$でした。ただし、後述のtargetを生成するには右の$(x, y, w, h)$のほうが便利になります。

ここでは$(x_{min}, y_{min}, x_{max}, y_{max})$を入力すると、$(x, y, w, h)$を出力する関数convert_xyxys_to_xywhsを作ります。

def convert_xyxy_to_xywh(bnd_xyxy):
    bnd_xywh = {
      "x": (bnd_xyxy['xmax'] + bnd_xyxy['xmin']) / 2,
      "y": (bnd_xyxy['ymax'] + bnd_xyxy['ymin']) / 2,
      "w": (bnd_xyxy['xmax'] - bnd_xyxy['xmin']),
      "h": (bnd_xyxy['ymax'] - bnd_xyxy['ymin'])
      }
    return bnd_xywh

def convert_xyxys_to_xywhs(bnd_xyxys):
    return [convert_xyxy_to_xywh(bnd) for bnd in bnd_xyxys]

次に$(x_{min}, y_{min}, x_{max}, y_{max})$に戻すconvert_xywhs_to_xyxysの関数を作ります。

def convert_xywh_to_xyxy(bnd_xywh):
    bnd_xyxy = {
      "xmin": bnd_xywh['x'] - bnd_xywh['w'] / 2,
      "ymin": bnd_xywh['y'] - bnd_xywh['h'] / 2,
      'xmax': bnd_xywh['x'] + bnd_xywh['w'] / 2,
      'ymax': bnd_xywh['y'] + bnd_xywh['h'] / 2,      
      }
    return bnd_xyxy

def convert_xywhs_to_xyxys(bnd_xywhs):
    return [convert_xywh_to_xyxy(bnd) for bnd in bnd_xywhs]

3. データオーギュメンテーション

この章のソースコードは主として以下のリンク先にあります。

物体検出の場合、bouding boxも定義しないといけないので、画像分類よりも教師データを作るのに手間がかかります。そのため、画像の枚数が十分に足りない場合が多くなり、画像を左右反転させたり、resizeしたり、画像にフィルタをかけたりするのが、さらに重要になります。
このような基本的な画像処理はPILを使って実現することができます。また物体検出の場合、bounding boxの座標も一緒に変換する必要があります。

例えば、左右反転させる場合、入力を$(x, y, w, h)$にすると、$(y, w, h)$についてはそのままで、$x$だけ以下のような変換をすることで新しいbounding boxになります。

x = [画像の横幅(pixcel)] - x

これを関数とすると、以下のようになります。

def random_horizontal_flip(img, bnd_xywhs, p=0.5):
    """
    img : PIL image
    bnd_xywhs : list of dicts
    p : float
    """
    if (p < np.random.rand()):
        return img, bnd_xywhs
    #image width
    size = img.size[0]
    img = ImageOps.mirror(img)

    for i in range(len(bnd_xywhs)):
        bnd_xywhs[i]["x"] = size - bnd_xywhs[i]["x"]
    return img, bnd_xywhs

最初に0~1の乱数を生成して、それがpより大きい値であったら、左右反転させずに入力をそのままreturnします。

次にrandomに画像のサイズを変更する(縮小する)関数は以下になります。
resizeした後に画像のサイズが変化すると困るので、PILで一面灰色の新しい画像を作り、その画像上に縮小した画像を張り付けています。

def random_resize(img, bnd_xywhs, p=0.5, max_shrink=0.3):
    """
    img: PIL image
    bnd_xywhs: list of dict   
    p: float
        the event probability, from 0 to 1
    max_shrink: float
        maximum rate of shrinking
    """
    if (p < np.random.rand()):
        return img, bnd_xywhs
    
    x_size, y_size = img.size
    
    #determine the ratio of the image after resizing
    x_resize = 1 - (max_shrink * np.random.rand())
    y_resize = 1 - (max_shrink * np.random.rand())
    
    img = img.resize((int(x_resize * x_size),
                      int(y_resize * y_size)))
    
    #generate gray image and paste resized image
    img_res = Image.new('RGB', (x_size, y_size), (127, 127, 127))
    img_res.paste(img, (0, 0))
    
    #bounding boxの更新
    for i in range(len(bnd_xywhs)):
        bnd_xywhs[i]["x"] *= x_resize
        bnd_xywhs[i]["y"] *= y_resize
        bnd_xywhs[i]["w"] *= x_resize
        bnd_xywhs[i]["h"] *= y_resize
    
    return img_res, bnd_xywhs

色味を変更するなど、bounding boxに変更がない場合はもう少し簡単になります。次の関数は、一定の確率で画像をモノクロに変更する関数です。

def random_grayscale(img, p=0.2):
    """
    img: PIL image
    p : float
    """
    if (p < np.random.rand()):
        return img
    return img.convert("L").convert("RGB")

上記のデータオーギュメンテーションをひとまとめにした関数を作成しておきます。

def data_aug(img, bnd_xywhs):
    """
    img: PIL image
    bnd_xywhs: list of dicts
    """
    img = random_grayscale(img, p=0.2)
    img = random_enhance(img, p=0.4)
    img = random_erasing(img, p=0.5, max_width=32)
    img, bnd_xywhs = random_resize(img, bnd_xywhs, p=0.5)
    img, bnd_xywhs = random_horizontal_flip(img, bnd_xywhs, p=0.5)
    img, bnd_xywhs = random_vertical_flip(img, bnd_xywhs, p=0.5)
    img = random_gaussian_blur(img,p=0.5)
    img = random_sharpness(img,p=0.5)
    img = random_poster(img)
    return img, bnd_xywhs

このデータオーギュメンテーションを実行してみます。これにより、教師データとなる画像の枚数を疑似的に増やすことができ、物体検出の精度向上が期待できます。

bnd_xywhs = convert_xyxys_to_xywhs(bnd_xyxys)
img, bnd_xywhs = data_aug(img, bnd_xywhs)
bnd_xyxys = convert_xywhs_to_xyxys(bnd_xywhs)

write_bndbox(img, bnd_xyxys)

オーギュメンテーションの例はこのようになります。

4. 教師データの作成

この4章以降の内容のソースコードは主として、以下にあります。

まずは画像分類と物体検知の教師データの違いについてみていきたいと思います。
画像分類の場合は画像1枚全体を見て、画像を分類します。
一方、物体検知の場合には画像を複数のグリッドセルに分割して、そのグリッドセル毎に分類します。
今回の場合では7×7に分割されたグリッドセル毎に物体(車)が存在しているか、それとも背景かのどちらかに分類します。

もっと具体的にはbounding boxの中心が存在するグリッドセルだけを1(車),その他のグリッドセルは0(背景)にします。

ただしこれだけでは、細かい物体の中心が分かりません。そこで、グリッドセルの位置に細かい位置ついても同じように出力します。
なお224×224の画像を7×7のグリッドセルに分割するので、1セルあたり32×32となります。
ただし出力が0~1になるようにするためにグリッドセル内の座標(ピクセル)を32で除したものを教師データとしています。
下図の場合、$x$軸が$0.2$、$y$軸が$0.1$です。

さらにbouding boxの中心の位置だけでなく、bouding boxのサイズ$(w_{t}, h_{t})$も出力します。

w_{t} = \sqrt{\frac{w}{224}}

h_{t} = \sqrt{\frac{h}{224}}

大きい物体の(w, h)は少しずれても大丈夫ですが、小さな物体ルートの(w, h)がずれると正解との差が大きくなるので、ルートをつけることにより、小さい物体を強調しています。

以上の内容をまとめると、一つのグリッドセルにつき

物体の中心　or not
グリッドセル内の中心の位置(x-axis)
グリッドセル内の中心の位置(y-axis)
bouding boxのサイズ$w_{t}$
bouding boxのサイズ$h_{t}$

の5つの数字が教師データとなります。
すなわち教師データとして、1枚の画像につき$(5×7×7)$のデータを作成します。
ここでDezero及びPytorchなどではチャンネルファーストなので、5が一番最初になっていることに注意してください。
これを図で示すと以下のようになります。

上記の内容を元に$(x, y, w, h)$のデータからtargetを作成する関数を作ります。

def make_target(bnd_xywhs, scale=32., imsize=224.):
    """
    bnd_xywhs : list of dict
    scale : int or float
         cellのピクセル数
    imsize : int or float
         画像1辺あたりのピクセル数
    """
    #5*7*7の配列の作成
    target = np.zeros([5, int(imsize / scale), int(imsize / scale)],
                      dtype=np.float32)

    for bnd_xywh in bnd_xywhs:
       #セルの位置を計算
        x_grid = bnd_xywh["x"] // scale
        y_grid = bnd_xywh["y"] // scale
       
        #セル内の中心位置を計算 
        x_pos = (bnd_xywh["x"] - (x_grid * scale)) / scale
        y_pos = (bnd_xywh["y"] - (y_grid * scale)) / scale
        
        #(w_t, h_t)の計算
        width = np.sqrt(bnd_xywh["w"] / imsize)
        height = np.sqrt(bnd_xywh["h"] / imsize)
        
        #x...axis_1, y...axis_0
        target[0, int(y_grid), int(x_grid)] = 1.
        target[1, int(y_grid), int(x_grid)] = x_pos
        target[2, int(y_grid), int(x_grid)] = y_pos
        target[3, int(y_grid), int(x_grid)] = width
        target[4, int(y_grid), int(x_grid)] = height

    return target

最初にnp.zerosをつかってtargetの配列をつくり、物体(の中心)が存在するグリッドセルだけ数字を上書きしているので、この関数でreturnされるターゲットはゼロだらけになりますが、「物体の中心　or not」のチャンネルを除いて、学習時に物体が存在しないグリッドセルは無視しますので、ゼロだらけでも問題ありません。

targetを$(x, y, w, h)$に戻す関数convert_pred_to_xywhも作成しておきます。
targetの「物体の中心　or not」のチャンネルで閾値を超えたグリッドセルについて$(x, y, w, h)$の値をreturnするような関数です。
後述のnon-max-supressionに使いますので、「物体の中心　or not」のチャンネルの値もscoreとして出力できるようにします。

def convert_pred_to_xywh(predict, scale=32, imsize=224, thres=0.4):
    
    y_pos, x_pos = np.where(predict[0, :, :] > thres)
    scores = []
    xywhs = []
    for x, y in zip(x_pos, y_pos):
        scores.append(predict[0, y, x])
        
        #widthとheightはmake targetでsqrtしていることに注意
        bnd_xywh = {
            "x": (x + predict[1, y, x]) * scale,
            "y": (y + predict[2, y, x]) * scale,
            "w": (predict[3, y, x] ** 2) * imsize,
            "h": (predict[4, y, x] ** 2) * imsize
            }
        xywhs.append(bnd_xywh)
    return scores, xywhs

関数の確認をします。
まず初期の$(x, y, w, h)$をprintすると、写真中に物体(車)が2つ存在していることが分かります。

print(bnd_xywhs)

[{'x': 42.703315165050604, 'y': 65.07760930164662, 'w': 84.00855889872452, 'h': 39.736460571586655}, {'x': 130.80460530175853, 'y': 45.412117574779984, 'w': 82.28529326112114, 'h': 41.763827599725765}]

make_targetして、「物体の中心　or not」のチャンネルのみをprintしてみます。

target = make_target(bnd_xywhs)
print(target[0, :, :])

[[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]]

targetを元の$(x, y, w, h)$にもどします。最初の[1.0, 1.0]はscoreです。 $(x, y, w, h)$の順番が最とは逆になっていますが、値は同じです。。

print(convert_pred_to_xywh(target))

([1.0, 1.0], [{'x': 130.8046052455902, 'y': 45.41211795806885, 'w': 82.2852860515045, 'h': 41.76382824414256}, {'x': 42.703314781188965, 'y': 65.0776093006134, 'w': 84.00855848438914, 'h': 39.73645990544492}])

5. networkの作成

$224×224×3$のサイズの画像を入力すると$7×7×5$を出力するネットワークを作成します。
ここでは画像分類で用いられるVGG16を利用します。
ImageNetで画像分類のために学習されたネットワークは、画像の特徴をうまく捉えることができるネットワークになっており、物体検出のタスクにも転用できることが知られています。
このネットワークのことをbackboneということもあります。

VGG16のネットワークを以下に示します。

A review of deep learning in the study of materials degradation

VGGでは13層のCNN層と3層の全結合層が存在しています。最終的にImageNetの1000クラスに分類することができます。
今回は7×7×5のデータが欲しいのですが、ちょうど13層のCNNが終わって最後のmax-poolingしたデータ(全結合層の前)が7×7×512になっています。

今回はここから派生させて、7×7×5のデータを出力できるようにします。ここではCNNを2層追加しています。最後は0~1が出力されてほしいので、活性化関数にsigmoidを使用します。ネットワークは以下のようになります。

dezero.modelsにはclass VGG16が用意されているので、このclassを継承して新しいクラスを作成します。
足りないCNN層については、__init__で追加で定義します。またforwardについては上書きして、7×7×5のデータが出力できるようにします。


import dezero.functions as F
import dezero.layers as L
from dezero import optimizers
from dezero.models import VGG16, Model


class NET(VGG16):
    def __init__(self, pretrained=True):
        super().__init__(pretrained)
        self.conv6_1 = L.Conv2d(512, kernel_size=3, stride=1, pad=1)
        self.conv6_2 = L.Conv2d(5, kernel_size=3, stride=1, pad=1)

    def forward(self, x):
        #VGG16のforwardを上書き
        x = F.relu(self.conv1_1(x))
        x = F.relu(self.conv1_2(x))
        x = F.pooling(x, 2, 2)
        x = F.relu(self.conv2_1(x))
        x = F.relu(self.conv2_2(x))
        x = F.pooling(x, 2, 2)
        x = F.relu(self.conv3_1(x))
        x = F.relu(self.conv3_2(x))
        x = F.relu(self.conv3_3(x))
        x = F.pooling(x, 2, 2)
        x = F.relu(self.conv4_1(x))
        x = F.relu(self.conv4_2(x))
        x = F.relu(self.conv4_3(x))
        x = F.pooling(x, 2, 2)
        x = F.relu(self.conv5_1(x))
        x = F.relu(self.conv5_2(x))
        x = F.relu(self.conv5_3(x))
        x = F.pooling(x, 2, 2)
        x = F.relu(self.conv6_1(x))
        x = F.sigmoid(self.conv6_2(x))
        return x

net = NET(pretrained=True)
optimizer = optimizers.AdaGrad(lr=0.0001).setup(net)
net.to_gpu()

6. 損失関数の設定

5章のnetworkで推論するとn×5×7×7の4次元のtensorのデータが得られます[pred]。(nはbatch size)
また7, 8章で出てくるのtrain_loaderからはmake_target由来のn×5×7×7のtensorデータが出力されます[targets]。

ここでのポイントはloss_0 (グリッドセル内に物体の中心が存在する or not)のloss以外では、
mask画像を使用して、物体の中心が存在しないグリッドセルの損失を計算しないようにすることです。

これにより、make_target由来のゼロだらけの値を学習しなくなります。

def loss_fn(pred, targets):
    mask = Variable(targets[:, 0, :, :])
    
    #object or background
    loss_0 = F.binary_cross_entropy(pred[:, 0, :, :],
                                    targets[:, 0, :, :])
    #Positions in the grid
    loss_1 = F.mean_squared_error(pred[:, 1, :, :] * mask, 
                                  targets[:, 1, :, :])
    loss_2 = F.mean_squared_error(pred[:, 2, :, :] * mask, 
                                  targets[:, 2, :, :])
    #size of bounding box
    loss_3 = F.mean_squared_error(pred[:, 3, :, :] * mask,
                                  targets[:, 3, :, :])
    loss_4 = F.mean_squared_error(pred[:, 4, :, :] * mask,
                                  targets[:, 4, :, :])
    
    loss = (loss_0 * 0.1) + loss_1 + loss_2 + loss_3 + loss_4

    return loss

loss_0は(物体の中心が存在する or not)の2値分類なので、binary_cross_entropy、他は回帰なので、MSEを使用しています。

7. dataloaderの作成

これまで作ってきたclassや関数をひとまとめにして、画像とbounding boxをloadして、trainingデータ or testデータを作るdataloaderを作成します。
Dezeroにはdataloaderを簡単に作れるようにする DataLoaderが用意されています。使い方はpytorchと似ています。
またDezeroにはPIL形式の画像をVGG16で推論できる形に変換する機能がありますので、それも利用します。

from dezero import DataLoader

class MyDataset:
    def __init__(self, paths_xml, transform, train=True):
        self.transform = transform
        self.paths_xml = paths_xml
        self.train = train
        
    def __len__(self):
        return len(self.paths_xml)

    def __getitem__(self, index):
        transform_anno = Anno_xml_to_List([])
        img_anno = transform_anno(self.paths_xml[index])

        bnd_xyxys = resize_xyxys_bnd(img_anno["object"])
        bnd_xywhs = convert_xyxys_to_xywhs(bnd_xyxys)
        
        img = Image.open(img_anno["path"])
        img = img.resize((224, 224))
        if self.train == True:
            #先にimgをresizeしておくこと
            img, bnd_xywhs = data_aug(img, bnd_xywhs)
            
        #VGG16の前処理
        img = self.transform(img)
        target = make_target(bnd_xywhs)

        return img, target

train_set = MyDataset(train_paths, VGG16.preprocess, train=True)
train_loader = DataLoader(train_set, batch_size=10, shuffle=True)

test_set = MyDataset(test_paths, VGG16.preprocess, train=False)
test_loader = DataLoader(test_set, batch_size=10, shuffle=False)

#GPUが使える場合のみ
train_loader.to_gpu()
test_loader.to_gpu()

今回,batch_size=10で1イテレーションにつき10枚の画像を学習させています。
GPUのメモリが小さく学習できないときには、batch_sizeを小さくすることにより、学習することができます。
ただし、バッチサイズが極端に小さくなると、学習が不安定になると言われていますので、注意が必要です。

8. 学習

ここまでの内容を元に学習させます。

from dezero import test_mode

#データの記録
train_losses = []
test_losses = []

for epoch in range(200):
    #train data, backwardとoptimizerで学習
    tmp_loss, tmp_data = 0., 0.
    for imgs, targets in train_loader:

        pred = net(imgs)
        loss = loss_fn(pred, targets)
        loss.backward()
        optimizer.update()
        net.cleargrads()
        
        tmp_loss += float(loss.data)
        tmp_data += imgs.shape[0]
        del(pred, loss)
        
    train_losses.append(tmp_loss / tmp_data)
    
    #test data,こちらでは学習させない
    tmp_loss, tmp_data = 0., 0.
    for imgs, targets in test_loader:
        with test_mode():
            pred = net(imgs)
        loss = loss_fn(pred, targets)
        tmp_loss += float(loss.data)
        tmp_data += imgs.shape[0]
        del(pred, loss)
        
    test_losses.append(tmp_loss/tmp_data)
    
    print("epoch:",epoch, 
          " train:", np.round(train_losses[-1], 4), 
          "test:", np.round(test_losses[-1], 4))

#lossの可視化
plt.xlim(0, 200)
plt.plot(train_losses, label="train_loss")
plt.plot(test_losses, label="test_loss")
plt.legend()
plt.show()

lossは以下のようになっています。

9. non-Maximum Suppressionによる重なりの除去

testデータを用いて、実際に可視化してみます。ここでは事前に用意している関数convert_img_for_mat とshow_heatmapを使います。
convert_img_for_mat はVGG16の前処理の逆の処理を行って、matplotlibで綺麗に表示できるように変換する関数です。

net.to_cpu()
test_loader.to_cpu()

for imgs, targets in test_loader:
    pred = net(imgs)
    
    for i in range(imgs.shape[0]):
        tmp_img = convert_img_for_mat(imgs[i])
        show_heatmap(tmp_img, pred[i, 0, :, :].data, ticks=False)
    break

赤色が強いほど、物体の中心が存在する確率が高いことを示しています。

ここでは複数のセルが｢救急車のおもちゃの中心はココだ！｣と主張しています。
このままですと、1つの物体に対して複数のbounding boxが重なりあってしまいます。このbounding boxの重なりを除去するのが、non-Maximum Suppressionです。
non-Maximum Suppressionでは、まずbounding boxが重なっているかを判定する必要があります。
この重なりを判定するのがIoUと呼ばれる値で、2つのbounding boxの座標から、以下のように計算できます。

ちなみに計算に必要となるx, yの座標は以下のように求めることができます。

このIoUを計算する関数は以下のようになります。

def calc_iou(xyxy_1, xyxy_2):
    #重なり合う部分がないとき、0をリターン
    if xyxy_1["xmax"] < xyxy_2["xmin"]:
        return 0.
    if xyxy_1["xmin"] > xyxy_2["xmax"]:
        return 0.
    if xyxy_1["ymax"] < xyxy_2["ymin"]:
        return 0.
    if xyxy_1["ymin"] > xyxy_2["ymax"]:
        return 0.
    
    #それぞれのbdの面積を計算
    area_1 = (xyxy_1["xmax"] - xyxy_1["xmin"]) * (xyxy_1["ymax"] - xyxy_1["ymin"])
    area_2 = (xyxy_2["xmax"] - xyxy_2["xmin"]) * (xyxy_2["ymax"] - xyxy_2["ymin"])
    
    #intersection部分の面積計算に必要な座標を計算
    xmin = max(xyxy_1["xmin"], xyxy_2["xmin"])
    ymin = max(xyxy_1["ymin"], xyxy_2["ymin"])
    xmax = min(xyxy_1["xmax"], xyxy_2["xmax"])
    ymax = min(xyxy_1["ymax"], xyxy_2["ymax"])
    
    area_inter = (xmax - xmin) * (ymax - ymin)
    area_union = area_1 + area_2 - area_inter
    
    return area_inter / area_union

IoUがしきい値を超えた場合、scoreが小さいbounding boxが削除される側になります。
次の関数non_max_supressionではitertoolsを使ってbounding boxの組み合わせを全通り求めています。
またiouの計算の途中でbounding boxを配列から削除すると、配列のインデックスがずれてしまいます。

故に、ここでは非効率的ですがIoUを全部計算した上で、最後に必要なbounding boxのみを抽出するようにしています。

import itertools

def non_max_supression(scores, xyxys, thres=0.4):
    #bounding boxの数が0個 or 1個
    if len(xyxys) <= 1:
        return scores, xyxys
    
    #Create two numbers of indexes that do not allow duplicates
    array = [i for i in range(len(xyxys))]
    combs = itertools.combinations(array, 2)
    
    flags = [True for i in range(len(xyxys))]

    for i, j in combs:
        iou = calc_iou(xyxys[i], xyxys[j])
        #The one with the smaller IouU is False.
        if iou > thres:
            if scores[i] > scores[j]:
                flags[j] = False
            else:
                flags[i] = False
    
    res_scores, res_xyxyxs = list(), list()
    
    #Delete those classified as False.
    for i, flag in enumerate(flags):
        if flag:
            res_scores.append(scores[i])
            res_xyxyxs.append(xyxys[i])
            
    return res_scores, res_xyxyxs

最後にtestデータの推論結果について、いくつか結果を表示します。

net.to_cpu()
test_loader.to_cpu()

for imgs, targets in test_loader:
    pred = net(imgs)
    
    for i in range(imgs.shape[0]):
        scores, pred_bnds = convert_pred_to_xywh(pred[i, :, :, :].data, 
                                                 thres=0.3)
        pred_bnds = convert_xywhs_to_xyxys(pred_bnds)
        scores, pred_bnds = non_max_supression(scores, pred_bnds, thres=0.3)
        tmp_img = convert_img_for_mat(imgs[i])
        
        write_bndbox(tmp_img, pred_bnds, scores=scores, ticks=False)
        show_heatmap(tmp_img, pred[i, 0, :, :].data, ticks=False)
    break

終わりに

Dezeroで物体検出が出来ました。タイトルが煽り気味でｽﾐﾏｾﾝﾃﾞｼﾀ。
今回の物体検出はかなり簡略化されているので、様々な工夫でさらに良くなるハズです。
ここで作った関数のいくつかは、torchvisionには最初から用意されています。実用的な観点からすると、torchvisionを使った方が効率がよいとは思います。

今回の内容は、こちらの講座も参考にしました。実装に関する説明はほとんど無いですが、理論面では非常に参考になりました。

今回より、さらに高性能な物体検出を作りたい方は、以下の教科書が参考になると思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up