More than 1 year has passed since last update.

YOLOX理解④：Augmentationを理解する（Mosaic, MixUp, etc）

Last updated at 2024-02-16Posted at 2024-02-15

YOLOXとは

YOLOX2021年に発表されたリアルタイムの物体検知でベンチマークとなるモデルです。リアルタイムの物体検知モデルはいくつもありますが、YOLOXは商用利用が可能なApache License 2.0であることから、各社で開発が盛んに進んでいると思われます。YOLOXを各方面から理解を深めるために、実装を中心に深掘りしていきます。

YOLOX理解の道のり

YOLOXのBackbone理解（CSPDarknet）
YOLOXのNeck/Head理解（YOLOXPPFPN, YOLOXHead）
YOLOXのLoss理解（4種類のLossと重み付け）
⭐️（今回はこれ）YOLOXのAugmentation（Mosaic, Mixup, etc）
YOLOXのSim OTAの理論と実装
YOLOXのモデル毎の性能差（Tiny, S, M, L, X）
YOLOXの実用例（社会実装された例など）

今回は、4番のAugmentationについて取り上げます。可視化にあたってはmmdetectionを用います。

YOLOXのAugmentation

今回、YOLOXのTransformを一つひとつ可視化するにあたって、mmdetectionのbrowse_dataset.pyを使用します。

1. YOLOXの前処理工程

mmdetectionの実装を見ると、Mosaic → RandomAffine → MixUp → YOLOXHSVRandomAug → RandomFlip (→ Resize → Pad)の順番でtrainの際のパイプラインが設計されています。一つ一つどんな前処理なのか画像を可視化しながら調査していきましょう。

train_pipeline = [
    dict(type='Mosaic', img_scale=img_scale, pad_val=114.0),
    dict(
        type='RandomAffine',
        scaling_ratio_range=(0.1, 2),
        # img_scale is (width, height)
        border=(-img_scale[0] // 2, -img_scale[1] // 2)),
    dict(
        type='MixUp',
        img_scale=img_scale,
        ratio_range=(0.8, 1.6),
        pad_val=114.0),
    dict(type='YOLOXHSVRandomAug'),
    dict(type='RandomFlip', prob=0.5),
    # According to the official implementation, multi-scale
    # training is not considered here but in the
    # 'mmdet/models/detectors/yolox.py'.
    # Resize and Pad are for the last 15 epochs when Mosaic,
    # RandomAffine, and MixUp are closed by YOLOXModeSwitchHook.
    dict(type='Resize', scale=img_scale, keep_ratio=True),
    dict(
        type='Pad',
        pad_to_square=True,
        # If the image is three-channel, the pad value needs
        # to be set separately for each channel.
        pad_val=dict(img=(114.0, 114.0, 114.0))),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1), keep_empty=False),
    dict(type='PackDetInputs')
]

2. transformない場合

COCO2017は元々の画像の解像度が統一されておらず、縦長や横長の画像があるため、学習用画像にする際にresizeをしてからpaddingをして指定の解像度（640, 640）に調整しています。余白部分はグレーですが、これはpad_valが(114.0, 114.0, 114.0)に設定されているからです。この数字はハイパラ調整された結果なのか慣習なのかはよくわからずでした。

3. YOLOXのtransform全部載せ

YOLOXのAugmentationを直感的に理解するのは難しいです。例えば以下のような画像が学習データとして使われています。

ピザを食べているおじさんの下に、うっすらおじさんが写っていて、しかも遠近法がおかしいキッチンにコップやナイフが多数写っています。下手な合成写真（心霊写真？）にも見えます。

このわかりづらさは、mosaicとmixupが併用されていることが原因で、人には理解しづらいです。学習効率が良いので採用されているようです。

以下は、同じくYOLOXの学習時にモデルに入力されている画像のサンプル。RandomAffineなども機能しているせいか、こんなに混雑した画像で学習させているのか、と驚きます。

4. Mosaicだけ入れた場合

mosaic transformはyolov4で初登場した前処理手法。1枚の画像で複数のコンテクストを学習することができるので、スケールやアスペクト比に対するロバスト性が向上するというメリットがあります。一方で、データセットが小さすぎる場合に過学習してしまったり、現実世界では存在し得ないスケール比での画像を生成してしまうリスクもあります。

mmdetectionのMosaic実装箇所

mosaicの中心座標を決める
左上の画像をまず配置して、残りの3枚の画像をランダムでデータセットから選択して配置する
選択された3枚の画像のうち、mosaicエリアよりも大きい場合は取り除く

mosaic transform

```python @TRANSFORMS.register_module() class Mosaic(BaseTransform): """Mosaic augmentation.

    Given 4 images, mosaic transform combines them into
    one output image. The output image is composed of the parts from each sub-
    image.

    .. code:: text

                        mosaic transform
                           center_x
                +------------------------------+
                |       pad        |  pad      |
                |      +-----------+           |
                |      |           |           |
                |      |  image1   |--------+  |
                |      |           |        |  |
                |      |           | image2 |  |
     center_y   |----+-------------+-----------|
                |    |   cropped   |           |
                |pad |   image3    |  image4   |
                |    |             |           |
                +----|-------------+-----------+
                     |             |
                     +-------------+

     The mosaic transform steps are as follows:

         1. Choose the mosaic center as the intersections of 4 images
         2. Get the left top image according to the index, and randomly
            sample another 3 images from the custom dataset.
         3. Sub image will be cropped if image is larger than mosaic patch

    Required Keys:

    - img
    - gt_bboxes (BaseBoxes[torch.float32]) (optional)
    - gt_bboxes_labels (np.int64) (optional)
    - gt_ignore_flags (bool) (optional)
    - mix_results (List[dict])

    Modified Keys:

    - img
    - img_shape
    - gt_bboxes (optional)
    - gt_bboxes_labels (optional)
    - gt_ignore_flags (optional)

    Args:
        img_scale (Sequence[int]): Image size after mosaic pipeline of single
            image. The shape order should be (width, height).
            Defaults to (640, 640).
        center_ratio_range (Sequence[float]): Center ratio range of mosaic
            output. Defaults to (0.5, 1.5).
        bbox_clip_border (bool, optional): Whether to clip the objects outside
            the border of the image. In some dataset like MOT17, the gt bboxes
            are allowed to cross the border of images. Therefore, we don't
            need to clip the gt bboxes in these cases. Defaults to True.
        pad_val (int): Pad value. Defaults to 114.
        prob (float): Probability of applying this transformation.
            Defaults to 1.0.
    """

    def __init__(self,
                 img_scale: Tuple[int, int] = (640, 640),
                 center_ratio_range: Tuple[float, float] = (0.5, 1.5),
                 bbox_clip_border: bool = True,
                 pad_val: float = 114.0,
                 prob: float = 1.0) -> None:
        assert isinstance(img_scale, tuple)
        assert 0 <= prob <= 1.0, 'The probability should be in range [0,1]. ' \
                                 f'got {prob}.'

        log_img_scale(img_scale, skip_square=True, shape_order='wh')
        self.img_scale = img_scale
        self.center_ratio_range = center_ratio_range
        self.bbox_clip_border = bbox_clip_border
        self.pad_val = pad_val
        self.prob = prob

    @cache_randomness
    def get_indexes(self, dataset: BaseDataset) -> int:
        """Call function to collect indexes.

        Args:
            dataset (:obj:`MultiImageMixDataset`): The dataset.

        Returns:
            list: indexes.
        """

        indexes = [random.randint(0, len(dataset)) for _ in range(3)]
        return indexes

    @autocast_box_type()
    def transform(self, results: dict) -> dict:
        """Mosaic transform function.

        Args:
            results (dict): Result dict.

        Returns:
            dict: Updated result dict.
        """
        if random.uniform(0, 1) > self.prob:
            return results

        assert 'mix_results' in results
        mosaic_bboxes = []
        mosaic_bboxes_labels = []
        mosaic_ignore_flags = []
        if len(results['img'].shape) == 3:
            mosaic_img = np.full(
                (int(self.img_scale[1] * 2), int(self.img_scale[0] * 2), 3),
                self.pad_val,
                dtype=results['img'].dtype)
        else:
            mosaic_img = np.full(
                (int(self.img_scale[1] * 2), int(self.img_scale[0] * 2)),
                self.pad_val,
                dtype=results['img'].dtype)

        # mosaic center x, y
        center_x = int(
            random.uniform(*self.center_ratio_range) * self.img_scale[0])
        center_y = int(
            random.uniform(*self.center_ratio_range) * self.img_scale[1])
        center_position = (center_x, center_y)

        loc_strs = ('top_left', 'top_right', 'bottom_left', 'bottom_right')
        for i, loc in enumerate(loc_strs):
            if loc == 'top_left':
                results_patch = copy.deepcopy(results)
```

mosaicだけtransformさせた場合はこのようになります。mosaicの中心座標はまちまちですが、必ず4枚画像が並んでいることがわかります。mosaicは大変シンプルで直感的にも理解しやすいtransformであることがわかります。

5. Mixupだけ入れた場合

2017年に提案された前処理手法。学習データセットからランダムに選択された2枚の線形的に混合し、かつラベルも同様に混合することでモデルがより一般化することを狙った前処理手法。元々は画像分類タスクむけに開発された手法ですが、物体検知やセグメンテーションにも活用されています。異なるクラス間の遷移を識別しやすくするメリットがある一方、学習過程の解釈性が低下するデメリットやデータセットによっては非現実的な画像が生成されるためモデルを混乱させるリスクもあります。

1枚の画像を中央に配置し、ランダムにピックアップした2枚目の画像を左上に配置
混合割合（λ）から正解ラベルの数値を決める
1. 犬猫データセット（犬のラベル[1, 0]と猫のラベル[0, 1]）においてλが0.6の場合、混合された画像のラベルは[0.6, 0.4]となる。画像は60%犬に似ていて40%猫に似ているように生成する

実際の画像を見てみると、わかりづらいですが正面のおばさんは猫を抱えています。加えて、箱に入った猫もうっすら見えます。ある意味奇跡的な画像ですね。猫被りしている・・・。二つのcatを認識する必要があるのですが、これは難しそう。さらに画像中央左側には果物やボウルが散乱しています。

MixUpは2枚の画像がベースになっているので、その前提でそれぞれの画像を確認する必要があります。他の画像も注意深くみると2枚の画像が重なっていることがわかります。MixUpだけでも直感的に認識しづらい画像となることがわかります。valの結果が良ければ文句は言わないことにしましょう。

6. RandomAffineだけ入れた場合

RandomAffineについては、回転させているだけなので珍しい手法ではないですが、YOLOXではscaling_ratio_range=(0.1, 2)となっており、最小で10%に縮小/最大で2倍に拡大している。transformの出現確率が100%であることも特徴です。

COCO2017の矩形はかなり小さいものもあるので、10%まで小さくしてしまうと数pxくらいしか残らないラベルも中にはある可能性があり、それらはほぼ認識無理なものとなっていることは予想がつきます。このscaling_ratio_rangeのハイパラ調整時に精度1番良かったのが今の(0.1, 2)に落ち着いたということなのでしょうか。

8. YOLOXHSVだけ入れた場合

Hue（色相）, Saturation（彩度）, Value（明度）を調整するための前処理。モデルが異なる照明条件や色のバリエーションに対してロバスト性を上げるための調整。

オリジナル	HSVtransform後

本当はHueを弱〜強と変化させた場合の画像の変化をグラデーションのようにして並べた方がわかりやすいような気もしますが、HueやSaturation, Valueを少し変えるTransformがYOLOXには入っていることがわかれば十分のような気がするので割愛します。

まとめ

YOLOXで用いられているTransformを理解することは、チューニングする上でとても重要。COCO2017データセットではこれらの前処理が全体の精度を上げることに貢献しています。とはいえ、別のデータセットを用いて転移学習させた場合でもうまくいくかどうかはわかりません。個人的にはMixUpが難解でこれを思い切って外してみても良いような気もしています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up