More than 3 years have passed since last update.

【論文翻訳】超精読Mask R-CNN　前編

Last updated at 2022-02-07Posted at 2021-11-16

なぜ和訳したか　

　私は趣味でkaggleに参加しているのですが、コンペのテーマでインスタンスセグメンテーションのタスクが多くその手法としてMask R-CNNが頻繁に使われるので、勉強がてら論文を翻訳しました。普段は英語のまま読むのですが、日本語に直すことで理解が深まり、またMask R-CNNについて今後勉強される方に参考になればと思いまして今回の取り組みをしました。前編では論文の3.1節までを取り扱います。Mask R-CNNの概要やアーキテクチャについてはほとんどここで語られおり、4章以降は結果を示しています。後編は準備中ですが、年内には公開できるようにいたします。
　

読んで頂く前に

論文のリンクはこちらです。
途中外部の情報で補足する場合は、マークを付けました。
各章ごとに英語と日本語の対比がしやすいように段落の頭に数字を振りました。（例①）
英語の本文と対比できるように、段落毎に折りたたんで併記しました。気にになる方は展開して、ご確認ください。また翻訳に誤りや不自然な箇所があったらご教示頂けますと幸いです。英語本文には通し番号を振っています。
図は文章で参照される順に配置しています。なので、Figure6がFigure3に先行したりしています。
Mask R-CNNの論文を読む前提として、R-CNN、Fast R-CNN、Faster R-CNNについて理解している必要があります。
こちらの記事①は英語ですが、難しい英単語や文構造もなく、各技術の概要について説明していてとても読みやすかったです。そのあとで、関連論文も読むと理解が深まるかと思います。その後に本記事を読んで頂けたらと思います。
インスタンスセグメンテーションの歴史（系譜？）については、こちらの日本語の記事が参考になります。

お願い

　私は純ジャパニーズですが、義務教育から勉強を続けて今年になってTOEIC890に到達しました（まだまだ勉強中です、、）ので、英語力には多少自身がありますが、まだまだ至らないところもあるかと思いますので、誤り等ございましたら、ご指摘いただけますと幸いです。意訳になりすぎず、直訳になりすぎを意識して和訳しました。和訳や調査に相当な工数がかかっていますので、良いと思っていただけたらLGTM頂けたら幸いですm(_ _)m

Abstract

　①私たちは、物体のインスタンスセグメンテーションに対して概念的にシンプルで、柔軟性があり、汎用性のあるフレームワークを提案します。私たちの手法は一枚の画像中の物体を効率的に検出します。同時に、ひとつひとつのインスタンスに対して高品質なセグメンテーションマスクを生成します。Mask R-CNNと呼ばれる手法は、バウンディングボックスの認識のための既存のブランチに並行して物体のマスクを予測するブランチを追加することで、Faster-RCNNを拡張します。Mask R-CNNは訓練するのが容易で少量のオーバーヘッドをFaster R-CNNに加えるだけで済み、5fpsで動作します。さらに、Mask R-CNNは他のタスクを一般化するのが容易です。例えば、同一フレーム内の人体の姿勢を推定することも可能です。私たちは、インスタンスセグメンテーション、バウンディングボックス物体検出、そして人のキーポイント検出を含むCOCOチャレンジの三部門すべてにおいて最高の結果をご提示します。私たちは、単純で効果的な私たちの手法が確かな基準としての機能を果たし、インスタンスレベルでの認識における将来の研究を助けることを願っています。コードは https://github.com/facebookresearch/Detectron で利用可能です。

英語本文1

``` We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, boundingbox object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at:https://github.com/facebookresearch/Detectron. ```

1.Introduction

　①画像認識のコミュニティは急速に短期間のうちに物体検出とセマンティックセグメンテーションの結果を向上させてきました。大部分において、これらの発展は、物体検出に対するFast/Faster R-CNN[12,36]やセマンティックセグメンテーションに対するFCN[30]などの強力なベースとなるシステムによってもたらされました。これらの方法は、概念的に直観的で、高速な訓練・推論時間とともに柔軟性とロバスト性を提供します。私たちの研究における目標は、インスタンスセグメンテーションに対して同程度に利用可能なフレームワークを開発することです。

英語本文2

``` The vision community has rapidly improved object detection and semantic segmentation results over a short period of time. In large part, these advances have been driven by powerful baseline systems, such as the Fast/Faster RCNN [12, 36] and Fully Convolutional Network (FCN) [30] frameworks for object detection and semantic segmentation, respectively. These methods are conceptually intuitive and offer flexibility and robustness, together with fast training and inference time. Our goal in this work is to develop a comparably enabling framework for instance segmentation. ```

　②インスタンスセグメンテーションはチャレンジングです。なぜならば、このタスクは一枚の画像中のすべての物体を正確に検出するとともに、またそれぞれのインスタンスを精確にセグメント化する必要があるからです。したがって、これは典型的なコンピュータービジョンのタスクの要素の組み合わせから成ります。物体検出は、個々の物体を分類しバウンディングボックスを用いて各物体の場所を特定することを目標とします。セマンティックセグメンテーションは、各ピクセルを、物体のインスタンスを区別することなく、一定のカテゴリに分類することを目標とします。¹こうしたことを考慮してみると、良好な結果を達成するためには複雑な手法が必要だと考える人もいるかもしれません。しかしながら、私たちは驚くほどに単純で、シンプルかつ高速なシステムが既存の最新のインスタンスセグメンテーションの結果を上回ったのです。

英語本文3

``` Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance. It therefore combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances.1 Given this, one might expect a complex method is required to achieve good results. However, we show that a surprisingly simple, flexible, and fast system can surpass prior state-of-the-art instance segmentation results. ```

　③私たちの手法（Mask R-CNNと呼びます）は、クラス分類とバウンディングボックスの回帰のための既存のブランチに並行して、対象領域(ROI)におけるセグメンテーションマスクを予測するブランチを追加することでFaster R-CNN[36]を拡張します（Figure 1)。
Mask R-CNNのアーキテクチャについて

Figure 1. インスタンスセグメンテーションに対するThe Mask R-CNNフレームワーク

英語本文4

``` Figure 1. The Mask R-CNN framework for instance segmentation. ```

マスクブランチは各ROIに適用された小さなFCNであり、ピクセル毎にセグメンテーションマスクを予測します。Mask R-CNNは実装が単純であり、与えられたFaster R-CNNのフレームワークを訓練することができるので、幅広い範囲で柔軟なアーキテクチャの設計を容易にします。加えて、マスクブランチは少量の計算資源のオーバーヘッドしか追加しませんので、高速なシステムかつ迅速な実験を可能とします。

英語本文5

``` Our method, called Mask R-CNN, extends Faster R-CNN [36] by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression (Figure 1). The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-topixel manner. Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation. ```

　④原理的には、直観的にはFaster RCNNを拡張したものですが、マスクブランチを適切に構築することは良好な結果を得るために非常に重要です。最も重要なのは、Faster R-CNNがネットワークの入力と出力間でピクセル間で位置が一致するようには設計されていませんでした。これは、RoIPool[18,12]　(インスタンスに注意を向けるために事実上のコアになっている操作)が特徴量抽出のためにいかに空間的に粗い量子化を行っているかにおいて明白です。この位置ずれを修正するために、私たちはRoIAlignと呼ばれるシンプルかつ量子化をしない層を提案します。この層は、忠実に精確な空間位置を保存します。一見些細な変化のようにしか思えませんが、RoIAlignは大きな影響を及ばします。つまり、これによってマスク精度が相対的に10%から50%向上します。二つ目に、私たちはマスクの推測とクラスの推測を分離することが極めて重要だと気が付きました。どういうことかと言いますと、私たちの手法はカテゴリを推測するためにクラス間で競わせることをせずに、各クラス毎に独立してバイナリのマスクを予測して、ネットワークのRoIの分類に依存しています。対照的に、FCNは通常ピクセル毎に多クラス分類を行っていますが、セグメンテーションとクラス分類をセットにしています。私たちの実験に基づくとFCNはインスタンスセグメンテーションにはうまく機能しません。

英語本文6

``` In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool [18, 12], the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Despite being a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Second, we found it essential to decouple mask and class prediction: we predict a binary mask for each class independently, without competition among classes, and rely on the network’s RoI classification branch to predict the category. In contrast, FCNs usually perform per-pixel multi-class categorization, which couples segmentation and classification, and based on our experiments works poorly for instance segmentation. ```

　⑤余計なものを付けなくても、Mask R-CNNはCOCOインスタンスセグメンテーションのタスク[28]におけるすべての既存の最新のシングルモデルの結果を上回っています。これらの結果には、2016年のコンペティションの優勝者が用いたひどくエンジニアリングされたモデルも含んでいます。私たちの手法は、COCOの物体検出のタスクについても副次的に秀でた結果を出しています。切除実験において、私たちは複数の基本的なインスタンス化を評価しています。これによって、ロバスト性を示したり、コアの要素の影響を分析することが可能となります。

英語本文7

``` Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task [28], including the heavilyengineered entries from the 2016 competition winner. As a by-product, our method also excels on the COCO object detection task. In ablation experiments, we evaluate multiple basic instantiations, which allows us to demonstrate its robustness and analyze the effects of core factors. ```

　⑥私たちのモデルはGPU上では1フレーム200ミリ秒で処理することができ、COCOのトレーニングには8GPUを搭載した単一のマシンで１日から２日で済みます。私たちは、この高速な訓練とテスト速度に加えて、フレームワークの柔軟性と精度によって、インスタンスセグメンテーショにおける将来の研究に利益をもたらし、かつ取り組みやすくなるものと信じています。

英語本文8

``` Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. We believe the fast train and test speeds, together with the framework’s flexibility and accuracy, will benefit and ease future research on instance segmentation. ```

　⑦最後にCOCOのキーポイントデータセット[28]における人体の姿勢推定のタスクによって私たちのフレームワークの汎用性をお示ししましょう。ワンホット化したバイナリマスクとして、各キーポイントを見ることで、修正を最小限に抑えつつMask R-CNNがインスタンス固有の姿勢を推定できるようになります。Mask R-CNNは2016年のCOCOのキーポイントコンペティションで優勝したモデルも上回り、同時に5fpsで動作します。従いまして、Mask RCNNはインスタンスレベルの認識に対して柔軟なフレームワークとしてみなすことができ、より複雑なタスクに対して容易に拡張することが可能です。

英語本文9

``` Finally, we showcase the generality of our framework via the task of human pose estimation on the COCO keypoint dataset [28]. By viewing each keypoint as a one-hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the same time runs at 5 fps. Mask R-CNN, therefore, can be seen more broadly as a flexible framework for instance-level recognition and can be readily extended to more complex tasks. ```

　⑧私たちは将来の研究を容易にするために、コードを公開しています。

英語本文10

``` We have released code to facilitate future research. ```

Figure 2. COCOテストセットにおけるMask R-CNNの結果。これらの結果はResNet-101[19]に基づいており、35.7というmask APを達成し、5fpsで動作します。マスクは色を付けて示しており、バウンディングボックス、カテゴリおよび信用度を示しています。

英語本文11

``` Figure 2. Mask R-CNN results on the COCO test set. These results are based on ResNet-101 [19], achieving a mask AP of 35.7 and running at 5 fps. Masks are shown in color, and bounding box, category, and confidences are also shown. ```

注1

共通の専門用語が以降続くことになりますが、私たちは「物体検出」という表現を使って、マスクではなくてボックスでの検出を表し、「セマンティックセグメンテーション」という表現を使ってインスタンスを識別せずにピクセル単位でクラス分類を行うことを表します。ただし、インスタンスセグメンテーションが、セマンティックでありかつ検出の一形態であることに注意します。

英語本文12

``` Following common terminology, we use object detection to denote detection via bounding boxes, not masks, and semantic segmentation to denote per-pixel classification without differentiating instances. Yet we note that instance segmentation is both semantic and a form of detection. ```

Mask R-CNNのアーキテクチャについて

Figure 1だけみても、なんのこっちゃだと思います。論文の後半で説明されていますが、言葉での説明が主ですので、いまいちピンとこないかと思います（私はそうでした）。詳細は、こちらの記事②やこちらの記事③が詳しく、下記のような構造となっています。

2.Related Work

①**R-CNN:**バウンディングボックス物体検出に対する領域ベースのCNN(R-CNN)手法[13]は、管理できる数の候補の物体領域[42,20]に着目し、畳み込みネットワーク[25,24]を各RoI毎に個別に評価します。R-CNNはRoIPoolを用いて特徴量マップに基づいてRoIに着目できるように拡張して[18,12]、高速化と高精度化を実現しました。Faster R-CNN[36]はRegion Proposal Network(RPN)を持ったAttention機構を学習することで、この流れを発展させました。Faster R-CNNは柔軟でかつロバストで、後に続く発展(例えば、[38,27,21])へとつながっていき、いくつかのベンチマークにおける最新のフレームワークです。
Attention機構について

英語本文13

``` R-CNN: The Region-based CNN (R-CNN) approach [13] to bounding-box object detection is to attend to a manageable number of candidate object regions [42, 20] and evaluate convolutional networks [25, 24] independently on each RoI. R-CNN was extended [18, 12] to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN [36] advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is flexible and robust to many follow-up improvements (e.g., [38, 27, 21]), and is the current leading framework in several benchmarks. ```

②**インスタンスセグメンテーション:**RCNNの有効性に触発されて、インスタンスセグメンテーションに対する多くの手法がセグメントプロポーザルに基づいています。以前の手法[13,15,16,9]は、ボトムアップ型のセグメント[42,2]に依拠していました。DeepMask[33]と後に続く研究[34,8]では、セグメントの候補を検出することを学習し、その後にそれらの候補がFast R-CNNによって分類されます。これらの手法において、セグメント化は認識部分に先立ちますが、低速で精度も良くありません。同様にDaiら[10]は複雑な複数段階のカスケードを提案しました。このカスケードはバウンディングボックスプロポーザルからセグメントプロポーザルを予測し、その後クラス分類を行います。これらの手法とは異なり代わりに、我々の手法はマスクとクラスのラベルの推測を並行して行っているので、よりシンプルでかつより柔軟です。

英語本文14

``` Instance Segmentation: Driven by the effectiveness of RCNN, many approaches to instance segmentation are based on segment proposals. Earlier methods [13, 15, 16, 9] resorted to bottom-up segments [42, 2]. DeepMask [33] and following works [34, 8] learn to propose segment candidates, which are then classified by Fast R-CNN. In these methods, segmentation precedes recognition, which is slow and less accurate. Likewise, Dai et al. [10] proposed a complex multiple-stage cascade that predicts segment proposals from bounding-box proposals, followed by classification. Instead, our method is based on parallel prediction of masks and class labels, which is simpler and more flexible. ```

　③直近では、Liら[26]は「完全に畳み込まれたインスタンスセグメンテーション」(FCIS)を目指して[8]におけるセグメントプロポーザルのシステムと[11]における物体検出のシステムを組み合わせました。[8,11,26]に共通するアイデアは、位置に敏感な出力チャンネルを完全に畳み込んで推測するということです。これらのチャンネルは、物体のクラス、ボックスそしてマスクを同時に処理し、システムを高速化しています。しかしながら、FCISはインスタンスのオーバーラップと疑似的なエッジを生み出してしまうというシステム的なエラーを露呈しています(Figure 6)。この図は、FCISがインスタンスをセグメント化することの根本的な難しさに直面していることを示しています。

英語本文15

``` Most recently, Li et al. [26] combined the segment proposal system in [8] and object detection system in [11] for “fully convolutional instance segmentation” (FCIS). The common idea in [8, 11, 26] is to predict a set of positionsensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. But FCIS exhibits systematic errors on overlapping instances and creates spurious edges (Figure 6), showing that it is challenged by the fundamental difficulties of segmenting instances. ```

Figure 6. FCIS+++[26]（上）とMask R-CNN (下, ResNet-101-FPN)。FCISには物体が重なるシステム的なアーティファクトが表れている。

英語本文16

``` Figure 6. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects. ```

　④インスタンスセグメンテーションに対するもうひとつの解決策[23, 4, 3, 29]は、セマンティックセグメンテーションの成功によってもたらされました。ピクセル単位でのクラス分類の結果（例えば、FCNの出力）を入力として、これらの手法は同一カテゴリのピクセルを異なるインスタンスに切り分ける試みをしています。これらの手法のセグメンテーションを最初に行う戦略とは対照的に、Mask R-CNNはインスタンス化を先に行う戦略に依拠しています。私たちは、両方の戦略をより深く取り入れるような研究が将来なされることを期待しています。

英語本文17

``` Another family of solutions [23, 4, 3, 29] to instance segmentation are driven by the success of semantic segmentation. Starting from per-pixel classification results (e.g., FCN outputs), these methods attempt to cut the pixels of the same category into different instances. In contrast to the segmentation-first strategy of these methods, Mask R-CNN is based on an instance-first strategy. We expect a deeper incorporation of both strategies will be studied in the future. ```

Attention機構について

RPNがAttention機構かどうかについてのQ&Aをご参照ください。

3.Mask R-CNN

　①Mask R-CNNは概念的にはシンプルです。すなわち、Faster R-CNNは各候補物体に対して二つの出力（クラスラベルとバウンディングボックスオフセット）を持ちますが、私たちはこれにオブジェクトマスクを出力する第三のブランチを追加します。従って、Mask R-CNNは自然で直観的な発想です。しかし追加のマスク出力はクラス出力とボックス出力とは区別され、さらに細かい空間精度での物体位置の抽出を要求します。次に、私たちは、Mask R-CNNの主要な要素を導入します。これにはピクセル間での位置合わせを含んでおり、それはFast/Faster R-CNNには欠落していた最大の要素です。

英語本文18

``` Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN. ```

②**Faster R-CNN:**まずざっとFaster R-CNNの検出器を振り返ることから始めましょう[36]。Faster R-CNNは二段階から成り立ちます。Region Proposal Network (RPN)と呼ばれる第一段階におきましては、候補物体のバウンディングボックスを検出します。最も重要なFast R-CNN[12]にある第二段階では、各候補のボックスからRoIPoolを用いて特徴量の抽出を行い、クラス分類およびバウンディングボックスの回帰を実施します。両段階で使われる特徴量は、高速な推論のために共有することができます。読者の方々にはFaster R-CNNと他のフレームワークの包括的な比較を行うために、[21]を参照されたい。

英語本文19

``` Faster R-CNN: We begin by briefly reviewing the Faster R-CNN detector [36]. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN [12], extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference. We refer readers to [21] for latest, comprehensive comparisons between Faster R-CNN and other frameworks. ```

③**Mask R-CNN:**Mask R-CNNは、同じ二段階の手続きを踏みます。第一段階はRPNです。第二段階では、クラスとボックスオフセットを推測するのに並行して、Mask R-CNNは各RoIに対してバイナリマスクを出力します。これは最近のシステムとは対照的で、それらにおいてはクラス分類は、マスクの推論に依存しています（例：[33,10,26]）。私たちの手法はバウンディングボックスのクラス分類と回帰を並行にして適用するというFast R-CNN[12]のスピリットに従っています（これにより、原型であるR-CNN[13]の多段のパイプラインを大幅に単純化できることが明らかになっています）。

英語本文20

``` Mask R-CNN: Mask R-CNN adopts the same two-stage procedure, with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions (e.g. [33, 10, 26]). Our approach follows the spirit of Fast R-CNN [12] that applies bounding-box classification and regression in parallel (which turned out to largely simplify the multi-stage pipeline of original R-CNN [13]). ```

　④正式には、訓練中はサンプリングされた各RoIにおけるマルチタスクロスを

L = L_{cls} + L_{box} + L_{mask}

のように定義します。クラス分類のロス$\ L_{cls} $とバウンディングボックスのロス$\ L_{box} $は、[12]で定義されているものと同一です。マスクブランチは、各RoIに対して$\ Km^2 $の次元の出力を持ち、$\ m×m $の解像度の$\ K $ 個のバイナリマスクをエンコーディングしています。このため、私たちはピクセル単位でのシグモイドを適用し、平均化されたバイナリクロスエントロピーロスとして$\ L_{mask} $を定義します。正解クラス$\ k $を持つRoIに対して、$\ L_{mask} $は$\ k $番目のマスクにおいてのみ定義されます（他のマスクの出力はロスに寄与しません）。

英語本文21

``` Formally, during training, we define a multi-task loss on each sampled RoI as L = Lcls + Lbox + Lmask. The classification loss Lcls and bounding-box loss Lbox are identical as those defined in [12]. The mask branch has a Km2 - dimensional output for each RoI, which encodes K binary masks of resolution m × m, one for each of the K classes. To this we apply a per-pixel sigmoid, and define Lmask as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, Lmask is only defined on the k-th mask (other mask outputs do not contribute to the loss). ```

　⑤私たちの$\ L_{mask} $の定義により、ネットワークがクラス間で競い合うことなしにクラス毎のマスクを生成することができます。すなわち、専用のクラス分類のブランチに依存して、マスク出力選択用のクラスラベルを推測します。これによって、マスクとクラスの推論を分離しています。これは、FCN[30]をセマンティックセグメンテーションに適用するときの一般的なプラクティスとは異なります。一般的にはピクセル単位でのソフトマックスと多項クロスエントロピーロスを用います。この場合、クラス間を跨るマスクは競合します。しかしながら、私たちの手法では、ピクセル単位でのシグモイドとバイナリロスを用いていますので、それが起きません。この定式化がインスタンスセグメンテーションの良好な結果の鍵となることを実験でお示しします。

英語本文22

``` Our definition of Lmask allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [30] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results. ```

⑥**Mask Representation:**マスクはインプットである物体の空間的な配置をエンコーディングします。従って、全結合層によって不可避的に短い出力ベクトルにまで分解されてしまうクラスラベルやボックスオフセットとは異なり、マスクの空間構造を抽出することが、畳み込みによってもたらされるピクセル間での対応付けによって、自然と可能となります。

英語本文23

``` Mask Representation: A mask encodes an input object’s spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully-connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions. ```

　⑦具体的には、私たちの手法ではFCN[30]を用いて、各RoIから$\ m×m $のマスクを推論します。これによって、マスクブランチの各層は$\ m×m $の物体の明確な空間的な配置を、空間的な次元を欠くベクトル表現に崩壊させることなく保持します。マスク予測[33,34,10]のために$\ fc $層に依存する既存の手法とは異なり、私たちの完全に畳み込まれた表現ですと、より少ないパラメータで済み、実験で示されたようにより高い精度となります。

英語本文24

``` Specifically, we predict an m × m mask from each RoI using an FCN [30]. This allows each layer in the mask branch to maintain the explicit m × m object spatial layout without collapsing it into a vector representation that lacks spatial dimensions. Unlike previous methods that resort to fc layers for mask prediction [33, 34, 10], our fully convolutional representation requires fewer parameters, and is more accurate as demonstrated by experiments. ```

　⑧このピクセル間でのふるまいには私たちのRoI特徴量を必要とし、それ自体は忠実に明確なピクセル毎の空間的な対応付けを保存するように上手く配置された小さな特徴量マップです。これは、私たちにマスク予測において主要な役割を果たすRoIAlign層を開発する動機となりました。

英語本文25

``` This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction. ```

⑨**RoIAlign:**RoIPool[12]は各RoIから小さな特徴量マップ（例えば、$\ 7×7 $）を抽出するための標準的な操作です。RoIPoolは最初に、浮動小数点RoIを特徴量マップの離散粒度に量子化し、次にこの量子化されたRoIは空間的なビンに分割されますが、このビン自体も量子化されており、最後に各ビンが含んでいる特徴量の値が集約されます（通常はmax poolingによって）。量子化は、たとえば、$\ [x/16] $を計算することにより、連続座標$\ x $で実行されます。ここで、16は特徴量マップのストライドであり、$\ [・]$は丸めです。同様に、量子化は、ビンに分割するときに実行されます（たとえば、$\ 7×7 $）。これらの量子化は、RoIと抽出された特徴量の間に不整合をもたらします。これは、クラス分類には影響を与えないかもしれませんが、それはクラス分類がわずかなずれに対してロバストだからであって、ピクセル精度のマスクの予測には大きな悪影響を及ぼします。
ガウス記号について

英語本文26

``` RoIAlign: RoIPool [12] is a standard operation for extracting a small feature map (e.g., 7×7) from each RoI. RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Quantization is performed, e.g., on a continuous coordinate x by computing [x/16], where 16 is a feature map stride and [·] is rounding; likewise, quantization is performed when dividing into bins (e.g., 7×7). These quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks. ```

　⑩これに対処するために、私たちはRoIPoolの粗い量子化を取り除くRoIAlign層を提案し、抽出された特徴量を入力と適切に位置合わせします。私たちが提案する変更はシンプルです。つまり、RoIの境界やビンについてのいかなる量子化も回避します（例えば、私たちの手法では$\ [x/16] $の代わりに$\ x/16 $を使用します。）私たちの手法では、各RoIビン内に一定間隔でサンプリングされた4か所における入力用の特徴量の正確な値を計算するためにバイリニア補間[22]を使用しており、（最大あるいは平均を用いて）結果を集約しています。詳細はFigure3をご覧ください。ここで、量子化がない限りにおいては、この結果は正確なサンプリング箇所にもサンプリング点数にも影響を受けないことに注意します。

英語本文27

``` To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries or bins (i.e., we use x/16 instead of [x/16]). We use bilinear interpolation [22] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed. ```

⑪§4.2で示しますようにRoIAlignによって、大幅な向上を達成しました。私たちは、[10]で提案されたRoIWarpとの比較を行っています。RoIAlignとは異なり、RoIWarpは位置合わせの課題を見逃しており、RoIPoolとちょうど同じようにRoIを量子化する実装が[10]ではされています。ですので、たとえRoIWarpが追加で[22]に触発されてバイリニアのリサンプリング処理を導入したとしても、実験で示しているようにRoIPoolと同等の性能を示しており（より詳細についてはTable 2cを参照）、位置合わせが重要な役割を果たしていることがわかります。

英語本文28

``` RoIAlign leads to large improvements as we show in §4.2. We also compare to the RoIWarp operation proposed in [10]. Unlike RoIAlign, RoIWarp overlooked the alignment issue and was implemented in [10] as quantizing RoI just like RoIPool. So even though RoIWarp also adopts bilinear resampling motivated by [22], it performs on par with RoIPool as shown by experiments (more details in Table 2c), demonstrating the crucial role of alignment. ```

⑫**Network Architecture:**私たちの手法の汎用性を示すために、私たちは複合的なアーキテクチャを使ってMask R-CNNをインスタンス化しています。わかりやすくするために、次の二つを区別します。すなわち、(i)一枚の画像全体に渡って特徴量抽出用のバックボーンとなる畳み込みのアーキテクチャおよび(ii)ネットワークの先頭部分です。これは、バウンディングボックス検出（クラス分類および回帰）部分と各RoIに対して個別に適用されるマスク予測部分から成ります。

英語本文29

``` Network Architecture: To demonstrate the generality of our approach, we instantiate Mask R-CNN with multiple architectures. For clarity, we differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition (classification and regression) and mask prediction that is applied separately to each RoI. ```

Figure 3. **RoIAlign:**破線の格子は特徴量マップを表す。実線はRoI(この例では2x2のビン)を表し、各ビンには4つのサンプル点がある。RoIAlignは特徴量マップ上の近傍の格子点からバイリニア補間によって各サンプル点の値を計算する。RoI、ビン、サンプル点に関するいかなる座標系においても量子化は行われません。

英語本文30

``` Figure 3. RoIAlign: The dashed grid represents a feature map, the solid lines an RoI (with 2×2 bins in this example), and the dots the 4 sampling points in each bin. RoIAlign computes the value of each sampling point by bilinear interpolation from the nearby grid points on the feature map. No quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. ```

⑬私たちは、ネットワーク-深さ-特徴という命名法を使ってバックボーンのアーキテクチャを記述します。私たちは、ResNet [19]とResNeXt [45] の50層と101層の場合を評価します。ResNet [19]を用いたFaster R-CNNのオリジナルの実装は4段目の最終の畳み込み層から特徴量を抽出しており、これを私たちはC4と呼びます。例えば、ResNet-50を持つバックボーンはResNet-50-C4と記述されます。これは、[19, 10, 21, 39]において使われる一般的な選択です。

英語本文31

``` We denote the backbone architecture using the nomenclature network-depth-features. We evaluate ResNet [19] and ResNeXt [45] networks of depth 50 or 101 layers. The original implementation of Faster R-CNN with ResNets [19] extracted features from the final convolutional layer of the 4-th stage, which we call C4. This backbone with ResNet-50, for example, is denoted by ResNet-50-C4. This is a common choice used in [19, 10, 21, 39]. ```

⑭さらに私たちはLinら[27]によって近年提案されたより効果的な別のバックボーンを調査します。それは、Feature Pyramid Network (FPN)と呼ばれるものです。FPNは側方結合を備えたトップダウンアーキテクチャを用いて、単一スケールの入力からネットワーク内の特徴量ピラミッドを構築します。FPNのバックボーンを持つFaster R-CNNは、特徴量ピラミッドの異なる階層から、そのスケールに従ってRoI特徴量を抽出しまが、さもなければこの手法の残りの部分は通常のResNetと似ています。Mask RCNNとともにResNet-FPNをバックボーンとして特徴量抽出を行うことによって、精度と速度の両方において素晴らしいゲインが得られます。FPNのより詳細については、読者の方々には[27]を参照されたい。
トップダウンとボトムアップについて。
vanillaの意味について
 FPNについて

英語本文32

``` We also explore another more effective backbone recently proposed by Lin et al. [27], called a Feature Pyramid Network (FPN). FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed. For further details on FPN, we refer readers to [27]. ```

⑮ネットワークのヘッド部分に対して、私たちは先行研究で提案されたアーキテクチャに従っており、それに対して完全に畳み込まれたマスク予測のブランチを追加しています。具体的には、私たちはResNet[19]とFPN[27]の論文からFaster R-CNNのボックスのヘッド部分を延長しています。詳細は、Figure 4に示しています。ResNet-C4のバックボーンのヘッド部分は、ResNetの五段目(つまり、'res5'の9層)を含んでおり、計算資源を必要とします。FPNについては、バックボーンはすでにres5を含んでいるので、より少ないフィルターを用いるだけで済みより効率が良くなっています。
ResNetの五段目

英語本文33

``` For the network head we closely follow architectures presented in previous work to which we add a fully convolutional mask prediction branch. Specifically, we extend the Faster R-CNN box heads from the ResNet [19] and FPN [27] papers. Details are shown in Figure 4. The head on the ResNet-C4 backbone includes the 5-th stage of ResNet (namely, the 9-layer ‘res5’ [19]), which is computeintensive. For FPN, the backbone already includes res5 and thus allows for a more efficient head that uses fewer filters. ```

⑯ここで、私たちのマスクブランチが単純な構造であることに注意します。より複雑なデザインは性能を向上させる可能性がありますが、本研究の焦点ではありません。

英語本文34

``` We note that our mask branches have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work. ```

Figure 4. **Head Architecture:**私たちは二つの既存のFaster RCNNのヘッド[19,27]を拡張しています。左右の図は、それぞれ[19]と[27]からResNet C4とFPNバックボーンのヘッドを示しており、それにマスクブランチが追加されています。数字は、空間の解像度とチャンネル数を示しています。矢印は、図から推測可能なように、畳み込み、逆畳み込みあるいは全結合層を示しています(畳み込みでは空間解像度は保持する一方で、逆畳み込みは解像度を増加させています)。すべての畳み込みは3x3のサイズですが、1x1である出力部分を除きます。逆畳み込みは2x2でストライドは2であり、隠れ層にはReLU[31]を使用しています。
左：'res5'はResNetの五段目を示しており、簡単のため一番最初の畳み込みが7x7のRoIに対してストライド1で処理するように変更を加えています([19]においては、14x14のストライド2です)。
右：'x4'は４つの連続する畳み込みを示しています。
head architectureとbackbone architectureについて

英語本文35

``` Figure 4. Head Architecture: We extend two existing Faster RCNN heads [19, 27]. Left/Right panels show the heads for the ResNet C4 and FPN backbones, from [19] and [27], respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimension while deconv increases it). All convs are 3×3, except the output conv which is 1×1, deconvs are 2×2 with stride 2, and we use ReLU [31] in hidden layers. Left: ‘res5’ denotes ResNet’s fifth stage, which for simplicity we altered so that the first conv operates on a 7×7 RoI with stride 1 (instead of 14×14 / stride 2 as in [19]). Right: ‘×4’ denotes a stack of four consecutive convs. ```

ガウス記号について

$\ [・]$はガウス記号です。

トップダウンとボトムアップについて

トップダウン型は、全体から部分に落とし込んでいくイメージ。ボトムアップ型は部分から全体を構成していくイメージです。
参考記事①:Top-down vs. bottom-up approaches
参考記事②:人工知能（AI）はこれからが本来の意味での発展に向かう！（3）
参考記事③:PoseResnet : トップダウンで骨格検出を行う機械学習モデル

vanillaの意味について

平凡なという意味があるようです。「Vanilla ResNet」で「通常の（何もアレンジしていない）ResNet」という意味になります。
参考記事①：vanillaってどんな意味？

FPNについて

FPNは下記のようなピラミッド形状をしています。各段階でスケールした画像に対して推論をしていくイメージのようです。（[27]より抜粋）

head architectureとbackbone architectureについて

本論文では、ざっくり言うと入力側と出力側でbackboneとheadというふうに分けて考えています。
詳細については、「How do backbone and head architecture work in Mask R-CNN?」参照。

resnetの五段目について

下記の9層を指しています。[19]

3.1Implementation Details

　①私たちは、先行するFast/Faster R-CNNの研究[12,36,27]に従ってハイパーパラメータを設定しました。これらの設定は、原著論文[12,36,27]における物体検出には用いられませんでしたが、私たちのインスタンスセグメンテーションはハイパーパラメータに対してはロバストであることを発見しました。

英語本文36

``` We set hyper-parameters following existing Fast/Faster R-CNN work [12, 36, 27]. Although these decisions were made for object detection in original papers [12, 36, 27], we found our instance segmentation system is robust to them. ```

②Training: Fast R-CNNと同様、RoIがIoUで少なくとも0.5の正解のボックスを含めばpositiveとし、そうでなければnegativeとみなしました。マスクロスの$\ L_{mask} $はpositiveのRoIでのみ定義されます。マスクターゲットはRoIとそれと被っている正解マスクの共通部分です。

英語本文37

``` Training: As in Fast R-CNN, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. The mask loss Lmask is defined only on positive RoIs. The mask target is the intersection between an RoI and its associated ground-truth mask. ```

　③私たちの手法はimage-centric訓練[12]を適用しています。画像のスケールが800ピクセルになるようにリサイズをします[27]。各ミニバッチはGPUあたり２枚の画像を持ち、各画像はサンプル数NのRoIを持ち、positiveとnegativeの比は1:3です[12]。C4バックボーンに対してNは64であり([12,36])、FPNに対してはNは512です[27]。私たちは、GPU8枚で（したがって、効果的なミニバッチ数は16です）、16万回のイタレーション数で訓練しています。学習率は0.02とし、12万回目で10分の1に減少させています。私たちは、0.0001の荷重減衰でモーメンタムは0.9としています。ResNeXt[45]を使って、私たちはGPU一枚あたり1画像で、同数のイタレーション数、学習率は0.01で訓練しています。

英語本文38

``` We adopt image-centric training [12]. Images are resized such that their scale (shorter edge) is 800 pixels [27]. Each mini-batch has 2 images per GPU and each image has N sampled RoIs, with a ratio of 1:3 of positive to negatives [12]. N is 64 for the C4 backbone (as in [12, 36]) and 512 for FPN (as in [27]). We train on 8 GPUs (so effective minibatch size is 16) for 160k iterations, with a learning rate of 0.02 which is decreased by 10 at the 120k iteration. We use a weight decay of 0.0001 and momentum of 0.9. With ResNeXt [45], we train with 1 image per GPU and the same number of iterations, with a starting learning rate of 0.01. ```

　④RPNのアンカーは[27]に従い、５つのスケールと３つのアスペクト比としています。分離を容易にするために、RPNは個別に訓練されており、指定されない限りMask R-CNNとは特徴量を共有していません。本論文に登場するすべてのモデルは、RPNとMask R-CNNが全く同じバックボーンを持っており、共有可能です。
anchor boxについて

英語本文39

``` The RPN anchors span 5 scales and 3 aspect ratios, following [27]. For convenient ablation, RPN is trained separately and does not share features with Mask R-CNN, unless specified. For every entry in this paper, RPN and Mask R-CNN have the same backbones and so they are shareable. ```

⑤**Inference:**テスト時において、プロポーザル数はC4バックボーン[36]に対して300であり、FPN[27]に対しては1000です。私たちは、これらの提案に対してボックス予測のブランチを走らせており、その後non-maximum suppressionを適用しています[14]。次にマスクブランチが、上位100個の検出されたボックスに対して適用されます。しかしながら、これは訓練時に用いられている並列計算とは異なりますが、推論を高速化させ精度を向上させています（RoIがより少なくすみ、より高精度であるため）。マスクブランチは、RoI毎に$\ K $個のマスクを予測することができますが、$\ k $番目のマスクしか用いません。$\ k $というのはクラス分類のブランチによって予測されたクラスを指します。次に、$\ m×m $の浮動小数点のマスク出力はRoIのサイズにまでリサイズされて、閾値0.5で二値化されます。
non-maximum suppressionについて

英語本文40

``` Inference: At test time, the proposal number is 300 for the C4 backbone (as in [36]) and 1000 for FPN (as in [27]). We run the box prediction branch on these proposals, followed by non-maximum suppression [14]. The mask branch is then applied to the highest scoring 100 detection boxes. Although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs). The mask branch can predict K masks per RoI, but we only use the k-th mask, where k is the predicted class by the classification branch. The m×m floating-number mask output is then resized to the RoI size, and binarized at a threshold of 0.5. ```

⑥ここで、私たちは検出したボックスの上位100個についてのみマスクを計算するので、Faster R-CNNに相当するものに対してわずかなオーバーヘッドしか追加していないことを述べておきます（例として、典型的なモデルでは20%以下）。

英語本文41

``` Note that since we only compute masks on the top 100 detection boxes, Mask R-CNN adds a small overhead to its Faster R-CNN counterpart (e.g., ∼20% on typical models). ```

Figure 5. COCOテスト画像におけるMask R-CNNの結果。ResNet-101-FPNを用いて、5fpsで動作。mask APは35.7(Table 1)。

英語本文42

``` Figure 5. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1) ```

Table 1. COCO test-devにおけるインスタンスセグメンテーションのmaskAP。MNC[10]とFCIS[26]は、それぞれCOCO2016とCOCO2016の優勝モデルです。余計なものをつけなくても、Mask R-CNNはより複雑なFCIS+++を凌駕しています。そのモデルは、マルチスケールでの訓練とテスト、水平方向のフリップさらにOHEM[38]を含んでいます。すべてのエントリーは、単一モデルの結果です。

英語本文43

``` Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes multi-scale train/test, horizontal flip test, and OHEM [38]. All entries are single-model results. ```

anchor boxについて

「ディープラーニングによる一般物体検出(4) – Faster R-CNN」に解説があります。

non-maximum suppressionについて

「Non-Maximum Suppressionを世界一わかりやすく解説する」に詳しい説明があります。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【論文翻訳】超精読Mask R-CNN 前編

なぜ和訳したか

読んで頂く前に

お願い

目次

Abstract

1.Introduction

注1

Mask R-CNNのアーキテクチャについて

2.Related Work

Attention機構について

3.Mask R-CNN

ガウス記号について

トップダウンとボトムアップについて

vanillaの意味について

FPNについて

head architectureとbackbone architectureについて

resnetの五段目について

3.1Implementation Details

anchor boxについて

non-maximum suppressionについて

【論文翻訳】超精読Mask R-CNN　前編

なぜ和訳したか