「YOLACT++: Better Real-time Instance Segmentation」(arXiv, 3 Dec 2019)をGoogle翻訳を使って和訳しました。ほとんど直訳です。

Abstract

We present a simple, fully-convolutional model for real-time (> 30 fps) instance segmentation that achieves competitive results on MS COCO evaluated on a single Titan Xp, which is significantly faster than any previous state-of-the-art approach. Moreover, we obtain this result after training on only one GPU. We accomplish this by breaking instance segmentation into two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients. Then we produce instance masks by linearly combining the prototypes with the mask coefficients. We find that because this process doesn’t depend on repooling, this approach produces very high-quality masks and exhibits temporal stability for free. Furthermore, we analyze the emergent behavior of our prototypes and show they learn to localize instances on their own in a translation variant manner, despite being fully-convolutional. We also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty. Finally, by incorporating deformable convolutions into the backbone network, optimizing the prediction head with better anchor scales and aspect ratios, and adding a novel fast mask re-scoring branch, our YOLACT++ model can achieve 34.1 mAP on MS COCO at 33.5 fps, which is fairly close to the state-of-the-art approaches while still running at real-time.

単一のTitan Xpで評価されたMS COCOで競争力のある結果を達成する、リアルタイム（> 30 fps）のインスタンスセグメンテーションのためのシンプルで完全な畳み込みモデルを提示します。さらに、1つのGPUのみでトレーニングした後にこの結果を取得します。これを実現するには、インスタンスセグメンテーションを2つの並列サブタスクに分割します。
(1) プロトタイプマスクの集合を生成
(2) インスタンスごとのマスク係数を予測
それから、プロトタイプをマスク係数と線形結合することにより、インスタンスマスクを作成します。このプロセスは再プーリングに依存しないため、このアプローチは非常に高品質のマスクを生成し、無料で一時的な安定性を示します。さらに、プロトタイプの緊急の動作を分析し、完全な畳み込みにもかかわらず、インスタンスが移動分散の方法でインスタンスを領域推定することを学ぶことを示します。また、わずかな精度へのペナルティしか持たない標準なNMSよりもドロップインが12ミリ秒速いFast NMSも提案します。最後に、変形可能な畳み込み層をバックボーンネットワークに組み込み、より良いアンカースケールとアスペクト比で予測ヘッドを最適化し、新しく高速マスクリスコアリングブランチを追加することにより、YOLACT++モデルは33.5fpsのMS COCOで34.1mAPを達成できます。これは、リアルタイムで実行しながら、最先端のアプローチにかなり近いです。

1 Introduction

“Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.”
– Joseph Redmon, YOLOv3 [1]

「とにかく箱はバカです。たぶん、私がYOLOに習得させることはできませんが、マスクを真に信じています。」
– Joseph Redmon, YOLOv3 [1]

What would it take to create a real-time instance segmentation algorithm? Over the past few years, the vision community has made great strides in instance segmentation, in part by drawing on powerful parallels from the well-established domain of object detection. State-of-the-art approaches to instance segmentation like Mask R-CNN [2] and FCIS [3] directly build off of advances in object detection like Faster R-CNN [4] and R-FCN [5]. Yet, these methods focus primarily on performance over speed, leaving the scene devoid of instance segmentation parallels to real-time object detectors like SSD [6] and YOLO [1], [7]. In this work, our goal is to fill that gap with a fast, one-stage instance segmentation model in the same way that SSD and YOLO fill that gap for object detection.

リアルタイムのインスタンスセグメンテーションのアルゴリズムを作成するには何が必要ですか？過去数年にわたり、ビジョンコミュニティは、物体検出の確立された領域から強力な類似点を引き出すことにより、インスタンスセグメンテーションで大きな進歩を遂げてきました。 Mask R-CNN [2]やFCIS [3]などのインスタンスセグメンテーションへの最先端のアプローチは、Faster R-CNN [4]やR-FCN [5]などの物体検出の進歩から直接構築されています。しかし、これらの方法は主に速度よりも精度に焦点を当てており、SSD [6]やYOLO [1]、[7]のようなリアルタイムの物体検出器に対応するインスタンスセグメンテーションとはかけ離れています。この論文では、そのギャップを、SSDとYOLOが埋めた方法と同じように、高速なone-stageのインスタンスセグメンテーションモデルで埋めることが目標です。

However, instance segmentation is hard—much harder than object detection. One-stage object detectors like SSD and YOLO are able to speed up existing two-stage detectors like Faster R-CNN by simply removing the second stage and making up for the lost performance in other ways. The same approach is not easily extendable, however, to instance segmentation. State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks. That is, these methods “re-pool” features in some bounding box region (e.g., via RoI-pool/align), and then feed these now localized features to their mask predictor. This approach is inherently sequential and is therefore difficult to accelerate. One-stage methods that perform these steps in parallel like FCIS do exist, but they require significant amounts of post-processing after localization, and thus are still far from real-time.

ただし、インスタンスセグメンテーションは難しく、物体検出よりもはるかに困難です。 SSDやYOLOなどのone-stage物体検出器は、2番目のstageを削除し、他の方法で失われた精度を補うだけで、Faster R-CNNなどの既存のtwo-stage検出器を高速化できます。ただし、同じアプローチをインスタンスセグメンテーションに簡単に拡張することはできません。最先端のtwo-stageインスタンスセグメンテーション手法は、マスクを生成するべき特徴領域の推定に大きく依存しています。つまり、これらの方法は、バウンディングボックス領域の特徴を「再プール」し（たとえば、RoI-pool/alignを介して）、これらの現在領域推定された特徴をマスク予測器に送ります。このアプローチは本質的に連続であるため、高速化が困難です。 FCISのようにこれらの手順を並行して実行するone-stageの方法は存在しますが、領域推定後にかなりの後処理が必要になるため、まだリアルタイムからはほど遠いです。

To address these issues, we propose YOLACT(You Only Look At CoefficienTs), a real-time instance segmentation framework that forgoes an explicit localization step. Instead, YOLACT breaks up instance segmentation into two parallel tasks: (1) generating a dictionary of non-local prototype masks over the entire image, and (2) predicting a set of linear combination coefficients per instance. Then producing a full-image instance segmentation from these two components is simple: for each instance, linearly combine the prototypes using the corresponding predicted coefficients and then crop with a predicted bounding box. We show that by segmenting in this manner, the network learns how to localize instance masks on its own, where visually, spatially, and semantically similar instances appear different in the prototypes.

これらの問題に対処するために、YOLACT（You Only Look At CoefficienTs）を提案します。これは、リアルタイムのインスタンスセグメンテーションのフレームワークで、明示的な領域推定のステップを省略する代わりに、インスタンスセグメンテーションを2つの並列タスクに分割します。
（1）画像全体で非ローカルプロトタイプマスクの辞書を生成
（2）インスタンスごとの線形結合係数の集合を予測
そして、これら2つの要素からフル画像のインスタンスセグメンテーションを作成するのは簡単です。各インスタンスで、対応する予測係数を使用してプロトタイプを線形結合し、予測境界ボックスでトリミングします。このようにセグメンテーションすることにより、視覚的、空間的、意味的に類似したインスタンスがプロトタイプで異なる場合に、ネットワークがインスタンスマスクを独自に領域推定する方法を学習することを示します。

Moreover, since the number of prototype masks is independent of the number of categories (e.g., there can be more categories than prototypes), YOLACT learns a distributed representation in which each instance is segmented with a combination of prototypes that are shared across categories. This distributed representation leads to interesting emergent behavior in the prototype space: some prototypes spatially partition the image, some localize instances, some detect instance contours, some encode position-sensitive directional maps (similar to those obtained by hard-coding a position-sensitive module in FCIS [3]), and most do a combination of these tasks (see Figure 5).

さらに、プロトタイプマスクの数はカテゴリ数に依存しないため（たとえば、プロトタイプよりも多くのカテゴリがある場合があります）、YOLACTは、各インスタンスがカテゴリ間で共有されるプロトタイプの組み合わせでセグメント化された分散表現を学習します。この分散表現は、プロトタイプ空間で興味深い創発的動作につながります：いくつかのプロトタイプは画像を空間的に分割し、いくつかはインスタンスを領域推定し、いくつかはインスタンスの輪郭を検出し、いくつかは位置依存方向マップをエンコードします（位置依存モジュールをハードコーディングすることで得られるものと同様） FCIS [3]）、およびこれらのタスクの組み合わせを実行します（図5参照）。

This approach also has several practical advantages. First and foremost, it’s fast: because of its parallel structure and extremely lightweight assembly process, YOLACT adds only a marginal amount of computational overhead to a one-stage backbone detector, making it easy to reach 30 fps even when using ResNet-101 [8]; in fact, the entire mask branch takes only ∼5 ms to evaluate. Second, masks are high-quality: since the masks use the full extent of the image space without any loss of quality from repooling, our masks for large objects are significantly higher quality than those of other methods (see Figure 9). Finally, it’s general: the idea of generating prototypes and mask coefficients could be added to almost any modern object detector.

このアプローチには、いくつかの実用的な利点もあります。
1. 何よりもまず、速いです。並列構造と非常に軽量なAssemblyプロセスにより、YOLACTはone-stageのバックボーン検出器にわずかな計算オーバーヘッドしか追加しないため、ResNet-101 [8]を使用していても30 fpsに到達しやすいことです; 実際、マスクブランチ全体の評価には、約5ミリ秒しかかかりません。
2. 第二に、マスクが高品質です：マスクはリプールの品質を損なうことなく画像空間の全範囲を使用するため、大きな検出用マスクは他の方法よりも非常に高品質です（図9参照）。
3. 最後に、一般的なことです。プロトタイプとマスク係数を生成するというアイデアは、ほとんどすべての最新物体検出器に追加できます。

Interestingly, breaking up instance segmentation in this way is loosely related to the ventral (“what”) and dorsal (“where”) streams hypothesized to play a prominent role in human vision [9]. The linear coefficients and corresponding detection branch can be thought of as recognizing individual instances (“what”), while the prototype masks can be seen as localizing instances in space (“where”). This is closer to, albeit still far away from, human vision than the two-stage “localize-then-segment” type approaches.

興味深いことに、この方法でインスタンスセグメンテーションを分割することは、人間の視覚において重要な役割を果たすと仮定される視覚路（それが「何」なのかを認識する腹側皮質視覚路とそれが「どこ」にあるのかうぃ認識する背側皮質視覚路）に大まかに関連しています[9]。線形係数と対応する検出ブランチは、個々のインスタンス（「何」）を認識すると考えることができ、プロトタイプマスクは、空間（「どこ」）のインスタンスをローカライズするものと見なすことができます。これは、two-stageの「ローカライズしてセグメントする」タイプのアプローチよりも、人間の視覚に近いとはいえ、それでも遠いです。

Our main contribution is the first real-time (> 30 fps) instance segmentation algorithm with competitive results on the challenging MS COCO dataset 10. In addition, we analyze the emergent behavior of YOLACT’s prototypes and provide experiments to study the speed vs. performance trade-offs obtained with different backbone architectures, numbers of prototypes, and image resolutions. We also provide a novel Fast NMS approach that is 12ms faster than traditional NMS with a negligible performance penalty. To further improve the performance of our model over our conference paper version [11], in Section 6, we propose YOLACT++. Specifically, we incorporate deformable convolutions [12], [13] into the backbone network, which provide more flexible feature sampling and strengthening its capability of handling instances with different scales, aspect ratios, and rotations. Furthermore, we optimize the prediction heads with better anchor scale and aspect ratio choices for larger object recall. Finally, we also introduce a novel fast mask re-scoring branch, which results in a decent performance boost with only marginal speed overhead. These improvements are validated in Tables 5, 6, and 7. Apart from these algorithm improvements over our conference paper [11], we also provide more qualitative results (Figure 8) and real-time bounding box detection results (Table 4).

私たちの主な貢献は、
1. 最初のリアルタイム（> 30 fps）インスタンスセグメンテーションアルゴリズムであり、同時に、挑戦的なMS COCOデータセットで競争力のある結果が得られます[10]（図1参照）。さらに、YOLACTのプロトタイプの新しい動作を分析し、さまざまなバックボーンアーキテクチャ、プロトタイプの数、および画像解像度で得られる速度と精度のトレードオフを調べる実験を提供します。
2. また、精度のペナルティがほとんどない、従来のNMSよりも12ミリ秒高速な新しいFaster NMSアプローチを提供します。
3. 会議論文バージョン[11]よりもモデルの精度をさらに向上させるために、セクション6でYOLACT++を提案します。具体的には、変形可能な畳み込み層[12]、[13]をバックボーンネットワークに組み込みます。これにより、より柔軟な特徴サンプリングが提供され、異なるスケール、アスペクト比、回転のインスタンスを処理する機能が強化されます。さらに、物体のリコールが大きくなるように、アンカースケールとアスペクト比の選択を改善して、予測ヘッドを最適化します。最後に、新しい高速マスクリスコアリングブランチも導入しました。これにより、わずかな速度オーバーヘッドのみで適切な精度向上ができます。
これらの改善は、表5、6、および7で検証されています。会議論文[11]に対するこれらのアルゴリズムの改善とは別に、より定性的な結果（図8）およびリアルタイムのバウンディングボックス検出結果（表4）も提供します。

Fig. 1:
Speed-performance trade-off for various instance segmentation methods on COCO. To our knowledge, ours is the first real-time (above 30 FPS) approach with around 30 mask mAP on COCO test-dev.

図１:

COCOでのさまざまなインスタンスセグメンテーション手法の速度と精度のトレードオフ。私たちの知る限り、私たちの方法はCOCO test-devで約30 mask mAPを使用した最初のリアルタイム（30 FPSを超える）アプローチです。

2 Related Work

Instance Segmentation

Given its importance, a lot of research effort has been made to push instance segmentation accuracy. Mask-RCNN [2] is a representative two-stage instance segmentation approach that first generates candidate region-of-interests (ROIs) and then classifies and segments those ROIs in the second stage. Follow-up works try to improve its accuracy by e.g., enriching the FPN features [14] or addressing the incompatibility between a mask’s confidence score and its localization accuracy [15]. These two-stage methods require re-pooling features for each ROI and processing them with subsequent computations, which make them unable to obtain real-time speeds (30 fps) even when decreasing image size (see Table 2c).

その重要性を考えて、インスタンスセグメンテーションの精度を高めるために多くの研究努力がなされてきました。 Mask-RCNN [2]は、まず候補の対象領域（ROI）を生成し、次にこれらのROIを第2stageで分類およびセグメントする、代表的なtwo-stageインスタンスセグメンテーションのアプローチです。事後研究では、たとえばFPN機能を充実させる[14]、マスクの信頼スコアとその領域推定の精度の非互換性に対処する[15]など、精度を向上させます。これらのtwo-stageの方法では、各ROIの再プーリング機能と後続の計算での処理が必要であり、画像サイズを小さくしてもリアルタイム速度（30 fps）を取得できません（表2cを参照）。

One-stage instance segmentation methods generate position sensitive maps that are assembled into final masks with position-sensitive pooling [3], [16] or combine semantic segmentation logits and direction prediction logits [17]. Though conceptually faster than two-stage methods, they still require repooling or other non-trivial computations (e.g., mask voting). This severely limits their speed, placing them far from real-time. In contrast, our assembly step is much more lightweight (only a linear combination) and can be implemented as one GPU-accelerated matrix-matrix multiplication, making our approach very fast.

One-stageインスタンスセグメンテーションの方法は、位置依存プーリング[3]、[16]で最終マスクに組み立てられる位置依存マップを生成するか、セマンティックセグメンテーションロジットと方向予測ロジット[17]を組み合わせます。概念的にはtwo-stageの手法よりも高速ですが、再プーリングまたはその他の重要な計算（マスク投票など）が必要です。これにより、速度が大幅に制限され、リアルタイムから大きく外れます。対照的に、アセンブリ手順ははるかに軽量（線形結合のみ）であり、1つのGPU加速行列行列乗算として実装できるため、アプローチが非常に高速になります。

Finally, some methods first perform semantic segmentation followed by boundary detection [18], pixel clustering [19], [20], or learn an embedding to form instance masks [21], [22], [23], [24]. Again, these methods have multiple stages and/or involve expensive clustering procedures, which limits their viability for real-time applications.

最後に、いくつかの方法は最初にセマンティックセグメンテーションを実行し、続いて境界検出[18]、ピクセルクラスタリング[19]、[20]を実行するか、埋め込みを学習してインスタンスマスクを形成します[21]、[22]、[23]、[24]。繰り返しますが、これらの方法には複数の段階があり、かつ/または高価なクラスタリング手順を伴うため、リアルタイムアプリケーションの実行可能性が制限されます。

Real-time Instance Segmentation

While real-time object detection [1], [6], [7], [26], and semantic segmentation [27], [28], [29], [30], [31] methods exist, few works have focused on real-time instance segmentation. Straight to Shapes [32] and Box2Pix [33] can perform instance segmentation in real-time (30 fps on Pascal SBD 2012 [34], [35] for Straight to Shapes, and 10.9 fps on Cityscapes [36] and 35 fps on KITTI [37] for Box2Pix), but their accuracies are far from that of modern baselines. While [38] substantially improves instance segmentation accuracy over these prior methods, it runs only at 11 fps on Cityscapes. In fact, Mask R-CNN [2] remains one of the fastest instance segmentation methods on semantically challenging datasets like COCO 10.

リアルタイム物体検出[1]、[6]、[7]、[26]、およびセマンティックセグメンテーション[27]、[28]、[29]、[30]、[31]手法が存在しますが、リアルタイムのインスタンスセグメンテーションに焦点を当てています。 Straight to Shapes[32]およびBox2Pix [33]は、リアルタイムでインスタンスセグメンテーションを実行できます（Pascal SBD 2012 [34]では30 fps、Straight to Shapesでは[35]、Cityscapes[36]および35 fpsでは10.9 fps KITTI [37] for Box2Pix）が、その精度は最新のベースラインの精度とはほど遠いです。 [38]はこれらの従来の方法よりもインスタンスセグメンテーションの精度を大幅に向上させますが、Cityscapesでは11 fpsでのみ実行されます。実際、Mask R-CNN [2]は、COCO [10]（550pxの画像で13.5 fps。表2c参照）のようなセマンティックに挑戦するデータセット上で最速のインスタンスセグメンテーション手法の1つのままです。

Prototypes

Learning prototypes (aka vocabulary/codebook) has been extensively explored in computer vision. Classical representations include textons [39] and visual words [40], with advances made via sparsity and locality priors [41], [42], [43]. Others have designed prototypes for object detection [44], [45], [46]. Though related, these works use prototypes to represent features, whereas we use them to assemble masks for instance segmentation. More- over, we learn prototypes that are specific to each image, rather than global prototypes shared across the entire dataset.

学習プロトタイプ（別名 vocabulary/codebook）は、コンピュータービジョンで広範囲に調査されています。古典的な表現には、textons[39]と視覚的な単語[40]が含まれ、スパース性と局所性の事前分布[41]、[42]、[43]によって進歩がなされています。他のものは、物体検出[44]、[45]、[46]のプロトタイプを設計しました。関連していますが、これらの作品はプロトタイプを使用して特徴を表しますが、インスタンスセグメンテーション用にマスクを組み立てるために使用します。さらに、データセット全体で共有されるグローバルプロトタイプではなく、各画像に固有のプロトタイプを学習します。

3 YOLACT

Our goal is to add a mask branch to an existing one-stage object detection model in the same vein as Mask R-CNN [2] does to Faster R-CNN [4], but without an explicit feature localization step (e.g., feature repooling). To do this, we break up the complex task of instance segmentation into two simpler, parallel tasks that can be assembled to form the final masks. The first branch uses an FCN [47] to produce a set of image-sized “prototype masks” that do not depend on any one instance. The second adds an extra head to the object detection branch to predict a vector of “mask coefficients” for each anchor that encode an instance’s representation in the prototype space. Finally, for each instance that survives NMS, we construct a mask for that instance by linearly combining the work of these two branches.

私たちの目標は、Mask R-CNN [2]がFaster R-CNN [4]に行うのと同じ方法で、明示的な特徴領域の推定ステップ（たとえば、特徴の再プーリング）なしで、既存の one-stage物体検出モデルにマスクブランチを追加することです。これを行うために、インスタンスセグメンテーションの複雑なタスクを2つのより単純な並列タスクに分割し、それらを組み合わせて最終的なマスクを形成できます。
1. FCN [47]を使用して、1つのインスタンスに依存しない画像サイズの「プロトタイプマスク」のセットを生成します。
2. 物体検出ブランチに別のヘッドを追加して、プロトタイプ空間でインスタンスの表現をエンコードする各アンカーの「マスク係数」のベクトルを予測します。
最後に、NMSで生き残った各インスタンスについて、これらの2つのブランチの作業を線形結合することにより、そのインスタンスのマスクを構築します。

Rationale

We perform instance segmentation in this way primarily because masks are spatially coherent; i.e., pixels close to each other are likely to be part of the same instance. While a convolutional (conv) layer naturally takes advantage of this coherence, a fully-connected ($f_c$) layer does not. That poses a problem, since one-stage object detectors produce class and box coefficients for each anchor as an output of an $f_c$ layer.(To show that this is an issue, we develop an “fc-mask” model that produces masks for each anchor as the reshaped output of an fc layer. As our experiments in Table 2c show, simply adding masks to a one-stage model as fc outputs only obtains 20.7 mAP and is thus very much insufficient.) Two stage approaches like Mask R-CNN get around this problem by using a localization step (e.g., RoI-Align), which preserves the spatial coherence of the features while also allowing the mask to be a conv layer output. However, doing so requires a significant portion of the model to wait for a first-stage RPN to propose localization candidates, inducing a significant speed penalty.

この方法でインスタンスセグメンテーションを実行するのは、主にマスクが空間的に一貫しているためです。つまり、互いに近いピクセルは同じインスタンスの一部である可能性が高いです。畳み込み層（conv）はこの一貫性を自然に利用しますが、全結合層（$f_c$）は利用しません。one-stage物体検出器は各アンカーのクラスおよびボックス係数を$f_c$層の出力として生成するため、問題が生じます。(これが問題であることを示すために、$f_c$層の再形成された出力として各アンカーのマスクを生成する「$f_c$-mask」モデルを開発します。表2cの実験が示すように、$f_c$出力としてone-stageモデルにマスクを追加するだけでは20.7 mAPしか得られないため、非常に不十分です。) Mask R-CNNのようなtwo-stageアプローチでは、領域推定ステップ（たとえば、RoI-Align）を使用してこの問題を回避します。これにより、特徴の空間的一貫性を保持しながら、マスクを変換レイヤー出力にすることができます。ただし、これを行うには、モデルのかなりの部分が第1stageのRPNが領域推定の候補を提案するのを待つ必要があり、速度が大幅に低下します。

Thus, we break the problem into two parallel parts, making use of $f_c$ layers, which are good at producing semantic vectors, and conv layers, which are good at producing spatially coherent masks, to produce the “mask coefficients” and “prototype masks”, respectively. Then, because prototypes and mask coefficients can be computed independently, the computational overhead over that of the backbone detector comes mostly from the assembly step, which can be implemented as a single matrix multiplication. In this way, we can maintain spatial coherence in the feature space while still being one-stage and fast.

したがって、連続ベクトルの生成に適した$f_c$層と、空間的に一貫なマスクの生成に優れたconv層を使用して、問題を2つの並列部分に分割し、「マスク係数」と「プロトタイプマスク」を生成します。そして、プロトタイプとマスク係数を個別に計算できるため、バックボーン検出器の計算オーバーヘッドの大部分は、単一の行列乗算として実装できるアセンブリ手順から生じます。このようにして、one-stageかつ高速でありながら、特徴空間の空間的一貫性を維持できます。

3.1 Prototype Generation

The prototype generation branch (protonet) predicts a set of k prototype masks for the entire image. We implement protonet as an FCN whose last layer has k channels (one for each prototype) and attach it to a backbone feature layer (see Figure 3 for an illustration). While this formulation is similar to standard semantic segmentation, it differs in that we exhibit no explicit loss on the prototypes. Instead, all supervision for these prototypes comes from the final mask loss after assembly.

プロトタイプ生成ブランチ（protonet）は、画像全体のk個のプロトタイプマスクの集合を予測します。protonetをFCNとして実装し、最後の層にk個のチャンネル（プロトタイプごとに1つ）があり、バックボーン特徴層に接続します（図3参照）。この定式化は標準のセマンティックセグメンテーションに似ていますが、プロトタイプに明示的な損失がないことが異なります。代わりに、これらのプロトタイプの管理はすべて、組み立て後の最終的なマスク損失から行われます。

Fig. 3: Protonet Architecture
The labels denote feature size and channels for an image size of 550 × 550. Arrows indicate 3 × 3 conv layers, except for the final conv which is 1 × 1. The increase in size is an upsample followed by a conv. Inspired by the mask branch in [2].

図３： Protonet Architecture

ラベルは、550×550の画像サイズの特徴サイズとチャンネル数を示します。矢印は3×3のconv層を示します。ただし、最終convは1×1です。サイズの増加は、アップサンプルの後にconvが続きます。 [2]のマスクブランチに触発されました。

We note two important design choices: taking protonet from deeper backbone features produces more robust masks, and higher resolution prototypes result in both higher quality masks and better performance on smaller objects. Thus, we use FPN [48] because its largest feature layers (P3 in our case; see Figure 2) are the deepest. Then, we upsample it to one fourth the dimensions of the input image to increase performance on small objects.

2つの重要な設計上の選択に注意してください。
1. バックボーンのより深い特徴マップからProtonetを取得すると、よりロバストなマスクが生成され、
2. より高い解像度のプロトタイプは、より小さな物体でより高品質のマスクとより良い精度の両方をもたらします。
したがって、
1. FPN [48]を使用します。最大の特徴マップ（この例ではP3。図2参照）が最も深いためです。
2. それを入力画像の4分の1の次元にまでアップサンプリングして、小さな物体の精度を向上させます。

Fig. 2: YOLACT Architecture
Blue/yellow indicates low/high values in the prototypes, gray nodes indicate functions that are not trained, and k = 4 in this example. We base this architecture off of RetinaNet [25] using ResNet-101 + FPN.

図２: YOLACT Architecture

青/黄はプロトタイプの低/高値を示し、灰色のノードはトレーニングされていない関数を示し、この例ではk = 4です。このアーキテクチャは、ResNet-101 + FPNを使用してRetinaNet [25]に基づいています。

Finally, we find it important for the protonet’s output to be unbounded, as this allows the network to produce large, overpowering activations for prototypes it is very confident about (e.g., obvious background). Thus, we have the option of following protonet with either a ReLU or no nonlinearity. We choose ReLU for more interpretable prototypes.

最後に、protonetの出力が制限されていないことが重要であることがわかります。これにより、ネットワークが非常に自信のあるプロトタイプ（例：明らかな背景）に対して大規模で強力な活性化関数を生成できるためです。したがって、ReLUを使うか、非線形関数を用いないかのオプションがあります。より多くの解釈可能なプロトタイプには、ReLUを選択します。

3.2 Mask Coefficients

Typical anchor-based object detectors have two branches in their prediction heads: one branch to predict c class confidences, and the other to predict 4 bounding box regressors. For mask coefficient prediction, we simply add a third branch in parallel that predicts k mask coefficients, one corresponding to each prototype. Thus, instead of producing 4 + c coefficients per anchor, we produce 4+c+k.

典型的なアンカーベースの物体検出器には、予測ヘッドに2つの分岐があります。
1. c個のクラスの信頼度を予測
2. 4つのバウンディングボックス回帰子を予測
マスク係数の予測では、各プロトタイプに対応するk個のマスク係数を予測する3番目のブランチを並列に追加するだけです。
したがって、アンカーごとに4 +c個の係数を生成する代わりに、4+c+k個を生成します。

Then for nonlinearity, we find it important to be able to subtract out prototypes from the final mask. Thus, we apply tanh to the k mask coefficients, which produces more stable outputs over no nonlinearity. The relevance of this design choice is apparent in Figure 2, as neither mask would be constructable without allowing for subtraction.

次に、非線形性のために、最終マスクからプロトタイプを減算できることが重要であることがわかります。したがって、k個のマスク係数に$\tanh$を適用します。これにより、非線形性がなく、より安定した出力が生成されます。この設計選択の関連性は、図2から明らかです。どちらのマスクも、減算を許可しないと構築できないです。

3.3 Mask Assembly

To produce instance masks, we combine the work of the prototype branch and mask coefficient branch, using a linear combination of the former with the latter as coefficients. We then follow this by a sigmoid nonlinearity to produce the final masks. These operations can be implemented efficiently using a single matrix multiplication and sigmoid:
$$
M=\sigma\left(P C^{T}\right)
$$
where P is an h×w×k matrix of prototype masks and C is a n×k matrix of mask coefficients for n instances surviving NMS and score thresholding. Other, more complicated combination steps are possible; however, we keep it simple (and fast) with a basic linear combination.

インスタンスマスクを作成するには、プロトタイプブランチとマスク係数ブランチの作業を組み合わせ、前者と後者の線形結合を係数として使用します。その後、シグモイドの非線形性がこれに続き、最終的なマスクが生成されます。これらの演算は、単一の行列乗算とシグモイドを使用して効率的に実装できます。

M=\sigma\left(P C^{T}\right)\\
P\in \mathbb{R}^{h×w×k}: k\text{個のプロトタイプマスクの行列}\\
C\in \mathbb{R}^{n×k}: \text{NMSおよびスコア閾値処理を生き延びた}n\text{個のインスタンスのmask係数行列}

他のより複雑な組み合わせ手順が可能ですが、基本的な線形結合を使用して、シンプル（かつ高速）に保ちます。

Losses

We use three losses to train our model: classification loss Lcls, box regression loss Lbox and mask loss Lmask with the weights 1, 1.5, and 6.125 respectively. Both Lcls and Lbox are defined in the same way as in [6]. Then to compute mask loss, we simply take the pixel-wise binary cross entropy between assembled masks M and the ground truth masks Mgt:
$$
Lmask =BCE(M,Mgt)
$$

モデルを訓練するために3つの損失を使用します。
1. 分類損失$L_{cls}$
2. ボックス回帰損失$L_{box}$
3. マスク損失$L_{mask}$
（それぞれ重み1、1.5、6.125）
$L_{cls}$と$L_{box}$の両方は、[6]と同じ方法で定義されます。
次に、マスク損失を計算するために、組み立てられたマスクMとグラウンドトゥルースマスクM_{gt}の間のピクセル単位のバイナリクロスエントロピーを取得します。

L_{mask} = \text{BCE}(M,M_{gt})\\

Cropping Masks

We crop the final masks with the predicted bounding box during evaluation. Specifically, we assign zero to pixels outside of the box region. During training, we instead crop with the ground truth bounding box, and divide Lmask by the ground truth box area to preserve small objects in the prototypes.

評価中に、予測されたバウンディングボックスで最終マスクをトリミングします。具体的には、ボックス領域の外側のピクセルにゼロを割り当てます。トレーニング中に、代わりにグラウンドトゥルースバウンディングボックスを使用してトリミングし、$L_{mask}$をグラウンドトゥルースボックス領域で割って、プロトタイプの小さな物体を保持します。

3.4 Emergent Behavior

Our approach might seem surprising, as the general consensus around instance segmentation is that because FCNs are translation invariant, the task needs translation variance added back in [3]. Thus methods like FCIS [3] and Mask R-CNN [2] try to explicitly add translation variance, whether it be by directional maps and position-sensitive repooling, or by putting the mask branch in the second stage so it does not have to deal with localizing instances. In our method, the only translation variance we add is to crop the final mask with the predicted bounding box. However, we find that our method also works without cropping for medium and large objects, so this is not a result of cropping. Instead, YOLACT learns how to localize instances on its own via different activations in its prototypes.

インスタンスセグメンテーションに関する一般的な一貫性は、FCNは移動不変であるため、タスクには[3]で追加された移動の分散が必要であるということです。したがって、FCIS [3]やMask R-CNN [2]のような手法は、それが方向マップと位置依存リプールによるものであろうと、もしくは、それがインスタンスの領域推定を処理する必要がないようにマスクブランチを第2ステージに配置することによるものであろうと、明示的に移動分散を追加しようとします。我々の方法では、追加する唯一の平行移動の分散は、予測されたバウンディングボックスで最終マスクをトリミングすることです。ただし、中規模および大規模な物体の場合は、トリミングを行わなくてもメソッドが機能するため、これはトリミングの結果ではありません。代わりに、YOLACTは、プロトタイプのさまざまなアクティベーションを介して、独自にインスタンス領域を推定する方法を学習します。

To see how this is possible, first note that the prototype activations for the solid red image (image a) in Figure 5 are actually not possible in an FCN without padding. Because a convolution outputs to a single pixel, if its input everywhere in the image is the same, the result everywhere in the conv output will be the same. On the other hand, the consistent rim of padding in modern FCNs like ResNet gives the network the ability to tell how far away from the image’s edge a pixel is. Conceptually, one way it could accomplish this is to have multiple layers in sequence spread the padded 0’s out from the edge toward the center (e.g., with a kernel like [1,0]). This means ResNet, for instance, is inherently translation variant, and our method makes heavy use of that property (images b and c exhibit clear translation variance).

これがどのように可能であるかを見るために、最初に、図5の赤一色の画像（画像a）のプロトタイプのアクティベーションは、パディングなしのFCNでは実際には不可能であることに注意してください。畳み込みは単一のピクセルに出力するため、画像内のどこでも入力が同じ場合、conv出力のどこでも結果は同じになります。一方、ResNetのような最新のFCNのパディングの一貫したパディングにより、ネットワークは、ピクセルが画像の端からどれだけ離れているかを知ることができます。概念的には、これを実現する1つの方法は、複数の層を順番に、パッドされた0を端から中央に向かって広げることです（たとえば、[1,0]のようなカーネルを使用）。これは、たとえばResNetは本質的に移動不変であり、この方法はそのプロパティを多用することを意味します（画像bとcは明確な移動のばらつきを示します）。

Fig. 5: Prototype Behavior
The activations of the same six prototypes (y axis) across different images (x axis). Prototypes 1-3 respond to objects to one side of a soft, implicit boundary (marked with a dotted line). Prototype 4 activates on the bottom-left of objects (for instance, the bottom left of the umbrellas in image d); prototype 5 activates on the background and on the edges between objects; and prototype 6 segments what the network perceives to be the ground in the image. These last 3 patterns are most clear in images d-f.

図５: Prototype Behavior

異なる画像（x軸）にわたる同じ6つのプロトタイプ（y軸）のアクティブ化。プロトタイプ1〜3は、細く薄く書いた境界（点線でマーク）の片側の物体に反応します。プロトタイプ4は、物体の左下（たとえば、画像dの傘の左下）でアクティブになります。プロトタイプ5は、背景と物体間のエッジでアクティブになります。そして、プロトタイプ6は、ネットワークが画像内の地面であると認識するものをセグメンテーションします。これらの最後の3つのパターンは、画像d〜fで最も明確です。

We observe many prototypes to activate on certain “partitions” of the image. That is, they only activate on objects on one side of an implicitly learned boundary. In Figure 5, prototypes 1-3 are such examples. By combining these partition maps, the network can distinguish between different (even overlapping) instances of the same semantic class; e.g., in image d, the green umbrella can be separated from the red one by subtracting prototype 3 from prototype 2.

画像の特定の「区画」でアクティブになる多くのプロトタイプを観察します。つまり、暗黙的に学習された境界の片側の物体でのみアクティブになります。図5では、プロトタイプ1〜3がそのような例です。これらの区画マップを組み合わせることにより、ネットワークは同じセマンティッククラスの異なる（重複する）インスタンスを区別できます。たとえば、画像dでは、プロトタイプ2からプロトタイプ3を引くことにより、緑の傘を赤の傘から分離できます。

Furthermore, being learned objects, prototypes are compressible. That is, if protonet combines the functionality of multiple prototypes into one, the mask coefficient branch can learn which situations call for which functionality. For instance, in Figure 5, prototype 2 is a partitioning prototype but also fires most strongly on instances in the bottom-left corner. Prototype 3 is similar but for instances on the right. This explains why in practice, the model does not degrade in performance even with as low as k = 32 prototypes (see Table 2b).

さらに、学習済みの物体なので、プロトタイプは圧縮可能です。つまり、pronetetが複数のプロトタイプの特徴を1つに結合する場合、マスク係数ブランチはどの状況がどの特徴を必要とするかを学習できます。たとえば、図5では、プロトタイプ2は区画型プロトタイプですが、左下隅のインスタンスでも最も強力に起動します。プロトタイプ3も同様ですが、右側のインスタンス用です。これは、実際には、k = 32という少ないプロトタイプ数でもモデルの精度が低下しない理由を説明しています（表2bを参照）。

On the other hand, increasing k is ineffective most likely because predicting coefficients is difficult. If the network makes a large error in even one coefficient, due to the nature of linear combinations, the produced mask can vanish or include leakage from other objects. Thus, the network has to play a balancing act to produce the right coefficients, and adding more prototypes makes this harder. In fact, we find that for higher values of k, the network simply adds redundant prototypes with small edge-level variations that slightly increase AP95, but not much else.

一方、係数を予測することは難しいため、kの増加はほとんど効果がありません。線形結合の性質により、ネットワークが1つの係数でも大きなエラーを起こすと、生成されたマスクが消えたり、他の物体からの漏れが含まれたりする可能性があります。したがって、ネットワークは適切な係数を生成するためにバランスをとる必要があり、プロトタイプを追加するとこれが難しくなります。実際、kの値が大きい場合、ネットワークはAP95をわずかに増加させるが、それ以外のことはほとんどない、エッジレベルの変動が小さい冗長なプロトタイプを追加するだけです。

4 Backbone Detector

For our backbone detector we prioritize speed as well as feature richness, since predicting these prototypes and coefficients is a difficult task that requires good features to do well. Thus, the design of our backbone detector closely follows RetinaNet [25] with an emphasis on speed.

バックボーン検出器では、これらのプロトタイプと係数を予測するのは優れた機能を必要とする難しいタスクであるため、速度と特徴の豊富さを優先します。したがって、我々のバックボーン検出器の設計は、速度を重視してRetinaNet [25]に厳密に従います。

YOLACT Detector

We use ResNet-101 [8] with FPN [48] as our default feature backbone and a base image size of 550 × 550.

ResNet-101 [8]とFPN [48]をデフォルトの機能バックボーンとして使用し、基本の画像サイズは550×550です。

We do not preserve aspect ratio in order to get consistent evaluation times per image. Like RetinaNet, we modify FPN by not producing P2 and producing P6 and P7 as successive 3 × 3 stride 2 conv layers starting from P5 (not C5) and place 3 anchors with aspect ratios [1, 1/2, 2] on each. The anchors of P3 have areas of 24 pixels squared, and every subsequent layer has double the scale of the previous (resulting in the scales [24, 48, 96, 192, 384]). For the prediction head attached to each Pi, we have one 3 × 3 conv shared by all three branches, and then each branch gets its own 3 × 3 conv in parallel. Compared to RetinaNet, our prediction head design (see Figure 4) is more lightweight and much faster. We apply smooth-L1 loss to train box regressors and encode box regression coordinates in the same way as SSD [6]. To train class prediction, we use softmax cross entropy with c positive labels and 1 background label, selecting training examples using OHEM [49] with a 3:1 neg:pos ratio. Thus, unlike RetinaNet we do not use focal loss, which we found not to be viable in our situation.

画像ごとに一貫した評価時間を得るために、アスペクト比を保持しません。 RetinaNetと同様に、P2を生成せずにFPNを修正し、P5（C5ではなく）から始まる連続3×3ストライド2 conv層としてP6およびP7を生成し、それぞれにアスペクト比[1、1/2、2]の3つのアンカーを配置します。 P3のアンカーの面積は24ピクセルの正方形で、後続のすべての層のスケールは前の2倍になります（スケール[24、48、96、192、384]）。各Piにアタッチされた予測ヘッドの場合、3つのブランチすべてで共有される1つの3×3 convがあり、各ブランチは独自の3×3 convを並行して取得します。 RetinaNetと比較して、予測ヘッドの設計（図4を参照）はより軽量で、はるかに高速です。smooth-L1 lossをボックス回帰のトレーニングに適用し、SSDと同じ方法でボックス回帰座標をエンコードします[6]。クラス予測をトレーニングするには、c個の正ラベルと1個のバックグラウンドラベルでsoftmax cross entropyを使用し、OHEM [49]を使用して3：1のネガティブ：ポジティブ比のトレーニング例を選択します。したがって、RetinaNetとは異なり、焦点損失を使用しません。

Fig. 4: Head Architecture
We use a shallower prediction head than RetinaNet [25] and add a mask coefficient branch. This is for c classes, a anchors for feature layer Pi, and k prototypes. See Figure 3 for a key.

RetinaNet [25]よりも浅い予測ヘッドを使用し、マスク係数ブランチを追加します。これは、cクラス、フィーチャレイヤーPiのアンカー、およびkプロトタイプ用です。キーについては、図3を参照してください。

With these design choices, we find that this backbone performs better and faster than SSD [6] modified to use ResNet-101 [8], with the same image size.

これらの設計上の選択により、このバックボーンは、同じ画像サイズでResNet-101 [8]を使用するように修正されたSSD [6]よりも優れた高速性を発揮します。

5 Other Improvements

We also discuss other improvements that either increase speed with little effect on performance or increase performance with no speed penalty.

また、精度にほとんど影響を与えずに速度を上げるか、速度を犠牲にすることなく精度を上げるその他の改善点についても説明します。

5.1 Fast NMS

After producing bounding box regression coefficients and class confidences for each anchor, like most object detectors we perform NMS to suppress duplicate detections. In many previous works [1], [2], [4], [6], [7], [25], NMS is performed sequentially. That is, for each of the c classes in the dataset, sort the detected boxes descending by confidence, and then for each detection remove all those with lower confidence than it that have an IoU overlap greater than some threshold. While this sequential approach is fast enough at speeds of around 5 fps, it becomes a large barrier for obtaining 30 fps (for instance, a 10 ms improvement at 5 fps results in a 0.26 fps boost, while a 10 ms improvement at 30 fps results in a 12.9 fps boost).

ほとんどの物体検出器のように、各アンカーのバウンディングボックス回帰係数とクラス信頼度を生成した後、NMSを実行して重複検出を抑制します。これまでの多くの研究[1]、[2]、[4]、[6]、[7]、[25]では、NMSは連続して実行されます。つまり、データセット内のc個のクラスごとに、検出されたボックスを信頼度の高い順に並べ替えてから、検出ごとに、IoUオーバーラップが閾値よりも大きい信頼度の低いものをすべて削除します。この連続アプローチは約5 fpsの速度で十分に高速ですが、30 fpsを取得するための大きな障壁になります（たとえば、5 fpsで10ミリ秒改善すると0.26 fpsブーストになり、30 fpsで10ミリ秒改善します 12.9 fpsのブーストで）。

To fix the sequential nature of traditional NMS, we introduce Fast NMS, a version of NMS where every instance can be decided to be kept or discarded in parallel. To do this, we simply allow already-removed detections to suppress other detections, which is not possible in traditional NMS. This relaxation allows us to implement Fast NMS entirely in standard GPU-accelerated matrix operations.

従来のNMSの連続な性質を修正するために、NMSの1バージョンであるFast NMSを導入しました。このバージョンでは、すべてのインスタンスを並行して保持または破棄することができます。これを行うには、既に削除された検出を許可して、他の検出を抑制します。これは、従来のNMSでは不可能です。この緩和により、Fast NMSを完全に標準のGPU加速マトリックス操作で実装できます。

To perform Fast NMS, we first compute a c × n × n pairwise IoU matrix X for the top n detections sorted descending by score for each of c classes. Batched sorting on the GPU is readily available and computing IoU can be easily vectorized. Then, we remove detections if there are any higher-scoring detections with a corresponding IoU greater than some threshold t. We efficiently implement this by first setting the lower triangle and diagonal of
$$
X to 0:X_{kij} =0, ∀k,j,i≥j,
$$
which can be performed in one batched triu call, and then taking the column-wise max:
$$
K_{kj} =max_i(X_{kij}) ∀k,j (2) i
$$
to compute a matrix K of maximum IoU values for each detection. Finally, thresholding this matrix with t (K < t) will indicate which detections to keep for each class.

Fast NMSを実行するには、最初にc個のクラスごとにスコアで降順にソートされた上位n個の検出について、c×n×n個のペアごとのIoU行列Xを計算します。 GPUでのバッチソートはすぐに利用でき、IoUの計算は簡単にベクトル化できます。次に、対応するIoUがある閾値tより大きいスコアの高い検出がある場合、検出を削除します。下記の下三角と対角線を最初に設定することにより、これを効率的に実装します。
$$
X to 0:X_{kij} =0, ∀k,j,i≥j,
$$
これは、1回のバッチ「triu」呼び出しで実行でき、列ごとの最大値を取得します。
$$
K_{kj} =max_i(X_{kij}) ∀k,j
$$
検出ごとに最大IoU値の行列Kを計算します。最後に、この行列をt（K <t）でしきい値処理すると、各クラスでどの検出を保持するかが示されます。

Because of the relaxation, Fast NMS has the effect of removing slightly too many boxes. However, the performance hit caused by this is negligible compared to the stark increase in speed (see Table 2a). In our code base, Fast NMS is 11.8 ms faster than a Cython implementation of traditional NMS while only reducing performance by 0.1 mAP. In the Mask R-CNN benchmark suite [2], Fast NMS is 15.0 ms faster than their CUDA implementation of traditional NMS with a performance loss of only 0.3 mAP.

緩和のため、Fast NMSはわずかに多くのボックスを削除する効果があります。ただし、これによる精度の低下は、速度の急激な増加と比較して無視できます（表2aを参照）。コードベースでは、Fast NMSは、従来のNMSのCython実装よりも11.8ミリ秒高速ですが、精度は0.1 mAPしか低下しません。 Mask R-CNNベンチマークスイート[2]では、Fast NMSは従来のNMSのCUDA実装よりも15.0ミリ秒高速で、精度損失はわずか0.3 mAPです。

5.2 Semantic Segmentation Loss

While Fast NMS trades a small amount of performance for speed, there are ways to increase performance with no speed penalty. One of those ways is to apply extra losses to the model during training using modules not executed at test time. This effectively increases feature richness while at no speed penalty.

Fast NMSは速度と引き換えに少量の精度を犠牲にしますが、速度を犠牲にすることなく精度を向上させる方法があります。これらの方法の1つは、テスト時に実行されないモジュールを使用して、トレーニング中にモデルに余分な損失を適用することです。これにより、速度が低下することなく、機能の豊富さが効果的に向上します。

Thus, we apply a semantic segmentation loss on our feature space using layers that are only evaluated during training. Note that because we construct the ground truth for this loss from instance annotations, this does not strictly capture semantic segmentation (i.e., we do not enforce the standard one class per pixel). To create predictions during training, we simply attach a 1x1 conv layer with c output channels directly to the largest feature map (P3) in our backbone. Since each pixel can be assigned to more than one class, we use sigmoid and c channels instead of softmax and c + 1. This loss is given a weight of 1 and results in a +0.4 mAP boost.

したがって、トレーニング中にのみ評価される層を使用して、特徴空間にセマンティックセグメンテーション損失を適用します。インスタンスアノテーションからこの損失のグランドトゥルースを構築するため、これはセマンティックセグメンテーションを厳密にキャプチャしないことに注意してください（つまり、ピクセルごとに標準の1つのクラスを強制しません）。トレーニング中に予測を作成するには、c出力チャネルを持つ1x1 convレイヤーをバックボーンの最大の機能マップ（P3）に直接接続します。各ピクセルは複数のクラスに割り当てることができるため、softmaxとc + 1の代わりにシグモイドとc個のチャンネルを使用します。この損失には1の重みが与えられ、+ 0.4 mAPブーストになります。

6 YOLACT++

YOLACT, as introduced thus far, is viable for real-time applications and only consumes ∼1500 MB of VRAM even with a ResNet-101 backbone. We believe these properties make it an attractive model that could be deployed in low-capacity embedded systems.

これまでに紹介したように、YOLACTはリアルタイムアプリケーションで実行可能であり、ResNet-101バックボーンを使用しても、〜1500 MBのVRAMしか消費しません。これらの特性により、低容量の組み込みシステムに展開できる魅力的なモデルになると考えています。

We next explore several performance improvements to the original framework, while keeping the real-time demand in mind. Specifically, we first introduce an efficient and fast mask rescoring network, which re-ranks the mask predictions according to their mask quality. We then identify ways to improve the backbone network with deformable convolutions so that our feature sampling aligns better with instances, which results in a better backbone detector and more precise mask prototypes. We finally discuss better choices for the detection anchors to increase recall.

次に、リアルタイムの需要を念頭に置いて、元のフレームワークの精度の改善をいくつか検討します。具体的には、最初に効率的で高速なマスクリスコアリングネットワークを導入します。これは、マスクの品質に応じてマスク予測を再ランク付けします。次に、変形可能なコンボリューションを使用してバックボーンネットワークを改善する方法を特定し、特徴サンプリングがインスタンスとより整合するようにします。これにより、バックボーン検出器とマスクプロトタイプの精度が向上します。最後に、リコールを増やすための検出アンカーのより良い選択について説明します。

6.1 Fast Mask Re-Scoring Network

As indicated by Mask Scoring R-CNN [15], there is a discrepancy in the model’s classification confidence and the quality of the predicted mask (i.e., higher quality mask segmentations don’t necessarily have higher class confidences). Thus, to better correlate the class confidence with mask quality, Mask Scoring R-CNN adds a new module to Mask R-CNN that learns to regress the predicted mask to its mask IoU with ground-truth.

Mask Scoring　R-CNN [15]で示されているように、モデルの分類信頼度と予測マスクの品質には矛盾があります（つまり、高品質のマスクセグメンテーションは必ずしも高いクラス信頼度を持ちません）。したがって、クラスの信頼度とマスク品質をよりよく相関させるために、Mask Scoring R-CNNは、予測されたマスクをグラウンドトゥルースでマスクIoUに回帰することを学習する新しいモジュールをMask R-CNNに追加します。

Inspired by [15], we introduce a fast mask re-scoring branch, which rescores the predicted masks based on their mask IoU with ground-truth. Specifically, our Fast Mask Re-Scoring Network is a 6-layer FCN with ReLU non-linearity per conv layer and a final global pooling layer. It takes as input YOLACT’s cropped mask prediction (before thresholding) and outputs the mask IoU for each object category. We rescore each mask by taking the product between the predicted mask IoU for the category predicted by our classification head and the corresponding classification confidence (see Figure 6).

[15]に触発されて、高速マスクリスコアリングブランチを導入します。これは、グラウンドトゥルースとのマスクIoUに基づいて予測マスクを再スコアリングします。具体的には、Fast Mask Re-Scoring Networkは、conv層ごとにReLU非線形関数を持ち、最終的なグローバルプーリング層を持つ6層FCNです。 YOLACTのトリミングされたマスク予測（しきい値処理前）を入力として受け取り、各物体カテゴリのマスクIoUを出力します。分類ヘッドによって予測されたカテゴリの予測マスクIoUと対応する分類信頼度の間の積を取ることにより、各マスクを再スコア化します（図6参照）。

Fig. 6: Fast Mask Re-scoring Network Architecture
Our mask scoring branch consists of 6 conv layers with ReLU non-linearity and 1 global pooling layer. Since there is no feature concatenation nor any fc layers, the speed overhead is only ∼1 ms.

図6: Fast Mask Re-scoring Network Architecture

マスクスコアリングブランチは、ReLU非線形性を備えた6つのconv層と1つのグローバルプーリング層で構成されています。特徴の連結もfc層もないため、速度のオーバーヘッドは約1ミリ秒です。

Our method differs from Mask Scoring R-CNN [15] in the following important ways: (1) Our input is only the mask at the full image size (with zeros outside the predicted box region) whereas their input is the ROI repooled mask concatenated with the feature from the mask prediction branch, and (2) we don’t have any fc layers. These make our method significantly faster. Specifically, the speed overhead of adding the Fast Mask Re-Scoring branch to YOLACT is 1.2 ms, which changes the fps from 34.4 to 33 for our ResNet-101 model, while the overhead of incorporating Mask Scoring R-CNN’s module into YOLACT is 28 ms, which would change the fps from 34.4 to 17.5. The speed difference mainly comes from MS R-CNN’s usage of the ROI align operation, its f c layers, and the feature concatenation in the input.

この方法は、以下の重要な点でMask Scoring R-CNN [15]と異なります。
（1）我々の入力はフル画像サイズのマスク（予測ボックス領域の外側にゼロがある）のみであるのに対し、Mask Scoring R-CNNの入力はマスク予測ブランチの機能と連結されたROIリプールマスクです。
（2）我々にはfc層がありません。
これらにより、メソッドが大幅に高速化されます。具体的には、Fast Mask Re-ScoringブランチをYOLACTに追加する速度オーバーヘッドは1.2 msであり、ResNet-101モデルのfpsを34.4から33に変更しますが、Mask Scoring R-CNNのモジュールをYOLACTに組み込むオーバーヘッドは28です。 ms。fpsを34.4から17.5に変更します。速度の違いは、主にMS R-CNNのROI整列操作の使用、そのfc層、および入力の機能連結に起因します。

6.2 Deformable Convolution with Intervals

Deformable Convolution Networks (DCNs) [12], [13] have proven to be effective for object detection, semantic segmentation, and instance segmentation due to its replacement of the rigid grid sampling used in conventional convnets with free-form sampling. We follow the design choice made by DCNv2 [13] and replace the 3x3 convolution layer in each ResNet block with a 3x3 deformable convolution layer for C3 to C5. Note that we do not use the modulated deformable modules because we can’t afford the inference time overhead that they introduce.

Deformable Convolution Networks（DCN）[12]、[13]は、従来のconvnetで使用されているリジッドグリッドサンプリングを自由形式サンプリングに置き換えたことにより、物体検出、セマンティックセグメンテーション、インスタンスセグメンテーションに効果的であることが証明されました。 DCNv2 [13]による設計選択に従い、各ResNetブロックの3x3畳み込み層をC3からC5の3x3deformable畳み込み層に置き換えます。変調された変形可能モジュールを使用しないことに注意してください。これは、それらが導入する推論時間のオーバーヘッドに余裕がないためです。

Adding deformable convolution layers into the backbone of YOLACT, leads to a +1.8 mask mAP gain with a speed overhead of 8 ms. We believe the boost is due to: (1) DCN can strengthen the network’s capability of handling instances with different scales, rotations, and aspect ratios by aligning to the target instances. (2) YOLACT, as a single-shot method, does not have a re-sampling process. Thus, a better and more flexible sampling strategy is more critical to YOLACT than two-stage methods, such as Mask R- CNN because there is no way to recover sub-optimal samplings in our network. In contrast, the ROI align operation in Mask R-CNN can address this problem to some extent by aligning all objects to a canonical reference region.

YOLACTのバックボーンにdeformable畳み込み層を追加すると、8ミリ秒の速度オーバーヘッドで+1.8マスクmAPゲインが得られます。
（1）DCNは、ターゲットインスタンスに合わせることで、異なるスケール、回転、アスペクト比のインスタンスを処理するネットワークの機能を強化できます。
（2）YOLACTは、シングルショット方式として、再サンプリングプロセスがありません。したがって、ネットワーク内で準最適なサンプリングを回復する方法がないため、Mask R-CNNなどのtwo-stage手法よりも優れた柔軟なサンプリング戦略がYOLACTにとって重要です。対照的に、Mask R-CNNのROI整列操作は、すべてのオブジェクトを標準参照領域に整列させることにより、この問題にある程度対処できます。

Even though the performance boost is fairly decent when directly plugging in the deformable convolution layers following the design choice in [13], the speed overhead is quite significant as well (see Table 7). This is because there are 30 layers with deformable convolutions when using ResNet-101. To speed up our ResNet-101 model while maintaining its performance boost, we explore using less deformable convolutions. Specifically, we try having deformable convolutions in four different configurations: (1) in the last 10 ResNet blocks, (2) in the last 13 ResNet blocks, (3) in the last 3 ResNet stages with an interval of 3 (i.e., skipping two ResNet blocks in between; total 11 deformable layers), and (4) in the last 3 ResNet stages with an interval of 4 (total 8 deformable layers). Given the results, the DCN (interval=3) setting is chosen as the final configuration in YOLACT++, which cuts down the speed overhead by 5.2 ms to 2.8 ms and only has a 0.2 mAP drop compared to not having an interval.

[13]の設計選択に従って、変形可能な畳み込み層に直接差し込むと精度がかなり向上しますが、速度のオーバーヘッドも非常に大きくなります（表7参照）。これは、ResNet-101を使用すると、変形可能な畳み込みを持つ30のレイヤーがあるためです。精度を向上させながらResNet-101モデルを高速化するために、変形の少ない畳み込みの使用を検討します。具体的には、
（1）最後の10個のResNetブロックで、
（2）最後の13個のResNetブロックで、
（3）インターバル３の最後の3つの ResNetステージで（つまり、合計11の変形可能層間の2つのResNetブロックをスキップ）
（4）間隔4の最後の3つのResNetステージ（合計8つの変形可能レイヤー）。
結果を考えると、DCN（interval = 3）設定がYOLACT ++の最終構成として選択されます。これにより、速度のオーバーヘッドが5.2ミリ秒から2.8ミリ秒に削減され、間隔がない場合と比較して0.2 mAPの低下しかありません。

6.3 Optimized Prediction Head

Finally, as YOLACT is based off of an anchor-based backbone detector, choosing the right hyper-parameters for the anchors, such as their scales and aspect ratios, is very important. We therefore revisit our anchor choice and compare with the anchor design of RetinaNet [25] and RetinaMask [50]. We try two variations: (1) keeping the scales unchanged while increasing the anchor aspect ratios from [1, 1/2, 2] to [1, 1/2, 2, 1/3, 3], and (2) keeping the aspect ratios unchanged while increasing the scales per FPN level 12 by threefold ([1x, $2^{1/3}$x, $2^{2/3}$ x]). The former and latter increases the number of anchors compared to the original configuration of YOLACT by 5/3 x and 3x, respectively. As shown in Table 6, using 3 multi-scale anchors per FPN level (config 2) produces the best speed vs. performance trade off.

最後に、YOLACTはアンカーベースのバックボーン検出器に基づいているため、スケールやアスペクト比など、アンカーの適切なハイパーパラメーターを選択することが非常に重要です。したがって、アンカーの選択を再検討し、RetinaNet [25]およびRetinaMask [50]のアンカー設計と比較します。 2つのバリエーションを試します。
（1）アンカーアスペクト比を[1、1/2、2]から[1、1/2、2、1/3、3]に増やしながらスケールを変更せずに維持する
（2）FPNレベル12あたりのスケールを3倍（[1x, $2^{1/3}$x, $2^{2/3}$ x]）増やしながらアスペクト比を変更せずに維持する
前者と後者では、YOLACTの元の構成に比べてアンカーの数がそれぞれ5/3倍と3倍増加します。表6に示すように、FPNレベルごとに3つのマルチスケールアンカーを使用すると（構成2）、最高の速度とパフォーマンスのトレードオフが得られます。

7 Results

We report instance segmentation results on MS COCO [10] and Pascal 2012 SBD [35] using the standard metrics. For MS COCO, we train on train2017 and evaluate on val2017 and test-dev. We also report box detection results on MS COCO.

標準のメトリックを使用して、MS COCO [10]およびPascal 2012 SBD [35]のインスタンスセグメンテーション結果を報告します。 MS COCOについては、train2017でトレーニングし、val2017とtest-devで評価します。 MS COCOのボックス検出結果も報告します。

7.1 Implementation Details

We train all models with batch size 8 on one GPU using ImageNet [51] pretrained weights. We find that this is a sufficient batch size to use batch norm, so we leave the pretrained batch norm unfrozen but do not add any extra bn layers. We train with SGD for 800k iterations starting at an initial learning rate of 10−3 and divide by 10 at iterations 280k, 600k, 700k, and 750k, using a weight decay of 5×10−4, a momentum of 0.9, and all data augmentations used in SSD [6]. For Pascal, we train for 120k iterations and divide the learning rate at 60k and 100k. We also multiply the anchor scales by 4/3, as objects tend to be larger. Training takes 4-6 days (depending on config) on one Titan Xp for COCO and less than 1 day on Pascal.

ImageNet [51]事前トレーニング済みの重みを使用して、1つのGPUでバッチサイズ8のすべてのモデルをトレーニングします。これはバッチノルムを使用するのに十分なバッチサイズであることがわかったため、事前トレーニング済みのバッチノルムはフリーズせずに、余分なbnレイヤーを追加しません。 10-3の初期学習率で開始し、280k、600k、700k、および750kの反復で10で除算し、5×10-4の重み減衰、0.9の運動量、およびすべてを使用して、800k反復のSGDでトレーニングします。 SSDで使用されるデータ拡張[6]。パスカルの場合、120k回の反復でトレーニングを行い、学習率を60kと100kに分割します。また、オブジェクトが大きくなる傾向があるため、アンカースケールに4/3を掛けます。トレーニングは、COCOの1つのTitan Xpで4〜6日（構成によって異なります）、Pascalで1日未満かかります。

7.2 Mask Results

We first compare YOLACT to state-of-the art methods on COCO’s test-dev set in Table 1. Because our main goal is speed, we compare against other single model results with no test-time augmentations. We report all speeds computed on a single Titan Xp, so some listed speeds may be faster than in the original paper.

まず、表1にあるCOCOのtest-devセットで、YOLACTを最先端の方法と比較します。主な目標は速度であるため、テスト時間の増加なしで他の単一モデルの結果と比較します。単一のTitan Xpで計算されたすべての速度を報告するため、リストされている速度の一部は、元の論文よりも速い場合があります。

YOLACT-550 offers competitive instance segmentation performance while at 3.8x the speed of the previous fastest instance segmentation method on COCO. We also note an interesting difference in where the performance of our method lies compared to others. Supporting our qualitative findings in Figure 9, the gap between YOLACT-550 and Mask R-CNN at the 50% overlap threshold is 9.5 AP, while it’s 6.6 at the 75% IoU threshold. This is different from the performance of FCIS, for instance, compared to Mask R-CNN where the gap is consistent (AP values of 7.5 and 7.6 respectively). Furthermore, at the highest (95%) IoU threshold, we outperform Mask R-CNN with 1.6 vs. 1.3 AP.

YOLACT-550は、競争力のあるインスタンスセグメンテーション精度を提供しますが、COCOの従来の最速のインスタンスセグメンテーション手法の3.8倍の速度です。また、この手法の精度が他の手法と比べてどこにあるかという興味深い違いにも注意してください。図9の定性的な結果を裏付けるように、50％のオーバーラップしきい値でのYOLACT-550とMask R-CNNのギャップは9.5 APですが、75％IoUのしきい値では6.6です。これは、たとえば、ギャップが一定であるマスクR-CNNと比較して、FCISの精度とは異なります（AP値はそれぞれ7.5および7.6）。さらに、最高（95％）のIoUしきい値では、1.6 AP対1.3 APでMask R-CNNよりも優れています。

We also report numbers for alternate model configurations in Table 1. In addition to our base 550 × 550 image size model, we train 400 × 400 (YOLACT-400) and 700 × 700 (YOLACT-700) models, adjusting the anchor scales accordingly (sx = s550/550 ∗ x). Lowering the image size results in a large decrease in performance, demonstrating that instance segmentation naturally demands larger images. Then, raising the image size decreases speed significantly but also increases performance, as expected. In addition to our base backbone of ResNet-101 [8], we also test ResNet-50 and DarkNet-53 [1] to obtain even faster results. If higher speeds are preferable we suggest using ResNet- 50 or DarkNet-53 instead of lowering the image size, as these configurations perform much better than YOLACT-400, while only being slightly slower.

また、代替モデル構成の数値を表1に報告します。ベース550×550画像サイズモデルに加えて、400×400（YOLACT-400）および700×700（YOLACT-700）モデルをトレーニングし、それに応じてアンカースケールを調整します。（sx = s550 / 550 ∗ x）。画像サイズを小さくすると、精度が大幅に低下します。これは、インスタンスセグメンテーションに自然に大きな画像が必要であることを示しています。次に、画像サイズを大きくすると速度が大幅に低下しますが、期待どおりに精度も向上します。 ResNet-101 [8]の基本バックボーンに加えて、ResNet-50およびDarkNet-53 [1]もテストして、より高速な結果を取得します。高速が望ましい場合、画像サイズを小さくする代わりにResNet-50またはDarkNet-53を使用することをお勧めします。これらの構成は、YOLACT-400よりもはるかに優れた精度を発揮しますが、わずかに遅いだけです。

The bottom two rows in Table 1 show the results of our YOLACT++ model with ResNet-50 and ResNet-101 backbones. With the proposed enhancements, YOLACT++ obtains a huge performance boost over YOLACT (5.9 mAP for the ResNet-50 model and 4.8 mAP for the ResNet-101 model) while maintaining high speed. In particular, our YOLACT++-ResNet-50 model runs at a real-time speed of 33.5 fps, which is 3.9x faster than Mask R- CNN, while its instance segmentation accuracy only falls behind by 1.6 mAP.

表1の下2行は、ResNet-50およびResNet-101バックボーンを使用したYOLACT ++モデルの結果を示しています。提案された拡張機能により、YOLACT ++は、高速を維持しながら、YOLACT（ResNet-50モデルの場合は5.9 mAP、ResNet-101モデルの場合は4.8 mAP）よりも大幅に精度が向上します。特に、YOLACT ++-ResNet-50モデルは、33.5 fpsのリアルタイム速度で動作し、Mask R-CNNよりも3.9倍高速ですが、インスタンスセグメンテーション精度は1.6 mAPだけ遅れています。

TABLE 1: MS COCO [10] Results
We compare to state-of-the-art methods for mask mAP and speed on COCO test-dev and include several ablations of our base model, varying backbone network and image size. We denote the backbone architecture with network-depth-features, where R and D refer to ResNet [8] and DarkNet [1], respectively. Our base model, YOLACT-550 with ResNet-101, is 3.9x faster than the previous fastest approach with competitive mask mAP. Our YOLACT++-550 model with ResNet-50 has the same speed while improving the performance of the base model by 4.3 mAP. Compared to Mask R-CNN, YOLACT++-R-50 is 3.9x faster and falls behind by only 1.6 mAP.

表１: MS COCO [10] Results

COCO test-devのmask mAPと速度の最先端の方法と比較し、さまざまなバックボーンネットワークと画像サイズのベースモデルのいくつかのアブレーションを含めます。 RとDがそれぞれResNet [8]とDarkNet [1]を参照するネットワーク-深さ-特徴を持ったバックボーンアーキテクチャを示します。基本モデルであるResNet-101を搭載したYOLACT-550は、競合するmask mAPを使用した従来の最速アプローチよりも3.9倍高速です。 ResNet-50を搭載したYOLACT ++-550モデルの速度は同じですが、ベースモデルの精度は4.3 mAP向上しています。マスクR-CNNと比較して、YOLACT ++-R-50は3.9倍速く、わずか1.6 mAPだけ遅れています。

Finally, we also train and evaluate our YOLACT ResNet-50 model on Pascal 2012 SBD in Table 3. YOLACT clearly outperforms popular approaches that report SBD performance, while also being significantly faster.

最後に、表3のPascal 2012 SBDでYOLACT ResNet-50モデルをトレーニングおよび評価します。YOLACTは、SBDの精度を報告する一般的なアプローチよりも明らかに優れていますが、大幅に高速です。

7.3 Mask Quality

Because we produce a final mask of size 138 × 138, and because we create masks directly from the original features (with no repooling to transform and potentially misalign the features), our masks for large objects are noticeably higher quality than those of Mask R-CNN [2] and FCIS [3]. For instance, in Figure 9, YOLACT produces a mask that cleanly follows the boundary of the arm, whereas both FCIS and Mask R-CNN have more noise. Moreover, despite being 5.9 mAP worse overall, at the 95% IoU threshold, our base model achieves 1.6 AP while Mask R- CNN obtains 1.3. This indicates that repooling does result in a quantifiable decrease in mask quality.

サイズ138×138の最終マスクを作成し、元の特徴から直接マスクを作成するため（変換するためのリプールおよびフィーチャの位置合わせの可能性なし）、大きな物体のマスクは、Mask R-CNN [2]およびFCIS [3]よりも著しく高品質です。たとえば、図9では、YOLACTがアームの境界をきれいにたどるマスクを生成しますが、FCISとMask R-CNNの両方にノイズが多くなります。さらに、95％IoUのしきい値で全体的に5.9 mAP悪化していますが、基本モデルは1.6 APを達成し、マスクR-CNNは1.3を取得しています。これは、再プーリングにより、マスク品質が定量的に低下することを示しています。

Mask Quality
Our masks are typically higher quality than those of Mask R-CNN [2] and FCIS [3] because of the larger mask size and lack of feature repooling.

図9: Mask Quality

マスクのサイズが大きく、特徴のリプールがないため、通常、マスクはMask R-CNN [2]およびFCIS [3]よりも高品質です。

7.4 Temporal Stability

Although we only train using static images and do not apply any temporal smoothing, we find that our model produces more temporally stable masks on videos than Mask R-CNN, whose masks jitter across frames even when objects are stationary. We believe our masks are more stable in part because they are higher quality (thus there is less room for error between frames), but mostly because our model is one-stage. Masks produced in two-stage methods are highly dependent on their region proposals in the first stage. In contrast for our method, even if the model predicts different boxes across frames, the prototypes are not affected, yielding much more temporally stable masks.

静止画像のみを使用してトレーニングを行い、時間的な平滑化は適用しませんが、モデルでは、物体が静止している場合でもフレーム間でマスクが揺れるMask R-CNNよりも、ビデオ上で時間的に安定したマスクが生成されることがわかります。マスクは品質が高いため（フレーム間のエラーの余地が少ないため）、マスクがより安定していると考えていますが、これは主にモデルがone-stageであるためです。 two-stageの方法で作成されたマスクは、最初の段階での領域の提案に大きく依存します。私たちの方法とは対照的に、モデルがフレーム間で異なるボックスを予測しても、プロトタイプは影響を受けず、はるかに時間的に安定したマスクが生成されます。

7.5 More Qualitative Results

Figure 7 shows many examples of adjacent people and vehicles, but not many for other classes. To further support that YOLACT is not just doing semantic segmentation, we include many more qualitative results for images with adjacent instances of the same class in Figure 8.

図7は、隣接する人々と乗り物の多くの例を示していますが、他のクラスの多くはありません。 YOLACTがセマンティックセグメンテーションを行っているだけではないことをさらにサポートするために、図8に同じクラスの隣接するインスタンスを持つ画像のより多くの定性的な結果を含めます。

Fig. 7: YOLACT
evaluation results on COCO’s test-dev set. This base model achieves 29.8 mAP at 33.0 fps. All images have the confidence threshold set to 0.3.

図7: YOLACT

COCOのテスト開発セットのYOLACT評価結果。この基本モデルは、33.0 fpsで29.8 mAPを達成します。すべての画像の信頼性しきい値は0.3に設定されています。

Fig. 8: More YOLACT
evaluation results on COCO’s test-dev set with the same parameters as before. To further support that YOLACT implicitly localizes instances, we select examples with adjacent instances of the same class.

図8: More YOLACT

以前と同じパラメーターを設定したCOCOのテスト開発セットの評価結果。 YOLACTが暗黙的にインスタンスを領域推定することをさらにサポートするために、同じクラスの隣接するインスタンスを持つ例を選択します。

For instance, in an image with two elephants (Figure 8 row 2, col 2), despite the fact that two instance boxes are overlapping with each other, their masks are clearly separating the instances. This is also clearly manifested in the examples of zebras (row 4, col 2) and birds (row 5, col 1).

たとえば、2つの象の画像（図8行2、列2）では、2つのインスタンスボックスが互いに重なり合っているにもかかわらず、マスクがインスタンスを明確に分離しています。これは、シマウマ（4列目、2列目）と鳥（5列目、1列目）の例でも明らかに現れています。

Note that for some of these images, the box doesn’t exactly crop off the mask. This is because for speed reasons (and because the model was trained in this way), we crop the mask at the prototype resolution (so one fourth the image resolution) with 1px of padding in each direction. On the other hand, the corresponding box is displayed at the original image resolution with no padding.

これらの画像の一部では、ボックスがマスクから正確に切り取られないことに注意してください。これは、速度の理由（およびモデルがこの方法でトレーニングされたため）で、各方向に1pxのパディングでプロトタイプ解像度（画像解像度の4分の1）でマスクをトリミングするためです。一方、対応するボックスは、パディングなしで元の画像解像度で表示されます。

7.6 Box Results

Since YOLACT produces boxes in addition to masks, we can also compare its object detection performance to other real-time object detection methods. Moreover, while our mask performance is real- time, we don’t need to produce masks to run YOLACT as an object detector. Thus, YOLACT is faster when run to produce boxes than when run to produce instance segmentations.

YOLACTはマスクに加えてボックスを生成するため、その物体検出精度を他のリアルタイムの物体検出方法と比較することもできます。さらに、マスクの精度はリアルタイムですが、YOLACTを物体検出器として実行するためにマスクを作成する必要はありません。したがって、YOLACTは、インスタンスセグメンテーションを生成するために実行する場合よりも、ボックスを生成するために実行する方が高速です。

In Table 4, we compare our performance and speed to various skews of YOLOv3 [1]. We are able to achieve similar detection results to YOLOv3 at similar speeds, while not employing any of the additional improvements in YOLOv2 and YOLOv3 like multi-scale training, optimized anchor boxes, cell-based regression encoding, and objectness score. Because the improvements to our detection performance in our observation come mostly from using FPN and training with masks (both of which are orthogonal to the improvements that YOLO makes), it is likely that we can combine YOLO and YOLACT to create an even better detector.

表4では、精度と速度をYOLOv3のさまざまなスキューと比較しています[1]。マルチスケールトレーニング、最適化されたアンカーボックス、セルベースの回帰エンコード、オブジェクトネススコアなど、YOLOv2およびYOLOv3の追加の改善を使用せずに、同様の速度でYOLOv3と同様の検出結果を達成できます。観測における検出性能の改善は、主にFPNの使用とマスクを使用したトレーニング（どちらもYOLOによる改善に直交する）によるものであるため、YOLOとYOLACTを組み合わせてさらに優れた検出器を作成できる可能性があります。

TABLE 4: Box Performance on COCO’s test-dev set.
For our method, timing is done without evaluating the mask branch. Both methods were timed on the same machine (using one Titan Xp). In each subgroup, we compare similar performing versions of our model to a corresponding YOLOv3 model. YOLOv3 doesn’t report all metrics for the 320 and 416 versions.

表4: Box Performance on COCO’s test-dev set.

この方法では、マスクブランチを評価せずにタイミングが実行されます。どちらの方法も同じマシン上でタイミングがとられました（1台のTitan Xpを使用）。各サブグループでは、モデルの同様の精度バージョンを対応するYOLOv3モデルと比較します。 YOLOv3は、320バージョンと416バージョンのすべての指標を報告しません。

Moreover, these detection results show that our mask branch takes only 6 ms in total to evaluate, which demonstrates how minimal our mask computation is.

さらに、これらの検出結果は、マスクブランチの評価に合計6ミリ秒しかかからないことを示しています。これは、マスクの計算が最小限であることを示しています。

7.7 YOLACT++ Improvements

Table 5 shows the contribution of each new component in our YOLACT++ model. The optimized anchor choice directly improves the recall of box prediction and boosts our backbone detector. The deformable convolutions help with better feature sampling by aligning the sampling positions with the instances of interest and better handles changes in scale, rotation, and aspect ratio. Importantly, with our exploration of using less deformable convolution layers, we can cut down their speed overhead significantly (from 8 ms to 2.8 ms) while keeping the performance almost the same (only 0.2 mAP drop) as compared to the original configuration proposed in [13]; see Table 7. With these two upgrades for object detection, YOLACT++ suffers less from localization failure and has finer mask predictions, as shown in Figure 10b, c, which together result in 3.4 mAP and 4.2 mAP boost for ResNet-101 and ResNet-50, respectively. In addition, the proposed fast mask re-scoring network re-ranks the mask predictions with the IoU based mask scores instead of solely relying on classification confidence. As a result, the under-estimated masks (masks with good quality but with low classification confidence) and over-estimated masks (masks with bad quality but with high classification confidence) are put into a more proper ranking as shown in Figure 10a. Our mask re-scoring method is also fast. Compared to incorporating MS R-CNN into YOLACT, it is 26.8 ms faster yet can still improve YOLACT by 1 mAP.

表5は、YOLACT ++モデルの新しい各コンポーネントの貢献度を示しています。最適化されたアンカーの選択は、ボックス予測のリコールを直接改善し、バックボーン検出器を強化します。変形可能な畳み込みは、サンプリング位置を対象のインスタンスに揃えることで、機能のサンプリングを改善し、スケール、回転、およびアスペクト比の変更をより適切に処理します。重要なのは、変形の少ない畳み込み層の使用を検討したことで、[ 13];表7を参照してください。これらの2つのオブジェクト検出のアップグレードにより、YOLACT ++はローカリゼーションエラーの影響を受けにくくなり、図10b、cに示すようにマスク予測が細かくなります。ResNet-101およびResNet-50 、それぞれ。さらに、提案された高速マスク再スコアリングネットワークは、分類信頼のみに依存するのではなく、IoUベースのマスクスコアでマスク予測を再ランク付けします。その結果、図10aに示すように、過小評価されたマスク（品質は高いが分類信頼度が低いマスク）および過大評価されたマスク（品質が悪いが分類信頼度が高いマスク）は、より適切なランキングに配置されます。マスクの再スコアリング方法も高速です。 MS R-CNNをYOLACTに組み込むのに比べて26.8ミリ秒高速ですが、YOLACTを1 mAP改善できます。

TABLE 5: YOLACT++ Improvements
Contribution to instance segmentation accuracy and speed overhead of each component of YOLACT++. Results on MS COCO val2017.

表5: YOLACT++ Improvements

YOLACT ++の各コンポーネントのインスタンスセグメンテーションの精度と速度のオーバーヘッドへの貢献。 MS COCO val2017の結果。

8 Discussion

Despite our masks being higher quality and having nice properties like temporal stability, we fall a bit behind state-of-the-art instance segmentation methods in overall performance, albeit while being much faster. Most errors are caused by mistakes in the detector: misclassification, box misalignment, etc. However, we have identified two typical errors caused by YOLACT’s mask generation algorithm.

マスクは高品質であり、一時的な安定性などの優れた特性を備えていますが、はるかに高速ではありますが、全体的な精度が最先端のインスタンスセグメンテーション手法に遅れをとっています。ほとんどのエラーは、誤分類、ボックスの位置ずれなど、検出器のミスが原因です。ただし、YOLACTのマスク生成アルゴリズムが原因の2つの典型的なエラーが特定されています。

Localization Failure

If there are too many objects in one spot in a scene, the network can fail to localize each object in its own prototype. In these cases, it will output something closer to a foreground mask than an instance segmentation for some objects in the group; e.g., in the first image in Figure 7 (row 1 column 1), the blue truck under the red airplane is not properly localized.

シーン内の1つのスポットに物体が多すぎる場合、ネットワークは各物体を独自のプロトタイプに領域推定できないことがあります。これらの場合、グループ内の一部の物体のインスタンスセグメンテーションよりも前景マスクに近いものを出力します。たとえば、図7の最初の画像（行1列1）では、赤い飛行機の下の青いトラックが適切に領域推定されていません。

Our YOLACT++ model addresses this problem to some degree by introducing more anchors covering more scales and applying deformable convolutions in the backbone for better feature sampling. For example, there are higher confidence and more accurate box detections in Figure 10c using YOLACT++.

YOLACT ++モデルは、より多くのスケールをカバーするアンカーを追加し、バックボーンに変形可能な畳み込みを適用することと、特徴サンプリングを改善することで、ある程度この問題に対処しています。たとえば、図10cでは、YOLACT ++を使用すると、より高い信頼性とより正確なボックス検出が行われます。

Leakage

Our network leverages the fact that masks are cropped after assembly, and makes no attempt to suppress noise outside of the cropped region. This works fine when the bounding box is accurate, but when it is not, that noise can creep into the instance mask, creating some “leakage” from outside the cropped region. This can also happen when two instances are far away from each other, because the network has learned that it doesn’t need to localize far away instances—the cropping will take care of it. However, if the predicted bounding box is too big, the mask will include some of the far away instance’s mask as well. For instance, Figure 7 (row 2 column 4) exhibits this leakage because the mask branch deems the three skiers to be far enough away to not have to separate them.

私たちのネットワークは、組み立て後にマスクが切り取られるという事実を活用しており、切り取られた領域外のノイズを抑制しようとしません。これは、バウンディングボックスが正確な場合は正常に機能しますが、そうでない場合は、ノイズがインスタンスマスクに忍び込み、トリミングされた領域の外側から「漏れ」が生じる可能性があります。これは、2つのインスタンスが互いに遠く離れている場合にも発生する可能性があります。これは、ネットワークが遠くのインスタンスを領域推定する必要がないことを認識しているためです。ただし、予測されたバウンディングボックスが大きすぎる場合、マスクには遠くのインスタンスのマスクも含まれます。たとえば、図7（行2列4）は、マスクブランチが3人のスキーヤーを十分に遠く離れているとみなし、それらを分離する必要がないため、この漏れを示しています。

Our YOLACT++ model partially mitigates these issues with a light-weight mask error down-weighting scheme, where masks exhibiting these errors will be ignored or ranked lower than higher quality masks. In Figure 10a, the leftmost giraffe’s mask has the best quality and with mask re-scoring, it is ranked highest with YOLACT++ whereas with YOLACT it is ranked 3rd among all detections in the image.

YOLACT ++モデルは、これらのエラーを示すマスクが無視されるか、高品質のマスクよりも低くランク付けされる軽量マスクエラーダウンウェイトスキームにより、これらの問題を部分的に軽減します。図10aでは、左端のキリンのマスクの品質が最高であり、マスクの再スコアリングでは、YOLACT ++で最高位にランク付けされますが、YOLACTでは画像内のすべての検出で3位にランク付けされます。

Understanding the AP GAP

However, localization failure and leakage alone are not enough to explain the almost 6 mAP gap between YOLACT’s base model and, say, Mask R-CNN. Indeed, our base model on COCO has just a 2.5 mAP difference between its test-dev mask and box mAP (29.8 mask, 32.3 box), meaning our base model would only gain a few points of mAP even with perfect masks. Moreover, Mask R-CNN has this same mAP difference (35.7 mask, 38.2 box), which suggests that the gap between the two methods lies in the relatively poor performance of our detector and not in our approach to generating masks.

ただし、領域推定の失敗と漏れだけでは、YOLACTの基本モデルと Mask R-CNNの間のほぼ6 mAPのギャップを説明するには不十分です。実際、COCOのベースモデルには、test-devマスクとボックスmAP（29.8マスク、32.3ボックス）の差が2.5 mAPしかないため、完全なマスクを使用してもベースモデルはmAPの数ポイントしか獲得できません。さらに、Mask R-CNNにはこの同じmAPの違い（35.7マスク、38.2ボックス）があります。これは、2つの方法のギャップは、マスクを生成するアプローチではなく、検出器の比較的低いパフォーマンスにあることを示唆しています。

We further corroborate this hypothesis by upgrading our back- bone detector in YOLACT++, where the mAP difference is still only 1.5 (34.6 mask, 36.1 box).

この仮説は、mAPの差がまだ1.5（34.6マスク、36.1ボックス）であるYOLACT ++のバックボーン検出器をアップグレードすることでさらに裏付けられています。

9 Conclusion

We presented the first competitive single-stage real-time instance segmentation method. The key idea is to predict mask prototypes and per-instance mask coefficients in parallel, and linearly com- bine them to form the final instance masks. Extensive experiments on MS COCO and Pascal VOC demonstrated the effectiveness of our approach and contribution of each component. We also analyzed the emergent behavior of our prototypes to explain how YOLACT, even as an FCN, introduces translation variance for instance segmentation. Finally, with improvements to the back- bone network, a better anchor design, and a fast mask re-scoring network, our YOLACT++ showed a significant boost compared to the original framework while still running at real-time.

最初の競争的なone-stageのリアルタイムインスタンスセグメンテーション手法を提示しました。重要な考え方は、マスクのプロトタイプとインスタンスごとのマスク係数を並列に予測し、それらを線形的に組み合わせて最終的なインスタンスマスクを形成することです。 MS COCOおよびPascal VOCに関する広範な実験により、当社のアプローチの有効性と各コンポーネントの貢献が実証されました。また、プロトタイプの出現時の動作を分析して、FCNであってもYOLACTがインスタンスのセグメンテーションに移動分散を導入する方法を説明しました。最後に、バックボーンネットワークの改善、より優れたアンカー設計、および高速マスクリスコアリングネットワークにより、YOLACT ++は、リアルタイムで動作しながら、元のフレームワークに比べて大幅に向上しました。

YOLACT++和訳