More than 3 years have passed since last update.

画像認識、「Vision Transformer (ViT)」と「Visual Transformers」の比較

Last updated at 2021-02-28Posted at 2020-10-15

概要

以下の「Vision Transformer (ViT)」が注目を浴びているよう。

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

あせって、間違えて、
以下の「VisualTransformers」の論文を読みかけてしまったので、

Visual Transformers: Token-basedImage Representation and Processing for Computer Vision

比較してみる。

比較

【比較１】代表的な図

Vision Transformer (ViT)

代表的な図を引用すると以下。

引用： https://openreview.net/forum?id=YicbFdNTTy

VisualTransformers

代表的な図を引用すると以下。

引用　https://arxiv.org/pdf/2006.03677.pdf

【比較２】CNNとの関係（というか全結合層との関係？）

CNNの問題点

CNNの問題点として、**全結合層で、空間的な位置関係への考慮がされない？**という指摘は、最初から、常にあったと思う。
それに対して。

Vision Transformer (ViT)

以下の引用のように、Position embeddingsにて、
空間的な位置情報を保持することを意識している。

引用： https://openreview.net/forum?id=YicbFdNTTy　　

Position embeddings are added to the patch embeddings to retain positional information. We explore
different 2D-aware variants of position embeddings (Appendix C.3) without any significant gains
over standard 1D position embeddings. The joint embedding serves as input to the encoder.

VisualTransformers

以下の引用のように、大局的な空間的位置関係の把握を問題視しており、
全結合層とかを置換えている（「replace the last stage of convolutions」（論文抜粋））のは、それへの対処と考えられる。

引用　https://arxiv.org/pdf/2006.03677.pdf

Convolutions are not efficient for sparse, high-level semantic concepts: Low-level visual features such as corners and edges are densely distributed across images; as a result, convolutions are apt for image processing early in a neural network. However, as the receptive field increases deeper in the network, the number of potential patterns grows exponentially large. As a result, semantic concepts become increasingly sparse: each concept appears in small regions of a few images. Later convolutions become correspondingly inefficient.

Google翻訳

3）畳み込みは、まばらで高レベルのセマンティックコンセプトには効率的ではありません。コーナーやエッジなどの低レベルの視覚的特徴は、画像全体に密に分散しています。その結果、畳み込みはニューラルネットワークの初期の画像処理に適しています。ただし、受容野がネットワークの奥深くで増加するにつれて、潜在的なパターンの数は指数関数的に大きくなります。その結果、セマンティックの概念はますますまばらになります。各概念は、いくつかの画像の小さな領域に表示されます。その後の畳み込みは、それに応じて非効率的になります。

【比較３】CNNとの関係

Vision Transformer (ViT)

CNNは使っていないことになっている。
ただし、
以下の図（左）のようにCNNと同じような作用が出ているので、あまり、使っているとかいないとかは、それほど、意味がないのかもしれません、よくわかっていません。
（CNNでも、このあたりは、わざわざ学習せずに仕込んでおいてもいいのでは、と思われているのでは。）