More than 3 years have passed since last update.

週刊Transformer（画像認識向け）

Posted at 2021-06-10

凄まじい勢いで増殖中のTransformer論文＋αにとりあえず、目を通しつつコメントを残していく。内容があっている保証はない。

When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations
GoogleとUCLAの論文。ViTとMLP-mixerはsharp local minimaに落ち込むので、SAM(sharpness-aware optimizer)を使ってLandscapeをスムーズにすると更にうまく行く。

Understanding Robustness of Transformers for Image Classification
Google論文。入力を変えてどれだけRobustかをメインにResNetとViTを比較。大量のデータで学習するとモデルサイズが大きくなればなるほど、ViTはResNetより精度が上がる。中規模データだと変わらない。

Regularization in ResNet with Stochastic Depth
Regularizerとして有効なStochastic DepthをTheoreticalに調べた。

Self-Damaging Contrastive Learning
Contrastive Learningで使うタグなしデータの量は、全てのデータで一定とはならない。インバランスなデータに強いネットワークをsimCLRを元に開発。

Learning from Noisy Labels with Deep Neural Networks: A Survey
Noisy LabelのSurvey。

Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding
Google論文。自然言語の方で、Position Encodingを学習できるフーリエの係数にすると精度が上がって収束が速い。
＿＿＿＿＿＿＿＿＿

Self-supervised Pretraining of Visual Features in the Wild
FAIR論文。新しいSelf-supervised学習。 13億のパラメータで10億の画像を512のGPUで学習するとImageNet:84.2%達成。

Scaling Local Self-Attention for Parameter Efficient Visual Backbones
Google論文。CNNにAttentionを入れるHaloNetという小さい規模のネットを開発。Efficient Netより良い。

Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length
Huaweiと精華大学の論文。ViTは16x16でパッチを作るけど、簡単に出来るやつはもっと大きくて良いし、難しいのはもっと細かく切った方がいいから、Dynamic Vision Transformers (DVT)っていう大きいのから順にわかるまでやって行くのを考えた。ざっくりやるときのパラメータをReuseするから、小さくて精度が良くなるっぽい。

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
Tencentの論文。Tokenの中身を途中でShuffleして精度を上げるTransformer。そのうちCodeを公開。

Scaling Vision Transformers
Google論文。モデルの規模、学習データ数、学習時間を大きくして、どうなるか調査。パラメータ数20億でImageNetのTop1：90.45％達成。

ResT: An Efficient Transformer for Visual Recognition
ViTのAttentionの部分をちょっと変えて、BackboneをResNetチックにDownsampleするようにして全体の計算量減らした。

Transformer in Convolutional Neural Networks
TokenをGridに分けて、その中でAttentionを計算し、じわじわとくっつけて行くことで、ネットワークを小さくして、精度を良くする。

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
精華大学とUCLAの論文。重要じゃないTokenを落としながらInferenceを速くするネットワーク。

RegionViT: Regional-to-Local Attention for Vision Transformers
細かく見るTokenと大まかに見るTokenを組み合わせる。いろんなデータセットで試したら、ViT系のSOTAと同じか更に良い結果になった。

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers
機械翻訳で、EncoderとDecoderあるけど、実はDecoderの方はそんなに重要じゃない。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up