More than 1 year has passed since last update.

Transformer系画像認識モデル論文リンク集

Last updated at 2022-08-25Posted at 2021-07-01

趣旨

Attentionを主要なコンポーネントとして構成される画像認識モデルの論文へのリンクを、arXiv.orgへの論文登録日時の順で並べて掲載。公式実装が公開されている場合は、そのリンクも掲載。

2020

Visual Transformers

ViT(Vision Transformer)

DeiT(Data-efficient Image Transformers)

BoTNet(Bottleneck Transformers)

2021 Q1

T2T-ViT(Tokens-to-Token ViT)

CPVT(Conditional Position encoding Vision Transformer)

PVT(Pyramid Vision Transformer)

TNT(Transformer-in-Transformer)

Perceiver

ConViT(Convolutional Vision Transformer)

HVT(Hierarchical Visual Transformer)

DeepViT

HaloNet

Swin(Shifted Windows Transformer)

CvT(Convolutional vision Transformer)

ViL(Vision Longformer)

PiT(Pooling-based Vision Transformer)

Cait(Class-Attention in Image Transformers)

2021 Q2

Visformer(Vision-friendly Transformer)

SiT(Self-supervised vision Transformers)

LocalVit(Locality Vision Transformers)

CCT(Compact Convolutional Transformer)

CoaT(Co-scale conv-attentional image Transformers)

MViTv1(Multiscale Vision Transformers)

ConTNet(ConvolutionTransformer Network)

Twins

Conformer

NesT(Nested Transformers)

ResT(ResNet Transformer)

LIT(Less attention vIsion Transformer)

DVT(Dynamic Vision Transformers)

MSG-Transformer(Messanger Transformer)

TransCNN

Refiner(Refined Vision Transformer)

Shuffle Transformer

ViTAE(Vision Transformer Advanced by Exploring Intrinsic Inductive Bias)

CoAtNet(Convolution and self-Attention Net)

CAT(Cross Attention Vision Transformer)

V-MoE(Vision Mixture of Experts)

BEiT(Bidirectional Encoder representation Image Transformers)

XCiT(Cross-Covariance Image Transformers)

EsVit(Efficient Self-supervised Vision Transformers)

P2T(Pyramid Pooling Transformer)

CivT(Cross inductive bias vision Transformers)

VOLO(Vision Outlooker)

ViTAS

2021 Q3

AutoFormer

CSWin(Cross-Shaped Windows) Transformer

Focal Transformer

ViX(Vision Xformers)

GLiT(Global and Local Image Transformer)

ViP(Visual Parser)

CMT(CNNs Meet Transformers)

WideNet

DPT(Deformable Patch-based Transformer)

Evo-ViT

PSViT(ViT with token Pooling and attention Sharing)

Mobile-Former

ViT-ResNAS

UFO-ViT(Unit Force Operated Vision Transformer)

2021 Q4

MobileViT

UniNet

SOFT(Softmax-free Transformer)

Hybrid BYOL-ViT

SReT(Sliced Recursive Transformer)

Swin Transformer V2

SWAT(Spatial Structure Within and Among Tokens)

SiT(Self-slimmed Vision Transformer)

S3(Searching the Search Space)

AdaViT(Adaptive Vision Transformers)

Shunted-Transformer

MViTv2(Multiscale Vision Transformers)

ReViT(Resizable-ViT)

Locally-SAG-Vit

AdaViT

LVT(Lite Vision Transformer)

ELSA

SimViT(Simple ViT)

SPT and LSA

SPVit(Soft Token Pruning ViT)

Pale Transformer

2022 Q1

DAT(Deformable Attention Transformer)

ViT-Slim

PyramidTNT(Pyramid Transformer-in-Transformer)

QuadTree

UniFormer

CXV(Convolutional Xformers for Vision)

O-ViT(Orthogonal Vision Transformer)

MOA(Multi-resolution Overlapped Attention)

BOAT(Bilateral Local Attention Vision Transformer)

BViT(Broad Attention based Vision Transformer)

AlterNet

VAT (Visual Attention Network)

ViTAEv2 (Vision Transformer Advanced by Exploring Inductive Bias v2)

PatchMerger

As-ViT(Auto-scaling Vision Transformers)

ViT-P

CF-ViT(Coarse-to-Fine Vision Transformer)

DGT(Dynamic Group Transformer)

WinfT(Window-Free Transformer)

EIT(Efficiently lead Inductive biases to ViT)

ScalableViT

FocalNet

DW-ViT(Dynamic Window Vision Transformer)

SepViT(Separable Vision Transformer)

2022 Q2

MaxViT(Multi-Axis ViT)

DaVit(Dual Attention Vision Transformers)

DeiT III

NAT(Neighborhood Attention Transformer)

Rest V2(ResNet Transformer V2)

ASF-former

EdgeViTs

TRT-Vit

SuperVit

iFormer(Inception Transformer)

LITv2(Less attention vIsion Transformer V2)

AdaptFormer

X-ViT

HiViT(Hierarchical Vision Transformer)

EfficientViT

EfficientFormer

MobileViTv2

PerViT(Peripheral Vision Transformer)

SP-ViT(Spatial Priors ViT)

EATFormer(Evolutionary Algorithm Transformer)

GC ViT(Global Context Vision Transformers)

EdgeNeXt

2022 Q3

LinGlos(Linear-complexity and Global-local interaction)

UniNet

LightViT

Dual-ViT

Next-ViT

XFormer

CETNet

XFormer

HorNet

Token Fusion

Transformer系画像認識モデル論文リンク集

趣旨

関連記事

2020

Visual Transformers

ViT(Vision Transformer)

DeiT(Data-efficient Image Transformers)

BoTNet(Bottleneck Transformers)

2021 Q1

T2T-ViT(Tokens-to-Token ViT)

CPVT(Conditional Position encoding Vision Transformer)

PVT(Pyramid Vision Transformer)

TNT(Transformer-in-Transformer)

Perceiver

ConViT(Convolutional Vision Transformer)

HVT(Hierarchical Visual Transformer)

DeepViT

HaloNet

Swin(Shifted Windows Transformer)

CvT(Convolutional vision Transformer)

ViL(Vision Longformer)

PiT(Pooling-based Vision Transformer)

Cait(Class-Attention in Image Transformers)

2021 Q2

Visformer(Vision-friendly Transformer)

SiT(Self-supervised vision Transformers)

LocalVit(Locality Vision Transformers)

CCT(Compact Convolutional Transformer)

CoaT(Co-scale conv-attentional image Transformers)

MViTv1(Multiscale Vision Transformers)

ConTNet(ConvolutionTransformer Network)

Twins

Conformer

NesT(Nested Transformers)

ResT(ResNet Transformer)

LIT(Less attention vIsion Transformer)

DVT(Dynamic Vision Transformers)

MSG-Transformer(Messanger Transformer)

TransCNN

Refiner(Refined Vision Transformer)

Shuffle Transformer

ViTAE(Vision Transformer Advanced by Exploring Intrinsic Inductive Bias)

CoAtNet(Convolution and self-Attention Net)

CAT(Cross Attention Vision Transformer)

V-MoE(Vision Mixture of Experts)

BEiT(Bidirectional Encoder representation Image Transformers)

XCiT(Cross-Covariance Image Transformers)

EsVit(Efficient Self-supervised Vision Transformers)

P2T(Pyramid Pooling Transformer)

CivT(Cross inductive bias vision Transformers)

VOLO(Vision Outlooker)

ViTAS

2021 Q3

AutoFormer

CSWin(Cross-Shaped Windows) Transformer

Focal Transformer

ViX(Vision Xformers)

GLiT(Global and Local Image Transformer)

ViP(Visual Parser)

CMT(CNNs Meet Transformers)

WideNet

DPT(Deformable Patch-based Transformer)

Evo-ViT

PSViT(ViT with token Pooling and attention Sharing)

Mobile-Former

ViT-ResNAS

UFO-ViT(Unit Force Operated Vision Transformer)

2021 Q4

MobileViT

UniNet

SOFT(Softmax-free Transformer)

Hybrid BYOL-ViT

SReT(Sliced Recursive Transformer)

Swin Transformer V2

SWAT(Spatial Structure Within and Among Tokens)

SiT(Self-slimmed Vision Transformer)

S3(Searching the Search Space)

AdaViT(Adaptive Vision Transformers)

Shunted-Transformer

MViTv2(Multiscale Vision Transformers)