概要

以下の記事、
AttentionがないのにTransformerは活躍するのか？！
https://ai-scholar.tech/articles/transformer/mlp_transformer
で、以下の論文
『Pay Attention to MLPs』
が紹介されていたので、その論文の少し内容をみた。

このタイミングでのこの論文は、話題としては、いい気がする。

Pay Attention to MLPs　の内容

ポイント

概要抜粋、『・・・Vision Transformersでは自己注意は重要ではないことがわかりました・・・』とのこと。
この見解に対して、多くの人がそうかもしれませんね、と考えると思う。
そもそも、言語と画像（Computer Vision）では、以下が大きく違う。

言語は、単語に意味があり、それは、人が作ったものである。
　画像は、人がつくったものではない。
言語は、1次元的であり、画像は2次元である。この2次元を容易には1次元的には扱えない。

尚、この論文は、Vision Transformersについてだけ述べているのではなく、広い範囲のTransformerについて述べている。

論文の結論を抜粋すると、

（抜粋）この採用により、特にNLPにおいて多くの素晴らしい結果が得られました。これまでのところ、何がこのような成功を後押ししているのかはまだ明らかになっていません。それは、Transformerのフィードフォワードの性質なのか、それともTransformerの複数ヘッドの自己注意層なのか？

と、問題を提起し、それに対する検討結果が示されている。全面的にself-attentionが重要というような結論には当然なっていない。

そもそも、
Self-attentionに関しては、有効な部分もあるだろうが、しかし、それは選択肢のひとつで決定的ではない可能性がある、少なくとも、それほど、有効に作用しない部分があるだろうというのは、多くの人がそうかもしれませんね、と従来より考えていると思う。

論文の概要や結論

概要

Transformers [1] have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

概要(DeepL訳）

Transformers [1]は、深層学習における最も重要なアーキテクチャ・イノベーションの1つとなっており、過去数年間で多くのブレイクスルーを可能にしてきました。ここでは、ゲーティングを用いたMLPをベースにしたシンプルなネットワークアーキテクチャgMLPを提案し、主要な言語および視覚アプリケーションにおいてTransformerと同等の性能を発揮できることを示します。この比較から、gMLPが同じ精度を達成できることから、Vision Transformersでは自己注意は重要ではないことがわかりました。BERTでは、我々のモデルは、事前学習perplexityでTransformerと同等の性能を達成し、いくつかの下流のNLPタスクでも優れています。gMLPの性能が悪い微調整タスクでは、gMLPモデルを大幅に大きくすることで、Transformersとの差を縮めることができます。一般的に、今回の実験では、gMLPはデータや計算量が増えてもTransformersと同等のスケーリングが可能であることを示しています。

www.DeepL.com/Translator（無料版）で翻訳しました。

結論

Since the seminal work of Vaswani et al. [1], Transformers have been widely adopted across NLP and computer vision. This adoption has enabled many impressive results especially in NLP. To date, it is still unclear what empowers such success: is it the feedforward nature of Transformers or is it the multi-head self-attention layers in Transformers? Our work suggests a simpler alternative to the multi-head self-attention layers in Transformers. We show that gMLPs, a simple variant of MLPs with gating, can be competitive with Transformers in terms of BERT’s pretraining perplexity and ViT’s accuracy. gMLPs are also comparable with Transformers in terms of the scalability over increased data and compute. As for BERT finetuning, we 9 find gMLPs can achieve appealing results on challenging tasks such as SQuAD without self-attention, and can significantly outperform Transformers in certain cases. We also find the inductive bias in Transformer’s multi-head self-attention useful on downstream tasks that require cross-sentence alignment. However in those cases, making gMLP substantially larger closes the gap with Transformers. More practically, blending a small single-head self-attention into gMLP allows for an even better architecture without the need for increasing model size.

結論(DeepL訳）

Vaswaniらの画期的な研究[1]以来、トランスフォーマーはNLPやコンピュータビジョンの分野で広く採用されています。この採用により、特にNLPにおいて多くの素晴らしい結果が得られました。これまでのところ、何がこのような成功を後押ししているのかはまだ明らかになっていません。それは、Transformerのフィードフォワードの性質なのか、それともTransformerの複数ヘッドの自己注意層なのか？本研究では、Transformersの多頭式自己注意層に代わる、よりシンプルな方法を提案します。gMLPは、ゲーティングを用いたMLPの単純な変形であり、BERTの事前学習パープレキシティとViTの精度の点でTransformersと競合できることを示しています。また、gMLPは、データや計算量の増加に対するスケーラビリティの点でもTransformersと同等です。BERTの微調整に関しては、gMLPはSQuADのような挑戦的なタスクにおいて、自己注意なしに魅力的な結果を得ることができ、特定のケースではTransformerを大幅に上回ることができることがわかりました。また、Transformerのマルチヘッド自己調整における帰納的なバイアスは、文を跨いだアライメントを必要とする下流のタスクにおいて有用であることがわかりました。しかし、そのような場合には、gMLPを大幅に大きくすることでTransformerとの差を縮めることができます。より現実的には、gMLPに小さなシングルヘッドのセルフアテンションをブレンドすることで、モデルサイズを大きくすることなく、さらに優れたアーキテクチャを実現することができます。

まとめ

特にありません。
以下の記事などで、自分でも、attentionについて書いてます（といっても単に記事の紹介ですが）、ご参考。
自己注意Self-Attention の解説でわかりやすいと思った記事N選　(N＝15）

論文『Pay Attention to MLPs』と『Attention Is All You Need』について

概要

Pay Attention to MLPs の内容

ポイント

論文の概要や結論

まとめ

Pay Attention to MLPs　の内容