Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Attention Is All You Need を読んでみた【Abstract & Introduction】

Last updated at Posted at 2023-12-04


タイトルにもあるように,今更ですが本記事ではかの有名な論文"Attention Is All You Need."を読んでみようというものです.解説したい所ですが,既に偉大な先人方が分かりやすい記事を書かれており,到底,学生の身分である私では太刀打ちできません.





The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.



The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.



The best performing models also connect the encoder and decoder through an attention mechanism.

こちらの話は存じ上げなかったのですが,Attention Is All You Need の [2] でも紹介されているもののことを指していると思われます.(恥ずかしながら,私は Attention Is All You Need を読むまでAttentionメカニズムの初出が Attention Is All You Need だと思っておりました.)

後の章でこの Attention メカニズム (当該論文内では Alignment と呼称) について言及があるので,この段階で紹介しておきます.

\left\{ \,
        & c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j \\
        & \alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x} exp(e_{ij})} \\
        & e_{ij} = a(s_{i-1}, h_j)

それぞれ,$c_i$: 意味ベクトル,$s_{i-1}$: 隠れ状態,$h_j$: 入力ベクトル,$a$: フィードフォワードニューラルネットワーク.

上のAttentionメカニズムは加法Attentionと呼ばれるもので,Attention Is All You Need内で紹介されている(内積)Attentionとは異なります.


We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Attention Is All You Need の貢献はTransformerアーキテクチャなのですね.


Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
2つの機械翻訳タスクで実験した結果、これらのモデルは、並列化可能で学習時間が大幅に短縮されるとともに、品質も優れていることが示された。我々のモデルはWMT 2014英独翻訳タスクで28.4 BLEUを達成し、アンサンブルを含む既存の最良結果を2 BLEU以上上回った。WMT 2014の英仏翻訳タスクにおいて、我々のモデルは8台のGPUで3.5日間学習した結果、単一モデルの最新BLEUスコア41.0を達成した。


  • 学習時間の短縮(並列化による)
  • 精度の向上




Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [31, 21, 13]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [18] and conditional computation [26], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduc- tion models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 16]. In all but a few cases [22], however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.



1, 2行目

Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [31, 21, 13].
リカレントニューラルネットワーク、特に長期短期記憶型ニューラルネットワーク[12]やゲート型リカレントニューラルネットワーク[7]は、言語モデリングや機械翻訳などのシーケンスモデリングや伝達問題において、最先端のアプローチとして確固たる地位を築いている[29, 2, 5]。その後も、リカレント言語モデルやエンコーダー・デコーダー・アーキテクチャーの限界を押し広げようとする数多くの努力が続けられてきた[31, 21, 13]。

この部分はAbstractの1, 2行目の内容ですね.

3, 4行目

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$, as a function of the previous hidden state $h_{t−1}$ and the input for position $t$.



This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

inherently sequential nature (本質的に逐次的な性質)とは先述されている,系列モデルのことで,前の入力を受け取らないと次に進めないために並列化が妨げられるのでしょう.

6, 7行目

Recent work has achieved significant improvements in computational efficiency through factorization tricks [18] and conditional computation [26], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.




次回は Background & Model Architecture の章を読み進めていきたいと思います.更新は片手間におこなっているので次がいつになるかは未定ですが,もしよろしければ次をお待ち下さい!


Attention Is All You Need を読んでみた【Background & Model Architecture(前編)】

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?