Attention Is All You Need を読んでみた【Abstract & Introduction】

Last updated at 2023-12-13Posted at 2023-12-04

初めましての方は初めまして．そうでない方はいつもありがとうございます．

タイトルにもあるように，今更ですが本記事ではかの有名な論文"Attention Is All You Need."を読んでみようというものです．解説したい所ですが，既に偉大な先人方が分かりやすい記事を書かれており，到底，学生の身分である私では太刀打ちできません．
学生なりの理解を見てやっても良いよって方は覗いてみて下さい．

では，前置きはこれくらいにして，見ていきましょう！

本記事では，AbstractとIntroductionを見ていきます．
また，以下の英文和訳にはDeepLを使用しています．

Abstract

先ずは原文を．

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

分割して見ていきます．

1行目

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.
主要な配列伝達モデルは、エンコーダーとデコーダーを含む複雑なリカレントニューラルネットワークまたは畳み込みニューラルネットワークに基づいている。

この頃に有名なモデルであれば，同じくGoogleから発表された所謂，GNMT('16)とかでしょうか？また，別のタスクになりますが，GoogleNet('14)やResNet('15)なんかも含めて言及されているのかもしれません．

2行目

The best performing models also connect the encoder and decoder through an attention mechanism.
最も性能の良いモデルは、エンコーダーとデコーダーをattentionメカニズムで接続している。

こちらの話は存じ上げなかったのですが，Attention Is All You Need の [2] でも紹介されているもののことを指していると思われます．（恥ずかしながら，私は Attention Is All You Need を読むまでAttentionメカニズムの初出が Attention Is All You Need だと思っておりました．）

後の章でこの Attention メカニズム (当該論文内では Alignment と呼称) について言及があるので，この段階で紹介しておきます．

\left\{ \,
    \begin{aligned}
        & c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j \\
        & \alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x} exp(e_{ij})} \\
        & e_{ij} = a(s_{i-1}, h_j)
    \end{aligned}
\right.

それぞれ，$c_i$: 意味ベクトル，$s_{i-1}$: 隠れ状態，$h_j$: 入力ベクトル，$a$: フィードフォワードニューラルネットワーク．

上のAttentionメカニズムは加法Attentionと呼ばれるもので，Attention Is All You Need内で紹介されている（内積）Attentionとは異なります．

3行目

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
私たちは、attentionメカニズムにのみ基づく新しいシンプルなネットワーク・アーキテクチャ「Transformer」を提案し、再帰や畳み込みを完全に排除する。

Attention Is All You Need の貢献はTransformerアーキテクチャなのですね．

4～6行目

Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
2つの機械翻訳タスクで実験した結果、これらのモデルは、並列化可能で学習時間が大幅に短縮されるとともに、品質も優れていることが示された。我々のモデルはWMT 2014英独翻訳タスクで28.4 BLEUを達成し、アンサンブルを含む既存の最良結果を2 BLEU以上上回った。WMT 2014の英仏翻訳タスクにおいて、我々のモデルは8台のGPUで3.5日間学習した結果、単一モデルの最新BLEUスコア41.0を達成した。

Transformerの利点は，

学習時間の短縮（並列化による）
精度の向上

に要約できるということのようです．

Introduction

先ずは原文を．

Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [31, 21, 13]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [18] and conditional computation [26], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduc- tion models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 16]. In all but a few cases [22], however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

分割して見ていきます．

第1段落

1, 2行目

Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [31, 21, 13].
リカレントニューラルネットワーク、特に長期短期記憶型ニューラルネットワーク[12]やゲート型リカレントニューラルネットワーク[7]は、言語モデリングや機械翻訳などのシーケンスモデリングや伝達問題において、最先端のアプローチとして確固たる地位を築いている[29, 2, 5]。その後も、リカレント言語モデルやエンコーダー・デコーダー・アーキテクチャーの限界を押し広げようとする数多くの努力が続けられてきた[31, 21, 13]。

この部分はAbstractの1, 2行目の内容ですね．

3, 4行目

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$, as a function of the previous hidden state $h_{t−1}$ and the input for position $t$.
リカレントモデルは通常、入出力シーケンスのシンボル位置に沿って計算を行なう。位置を計算時間のステップに合わせ、前の隠れ状態$h_{t-1}$と位置$t$の入力の関数として、連続した隠れ状態$h_t$を生成する。

この部分は，系列モデルの基本的な説明がされているようです．とにかく，位置と前の状態を入力として次の状態を決定することだけ理解できていれば良さそうです．
詳細な系列モデルに関する説明は多くの偉大な先人の方々がされているので，そちらにお任せします．

5行目

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
この本質的に逐次的な性質は、学習例内での並列化を妨げ、これは、メモリ制約により学習例間でのバッチ処理が制限されるため、シーケンス長が長くなると致命的になる。

inherently sequential nature (本質的に逐次的な性質)とは先述されている，系列モデルのことで，前の入力を受け取らないと次に進めないために並列化が妨げられるのでしょう．
メモリ制約についてはちょっと理由が分かりません（不勉強で申し訳ございません）が，そういう問題があるのだと理解しておき，次に進みます．

6, 7行目

Recent work has achieved significant improvements in computational efficiency through factorization tricks [18] and conditional computation [26], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
最近の研究では、因数分解の技法[18]や条件付き計算[26]によって計算効率の大幅な改善が達成されており、後者の場合はモデルの性能も向上している。しかし、逐次計算の基本的な制約は残っている。

この部分で挙げられている論文は，存じ上げなかったのですが，とにかく逐次的な計算性質の根本的な解決には至っていないようです．

最後に

ここまで読んで頂きありがとうございました．
研究室に配属されて2年目のひよっこ学部生が書いている記事なので，間違い等があればぜひご教示頂きたいです．（お手柔らかにお願いします！）

次回は Background & Model Architecture の章を読み進めていきたいと思います．~~更新は片手間におこなっているので次がいつになるかは未定ですが，もしよろしければ次をお待ち下さい！~~

追記

次の記事を投稿しました．もし気に入って頂けて，よろしければ覗いてみて下さい！
Attention Is All You Need を読んでみた【Background & Model Architecture（前編）】

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up