More than 1 year has passed since last update.

Attention Is All You Need 和訳

機械学習

Posted at 2022-09-12

はじめに

身内で共有するためのざっくりとした機械翻訳です。不十分なところが多々ありますので、原論文に当たられることをおすすめします。

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.
ドミナントシーケンス変換モデルは、エンコーダーとデコーダーを含む複雑な再帰型または畳み込みニューラルネットワークに基づいています。

The best performing models also connect the encoder and decoder through an attention mechanism.
また、最高のパフォーマンスを発揮するモデルは、アテンションメカニズムを介してエンコーダとデコーダを接続します。

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
私たちは、アテンションメカニズムのみに基づいて、再帰と畳み込みを完全に不要にする、新しいシンプルなネットワークアーキテクチャであるトランスフォーマーを提案します。

Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
2 つの機械翻訳タスクの実験では、これらのモデルが優れた品質であると同時に、より並列化可能であり、トレーニングに必要な時間が大幅に短縮されていることが示されています。

Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.
私たちのモデルは、WMT 2014 の英語からドイツ語への翻訳タスクで 28.4 BLEU を達成し、アンサンブルを含む既存の最良の結果を 2 BLEU 以上改善しています。

On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
WMT 2014 の英語からフランス語への翻訳タスクでは、モデルは 8 つの GPU で 3.5 日間トレーニングした後、新しい単一モデルの最先端の BLEU スコア 41.8 を確立しました。文学からのモデル。

We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Transformer は、大規模で限られたトレーニングデータの両方を使用した英語の選挙区の解析にうまく適用することで、他のタスクにうまく一般化できることを示しています。

1 Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and ∗Equal contribution.
リカレントニューラルネットワーク、特に長短期記憶 [13] およびゲート付きリカレント [7] ニューラルネットワークは、シーケンスモデリングと ∗ 等価寄与における最先端のアプローチとしてしっかりと確立されています。

Listing order is random.
掲載順はランダムです。

Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea.
Jakob は、RNN を自己注意に置き換えることを提案し、このアイデアを評価する取り組みを開始しました。

Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work.
Ashish は、Illia と共に最初の Transformer モデルを設計および実装し、この作業のあらゆる側面に深く関わってきました。

Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail.
ノームは、スケーリングされたドット積の注意、マルチヘッドの注意、およびパラメーターのない位置表現を提案し、ほぼすべての詳細に関与する別の人物になりました。

Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor.
Niki は、オリジナルのコードベースと tensor2tensor で無数のモデルバリアントを設計、実装、調整、評価しました。

Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations.
Llion は、新しいモデルバリアントの実験も行い、最初のコードベースと効率的な推論と視覚化を担当しました。

Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
Lukasz と Aidan は、tensor2tensor のさまざまな部分を設計して実装し、以前のコードベースを置き換えて、結果を大幅に改善し、研究を大幅に加速するために数え切れないほどの長い日を費やしました。

transduction problems such as language modeling and machine translation [35, 2, 5].
言語モデリングや機械翻訳などの変換問題[35、2、5]。

Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
それ以来、リカレント言語モデルとエンコーダー/デコーダーアーキテクチャの境界を押し広げるために、多くの努力が続けられてきました [38, 24, 15]。

Recurrent models typically factor computation along the symbol positions of the input and output sequences.
再帰モデルは通常、入力シーケンスと出力シーケンスのシンボル位置に沿って計算を因数分解します。

Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t.
位置を計算時間のステップに合わせて、前の隠れ状態 ht−1 と位置 t の入力の関数として一連の隠れ状態 ht を生成します。

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
この本質的にシーケンシャルな性質により、トレーニング例内での並列化が妨げられます。これは、メモリの制約によりサンプル間のバッチ処理が制限されるため、シーケンスの長さが長くなると重要になります。

Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter.
最近の研究では、因数分解のトリック [21] と条件付き計算 [32] によって計算効率が大幅に向上し、後者の場合はモデルのパフォーマンスも向上しています。

The fundamental constraint of sequential computation, however, remains.
ただし、逐次計算の基本的な制約は残ります。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19].
注意メカニズムは、さまざまなタスクにおける説得力のあるシーケンスモデリングおよび変換モデルの不可欠な部分になり、入力シーケンスまたは出力シーケンスの距離に関係なく依存関係をモデル化できます [2、19]。

In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
ただし、いくつかのケースを除いて [27]、そのような注意メカニズムはリカレントネットワークと組み合わせて使用されます。

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
この作業では、モデルアーキテクチャである Transformer を提案します。これは、再発を避け、代わりに注意メカニズムに完全に依存して、入力と出力の間のグローバルな依存関係を描画します。

The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Transformer は、大幅に多くの並列化を可能にし、8 つの P100 GPU でわずか 12 時間のトレーニングを行うだけで、翻訳品質の新しい最先端に達することができます。

2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.
シーケンシャル計算を削減するという目標は、Extended Neural GPU [16]、ByteNet [18]、および ConvS2S [9] の基盤も形成します。これらはすべて、畳み込みニューラルネットワークを基本的なビルディングブロックとして使用し、すべての入力と出力位置。

In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.
これらのモデルでは、任意の 2 つの入力位置または出力位置からの信号を関連付けるのに必要な操作の数は、位置間の距離に応じて増加します。ConvS2S では直線的に、ByteNet では対数的に増加します。

This makes it more difficult to learn dependencies between distant positions [12].
これにより、離れた位置間の依存関係を学習することがより困難になります [12]。

In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
Transformer では、セクション 3.2 で説明したように、アテンションで重み付けされた位置を平均化することで効果的な解像度が低下しますが、これは一定数の操作に削減されます。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
イントラアテンションと呼ばれることもある自己注意は、シーケンスの表現を計算するために、単一のシーケンスのさまざまな位置を関連付ける注意メカニズムです。

Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
自己注意は、読解、抽象的要約、テキスト含意、タスクに依存しない文表現の学習など、さまざまなタスクでうまく使用されています [4、27、28、22]。

End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
エンドツーエンドのメモリネットワークは、順序整列された反復ではなく反復注意メカニズムに基づいており、簡単な言語の質問応答および言語モデリングタスクでうまく機能することが示されています [34]。

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution.
ただし、私たちの知る限りでは、Transformer は、sequencealigned RNN や畳み込みを使用せずに入力と出力の表現を計算するために自己注意に完全に依存する最初の変換モデルです。

In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].
以下のセクションでは、トランスフォーマーについて説明し、自己注意を喚起し、[17、18] および [9] などのモデルに対するその利点について説明します。

3 Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].
競合するニューラルシーケンス変換モデルのほとんどは、エンコーダーデコーダー構造を持っています [5、2、35]。

Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn).
ここで、エンコーダはシンボル表現 (x1, ..., xn) の入力シーケンスを連続表現 z = (z1, ..., zn) のシーケンスにマッピングします。

Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time.
z が与えられると、デコーダはシンボルの出力シーケンス (y1、...、ym) を一度に 1 要素ずつ生成します。

At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
各ステップで、モデルは自己回帰 [10] であり、次の生成時に以前に生成されたシンボルを追加の入力として消費します。

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer は、図 1 の左半分と右半分にそれぞれ示されているように、エンコーダーとデコーダーの両方にスタックされた自己注意層とポイント単位の完全に接続された層を使用して、この全体的なアーキテクチャに従います。

3.1 Encoder and Decoder Stacks

Encoder

The encoder is composed of a stack of N = 6 identical layers.
エンコーダーは、N = 6 の同一レイヤーのスタックで構成されます。

Each layer has two sub-layers.
各レイヤーには 2 つのサブレイヤーがあります。

The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network.
1 つ目はマルチヘッドのセルフアテンションメカニズムで、2 つ目は単純な位置ごとに完全に接続されたフィードフォワードネットワークです。

We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1].
2 つのサブレイヤーのそれぞれの周囲に残差接続 [11] を使用し、その後にレイヤーの正規化 [1] を使用します。

That is, the output of each sub-layer is LayerNorm(x Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.
つまり、各サブレイヤーの出力は LayerNorm(x Sublayer(x)) です。ここで、Sublayer(x) はサブレイヤー自体によって実装される関数です。

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
これらの残留結合を容易にするために、モデル内のすべてのサブレイヤーと埋め込みレイヤーは、次元 dmodel = 512 の出力を生成します。

Decoder

The decoder is also composed of a stack of N = 6 identical layers.
デコーダーも、N = 6 の同一レイヤーのスタックで構成されています。

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
各エンコーダー層の 2 つのサブレイヤーに加えて、デコーダーは、エンコーダースタックの出力に対してマルチヘッドアテンションを実行する 3 番目のサブレイヤーを挿入します。

Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
エンコーダーと同様に、各サブレイヤーの周囲に残差接続を使用し、その後にレイヤーの正規化を行います。

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
また、デコーダスタック内の自己注意サブレイヤを変更して、位置が後続の位置に注意を向けないようにします。

This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
このマスキングは、出力の埋め込みが 1 位置だけオフセットされているという事実と相まって、位置 i の予測が i より小さい位置の既知の出力にのみ依存できることを保証します。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.
Attention 関数は、クエリと一連のキーと値のペアを出力にマッピングするものとして説明できます。ここで、クエリ、キー、値、および出力はすべてベクトルです。

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
出力は、値の加重合計として計算されます。各値に割り当てられた加重は、対応するキーを使用したクエリの互換性関数によって計算されます。

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2).
私たちは、特に注意を払っていることを「Scaled Dot-Product Attention」と呼んでいます (図 2)。

The input consists of queries and keys of dimension dk, and values of dimension dv.
入力は次元 dk のクエリとキー、および次元 dv の値で構成されます。

We compute the dot products of the query with all keys, divide each by √ dk, and apply a softmax function to obtain the weights on the values.
すべてのキーを使用してクエリの内積を計算し、それぞれを √ dk で割り、softmax 関数を適用して値の重みを取得します。

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q.
実際には、一連のクエリに対してアテンション関数を同時に計算し、行列 Q にまとめます。

The keys and values are also packed together into matrices K and V .
キーと値は、行列 K と V にもまとめられます。

We compute the matrix of outputs as:
出力の行列を次のように計算します。

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention.
最も一般的に使用される 2 つのアテンション関数は、加算アテンション [2] と内積 (乗法) アテンションです。

Dot-product attention is identical to our algorithm, except for the scaling factor of 1/√dk .
内積注意は、スケーリング係数 1/√dk を除いて、私たちのアルゴリズムと同じです。

Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.
Additive Attention は、単一の隠れ層を持つフィードフォワードネットワークを使用して互換性関数を計算します。

While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
この 2 つは理論的な複雑さは似ていますが、ドット積アテンションは、高度に最適化された行列乗算コードを使用して実装できるため、実際にははるかに高速でスペース効率が高くなります。

While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3].
dk の値が小さい場合、この 2 つのメカニズムは同様に機能しますが、dk の値が大きい場合は、スケーリングなしで加法的注意が内積注意よりも優れています [3]。

We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4 .
dk の値が大きい場合、ドット積の大きさが大きくなり、softmax 関数が非常に小さい勾配を持つ領域に押し込まれると思われます 4 。

To counteract this effect, we scale the dot products by 1/√dk
この影響を打ち消すために、内積を 1/√dk でスケーリングします。

3.2.2 Multi-Head Attention

Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.
dmodel 次元のキー、値、およびクエリを使用して単一のアテンション関数を実行する代わりに、それぞれ dk、dk、および dv 次元への異なる学習済み線形射影を使用して、クエリ、キー、および値を h 回線形射影することが有益であることがわかりました。

On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values.
これらの射影されたクエリ、キー、および値の各バージョンで、アテンション関数を並行して実行し、dv 次元の出力値を生成します。

These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.
これらは連結され、再び射影され、最終的な値が得られます (図 2 参照)。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
マルチヘッドアテンションにより、モデルは異なる位置にある異なる表現部分空間からの情報に共同で注意を向けることができます。

With a single attention head, averaging inhibits this.
アテンションヘッドが 1 つの場合、平均化によってこれが抑制されます。

In this work we employ h = 8 parallel attention layers, or heads.
この作業では、h = 8 の平行な注意層、またはヘッドを使用します。

For each of these we use dk = dv = dmodel/h = 64.
これらのそれぞれについて、dk = dv = dmodel/h = 64 を使用します。

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
各ヘッドの次元が削減されるため、総計算コストは、完全な次元を持つ単一ヘッドの注意のコストと同様です。

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:
Transformer は、マルチヘッドアテンションを 3 つの異なる方法で使用します。

In "encoder-decoder attention" layers

In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.
- 「エンコーダーデコーダーアテンション」レイヤーでは、クエリは前のデコーダーレイヤーから取得され、メモリキーと値はエンコーダーの出力から取得されます。

This allows every position in the decoder to attend over all positions in the input sequence.
これにより、デコーダ内のすべての位置が入力シーケンス内のすべての位置に対応できるようになります。

This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
これは、[38、2、9] などのシーケンスからシーケンスへのモデルにおける典型的なエンコーダー/デコーダーの注意メカニズムを模倣しています。

The encoder contains self-attention layers.

The encoder contains self-attention layers.
エンコーダーには自己注意レイヤーが含まれています。

In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder.
セルフアテンションレイヤーでは、すべてのキー、値、クエリが同じ場所 (この場合は、エンコーダーの前のレイヤーの出力) から取得されます。

Each position in the encoder can attend to all positions in the previous layer of the encoder.
エンコーダーの各位置は、エンコーダーの前のレイヤーのすべての位置に対応できます。

self-attention layers

Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.
同様に、デコーダ内のセルフアテンション層により、デコーダ内の各位置が、その位置までのデコーダ内のすべての位置に対応できるようになります。

We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.
自己回帰特性を維持するために、デコーダーでの左方向の情報フローを防止する必要があります。

We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.
不正な接続に対応する softmax の入力のすべての値をマスクアウト (-∞ に設定) することにより、スケーリングされた内積注意の内部でこれを実装します。

See Figure 2.
図 2 を参照してください。

3.3 Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
注意サブレイヤーに加えて、エンコーダーとデコーダーの各レイヤーには、完全に接続されたフィードフォワードネットワークが含まれており、各位置に個別かつ同一に適用されます。

This consists of two linear transformations with a ReLU activation in between.
これは、ReLU アクティベーションを間に挟んだ 2 つの線形変換で構成されます。

While the linear transformations are the same across different positions, they use different parameters from layer to layer.
線形変換は異なる位置で同じですが、レイヤーごとに異なるパラメーターを使用します。

Another way of describing this is as two convolutions with kernel size 1.
これを説明する別の方法は、カーネルサイズ 1 の 2 つの畳み込みです。

The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality df f = 2048.
入力と出力の次元は dmodel = 512 で、内層の次元は df f = 2048 です。

3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel.
他のシーケンス変換モデルと同様に、学習した埋め込みを使用して、入力トークンと出力トークンを次元 dmodel のベクトルに変換します。

We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.
また、通常の学習線形変換とソフトマックス関数を使用して、デコーダ出力を予測された次のトークン確率に変換します。

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30].
このモデルでは、[30] と同様に、2 つの埋め込みレイヤーとプレソフトマックス線形変換の間で同じ重み行列を共有します。

In the embedding layers, we multiply those weights by √dmodel.
埋め込みレイヤーでは、これらの重みに √dmodel を掛けます。

3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.
モデルには再帰も畳み込みも含まれていないため、モデルがシーケンスの順序を利用できるようにするには、シーケンス内のトークンの相対位置または絶対位置に関する情報を注入する必要があります。

To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.
この目的のために、エンコーダーとデコーダーのスタックの下部にある入力埋め込みに「位置エンコーディング」を追加します。

The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed.
位置エンコーディングは埋め込みと同じ次元 dmodel を持つため、2 つを合計できます。

There are many choices of positional encodings, learned and fixed [9].
位置エンコーディングには多くの選択肢があり、学習され、固定されています [9]。

In this work, we use sine and cosine functions of different frequencies:
この作業では、異なる周波数の正弦関数と余弦関数を使用します。

where pos is the position and i is the dimension.
ここで、pos は位置、i は次元です。

That is, each dimension of the positional encoding corresponds to a sinusoid.
つまり、位置エンコーディングの各次元は正弦波に対応します。

The wavelengths form a geometric progression from 2π to 10000 · 2π.
波長は、2π から 10000 · 2π までの等比数列を形成します。

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P Epos+k can be represented as a linear function of P Epos.
この関数を選択したのは、この関数を使用すると、モデルが相対的な位置に対応することを簡単に学習できると仮定したためです。これは、任意の固定オフセット k に対して、P Epos+k を P Epos の線形関数として表すことができるためです。

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)).
また、代わりに学習した位置埋め込み [9] を使用して実験したところ、2 つのバージョンがほぼ同じ結果を生成することがわかりました (表 3 の行 (E) を参照)。

We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
正弦波バージョンを選択したのは、モデルがトレーニング中に発生したものよりも長いシーケンス長に外挿できる可能性があるためです。

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types.
表 1: さまざまなレイヤータイプの最大パス長、レイヤーごとの複雑さ、および連続操作の最小数。

n is the sequence length, d is the representation dimension, k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention.
n はシーケンスの長さ、d は表現次元、k は畳み込みのカーネルサイズ、r は制限された自己注意の近傍のサイズです。

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1, ..., xn) to another sequence of equal length (z1, ..., zn), with xi , zi ∈ R d , such as a hidden layer in a typical sequence transduction encoder or decoder.
このセクションでは、シンボル表現 (x1, ..., xn) の 1 つの可変長シーケンスを同じ長さの別のシーケンス (z1, .. ., zn)、典型的なシーケンス変換エンコーダーまたはデコーダーの隠れ層など、xi 、zi ∈ R d を使用します。

Motivating our use of self-attention we consider three desiderata.
自己注意の使用を動機付けるために、私たちは 3 つの必要性を考えています。

One is the total computational complexity per layer.
1 つは、レイヤーごとの計算の複雑さの合計です。

Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
もう 1 つは、必要な順次操作の最小数によって測定される、並列化できる計算の量です。

The third is the path length between long-range dependencies in the network.
3 つ目は、ネットワーク内の長距離依存関係間のパスの長さです。

Learning long-range dependencies is a key challenge in many sequence transduction tasks.
長期的な依存関係を学習することは、多くの配列変換タスクにおける重要な課題です。

One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.
このような依存関係を学習する能力に影響を与える重要な要因の 1 つは、ネットワーク内を通過する必要がある前方および後方信号のパスの長さです。

The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12].
入力シーケンスと出力シーケンスの位置の任意の組み合わせ間のこれらのパスが短いほど、長期的な依存関係を学習しやすくなります [12]。

Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
したがって、異なる層タイプで構成されるネットワーク内の任意の 2 つの入力位置と出力位置の間の最大パス長も比較します。

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.
表 1 に示すように、自己注意層はすべての位置を一定数の連続して実行される操作で接続しますが、再帰層は O(n) の連続操作を必要とします。

In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations.
計算の複雑さの点では、シーケンスの長さ n が表現の次元数 d よりも小さい場合、自己注意層は再帰層よりも高速です。これは、機械翻訳の最先端のモデルで使用される文の表現の場合に最もよく見られます。、ワードピース[38]やバイトペア[31]表現など。

To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.
非常に長いシーケンスを含むタスクの計算パフォーマンスを向上させるために、自己注意は、それぞれの出力位置を中心とした入力シーケンス内のサイズ r の近傍のみを考慮するように制限できます。

This would increase the maximum path length to O(n/r).
これにより、最大パス長が O(n/r) に増加します。

We plan to investigate this approach further in future work.
このアプローチについては、今後の作業でさらに調査する予定です。

A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions.
カーネル幅が k < n の単一の畳み込み層では、入力位置と出力位置のすべてのペアが接続されるわけではありません。

Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network.
これを行うには、隣接するカーネルの場合は O(n/k) 畳み込み層のスタックが必要であり、拡張畳み込みの場合は O(logk(n)) [18]、任意の 2 つの位置間の最長パスの長さが増加します。ネットワークで。

Convolutional layers are generally more expensive than recurrent layers, by a factor of k.
一般に、畳み込み層は再帰層よりも k 倍高価です。

Separable convolutions [6], however, decrease the complexity considerably, to O(k · n · d n · d 2 ).
ただし、分離可能な畳み込み [6] では、複雑さが大幅に減少し、O(k · n · d n · d 2 ) になります。

Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
ただし、k = n の場合でも、分離可能な畳み込みの複雑さは、モデルで採用するアプローチである、自己注意レイヤーとポイントワイズフィードフォワードレイヤーの組み合わせと同じです。

As side benefit, self-attention could yield more interpretable models.
副次的な利点として、自己注意により、より解釈可能なモデルが得られる可能性があります。

We inspect attention distributions from our models and present and discuss examples in the appendix.
モデルからの注意分布を調べ、付録で例を示して説明します。

Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
個々のアテンションヘッドはさまざまなタスクを実行することを明確に学習するだけでなく、多くが文の構文的および意味的構造に関連する行動を示すように見えます。

5 Training

This section describes the training regime for our models.
このセクションでは、モデルのトレーニング体制について説明します。

5.1 Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs.
約 450 万の文のペアで構成される標準の WMT 2014 英語-ドイツ語データセットでトレーニングを行いました。

Sentences were encoded using byte-pair encoding [3], which has a shared sourcetarget vocabulary of about 37000 tokens.
文は、約 37000 トークンの共有 sourcetarget 語彙を持つバイトペアエンコーディング [3] を使用してエンコードされました。

For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38].
英仏については、3,600 万の文と 32,000 単語単位の語彙に分割されたトークンで構成される、大幅に大規模な WMT 2014 英仏データセットを使用しました [38]。

Sentence pairs were batched together by approximate sequence length.
文のペアは、おおよそのシーケンスの長さによってまとめられました。

Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
各トレーニングバッチには、約 25000 のソーストークンと 25000 のターゲットトークンを含む文のペアのセットが含まれていました。

5.2 Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs.
8 つの NVIDIA P100 GPU を搭載した 1 台のマシンでモデルをトレーニングしました。

For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds.
このホワイトペーパー全体で説明されているハイパーパラメーターを使用する基本モデルでは、各トレーニングステップに約 0.4 秒かかりました。

We trained the base models for a total of 100,000 steps or 12 hours.
基本モデルを合計 100,000 ステップまたは 12 時間トレーニングしました。

For our big models,(described on the bottom line of table 3), step time was 1.0 seconds.
大きなモデル (表 3 の一番下の行に記載) では、ステップ時間は 1.0 秒でした。

The big models were trained for 300,000 steps (3.5 days).
大きなモデルは、300,000 ステップ (3.5 日) トレーニングされました。

5.3 Optimizer

We used the Adam optimizer [20] with β1 = 0.9, β2 = 0.98 and = 10−9 .
β1 = 0.9、β2 = 0.98、および = 10−9 で Adam オプティマイザ [20] を使用しました。

We varied the learning rate over the course of training, according to the formula:
次の式に従って、トレーニングの過程で学習率を変化させました。

This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number.
これは、最初の warmup_steps トレーニングステップで学習率を直線的に増加させ、その後、ステップ数の逆平方根に比例して減少させることに対応します。

We used warmup_steps = 4000.
Warmup_steps = 4000 を使用しました。

5.4 Regularization

We employ three types of regularization during training:
トレーニング中に 3 種類の正則化を使用します。

Residual Dropout

We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized.
サブレイヤーの入力に追加して正規化する前に、各サブレイヤーの出力にドロップアウト [33] を適用します。

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.
さらに、エンコーダースタックとデコーダースタックの両方で、埋め込みと位置エンコーディングの合計にドロップアウトを適用します。

For the base model, we use a rate of Pdrop = 0.1.
基本モデルでは、Pdrop = 0.1 のレートを使用します。

Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
表 2: Transformer は、英語からドイツ語および英語からフランス語への newstest2014 テストで、以前の最先端のモデルよりも優れた BLEU スコアを、わずかなトレーニングコストで達成しています。

Label Smoothing

During training, we employed label smoothing of value ls = 0.1 [36].
トレーニング中、値 ls = 0.1 のラベル平滑化を採用しました [36]。

This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
これは、モデルがより不確かであることを学習するため、混乱を招きますが、精度と BLEU スコアは向上します。

6 Results

6.1 Machine Translation

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4.
WMT 2014 の英語からドイツ語への翻訳タスクでは、ビッグトランスフォーマーモデル (表 2 の Transformer (big)) は、以前に報告された最高のモデル (アンサンブルを含む) よりも 2.0 BLEU 以上優れており、新しい現状を確立しています。アートBLEUスコア28.4。

The configuration of this model is listed in the bottom line of Table 3.
このモデルの構成は、表 3 の一番下の行にリストされています。

Training took 3.5 days on 8 P100 GPUs.
トレーニングには、8 つの P100 GPU で 3.5 日かかりました。

Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
私たちの基本モデルでさえ、以前に公開されたすべてのモデルとアンサンブルを上回り、競合モデルのトレーニングコストのほんの一部です。

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model.
WMT 2014 の英語からフランス語への翻訳タスクでは、ビッグモデルは 41.0 の BLEU スコアを達成し、以前に公開されたすべての単一モデルを上回り、以前の最先端のトレーニングコストの 1/4 未満でした。モデル。

The Transformer (big) model trained for English-to-French used dropout rate Pdrop = 0.1, instead of 0.3.
英語からフランス語へのトレーニング用にトレーニングされた Transformer (big) モデルでは、ドロップアウト率 Pdrop = 0.3 ではなく 0.1 が使用されました。

For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals.
基本モデルには、10 分間隔で書き込まれた最後の 5 つのチェックポイントを平均して得られた単一のモデルを使用しました。

For the big models, we averaged the last 20 checkpoints.
大きなモデルについては、最後の 20 個のチェックポイントを平均しました。

We used beam search with a beam size of 4 and length penalty α = 0.6 [38].
ビームサイズ 4、長さペナルティ α = 0.6 のビームサーチを使用しました [38]。

These hyperparameters were chosen after experimentation on the development set.
これらのハイパーパラメータは、開発セットでの実験の後に選択されました。

We set the maximum output length during inference to input length 50, but terminate early when possible [38].
推論中の最大出力長を入力長 50 に設定しますが、可能な場合は早期に終了します [38]。

Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature.
表 2 は、結果を要約し、翻訳の品質とトレーニングコストを文献の他のモデルアーキテクチャと比較しています。

We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU 5 .
モデルのトレーニングに使用される浮動小数点演算の数は、トレーニング時間、使用される GPU の数、および各 GPU の持続的な単精度浮動小数点容量の推定値を乗算することによって推定されます 5 。

6.2 Model Variations

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013.
Transformer のさまざまなコンポーネントの重要性を評価するために、基本モデルをさまざまな方法で変更し、開発セット newstest2013 での英語からドイツ語への翻訳のパフォーマンスの変化を測定しました。

We used beam search as described in the previous section, but no checkpoint averaging.
前のセクションで説明したようにビーム検索を使用しましたが、チェックポイントの平均化は使用しませんでした。

We present these results in Table 3.
これらの結果を表 3 に示します。

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2.
表 3 の行 (A) では、セクション 3.2.2 で説明したように、計算量を一定に保ちながら、アテンションヘッドの数とアテンションキーと値の次元を変化させています。

While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
シングルヘッドのアテンションは最高の設定よりも 0.9 BLEU 悪いですが、ヘッドが多すぎると品質も低下します。

Table 3: Variations on the Transformer architecture.
表 3: Transformer アーキテクチャのバリエーション。

Unlisted values are identical to those of the base model.
記載されていない値は、ベースモデルの値と同じです。

All metrics are on the English-to-German translation development set, newstest2013.
すべての指標は、英語からドイツ語への翻訳開発セット、newstest2013 にあります。

Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.
リストされたパープレキシーは、バイトペアのエンコーディングによると、ワードピースごとであり、ワードごとのパープレキシーと比較すべきではありません。

In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality.
表 3 の行 (B) では、アテンションキーのサイズ dk を小さくするとモデルの品質が低下することがわかります。

This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial.
これは、互換性を判断することは容易ではなく、内積よりも洗練された互換性関数が有益である可能性があることを示唆しています。

We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting.
行 (C) と (D) では、予想どおり、モデルが大きいほど優れており、ドロップアウトはオーバーフィッティングを回避するのに非常に役立ちます。

In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
行 (E) では、正弦波位置エンコーディングを学習した位置埋め込み [9] に置き換え、基本モデルとほぼ同じ結果を観察します。

6.3 English Constituency Parsing

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing.
Transformer が他のタスクに一般化できるかどうかを評価するために、英語の選挙区の解析に関する実験を行いました。

This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input.
このタスクには特定の課題があります。アウトプットは強い構造的制約を受け、インプットよりも大幅に長くなります。

Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37].
さらに、RNN の sequence-to-sequence モデルは、小さなデータ領域で最先端の結果を達成できていません [37]。

We trained a 4-layer transformer with dmodel = 1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences.
Penn Treebank [25] の Wall Street Journal (WSJ) 部分で、約 40K のトレーニングセンテンスで、dmodel = 1024 の 4 層トランスフォーマーをトレーニングしました。

We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37].
また、約 1,700 万のセンテンス [37] から得られた、信頼性の高い、より大きな BerkleyParser コーパスを使用して、半教師あり設定でトレーニングしました。

We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
WSJ のみの設定では 16K トークンの語彙を使用し、半教師あり設定では 32K トークンの語彙を使用しました。

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model.
セクション 22 開発セットでドロップアウト、注意と残差 (セクション 5.4)、学習率、およびビームサイズを選択するために、少数の実験のみを実行しました。他のすべてのパラメーターは、英語からドイツ語への基本翻訳モデルから変更されていません。

During inference, we increased the maximum output length to input length 300.
推論中に、最大出力長を入力長 300 に増やしました。

We used a beam size of 21 and α = 0.3 for both WSJ only and the semi-supervised setting.
WSJ のみと半教師付き設定の両方で、21 のビームサイズと α = 0.3 を使用しました。

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
表 4 の結果は、タスク固有の調整がないにもかかわらず、モデルが驚くほどうまく機能し、Recurrent Neural Network Grammar [8] を除いて、以前に報告されたすべてのモデルよりも優れた結果をもたらすことを示しています。

In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences.
RNN sequence-to-sequence モデル [37] とは対照的に、Transformer は、40K センテンスの WSJ トレーニングセットのみでトレーニングした場合でも、BerkeleyParser [29] より優れています。

7 Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
この作業では、完全に注意に基づく最初のシーケンス変換モデルである Transformer を提示し、エンコーダー/デコーダーアーキテクチャで最も一般的に使用される反復層を多頭自己注意に置き換えました。

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
翻訳タスクの場合、Transformer は再帰層または畳み込み層に基づくアーキテクチャよりも大幅に高速にトレーニングできます。

On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art.
WMT 2014 の英語からドイツ語への翻訳タスクと WMT 2014 の英語からフランス語への翻訳タスクの両方で、新しい最先端技術を達成しました。

In the former task our best model outperforms even all previously reported ensembles.
前者のタスクでは、以前に報告されたすべてのアンサンブルよりも、最良のモデルが優れています。

We are excited about the future of attention-based models and plan to apply them to other tasks.
私たちは注意ベースのモデルの将来に期待を寄せており、それらを他のタスクに適用することを計画しています。

We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.
Transformer をテキスト以外の入力および出力モダリティを含む問題に拡張し、画像、オーディオ、ビデオなどの大量の入力と出力を効率的に処理するためのローカルで制限された注意メカニズムを調査する予定です。

Making generation less sequential is another research goals of ours.
世代の連続性を低くすることは、私たちのもう 1 つの研究目標です。

The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor.
モデルのトレーニングと評価に使用したコードは、https://github.com/tensorflow/tensor2tensor で入手できます。

Acknowledgements

We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.
Nal Kalchbrenner と Stephan Gouws には、有益なコメント、修正、およびインスピレーションを与えていただき、感謝しています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up