More than 5 years have passed since last update.

TensorFlow 2.0 Beta チュートリアル「Transformer model for language understanding」日本語訳

Posted at 2019-07-16

はじめに

TensorFlow 2.0 Beta のチュートリアル「Transformer model for language understanding」に取り組んだ際の日本語訳です。
なるべく日本語として読みやすい文章にしたつもりですので、参考として残します。なおセクション名は日本語訳していません。
本記事では日本語訳のみを記載しますが、編集した notebook には英語と日本語訳を併記しています。

オリジナル transformer.ipynb ( Colab / GitHub )
日本語訳 transformer.ipynb ( Colab / GitHub )

Transformer model for language understanding

このチュートリアルではポルトガル語を英語に翻訳するための Transformer モデルを学習します。これはテキスト生成とアテンションの知識を前提とした高度な例です。
Transformer モデルの背後にある核となる考え方は「セルフアテンション(self-attention)」—そのシーケンスの表現を計算するため入力シーケンスのさまざまな位置に注目できる機能です。Transformer はセルフアテンションレイヤーのスタックを作成します。これについては後述の Scaled dot product attention（スケール付きドット積アテンション）および Multi-head attention（マルチヘッドアテンション）のセクションで説明します。
Transformer モデルは RNN や CNN の代わりにセルフアテンションレイヤーのスタックを使用して可変サイズの入力を処理します。この一般的なアーキテクチャーには多くの利点があります。

データ間の時間的／空間的関係については想定されていません。これは一連のオブジェクト（StarCraftユニットなど）を処理するのに理想的です。
レイヤー出力は RNN のような直列ではなく並列に計算できます。
離れた項目同士が、多数の RNN ステップや畳み込みレイヤーを通過することなしに、互いの出力に影響を与えることができます（たとえば Scene Memory Transformer を参照）。
長期の依存関係を学ぶことができます。これは多くのシーケンスタスクでの課題です。

このアーキテクチャーの欠点は次のとおりです。

時系列の場合、時間ステップの出力は入力と現在の隠れ状態だけではなく「履歴全体」から計算されます。これは効率的ではないかもしれません。
テキストのように時間的／空間的関係がある入力の場合は、何らかの位置エンコーディングを追加する必要があります。そうしないとモデルは効果的に Bag of words を見ることになります。（訳注：うまく訳せませんでした。Bag of words を見なければならないこと自体を否定的に表現しているのか、効果的に見れないという否定形にし忘れているのかどちらかだと思いますが…）

このノートブックでモデルを訓練した後は、ポルトガル語の文を入力して英語の翻訳を返すことができます。

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

!pip install tensorflow-gpu==2.0.0-beta1
import tensorflow_datasets as tfds
import tensorflow as tf

import time
import numpy as np
import matplotlib.pyplot as plt

Setup input pipeline

TFDS を使用して TED Talks オープン翻訳プロジェクトからポルトガル語 - 英語翻訳データセットを読み込みます。
このデータセットには約 50000 の訓練データ、1100 の検証データ、および 2000 のテストデータが含まれています。

examples, metadata = tfds.load(
    'ted_hrlr_translate/pt_to_en',
    with_info=True,
    as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

トレーニングデータセットからカスタムサブワードトークナイザーを作成します。

tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

sample_string = 'Transformer is awesome.'

tokenized_string = tokenizer_en.encode(sample_string)
print(f'Tokenized string is {tokenized_string}')

original_string = tokenizer_en.decode(tokenized_string)
print(f'The original string: {original_string}')

assert original_string == sample_string

トークナイザーは単語が辞書にない場合、文字列をサブワードに分割してエンコードします。

for ts in tokenized_string:
    print (f'{ts} ----> {tokenizer_en.decode([ts])}')

BUFFER_SIZE = 20000
BATCH_SIZE = 64

開始トークンと終了トークンを入力とターゲットに追加します。

def encode(lang1, lang2):
    lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
        lang1.numpy()) + [tokenizer_pt.vocab_size + 1]

    lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
        lang2.numpy()) + [tokenizer_en.vocab_size + 1]

    return lang1, lang2

注：この例を小さくかつ比較的高速に保つには、40トークンを超える長さのデータを削除してください。

MAX_LENGTH = 40

余談：Colabでは MAX_LENGTH = 34 まで落とさないと RAM 不足でトレーニングが終了しませんでした。

def filter_max_length(x, y, max_length=MAX_LENGTH):
    return tf.logical_and(tf.size(x) <= max_length, tf.size(y) <= max_length)

.map() 内の操作はグラフモードで実行され、numpy 属性を持たないグラフテンソルを受け取ります。トークナイザーは文字列または Unicode シンボルが整数にエンコードされることを想定しています。したがって文字列値も含め、numpy 属性を持つ eager テンソルを受け取る tf.py_function の中でエンコーディングする必要があります。

def tf_encode(pt, en):
    return tf.py_function(encode, [pt, en], [tf.int64, tf.int64])

train_dataset = train_examples.map(tf_encode)
train_dataset = train_dataset.filter(filter_max_length)
# cache the dataset to memory to get a speedup while reading from it.
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(
    BATCH_SIZE, padded_shapes=([-1], [-1]))
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_max_length).padded_batch(
    BATCH_SIZE, padded_shapes=([-1], [-1]))

pt_batch, en_batch = next(iter(val_dataset))
pt_batch, en_batch

Positional encoding

このモデルには再帰や畳み込みが含まれていないため、文中の単語の相対位置に関する情報をモデルに提供するための位置エンコードが追加されています。
位置エンコードベクトルが埋め込みベクトルに追加されます。埋め込みは同じ意味を持つトークン同士が互いに近くなるような、d 次元空間内のトークンを表します。しかし埋め込みは文中の単語の相対位置をエンコードしません。そのため位置エンコードを追加した後は、d 次元空間における「意味と文中の位置の類似性」に基づいて、単語は互いに近くなります。
より詳しく学ぶには位置エンコードに関するノートブックを参照してください。位置エンコードの計算式は次のとおりです。

\begin{align}
& \Large{PE_{(pos, 2i)}\quad = sin(pos / 10000^{2i / d_{model}})} \\
& \Large{PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{model}})}
\end{align}

def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    return pos * angle_rates

def positional_encoding(position, d_model):
    angle_rads = get_angles(
        np.arange(position)[:, np.newaxis],
        np.arange(d_model)[np.newaxis, :],
        d_model)

    # apply sin to even indices in the array; 2i
    sines = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    cosines = np.cos(angle_rads[:, 1::2])

    pos_encoding = np.concatenate([sines, cosines], axis=-1)

    pos_encoding = pos_encoding[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

pos_encoding = positional_encoding(50, 512)
print(pos_encoding.shape)

plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('Depth')
plt.xlim((0, 512))
plt.ylabel('Position')
plt.colorbar()
plt.show()

Masking

シーケンスバッチ内のすべてのパッドトークンをマスクし、モデルがパディングを入力として扱わないようにします。マスクはパッド値 0 の場所を示し、その位置には 1 を、それ以外は 0 を出力します。

def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
  
    # add extra dimensions so that we can add the padding
    # to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)

先読みマスクはシーケンス内の未来のトークンをマスクするために使われます。言い換えるとマスクは使うべきでないエントリーを示します。
これは 3 番目の単語を予測するために 1 番目と 2 番目の単語のみが使用されることを意味します。同様に 4 番目の単語を予測するためには 1 番目、2 番目、3 番目の単語のみが使用されます。

def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)

x = tf.random.uniform((1, 3))
temp = create_look_ahead_mask(x.shape[1])
temp

Scaled dot product attention

Transformer が使用するアテンション関数は 3つの入力：Q（クエリー）、K（キー）、V（バリュー）を取ります。アテンション重みの計算に使われる式は次のとおりです。

$$\Large{Attention(Q, K, V) = softmax_k(\frac{QK^T}{\sqrt{d_k}}) V} $$

ドット積アテンションは深さの平方根の因数によってスケールされます。これは深さの値が大きいと、ドット積（内積）が大きくなってソフトマックス関数を小さな勾配に押しやる結果、非常に「ハードな」ソフトマックスになるためです。
たとえば Q と K の平均が 0 で分散が 1 であるとします。これらの行列の積は平均が 0 で分散が dk になるとします。このとき dk の平方根をスケーリングに使います（他の数ではなく）。なぜならより「穏やかな」ソフトマックスを得るには Q と K の行列積は平均 0、分散 1 を持つべきだからです。
マスクに -1e9（負の無限大に近い値）を掛けます。これはマスクがスケールされた Q と K の行列積に合計され、ソフトマックスの直前に適用されるためです。目的はこれらのセルをゼロ出力にすることで、ソフトマックスへの大きな負の入力値はほぼゼロの出力になります。

def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead) 
    but it must be broadcastable for addition.

    Args:
      q: query shape == (..., seq_len_q, depth)
      k: key shape == (..., seq_len_k, depth)
      v: value shape == (..., seq_len_v, depth_v)
      mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
      output, attention_weights
    """

    # (..., seq_len_q, seq_len_k)
    matmul_qk = tf.matmul(q, k, transpose_b=True)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)  

    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    # (..., seq_len_q, seq_len_k)
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

    # (..., seq_len_q, depth_v)
    output = tf.matmul(attention_weights, v)

    return output, attention_weights

ソフトマックスの正規化は K に対して行われるため、その値によって Q に与えられる重要度が決まります。

出力はアテンション重みと V（バリュー）ベクトルの積を表します。これにより注目したい単語がそのまま維持され、無関係な単語が追い出されます。

def print_out(q, k, v):
    temp_out, temp_attn = scaled_dot_product_attention(q, k, v, None)
    print ('Attention weights are:')
    print (temp_attn)
    print ('Output is:')
    print (temp_out)

np.set_printoptions(suppress=True)

temp_k = tf.constant([[10,0,0],
                      [0,10,0],
                      [0,0,10],
                      [0,0,10]], dtype=tf.float32)  # (4, 3)

temp_v = tf.constant([[   1,0],
                      [  10,0],
                      [ 100,5],
                      [1000,6]], dtype=tf.float32)  # (4, 2)

# This `query` aligns with the second `key`,
# so the second `value` is returned.
temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)

# This query aligns with a repeated key (third and fourth), 
# so all associated values get averaged.
temp_q = tf.constant([[0, 0, 10]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)

# This query aligns equally with the first and second key, 
# so their values get averaged.
temp_q = tf.constant([[10, 10, 0]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)

全てのクエリーを一緒に渡します。


# (3, 3)
temp_q = tf.constant([[0, 0, 10], [0, 10, 0], [10, 10, 0]], dtype=tf.float32)
print_out(temp_q, temp_k, temp_v)

Multi-head attention

マルチヘッドアテンションは4つの部分から構成されています。

線形レイヤーとヘッドへの分割
スケール付きドット積アテンション
ヘッドの連結
最後の線形レイヤー

メモ：ヘッドについて
数が同じなので単純に「ヘッド＝アテンション」という解釈もできますが、ニュアンスとしては「考える場所／もの」という意味で「ヘッド（頭）」と表現しているようです。アテンションが持つ「注目する」という効果をあたかも「考える」ことととらえれば、複数のアテンションを行うことは複数（人）の「頭で」考えることに相当するため、その概念から「ヘッド」を使ってマルチヘッドアテンションやシングルヘッドアテンションと言っているものと思われます。

マルチヘッドアテンションの各ブロックは 3つの入力、Q（クエリー）、K（キー）、V（バリュー）を受け取ります。これらは線形（全結合）層を通り、複数のヘッドへ分割されます。
上で定義した「スケール付きドット積アテンション」が各ヘッドに適用されます（効率のためにブロードキャストされます）。アテンションのステップでは適切なマスクを使う必要があります。各ヘッドのアテンション出力は（tf.transpose と tf.reshape を使って）連結され、最後の Dense レイヤーを通過します。
単一のアテンションヘッドの代わりに Q、K、および V が複数のヘッドへ分割されます。これによりモデルが異なる表現空間の異なる位置情報に同時に注目することを可能にします。分割により各ヘッドの次元が減るため、トータル計算コストは元の次元でのシングルヘッドアテンションと同じです。

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        # (batch_size, seq_len_q, num_heads, depth)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])

        # (batch_size, seq_len_q, d_model)
        concat_attention = tf.reshape(
            scaled_attention,
            (batch_size, -1, self.d_model))

        # (batch_size, seq_len_q, d_model)
        output = self.dense(concat_attention)

        return output, attention_weights

試しに MultiHeadAttention レイヤーを作成します。シーケンス y の各位置で MultiHeadAttention はシーケンス内の他のすべての位置にわたって 8つのアテンションヘッドすべてを実行し、各位置に同じ長さの新しいベクトルを返します。

temp_mha = MultiHeadAttention(d_model=512, num_heads=8)
y = tf.random.uniform((1, 60, 512))  # (batch_size, encoder_sequence, d_model)
out, attn = temp_mha(y, k=y, q=y, mask=None)
out.shape, attn.shape

Point wise feed forward network

ポイントワイズフィードフォワードネットワークは、間に ReLU アクティベーションを持った 2つの全結合レイヤーで構成されます。

def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

sample_ffn = point_wise_feed_forward_network(512, 2048)
sample_ffn(tf.random.uniform((64, 50, 512))).shape

Encoder and decoder

Transformer モデルは、同じ汎用的なパターンである標準のアテンション付き Seq2seq モデルに準じます。

入力の文は、シーケンスの単語／トークンごとの出力を生成する N 個のエンコーダーレイヤーを通ります。
デコーダーは次の単語を予測するため、エンコーダの出力と、自身の入力（セルフアテンション）に注目します。

Encoder layer

エンコーダーレイヤーはそれぞれ次のサブレイヤーで構成されています。

マルチヘッドアテンション（パディングマスク付き）
ポイントワイズフィードフォワードネットワーク

これらのサブレイヤーにはそれぞれを迂回する残差接続があり、レイヤー Normalization に続きます。残差接続は深いネットワークにおける勾配消失問題の防止に役立ちます。
各サブレイヤーの出力は LayerNorm(x + Sublayer(x)) であり、正規化は（最後の）d_model 軸で行われます。Transformer には N個のエンコーダーレイヤーがあります。

class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):

        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2

sample_encoder_layer = EncoderLayer(512, 8, 2048)

sample_encoder_layer_output = sample_encoder_layer(
    tf.random.uniform((64, 43, 512)), False, None)

sample_encoder_layer_output.shape  # (batch_size, input_seq_len, d_model)

Decoder layer

デコーダーレイヤーはそれぞれ次のサブレイヤーで構成されています。

マスク付きマルチヘッドアテンション（先読みマスクとパディングマスク付き）
マルチヘッドアテンション（パディングマスク付き）。V（バリュー）と K（キー）は入力としてエンコーダーからの出力を受け取り、Q（クエリー）はマスク付きマルチヘッドアテンションサブレイヤーからの出力を受け取ります
ポイントワイズフィードフォワードネットワーク

これらのサブレイヤーにはそれぞれを迂回する残差接続があり、レイヤー Normalization に続きます。各サブレイヤーの出力は LayerNorm(x + Sublayer(x)) であり、正規化は（最後の）d_model 軸で行われます。
Transformer には N個のデコーダーレイヤーがあります。
Q がデコーダーの最初のアテンションブロックからの出力を受け取り、K がエンコーダーからの出力を受け取ると、アテンション重みはデコーダー入力に与えられた重要度をエンコーダー出力に基づいて表します。言い換えると、デコーダーはエンコーダー出力と自身の出力への自己注目を見るることによって次の単語を予測します。上述の Scaled dot product attention セクションのデモを参照してください。

class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)
    
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        # enc_output.shape == (batch_size, input_seq_len, d_model)

        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(
            enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

        return out3, attn_weights_block1, attn_weights_block2

sample_decoder_layer = DecoderLayer(512, 8, 2048)

sample_decoder_layer_output, _, _ = sample_decoder_layer(
    tf.random.uniform((64, 50, 512)),
    sample_encoder_layer_output,
    False, None, None)

sample_decoder_layer_output.shape  # (batch_size, target_seq_len, d_model)

Encoder は以下で構成されます。

入力埋め込み（レイヤー）
位置エンコード（レイヤー）
N 個のエンコーダーレイヤー

入力は埋め込みを経て、位置エンコードと合算されます。この合算出力がエンコーダーレイヤーへの入力です。エンコーダー（レイヤー）の出力はデコーダー（レイヤー）への入力です。

class Encoder(tf.keras.layers.Layer):
    def __init__(
            self,
            num_layers,
            d_model,
            num_heads,
            dff,
            input_vocab_size,
            rate=0.1):

        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                           for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):

        seq_len = tf.shape(x)[1]

        # adding embedding and position encoding.
        x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, d_model)

sample_encoder = Encoder(
    num_layers=2, d_model=512, num_heads=8, dff=2048, input_vocab_size=8500)

sample_encoder_output = sample_encoder(
    tf.random.uniform((64, 62)),
    training=False,
    mask=None)

print(sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)

Decoder

Decoder は以下で構成されます。

出力埋め込み（レイヤー）
位置エンコード（レイヤー）
N 個のデコーダーレイヤー

ターゲットは埋め込みを経て、位置エンコードと合算されます。この合算出力がデコーダーレイヤーへの入力です。デコーダー（レイヤー）の出力は最後の線形レイヤーへの入力です。

class Decoder(tf.keras.layers.Layer):
    def __init__(
            self,
            num_layers,
            d_model,
            num_heads,
            dff,
            target_vocab_size,
            rate=0.1):

        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(target_vocab_size, self.d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
                           for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):

        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](
                x, enc_output, training, look_ahead_mask, padding_mask)

            attention_weights[f'decoder_layer{i+1}_block1'] = block1
            attention_weights[f'decoder_layer{i+1}_block2'] = block2

        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

sample_decoder = Decoder(
    num_layers=2, d_model=512, num_heads=8, dff=2048, target_vocab_size=8000)

output, attn = sample_decoder(
    tf.random.uniform((64, 26)),
    enc_output=sample_encoder_output, 
    training=False,
    look_ahead_mask=None,
    padding_mask=None)

output.shape, attn['decoder_layer2_block2'].shape

Create the Transformer

Transformer はエンコーダー、デコーダー、そして最後の線形レイヤーで構成されます。デコーダーの出力が線形レイヤーへの入力であり、その出力が返却されます。

class Transformer(tf.keras.Model):
    def __init__(
            self,
            num_layers,
            d_model,
            num_heads,
            dff,
            input_vocab_size,
            target_vocab_size,
            rate=0.1):
        
        super(Transformer, self).__init__()

        self.encoder = Encoder(
            num_layers, d_model, num_heads, dff, input_vocab_size, rate)

        self.decoder = Decoder(
            num_layers, d_model, num_heads, dff, target_vocab_size, rate)

        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(
            self,
            inp,
            tar,
            training,
            enc_padding_mask,
            look_ahead_mask,
            dec_padding_mask):

        # (batch_size, inp_seq_len, d_model)
        enc_output = self.encoder(inp, training, enc_padding_mask)

        # dec_output.shape == (batch_size, tar_seq_len, d_model)
        dec_output, attention_weights = self.decoder(
            tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        # (batch_size, tar_seq_len, target_vocab_size)
        final_output = self.final_layer(dec_output)

        return final_output, attention_weights

sample_transformer = Transformer(
    num_layers=2, d_model=512, num_heads=8, dff=2048,
    input_vocab_size=8500, target_vocab_size=8000)

temp_input = tf.random.uniform((64, 62))
temp_target = tf.random.uniform((64, 26))

fn_out, _ = sample_transformer(
    temp_input, temp_target,
    training=False,
    enc_padding_mask=None,
    look_ahead_mask=None,
    dec_padding_mask=None)

fn_out.shape  # (batch_size, tar_seq_len, target_vocab_size)

Set hyperparameters

この例を小さく、比較的高速に保つために、num_layers、d_model、および dff の値を減らしました。
Transformer の基本モデルで使用されている値は次のとおりです。num_layers = 6、d_model = 512、dff = 2048。Transformer の他のすべてのバージョンについてはこの論文を参照してください。
注：以下の値を変更することで多くのタスクで SOTA を記録したモデルを手に入れることができます。

num_layers = 4
d_model = 128
dff = 512
num_heads = 8

input_vocab_size = tokenizer_pt.vocab_size + 2
target_vocab_size = tokenizer_en.vocab_size + 2
dropout_rate = 0.1

Optimizer

論文の計算式に従うカスタム学習率スケジューラー付きの Adam オプティマイザーを使用します。

\Large{lrate = d_{model}^{-0.5} * min(step{\_}num^{-0.5}, step{\_}num * warmup{\_}steps^{-1.5})}

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()

        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(
    learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

temp_learning_rate_schedule = CustomSchedule(d_model)

plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")

Loss and metrics

ターゲットシーケンスはパディングされているため、損失計算時にパディングマスクを適用することが重要です。

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
    name='train_accuracy')

Training and checkpointing

transformer = Transformer(
    num_layers, d_model, num_heads, dff,
    input_vocab_size, target_vocab_size, dropout_rate)

def create_masks(inp, tar):
    # Encoder padding mask
    enc_padding_mask = create_padding_mask(inp)

    # Used in the 2nd attention block in the decoder.
    # This padding mask is used to mask the encoder outputs.
    dec_padding_mask = create_padding_mask(inp)

    # Used in the 1st attention block in the decoder.
    # It is used to pad and mask future tokens in the input received by 
    # the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return enc_padding_mask, combined_mask, dec_padding_mask

チェックポイントパスとチェックポイントマネージャーを作成します。これは n エポックごとにチェックポイントを保存するために使用されます。

checkpoint_path = "./checkpoints/train"

ckpt = tf.train.Checkpoint(transformer=transformer, ptimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print('Latest checkpoint restored!!')

ターゲットは tar_inp と tar_real に分割されます。tar_inp はデコーダーへの入力として渡されます。tar_real は同じ入力を 1 つずらしたもので、tar_inp のそれぞれの位置で予測されるべき次のトークンが tar_real に含まれます。
sentence が "SOS A lion in the jungle is sleeping EOS" の例です。

tar_inp = "SOS A lion in the jungle is sleeping"
tar_real = "A lion in the jungle is sleeping EOS"

Transformer は自己回帰モデルです。一度に 1つの部分を予測し、それまでの出力を使って次にすべきことを決定します。
この例ではトレーニング中に teacher-forcing を使用します（テキスト生成チュートリアルのように）。teacher-forcing はモデルが今の時間ステップで何を予測するかにかかわらず、真の出力を次の時間ステップに渡します。
Transformer が各単語を予測するとき、セルフアテンションは次の単語をよりよく予測するために入力シーケンスの直前の単語を見ることを可能にします。
期待される出力（正解値）によるモデルのピーキング防止のため、モデルは先読みマスクを使用します。

EPOCHS = 20

# The @tf.function trace-compiles train_step into a TF graph for faster
# execution. The function specializes to the precise shape of the argument
# tensors. To avoid re-tracing due to the variable sequence lengths or variable
# batch sizes (the last batch is smaller), use input_signature to specify
# more generic shapes.

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]

@tf.function
def train_step(inp, tar):
    tar_inp = tar[:, :-1]
    tar_real = tar[:, 1:]

    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

    with tf.GradientTape() as tape:
        predictions, _ = transformer(
            inp, tar_inp,
            True,
            enc_padding_mask,
            combined_mask,
            dec_padding_mask)
        loss = loss_function(tar_real, predictions)

    gradients = tape.gradient(loss, transformer.trainable_variables)    
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    train_loss(loss)
    train_accuracy(tar_real, predictions)

入力言語としてポルトガル語が使用され、ターゲット言語は英語です

for epoch in range(EPOCHS):
    start = time.time()

    train_loss.reset_states()
    train_accuracy.reset_states()
  
    # inp -> portuguese, tar -> english
    for (batch, (inp, tar)) in enumerate(train_dataset):
        train_step(inp, tar)
    
        if batch % 500 == 0:
            print(f'Epoch {epoch + 1} Batch {batch} '
                  f'Loss {train_loss.result():.4f} '
                  f'Accuracy {train_accuracy.result():.4f}')
      
    if (epoch + 1) % 5 == 0:
        ckpt_save_path = ckpt_manager.save()
        print(f'Saving checkpoint for epoch {epoch + 1} at {ckpt_save_path}')
    
    print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} '
          f'Accuracy {train_accuracy.result():.4f}')

    print(f'Time taken for 1 epoch: {time.time() - start} secs\n')

Evaluate

評価には次のステップが使われます。

ポルトガル語のトークナイザー (tokenizer_pt) を使って入力文をエンコードします。さらに開始トークンと終了トークンを追加し、モデル訓練データと同等の入力にします。これがエンコーダー入力です。
デコーダー入力は start token == tokenizer_en.vocab_size です。
パディングマスクと先読みマスクを計算します。
decoder はそれから encoder output と自身の出力（セルフアテンション）を見て予測を出力します。
最後の単語を選択し、その argmax を計算します。
デコーダーに渡す際、予測した単語をデコーター入力に連結します。
このアプローチでは、デコーダーは予測した直前の単語に基づいて次の単語を予測します。

注：ここで使用しているモデルは比較的高速に保つため容量が少なく、予測があまり正しくないかも知れません。論文の結果を再現するには上記のハイパーパラメーターを変更し、データセット全体、およびベース Transformer モデルか Transformer XL を使います。

def evaluate(inp_sentence):
    start_token = [tokenizer_pt.vocab_size]
    end_token = [tokenizer_pt.vocab_size + 1]

    # inp sentence is portuguese, hence adding the start and end token
    inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token
    encoder_input = tf.expand_dims(inp_sentence, 0)

    # as the target is english, the first word to the transformer should be the
    # english start token.
    decoder_input = [tokenizer_en.vocab_size]
    output = tf.expand_dims(decoder_input, 0)

    for i in range(MAX_LENGTH):
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
            encoder_input, output)

        # predictions.shape == (batch_size, seq_len, vocab_size)
        predictions, attention_weights = transformer(
            encoder_input,
            output,
            False,
            enc_padding_mask,
            combined_mask,
            dec_padding_mask)

        # select the last word from the seq_len dimension
        predictions = predictions[: ,-1:, :]  # (batch_size, 1, vocab_size)

        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

        # return the result if the predicted_id is equal to the end token
        if tf.equal(predicted_id, tokenizer_en.vocab_size+1):
            return tf.squeeze(output, axis=0), attention_weights

        # concatentate the predicted_id to the output
        # which is given to the decoder as its input.
        output = tf.concat([output, predicted_id], axis=-1)

    return tf.squeeze(output, axis=0), attention_weights

def plot_attention_weights(attention, sentence, result, layer):
    fig = plt.figure(figsize=(16, 8))

    sentence = tokenizer_pt.encode(sentence)

    attention = tf.squeeze(attention[layer], axis=0)

    for head in range(attention.shape[0]):
        ax = fig.add_subplot(2, 4, head + 1)

        # plot the attention weights
        ax.matshow(attention[head][:-1, :], cmap='viridis')

        fontdict = {'fontsize': 10}

        ax.set_xticks(range(len(sentence) + 2))
        ax.set_yticks(range(len(result)))

        ax.set_ylim(len(result)-1.5, -0.5)

        ax.set_xticklabels(
            ['<start>']+[tokenizer_pt.decode([i]) for i in sentence]+['<end>'], 
            fontdict=fontdict, rotation=90)

        ax.set_yticklabels(
            [tokenizer_en.decode([i]) for i in result
             if i < tokenizer_en.vocab_size],
            fontdict=fontdict)

        ax.set_xlabel(f'Head {head + 1}')

    plt.tight_layout()
    plt.show()

def translate(sentence, plot=''):
    result, attention_weights = evaluate(sentence)

    predicted_sentence = tokenizer_en.decode(
        [i for i in result if i < tokenizer_en.vocab_size])  

    print(f'Input: {sentence}')
    print(f'Predicted translation: {predicted_sentence}')

    if plot:
        plot_attention_weights(attention_weights, sentence, result, plot)

translate("este é um problema que temos que resolver.")
print("Real translation: this is a problem we have to solve .")

translate("os meus vizinhos ouviram sobre esta ideia.")
print("Real translation: and my neighboring homes heard about this idea .")

translate("vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.")
print("Real translation: so i 'll just share with you some stories very quickly of some magical things that have happened .")

plot パラメーターにはデコーダーのさまざまなレイヤーやアテンションブロックを渡すことができます。

translate("este é o primeiro livro que eu fiz.", plot='decoder_layer4_block2')
print("Real translation: this is the first book i've ever done.")

Summary

このチュートリアルでは位置エンコード、マルチヘッドアテンション、マスキングの重要性と Transformer の作成方法について学びました。
Transformer の訓練に別のデータセットを使用してみてください。上記のハイパーパラメーターを変更してベース Transformer や Transformer XL を作成することもできます。ここで定義したレイヤーを使用して BERT を作成し、最先端のモデルを訓練することもできます。さらに、より良い予測を得るためにビームサーチを実装することができます。

おわりに

コード部分は改行や f-string など少し改変していますが、日本語訳部分だけでも参考になれば幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up