More than 1 year has passed since last update.

Tensorflowでベクトル形式のトークンをマスキングしたい

Last updated at 2022-11-26Posted at 2022-10-10

はじめに

tensorflowでは、長さが異なる時系列データを扱えるようにするために、マスキングの仕組みが提供されています。

参考リンクのケースでは、文字列をインデックス化した後の以下のような時系列データを扱っています。

[
  [71, 1331, 4231]
  [73, 8, 3215, 55, 927],
  [83, 91, 1, 645, 1253, 927],
]

この例では、各シーケンスの要素(トークン)は数値で定義されています。
ですが、今回私が扱いたいのは、例えば以下のようなデータです。

[
    [[323,2], [52,0], [1,26]],
    [[727,2], [1,2], [131,2], [93,0], [867,220]],
    [[523,2], [764,2], [52,2], [0,2], [111,58], [1,242]],
]

このように、各シーケンスに含まれるトークンがベクトルのケースです。
このようなデータの場合に、「トークンごと」にマスクするにはどうすればよいか調べてみました。

実験

まずは、参考リンクに従って、上記の例をマスクしていきます。
マスクの手順は、1.パディング、2.マスキングの順に進めるようです。

ソースコードを一部修正しつつ、処理を進めてみます。

raw_inputs = [
    [[323,2], [52,0], [1,26]],
    [[727,2], [1,2], [131,2], [93,0], [867,220]],
    [[523,2], [764,2], [52,2], [0,2], [111,58], [1,242]],
]

padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, padding="post"
)
print(padded_inputs)

# 実行結果(見づらいので整形)
[[[323   2], [ 52   0], [  1  26], [  0   0], [  0   0], [  0   0]]
 [[727   2], [  1   2], [131   2], [ 93   0], [867 220], [  0   0]]
 [[523   2], [764   2], [ 52   2], [  0   2], [111  58], [  1 242]]]

パディング処理によって、シーケンスの長さが揃うようにトークン([0, 0]のベクトル)が埋め込まれました。
パディングについては、良さそうです。

では、マスキングも見ていきましょう。

masking_layer = layers.Masking()
# 以下の行はよく理解していません
unmasked_embedding = tf.cast(
    tf.tile(tf.expand_dims(padded_inputs, axis=-1), [1, 1, 1, padded_inputs.shape[-1]]),
    tf.float32
)

masked_embedding = masking_layer(unmasked_embedding)
print(masked_embedding._keras_mask)

# 実行結果
<tf.Tensor: shape=(3, 6, 2), dtype=bool, numpy=
tf.Tensor(
[[[ True  True],   [ True False],   [ True  True],   [False False],   [False False],   [False False]] 
 [[ True  True],   [ True  True],   [ True  True],   [ True False],   [ True  True],   [False False]] 
 [[ True  True],   [ True  True],   [ True  True],   [False  True],   [ True  True],   [ True  True]]
] shape=(3, 6, 2), dtype=bool)

一見良さそうに見えますが、問題が発生しています。

1つめのシーケンスの左から2番目のトークンに注目すると[True, False]となっています。
マスキングをそのまま使うと、「パディングで埋め込んだ0」と「元から入っていた意味のある0」の区別ができないため、このようなことが起こってしまいます。

どうにかパディングで埋め込まれたトークンだけにマスクできないでしょうか。

改善

パディングで埋め込まれたトークンは、すべての要素が0のベクトルであることがわかります。
つまり、すべて0であるようなベクトルを検出できれば良さそうです。

そのため、numpyのany関数を使って以下のように処理することにしました。

np.any(padded_inputs, axis=2)

# 実行結果
array([[ True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True, False],
       [ True,  True,  True,  True,  True,  True]])

これでトークン単位のマスクデータが得られました(参考リンクのマスク結果と同じです)
今回は、各トークンがサイズ2のベクトルなので、軸を1つ追加してデータを拡張します。

c = np.any(padded_inputs, axis=2)
tf.cast(tf.tile(tf.expand_dims(c, axis=-1), [1, 1, padded_inputs.shape[-1]]), tf.bool)

# 実行結果
<tf.Tensor: shape=(3, 6, 2), dtype=bool, numpy=
array([
[[ True,  True], [ True,  True], [ True,  True], [False, False], [False, False], [False, False]], 
[[ True,  True], [ True,  True], [ True,  True], [ True,  True], [ True,  True], [False, False]], 
[[ True,  True], [ True,  True], [ True,  True], [ True,  True], [ True,  True], [ True,  True]]
])>

今度はパディングされた項目のみ、正しくマスク処理がされたようです。
※もっと簡単な方法があれば教えてください。

完成

せっかくなので、パディングとマスク処理を組み込んだレイヤーを自作しました。

class MyPaddingMaskingLayer(layers.Layer):
    def call(self, inputs):
        padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(
            inputs, padding="post"
        )
        return padded_inputs
        
    def compute_mask(self, inputs, mask=None):
        padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(
            inputs, padding="post"
        )
        c = np.any(padded_inputs, axis=2)
        mask = tf.cast(tf.tile(tf.expand_dims(c, axis=-1), [1, 1, padded_inputs.shape[-1]]), tf.bool)
        return mask

このレイヤーに対して、パディング前のデータを与えてみます。

raw_inputs = [
    [[323,2], [52,0], [1,26]],
    [[727,2], [1,2], [131,2], [93,0], [867,220]],
    [[523,2], [764,2], [52,2], [0,2], [111,58], [1,242]],
]
padmask_layer = MyPaddingMaskingLayer()

masked_embedding = padmask_layer(raw_inputs)
print(padmask_layer.compute_mask(masked_embedding))

# 実行結果
<tf.Tensor: shape=(3, 6, 2), dtype=bool, numpy=
array([
[[ True,  True], [ True,  True], [ True,  True], [False, False], [False, False], [False, False]], 
[[ True,  True], [ True,  True], [ True,  True], [ True,  True], [ True,  True], [False, False]], 
[[ True,  True], [ True,  True], [ True,  True], [ True,  True], [ True,  True], [ True,  True]]
])>

compute_maskの結果が正しく返ってきているようなので、これは使えそうです。
まだ、実際のデータで試していないので、後で実験してみようと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up