More than 5 years have passed since last update.

ディープラーニングフレームワークで可変長の入力を扱うときのTips

Last updated at 2017-10-31Posted at 2017-08-26

はじめに

自然言語処理などで可変長の行列を使うときのパターンがいくつかある。毎回再実装しているような気がするので備忘録としてまとめる。

本記事では自分がよく使うChainerとTensorflowの実装をのせる。
(注: プロダクションのコードをコピペしたのではなく、このポストのために1から再実装したので、テストなどされていません。)

Chainerに関する留意点

Chainerでは、可変長は下記で扱うように Variable + length で管理するのではなく、Variable のリストとして管理することを推奨しているように思う。具体的にはL.NStepLSTMやF.pad_sequenceなどである。

注記

下記記載のコードは、それぞれ下記のimportがされていることを前提にしている。

Chainer

import chainer
import chainer.functions as F
import numpy as np

Tensorflow

import tensorflow as tf
import numpy as np


sess = tf.InteractiveSession()

本文

Padding

多くのディープラーニングフレームワークはGPUやCPUの並列計算を活用するために可変長の行列の計算を直接的にはサポートしていない。そこで、最大長の行列にあわせて、系列長外の部分を適当な値で埋めるパディングを行う。

なお、この部分はディープラーニングフレームワークではなくデータ作成の段階で自分でやってしまうことが多い。

X = [np.array([1, 2]),
     np.array([11, 12, 13, 14]),
     np.array([21])]

# 単語IDを扱うことを想定してint32で
x = np.zeros([3, 4], dtype=np.int32)

for i, xi in enumerate(X):
    x[i, :len(xi)] = xi[:]

print x
# [[ 1  2  0  0]
#  [11 12 13 14]
#  [21  0  0  0]]

なお、ChainerのL.EmbedIdを使うときは、0埋めではなく-1埋めにし、L.EmbedId(..., ignore_label=-1)を使うと良い。

Masking

Sum poolingなどをするときに、上記paddingによって生まれた系列長外の部分を0でマスキングする(ただし、あまりマスキングを過信しないこと。こういった計算はwhereの計算で実現できる。

(Trueの場合は左辺値、Falseの場合は右辺値を採用することでマスキングとして機能する)

この処理をステップバイステップで書くと:

Chainer

x = chainer.Variable(np.arange(1, 7).reshape(2, 3))
print x
# variable([[1 2 3]
#           [4 5 6]])

length = np.array([3, 2], dtype=np.int32)
print length
# [3 2]

xp = chainer.cuda.get_array_module(x.data)
mask = xp.tile(xp.arange(x.shape[-1]).reshape(1, -1), (x.shape[0], 1))
print mask
# [[0 1 2]
#  [0 1 2]]

mask = mask < length.reshape(-1, 1)
print mask
# [[ True  True  True]
#  [ True  True False]]

padding = xp.zeros(x.shape, dtype=x.dtype)
print padding
# [[0 0 0]
#  [0 0 0]]

z = F.where(mask, x, padding)
print z
# variable([[1 2 3]
#           [4 5 0]])

Tensorflowではsequence_maskが便利。

Tensorflow

x = tf.constant(np.arange(1, 7).reshape(2, 3).astype(np.float32))
length = tf.constant(np.array([3, 2], dtype=np.int32))

mask = tf.sequence_mask(length, tf.shape(x)[-1])
padding = tf.fill(tf.shape(x), 0.0)
z = tf.where(mask, x, padding)
print z.eval()
# [[ 1.  2.  3.]
#  [ 4.  5.  0.]]

Chainer版 (というよりnumpy版) sequence_mask

Chainer

def sequence_mask(length, max_num=None):
    xp = chainer.cuda.get_array_module(length.data)
    if max_num is None:
        max_num = xp.max(length)
    # create permutation on (length.ndim + 1) dimension
    perms = xp.arange(max_num).reshape([1] * length.ndim + [-1])
    length = length.reshape([1] * (length.ndim - 1) + [-1] + [1])
    return perms < length

Reshape

ディープラーニングではミニバッチサイズ×特徴量のランク2行列を扱うことが多いので、多くのフレームワークにはそのような行列を入力とした関数が多く用意されている。それらの関数の恩恵を享受するために、ミニバッチ×シークエンス長さ×特徴量の行列を(ミニバッチサイズ*シークエンス長さ)×特徴量のランク2行列に変換して処理をする。

ただし、これだと比較的行列が疎の場合に余計な処理がもったいない。インデキシングをがんばることで処理を減らすことができる。
（試していないが、行列が疎でない場合はメモリの再確保で逆に時間がかかるかもしれないので注意）

Tensorflowの場合は下記の処理によってこのような処理を実現できる。

Chainer

# WARNING: I have not checked it in case of rank != 3

x = chainer.Variable(np.arange(18).astype(np.float32).reshape(3, 3, 2))
length = np.array([2, 3, 1], dtype=np.int32)
w = chainer.Variable(np.ones([2, 3], dtype=np.float32))

# sequence_mask は前述
mask = sequence_mask(length, x.shape[length.ndim])
print mask
# [[ True  True False]
#  [ True  True  True]
#  [ True False False]]

x_reshaped = F.get_item(x, mask)
print x_reshaped
# [[  0.   1.]
#  [  2.   3.]
#  [  6.   7.]
#  [  8.   9.]
#  [ 10.  11.]
#  [ 12.  13.]]

y_reshaped = F.matmul(x_reshaped, w)
print y_reshaped
# [[  1.   1.   1.]
#  [  5.   5.   5.]
#  [ 13.  13.  13.]
#  [ 17.  17.  17.]
#  [ 21.  21.  21.]
#  [ 25.  25.  25.]]

pad_shape = [[0, 0] for _ in xrange(y_reshaped.ndim)]
pad_shape[length.ndim - 1][1] = 1
y_reshaped = F.pad(y_reshaped, pad_shape, 'constant', constant_values=0.)
print y_reshaped
# variable([[  1.,   1.,   1.],
#           [  5.,   5.,   5.],
#           [ 13.,  13.,  13.],
#           [ 17.,  17.,  17.],
#           [ 21.,  21.,  21.],
#           [ 25.,  25.,  25.],
#           [  0.,   0.,   0.]])


idx_size = np.prod(mask.shape)
inv_idx = np.ones([idx_size], dtype=np.int32) * -1
inv_idx[np.nonzero(mask.flat)[0]] = np.arange(x_reshaped.shape[0]).astype(np.int32)
print inv_idx
# [ 0  1 -1  2  3  4  5 -1 -1]

y = F.reshape(F.get_item(y_reshaped, inv_idx), list(x.shape[:length.ndim + 1]) + [-1])
print y
# [[[  1.   1.   1.]
#   [  5.   5.   5.]
#   [  0.   0.   0.]]
# 
#  [[ 13.  13.  13.]
#   [ 17.  17.  17.]
#   [ 21.  21.  21.]]
# 
#  [[ 25.  25.  25.]
#   [  0.   0.   0.]
#   [  0.   0.   0.]]]

Tensorflowの場合は下記の処理によってこのような処理を実現できる。

Tensorflow

# WARNING: I have not checked it in case of rank != 3
x = tf.constant(np.arange(18).astype(np.float32).reshape(3, 3, 2))
length = tf.constant(np.array([2, 3, 1], dtype=np.int32))
w = tf.constant(np.ones([2, 3], dtype=np.float32))

mask = tf.sequence_mask(length, tf.shape(x)[tf.rank(length)])
print mask.eval()
# [[ True  True False]
#  [ True  True  True]
#  [ True False False]]

x_reshaped = tf.boolean_mask(x, mask)
print x_reshaped.eval()
# [[  0.   1.]
#  [  2.   3.]
#  [  6.   7.]
#  [  8.   9.]
#  [ 10.  11.]
#  [ 12.  13.]]

y_reshaped = tf.matmul(x_reshaped, w)
print y_reshaped.eval()
# [[  1.   1.   1.]
#  [  5.   5.   5.]
#  [ 13.  13.  13.]
#  [ 17.  17.  17.]
#  [ 21.  21.  21.]
#  [ 25.  25.  25.]]

idx = tf.to_int32(tf.where(mask))
print idx.eval()
# [[0 0]
#  [0 1]
#  [1 0]
#  [1 1]
#  [1 2]
#  [2 0]]

shape = tf.concat([tf.shape(x)[:-1], tf.shape(y_reshaped)[-1:]], 0)
print shape.eval()
# [3 3 3]

y = tf.scatter_nd(idx, y_reshaped, shape)
print y.eval()
# [[[  1.   1.   1.]
#   [  5.   5.   5.]
#   [  0.   0.   0.]]
# 
#  [[ 13.  13.  13.]
#   [ 17.  17.  17.]
#   [ 21.  21.  21.]]
# 
#  [[ 25.  25.  25.]
#   [  0.   0.   0.]
#   [  0.   0.   0.]]]

Softmaxの実装

与えられた行列の最外次元にsoftmaxを行うことを考える。このようなシチュエーションはListNetのPermutation probability distributionや、アテンションの計算で生じる。

Softmaxの式
$$
y_i = \frac{exp(x_i)}{\sum_jexp({x_j})}
$$

x = np.random.random([2, 3]).astype(np.float32)
# array([[ 0.44715771,  0.85983515,  0.08915455],
#        [ 0.02465274,  0.63411605,  0.01340247]], dtype=float32)

length = np.array([3, 2], dtype=np.int32)

下記図のように青い領域だけを使ってSoftmaxを計算したい。

ちなみに事前/事後にマスクをかけるのはダメ。

Chainer

# ダメな例1
x_ = np.copy(x)
x_[1, 2] = 0.
print F.softmax(x_)
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.26211682,  0.48214924,  0.25573397]])

# ダメな例2
y = F.softmax(x)
y[1, 2] = 0.
print y
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.26121548,  0.48049128,  0.0       ]])
# 2行目の合計が1.0になっていないことから明らかにだめ

理由は非常に簡単で、例1は $exp(0.258) \neq 0$のため。例2では、 x[2,1]が分母の計算に影響してしまっている。

Softmaxの計算では$exp(-inf) = 0$であることを利用してマスキングを行う。

Chainer

def masked_softmax(x, length):
    """
    Softmax operation on the ourter-most dimenstion of x.

    Args:
         x (chainer.Variable): Values to be passed to softmax
         length (numpy.ndarray or cupy.ndarray):
             Number of items in the outer-most dimension of x
    """
    assert x.ndim - 1 == length.ndim
    xp = chainer.cuda.get_array_module(x.data)
    x_shape = x.shape
    x = F.reshape(x, (-1, x_shape[-1]))
    # mask: (B, T)
    mask = xp.tile(xp.arange(x.shape[-1]).reshape(1, -1), (x.shape[0], 1))
    mask = mask < length.reshape(-1, 1)
    padding = xp.ones(x.shape, dtype=x.dtype) * -np.inf
    z = F.where(mask, x, padding)
    return F.reshape(F.softmax(z), x_shape)


print masked_softmax(chainer.Variable(x), length)
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.35218161,  0.64781839,  0.        ]])

Tensorflow

def masked_softmax(x, length):
    """
    Softmax operation on the ourter-most dimenstion of x.

    Args:
         x (tf.Tensor): Values to be passed to softmax
         length (tf.Tensor): Number of items in the outer-most dimension of x
    """
    mask = tf.sequence_mask(length, tf.shape(x)[-1])
    padding = tf.fill(tf.shape(x), -np.inf)
    z = tf.where(mask, x, padding)
    return tf.nn.softmax(z, dim=-1)


print masked_softmax(
    tf.constant(x),
    tf.constant(length)).eval()
# [[ 0.31153342,  0.47068265,  0.21778394],
#  [ 0.35218161,  0.64781839,  0.        ]]

Appendix:

Maskを過信しない

ディープラーニングフレームワークでは0除算が生じると、whereを使ったとしても勾配がinfになる仕様がある。なので、「不安定な計算をしてもマスクすればいいや」は通用しない。

次の式のようなネットワークがある。

e = f_0(x) \\
w = f_1(e)

これをchainルールで表現すると下記になる。
$$
\frac{\partial w}{\partial x} = \frac{\partial w}{\partial e}\frac{\partial e}{\partial x}
$$

さて、これは自動微分では下記のように（ざっくりとだが）実現される。

x.grad = e.grad * g(f_0, e, x)

ここで、 g(f_0, e, x)は $f_0$とその入出力から表現される偏微分である。つまり、上段の式からどのような微分値e.gradが来たとしても、式$f_0$の偏微分値がinfやnanであればx.gradもまたinfやnanになってしまう。試しにChainerとTensorflowでこれを試すと、

Tensorflow

sess = tf.InteractiveSession()

x = tf.constant(0.0)

t = x
e = 1. / x
w = tf.where(True, t, e)

print w.eval()  # 0.0
print tf.gradients(w, x)[0].eval()  # nan

Chainer

x = chainer.Variable(np.array([0.0], dtype=np.float32))
t = x
e = 1. / x
w = chainer.functions.where(np.array([True]), t, e)

w.grad = np.array([1.0], np.float32)
w.backward(retain_grad=True)

print w  # 0.
print x.grad  # nan

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up