導入

クラス分類を学習させる際に、データセット内でのクラスの偏りを抑える方法は大きく2つあります。
* 学習データの数そのものを整形することで偏りを抑える
* 学習の際の逆伝搬する値を制御することで偏りを抑える

今回は2番目の方法をChainerを用いて採用する際に気になったことをまとめました。

モチベーション

具体的には、softmax_cross_entropy関数にclass_weightという引数があるのですが、これを制御することでクラスごとの学習させる強さを変えられます。
例えば2クラス分類の場合、クラス'0'の学習に比べてクラス'1'の学習を2倍強く行えたりする、ということです。
じゃあ2倍の重み付けをしてあげた時、どういう意味で2倍強く学習してくれるのかな？って疑問に思ったので調べてみました。

実行環境

Mac OS X 10.10.5 (Yosemite)
Python 2.7.13
numpy @1.12.1
Chainer 1.24.0

Lossへの影響をみてみる

まずはChainerのドキュメントを読んでみる。

chainer.functions.softmax_cross_entropy(x, t, normalize=True, cache_score=True, class_weight=None, ignore_label=-1, reduce='mean')
...
・ class_weight (ndarray or ndarray) – An array that contains constant weights that will be multiplied with the loss values along with the second dimension. The shape of this array should be (x.shape[1],). If this is not None, each class weight class_weight[i] is actually multiplied to y[:, i] that is the corresponding log-softmax output of x and has the same shape as x before calculating the actual loss value.

つまり、Lossを計算する前に計算される$log(Softmax(x))$にxの形にあわせて乗算されるみたいです。なるほど。

では実際にどの段階でclass_weightが乗算されるのか見てみたいと思います。
まずはsoftmax_cross_entropy.pyにおけるforward関数を見てみます。

chainer/functions/loss/softmax_cross_entropy.py

    def forward_cpu(self, inputs):
        x, t = inputs
        if chainer.is_debug():
            self._check_input_values(x, t)

        log_y = log_softmax._log_softmax(x, self.use_cudnn)
        if self.cache_score:
            self.y = numpy.exp(log_y)
        if self.class_weight is not None:
            shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
            log_y *= _broadcast_to(self.class_weight.reshape(shape), x.shape)
        log_yd = numpy.rollaxis(log_y, 1)
        log_yd = log_yd.reshape(len(log_yd), -1)
        log_p = log_yd[numpy.maximum(t.ravel(), 0), numpy.arange(t.size)]

        log_p *= (t.ravel() != self.ignore_label)
        if self.reduce == 'mean':
            # deal with the case where the SoftmaxCrossEntropy is
            # unpickled from the old version
            if self.normalize:
                count = (t != self.ignore_label).sum()
            else:
                count = len(x)
            self._coeff = 1.0 / max(count, 1)

            y = log_p.sum(keepdims=True) * (-self._coeff)
            return y.reshape(()),
        else:
            return -log_p.reshape(t.shape),

注目すべきなのは11行目において、log(Softmax(x))の計算結果に対してclass_weightをブロードキャストしているということです。
log_y *= _broadcast_to(self.class_weight.reshape(shape), x.shape)
つまりクロスエントロピー誤差の計算$L = -\sum t_{k} \log{(Softmax(y_{k}))}$ における $\log{(Softmax(y_{k}))}$ に足し合わせる前に乗算されていることになります。
このとき $k$ はクラス数を指します。
つまり、式にすると $L = -\sum t_{k} ClassWeight_{k} \log{(Softmax(y_{k}))}$ という計算を行っていることになります。
ドキュメント通りですね。

これを確かめるために、インタラクティブに実験してみます。

>> import numpy as np
>> import chainer
>> x = np.array([[1, 0]]).astype(np.float32)
>> t = np.array([1]).astype(np.int32)
>> # クラス'1'を2倍の重みをつけて学習させる
>> cw = np.array([1, 2]).astype(np.float32)
>> sce_nonweight = chainer.functions.loss.softmax_cross_entropy.SoftmaxCrossEntropy()
>> sce_withweight = chainer.functions.loss.softmax_cross_entropy.SoftmaxCrossEntropy(class_weight=cw)
>> loss_nonweight = sce_nonweight(x, t)
>> loss_withweight = sce_withweight(x, t)
>> loss_nonweight.data
array(1.31326162815094, dtype=float32)
>> loss_withweight.data
array(2.62652325630188, dtype=float32)

確かにLossの値は2倍となっていることがわかります。

よってここまででわかったこととしては、class_weightでの重み付けは出力されるLossの値にそのまま反映されることになりそうです。

逆伝搬における影響をみてみる

では学習、つまり逆伝搬の際にはどのような影響があるだろうか。
ここで確認したいことは、softmax_cross_entropyから逆伝搬される値である $y-t$ の値がどうなっているのかということです。
予想としては、$y-t$ の値にそのまま重みが乗算されているのではと思うのですが、とりあえずchainerの実装を確認してみます。

chainer/functions/loss/softmax_cross_entropy.py

    def backward_cpu(self, inputs, grad_outputs):
        x, t = inputs
        gloss = grad_outputs[0]
        if hasattr(self, 'y'):
            y = self.y.copy()
        else:
            y = log_softmax._log_softmax(x, self.use_cudnn)
            numpy.exp(y, out=y)
        if y.ndim == 2:
            gx = y
            gx[numpy.arange(len(t)), numpy.maximum(t, 0)] -= 1
            if self.class_weight is not None:
                shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
                c = _broadcast_to(self.class_weight.reshape(shape), x.shape)
                c = c[numpy.arange(len(t)), numpy.maximum(t, 0)]
                gx *= _broadcast_to(numpy.expand_dims(c, 1), gx.shape)
            gx *= (t != self.ignore_label).reshape((len(t), 1))
        else:
            # in the case where y.ndim is higher than 2,
            # we think that a current implementation is inefficient
            # because it yields two provisional arrays for indexing.
            n_unit = t.size // len(t)
            gx = y.reshape(y.shape[0], y.shape[1], -1)
            fst_index = numpy.arange(t.size) // n_unit
            trd_index = numpy.arange(t.size) % n_unit
            gx[fst_index, numpy.maximum(t.ravel(), 0), trd_index] -= 1
            if self.class_weight is not None:
                shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
                c = _broadcast_to(self.class_weight.reshape(shape), x.shape)
                c = c.reshape(gx.shape)
                c = c[fst_index, numpy.maximum(t.ravel(), 0), trd_index]
                c = c.reshape(y.shape[0], 1, -1)
                gx *= _broadcast_to(c, gx.shape)
            gx *= (t != self.ignore_label).reshape((len(t), 1, -1))
            gx = gx.reshape(y.shape)
        if self.reduce == 'mean':
            gx *= gloss * self._coeff
        else:
            gx *= gloss[:, None]
        return gx, None

ここで9~17行目が今回 $y-t$ を計算しているところになりますが、予想通り逆伝搬の値に class_weight をブロードキャストしていることがわかります。

また、最後にglossが乗算されていることも確認できます。そしてglossは何かというとgrad_outputみたいなのですが、これはVariableクラスのメンバであるgradです。
てことは初期値のgradを確認すればいいので見てみます。

>> loss_nonweight.backward()
>> aloss_nonweight.backward()
>> loss_nonweight.grad
array(1.0, dtype=float32)
>> loss_withweight.grad
array(1.0, dtype=float32)

もちろんそうでなくては困っていたのですが、最初の逆伝搬の値は $\frac{\partial L}{\partial L} = 1$ です。なのでこの結果は間違ってなさそう。

また一応触れておくと、gloss 以外にも乗算されているパラメータ _coeff があるのですが、これは batch 学習の際に batchsize の逆数が入るだけ (つまり平均にするためのメンバ) で、今回の場合は1となります。
ちなみにLossを計算するときも同じく _coeff を乗算しています。

てことは class_weight で定義した重み付けは最初の予想通り、そのまま学習に比例して関わっているぽい。
ではちょっと無理やりですが実験。

>> sce_nonweight.backward_cpu((x,t),[loss_nonweight.grad])
(array([[ 0.7310586, -0.7310586]], dtype=float32), None)
>> sce_withweight.backward_cpu((x,t),[loss_withweight.grad])
(array([[ 1.4621172, -1.4621172]], dtype=float32), None)

逆伝搬の値は、chainer.functions.softmax(x).data をみてみると array([[ 0.7310586 , 0.26894143]], dtype=float32) であったことから、$y - t$ となっていることがわかります。
そして、逆伝搬の値もちゃんと2倍されているということが確認できました。めでたし。

結論として、class_weightの重みは逆伝搬の値にも比例して反映されるということがわかりました。

結論

Chainerに実装されているsoftmax_cross_entropyにおける引数class_weigthは

Lossの計算時にそのまま乗算される
softmax_cross_entropyから出力される逆伝搬の値にそのまま乗算される

ということがわかりました。

誰が得するのかわからないですが、参考になれば。
何か間違っているところがあれば教えていただけるとありがたいです。

Chainerのsoftmax_cross_entropy関数の引数class_weightについて調べてみた

導入

モチベーション

実行環境

Lossへの影響をみてみる

逆伝搬における影響をみてみる

結論