More than 3 years have passed since last update.

L2-constrained Softmax Loss for Discriminative Face Verification【4. Proposed Method】【論文 DeepL 翻訳】

論文読み

Last updated at 2020-12-01Posted at 2020-12-01

この記事は自分用のメモみたいなものです.
ほぼ DeepL 翻訳でお送りします.
間違いがあれば指摘していだだけると嬉しいです.

翻訳元
L2-constrained Softmax Loss for Discriminative Face Verification
Rajeev Ranjan, Carlos D. Castillo, Rama Chellappa

前: 【3. Motivation】
次: 【5. Results】

4. Proposed Method

訳文

提案された $L_2$-softmax損失は, 式 3 で与えられる

$$\rm{minimize} \ \ - \frac{1}{M} \sum_{i=1}^{M} \log \frac{e^{W_{y_i}^{T}f(x_i)+b_{y_i}}}{\sum_{j=1}^{C}e^{W_{j}^{T}f(x_i)+b_{j}}} \tag{3}$$

$$\rm{subject \ to} \ \ ||f(x_i)||_2 = \alpha, \forall _i 1,2, \ldots M, $$

ここで, $x_i$ はサイズ $M$ のミニバッチの入力画像, $y_i$は対応するクラスラベル, $f(x_i)$ は DCNN の最後の層から得られる特徴記述子, $C$ は主クラスの数, $W$ と $b$ は分類器として働くネットワークの最後の層の重みとバイアスである. この式は, 式1で定義された通常のソフトマックス損失に $L_2$-制約を追加したものである. MNIST [17] のデータを用いて, この制約の有効性を示す.

4.1. MNIST Example

我々は MNIST データセット [17] における $L_2$-softmax 損失の効果を研究している. 我々は [33] で述べた LeNet のより深く広いバージョンを使用しており, 最後の隠れ層の出力は可視化を容易にするために2次元に制限されている. 最初のセットアップでは, クラス数 $= 10$ の分類のための通常のソフトマックス損失を用いて, ネットワークをエンドツーエンドで学習する. 2 つ目のセットアップでは, 2 次元特徴量に $L_2$-正規化層とスケール層を追加し, 式3 で記述された $L_2$-制約を適用する (詳細はセクション 4.2 を参照). 図 3 は, 10,000 の画像を含む MNIST のテストセットについて, 異なるクラスの 2 次元特徴量を示している. 図中に示されている各ローブは, ユニークなクラスの 2 次元特徴を表している. 2 番目のセットアップの特徴は, $L_2$-正規化層の前に得られたものである.

図3. MNIST 分類テストセットの2次元特徴量の可視化 (a) Softmax Loss. (b) $L_2$-Softmax Loss

上で議論した 2 つのセットアップを用いて学習した特徴の間には, 2 つの明確な違いがあることがわかる. まず, 各クラスのローブの平均的な幅から推定できる通常のソフトマックス損失を用いた場合, クラス内の角度分散が大きくなる. 一方, $L_2$-softmax 損失を用いて得られた特徴量は, クラス内角度分散が小さく, より細いローブで表現される. 第二に, 特徴量の大きさがソフトマックス損失 (150まで) の場合, 特徴量の大きさが非常に大きくなる. 対照的に, 特徴量のノルムは $L_2$-softmax 損失にはほとんど影響しない. なぜなら, すべての特徴量は損失を計算する前に固定半径の円に正規化されるからである. したがって, ネットワークは, 同じクラスの特徴を互いに近づけ, 正規化された空間または角度空間で異なるクラスの特徴を分離することに焦点を当てている. 表 1 に MNIST のテストセットでの 2 つのセットアップで得られた精度を示す. $L_2-$softmax 損失の方が高い性能を達成しており, 誤差を 15% 以上減少させている. なお, 分類には 2 次元特徴量のみを用いているため, これらの精度は一般的な DCNN と比較して低いことに注意が必要である.

表1. MNISTテストセットの精度(%)

4.2. Implementation Details

ここでは, 式 3 で述べた $L_2$-constraint を DCNN のフレームワークで実装する方法を詳細に説明する. この制約は, 図 4 に示すように, $L_2$-正規化層の後にスケール層を追加することで実現される.

$L_2$-正規化レイヤとスケールレイヤを追加して, 特徴記述子が半径 $\alpha$ の超球面上に存在するように制約する.

このモジュールは, 特徴記述子として機能する DCNN の最後の層の直後に追加される. $L_2$-正規化層は, 入力特徴量 $x$ を式 4 で与えられた単位ベクトルに正規化する. スケール層は, 入力された単位ベクトルをパラメータ $\alpha$ で与えられた固定半径にスケーリングする (式5). 合計すると, ネットワークの他のパラメータと一緒に学習できるスカラーパラメータ ($\alpha$) を 1 つ導入するだけである.

$$\mathbf{y} = \frac{\mathbf{x}}{||\mathbf{x}||_2} \tag{4}$$

$$\mathbf{z} = \alpha \cdot \mathbf{y} \tag{5}$$

このモジュールは完全に微分可能であり, ネットワークのエンドツーエンドの学習に使用することができる. テスト時には, コサイン類似度を計算しながら特徴量を単位長に正規化するため, 提案するモジュールは冗長である. 学習時には, $L_2$-正規化層とスケール層を介して勾配をバックプロパゲーションし, 以下のような連鎖規則を用いてスケーリングパラメータ $\alpha$ に対する勾配を計算する.

$$\frac{\partial l}{\partial y_i} = \frac{\partial l}{\partial z_i} \cdot \alpha$$

$$\frac{\partial l}{\partial \alpha} = \sum_{j=1}^{D} \frac{\partial l}{\partial z_i} \cdot y_i$$

$$\frac{\partial l}{\partial x_i} = \sum_{j=1}^{D} \frac{\partial l}{\partial y_i} \cdot \frac{\partial y_j}{\partial x_i} \tag{6}$$

$$\frac{\partial y_i}{\partial x_i} = \frac{||x||_{2}^{2} - x_{i}^2}{||x||_{2}^{3}}$$

$$\frac{\partial y_i}{\partial x_i} = \frac{-x_i \cdot x_j}{||x||_{2}^{3}}$$

4.3. Bounds on Parameter α

スケーリングパラメータ $\alpha$ は $L_2$-softmax 損失の性能を決定する上で重要な役割を果たす. $L_2$ 制約を強制するには2つの方法がある: 1) 学習中は $\alpha$ を固定しておく方法と, 2) ネットワークにパラメータ $\alpha$ を学習させる方法である. 2 つ目の方法はエレガントで, 通常のソフトマックス損失よりも常に改善される. しかし, ネットワークが学習する $\alpha$ パラメータは高いため, $L_2$-制約が緩和される. ソフトマックス分類器は, 全体的な損失を最小化するために特徴ノルムを増加させることを目的としているが, その代わりに $\alpha$ パラメータを増加させ, 簡単なサンプルにフィットするための自由度を高める. したがって, ネットワークによって学習された $\alpha$ はパラメータの上限を形成する. $\alpha$ をより低い定数値に固定することで, より良い性能が得られる.
一方, $\alpha$ の値が非常に低い場合, 学習は収束しない. 例えば, LFW [14] のデータセットでは, $\alpha = 1$ では精度が 86.37% と非常に悪くなる (図7参照). その理由は, 半径の小さい超球 ($\alpha$) では, 同じクラスの特徴を一緒に埋め込んだり, 異なるクラスの特徴を離れたところに埋め込んだりするための表面積が限られているからである.
ここでは, $\alpha$ の理論的な下限を定式化する. クラス $C$ の数を特徴次元 $D$ の 2 倍以下と仮定すると, 2 つのクラスの中心が少なくとも $90^\circ$ 離れているような次元 $D$ の超球上にクラスを分散させることができる. 図5(a) は, 半径 $\alpha$ の円上に分布する $C = 4$ つのクラス中心の場合のこのケースを表している. ここでは, 分類器の重み ($W_i$) を, それぞれのクラス中心の方向を指す単位ベクトルと仮定する. バイアス項は無視する. ある特徴を正しく分類するための平均ソフトマックス確率 $p$ は, 式7で与えられる

$$p = \frac{e^{W_{i}^{T} X_{i}}}{\sum_{j=1}^{T} e^{W_{j}^{T} X_{i}}} $$

$$= \frac{e^{\alpha}}{e^{\alpha} + C - 2} \tag{7}$$

項 $e^{- \alpha}$ を無視して $C$ クラスに一般化すると、平均的な確率は次のようになる:

$$p= \frac{e^\alpha}{e^\alpha+C-2} \tag{8}$$

図 5(b) は, 様々なクラス数 $C$ のパラメータ $\alpha$ の関数としての確率スコアをプロットしたものである. 与えられた分類確率 (例えば $p = 0.9$) を達成するためには, $C$ が大きいほど $\alpha$ を高くする必要があることが推論できる. データセットのクラス数 $C$ が与えられると, 式9を用いて, 確率スコア $p$ を達成するための $\alpha$ の下限を求めることができる.

$$\alpha_{low} = \log \frac{p(C-2)}{1-p}$$

図5. (a) 仮定した特徴分布の2次元可視化 (b) 異なるクラス数 $C$ に対する $\alpha$ のソフトマックス確率の変化

原文

The proposed $L_2$-softmax loss is given by Equation 3

$$\rm{minimize} \ \ - \frac{1}{M} \sum_{i=1}^{M} \log \frac{e^{W_{y_i}^{T}f(x_i)+b_{y_i}}}{\sum_{j=1}^{C}e^{W_{j}^{T}f(x_i)+b_{j}}} \tag{3}$$

$$\rm{subject \ to} \ \ ||f(x_i)||_2 = \alpha, \forall _i 1,2, \ldots M, $$

where $x_i$ is the input image in a mini-batch of size $M$, $y_i$ is the corresponding class label, $f(x_i)$ is the feature descriptor obtained from the penultimate layer of DCNN, $C$ is the number of subject classes, and $W$ and $b$ are the weights and bias for the last layer of the network which acts as a classifier. This equation adds an additional $L_2$-constraint to the regular softmax loss defined in Equation 1. We show the effectiveness of this constraint using MNIST [17] data.

4.1. MNIST Example

We study the effect of $L_2$-softmax loss on the MNIST dataset [17]. We use a deeper and wider version of LeNet mentioned in [33], where the last hidden layer output is restricted to 2-dimensions for easy vizualization. For the first setup, we train the network end-to-end using the regular softmax loss for digits classifcation with number of classes $= 10$. For the second setup, we add an $L_2$-normalize layer and scale layer to the 2-dimensional features which enforces the $L_2$-constraint described in Equation 3 (seen Section 4.2 for details). Figure 3 depicts the 2-D features for different classes for MNIST test set containing 10, 000 digit images. Each of the lobes shown in the figure represents 2-D features of unique digits classes. The features for the second setup were obtained before the $L_2$-normalization layer.

Figure 3. Vizualization of 2-dimensional features for MNIST digit classification test set using (a) Softmax Loss. (b) L2-Softmax Loss

We find two clear differences between the features learned using the two setups discussed above. First, the intra-class angular variance is large when using the regular softmax loss, which can be estimated by the average width of the lobes for each class. On the other hand, the features obtained with $L_2$-softmax loss have lower intra-class angular variability, and are represented by thinner lobes. Second, the magnitudes of the features are much higher with the softmax loss (ranging upto 150), since larger feature norms result in a higher probability for a correctly classified class. In contrast, the feature norm has minimal effect on the $L_2$-softmax loss since every feature is normalized to a circle of fixed radius before computing the loss. Hence, the network focuses on bringing the features from the same class closer to each other and separating the features from different classes in the normalized or angular space. Table 1 lists the accuracy obtained with the two setups on MNIST test set. $L_2$-softmax loss achieves a higher performance, reducing the error by more than 15%. Note that these accuracy numbers are lower compared to a typical DCNN since we are using only 2-dimensional features for classification.

Table 1. Accuracy on MNIST test set in (%)

4.2. Implementation Details

Here, we provide the details of implementing the $L_2$-constraint described in Equation 3 in the framework of DCNNs. The constraint is enforced by adding an $L_2$-normalize layer followed by a scale layer as shown in Figure 4.

Figure 4. We add an $L_2$-normalize layer and a scale layer to constrain the feature descriptor to lie on a hypersphere of radius $\alpha$.

This module is added just after the penultimate layer of DCNN which acts as a feature descriptor. The $L_2$-normalize layer normalizes the input feature $x$ to a unit vector given by Equation 4. The scale layer scales the input unit vector to a fixed radius given by the parameter $\alpha$ (Equation 5). In total, we just introduce one scalar parameter ($\alpha$) which can be trained along with the other parameters of the network.

$$\mathbf{y} = \frac{\mathbf{x}}{||\mathbf{x}||_2} \tag{4}$$

$$\mathbf{z} = \alpha \cdot \mathbf{y} \tag{5}$$

The module is fully differentiable and can be used in the end-to-end training of the network. At test time, the proposed module is redundant, since the features are eventually normalized to unit length while computing the cosine similarity. At training time, we backpropagate the gradients through the L2-normalize and the scale layer, as well as compute the gradients with respect to the scaling parameter α using the chain rule as given below.

$$\frac{\partial l}{\partial y_i} = \frac{\partial l}{\partial z_i} \cdot \alpha$$

$$\frac{\partial l}{\partial \alpha} = \sum_{j=1}^{D} \frac{\partial l}{\partial z_i} \cdot y_i$$

$$\frac{\partial l}{\partial x_i} = \sum_{j=1}^{D} \frac{\partial l}{\partial y_i} \cdot \frac{\partial y_j}{\partial x_i} \tag{6}$$

$$\frac{\partial y_i}{\partial x_i} = \frac{||x||_{2}^{2} - x_{i}^2}{||x||_{2}^{3}}$$

$$\frac{\partial y_i}{\partial x_i} = \frac{-x_i \cdot x_j}{||x||_{2}^{3}}$$

4.3. Bounds on Parameter α

The scaling parameter $\alpha$ plays a crucial role in deciding the performance of $L_2$-softmax loss. There are two ways to enforce the $L_2$-constraint: 1) by keeping α fixed throughout the training, and 2) by letting the network to learn the parameter $\alpha$. The second way is elegant and always improves over the regular softmax loss. But, the α parameter learned by the network is high which results in a relaxed $L_2$-constraint. The softmax classifier aimed at increasing the feature norm for minimizing the overall loss, increases the α parameter instead, allowing it more freedom to fit to the easy samples. Hence, $\alpha$ learned by the network forms an upper bound for the parameter. A better performance is obtained by fixing $\alpha$ to $\alpha$ lower constant value.
On the other hand, with a very low value of $\alpha$, the training doesn’t converge. For instance, $\alpha = 1$ performs very poorly on the LFW [14] dataset, achieving an accuracy of 86.37% (see Figure 7). The reason being that a hypersphere with small radius ($\alpha$) has limited surface area for embedding features from the same class together and those from different classes far from each other.
Here, we formulate a theoretical lower bound on $\alpha$. Assuming the number of classes $C$ to be lower than twice the feature dimension $D$, we can distribute the classes on a hypersphere of dimension $D$ such that any two class centers are at least $90^\circ$ apart. Figure 5(a) represents this case for $C = 4$ class centers distributed on a circle of radius $\alpha$. We assume the classifier weights ($W_i$) to be a unit vector pointing in the direction of their respective class centers. We ignore the bias term. The average softmax probability $p$ for correctly classifying a feature is given by Equation 7

$$p = \frac{e^{W_{i}^{T} X_{i}}}{\sum_{j=1}^{T} e^{W_{j}^{T} X_{i}}} $$

$$= \frac{e^{\alpha}}{e^{\alpha} + C - 2} \tag{7}$$

Ignoring the term $e^{− \alpha}$ and generalizing it for $C$ classes, the average probability becomes:

$$p= \frac{e^\alpha}{e^\alpha+C-2} \tag{8}$$

Figure 5(b) plots the probability score as a function of the parameter $\alpha$ for various number of classes $C$. We can infer that to achieve a given classification probability (say $p = 0.9$), we need to have a higher $\alpha$ for larger $C$. Given the number of classes $C$ for a dataset, we can obtain the lower bound on $\alpha$ to achieve a probability score of $p$ by using Equation 9.

$$\alpha_{low} = \log \frac{p(C-2)}{1-p}$$

Figure 5. (a) 2-D vizualization of the assumed distribution of features (b) Variation in Softmax probability with respect to α for different number of classes C

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up