機械学習

確率的勾配降下法


SGD

(stocastic gradient descent)

ミニバッチ単位で最急降下法を適用してパラメータ $\theta$ を更新する。

\begin{aligned}

\theta &\leftarrow \theta - \alpha g_{\theta}
\end{aligned}


  • $g_{\theta}$: パラメータ $\theta$ と現在の入力に対する損失関数 (小さいほどよい) の勾配

  • $\alpha$: 学習率。$10^{-2}$ とかでよい。


Momentum

\begin{aligned}

\tilde{m} &\leftarrow \beta \tilde{m} + g_{\theta} \\
\theta &\leftarrow \theta - \alpha \tilde{m}
\end{aligned}


  • $\beta$: 慣性パラメータ。


NesterovAG

(Nesterov accelerated gradient)


  • Yoshua Bengio, Nicolas Boulanger-Lewandowski, Razvan Pascanu. 2012. Advances in optimizing recurrent networks. arXiv:1212.0901

\begin{aligned}

m_{t} &\leftarrow \gamma m_{t-1} + g_{\theta_{t-1} - \gamma m_{t-1}} \\
\theta_{t} &\leftarrow \theta_{t-1} - \alpha m_{t}
\end{aligned}

ここで $\hat{\theta}_{t} \overset{d}{=} \theta_{t} + \gamma_{t-1} v_{t-1}$ と再パラメータ化する:

\begin{aligned}

m &\leftarrow \gamma m + g_{\hat{\theta}} \\
\hat{\theta} &\leftarrow \hat{\theta} - \alpha (g_{\hat{\theta}} + \gamma m)
\end{aligned}

KerasSGD(nesterov=True) はこれ。


AdaGrad

\begin{aligned}

\tilde{v} &\leftarrow \tilde{v} + g_{\theta}^{2} \\
\theta &\leftarrow \theta - \frac{\alpha}{\sqrt{\tilde{v}} + \epsilon} g_{\theta}
\end{aligned}


  • $r$ は勾配の二乗の累積値。初期値は $0$

  • $\epsilon$ は安定のためのパラメータ。$10^{-8}$ とかでよい。


RMSProp

\begin{aligned}

\tilde{v} &\leftarrow \beta \tilde{v} + (1 - \gamma) g_{\theta}^{2} \\
\theta &\leftarrow \theta - \frac{\alpha}{\sqrt{\tilde{v}} + \epsilon} g_{\theta}
\end{aligned}

$\beta$ は過去の勾配による影響を減衰させるパラメータ。

初期値は $r = 0$ とする。パラメータは $\alpha = 10^{-3}$、$\beta = 0.9$、$\epsilon = 10^{-8}$ ぐらい。


RMSPropGraves


  • Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv:1308.0850

\begin{aligned}

\tilde{m} &\leftarrow \beta \tilde{m} + (1 - \beta) g_{\theta} \\
\tilde{v} &\leftarrow \beta \tilde{v} + (1 - \beta) g_{\theta}^{2} \\
\hat{m} &\leftarrow \gamma \hat{m} + \frac{1}{\sqrt{\tilde{v} - \tilde{m}^{2}} + \epsilon} g_{\theta} \\
\theta &\leftarrow \theta - \alpha \hat{m}
\end{aligned}

$\beta=0.95$, $\gamma=0.9$, $\alpha=0.0001$, $\epsilon=0.0001$ ぐらい。

出力部分の勾配は $(-100, 100)$ の範囲にクリッピング、各パラメータの勾配は $(-10, 10)$ の範囲にクリッピングする。


AdaDelta


  • Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv:1212.5701.

\begin{aligned}

r &\leftarrow \gamma r + (1 - \gamma) g_{\theta}^{2} \\
v &\leftarrow \frac{\sqrt{s + \epsilon}}{\sqrt{r + \epsilon}} g_{\theta} \\
\theta &\leftarrow \theta - \alpha v \\
s &\leftarrow \gamma s + (1 - \gamma) v^{2}
\end{aligned}

初期値は $v = 0$、$r = 0$。パラメータは $\alpha = 1$、$\gamma = 0.95$、$\epsilon = 10^{-8}$ ぐらい。


Adam


  • Diederik Kingma, Jimmy Ba. 2015. Adam: a method for stochastic optimization. arXiv:1412.6980.

\begin{aligned}

\tilde{m} &\leftarrow \beta_{1} \tilde{m} + (1 - \beta_{1}) g_{\theta} \\
\tilde{v} &\leftarrow \beta_{2} \tilde{v} + (1 - \beta_{2}) g_{\theta}^{2} \\
m &\leftarrow \frac{\tilde{m}}{1 - \beta_{1}^{t}} \\
v &\leftarrow \frac{\tilde{v}}{1 - \beta_{2}^{t}} \\
\theta &\leftarrow \theta - \alpha \frac{m}{\sqrt{v} + \epsilon}
\end{aligned}

ここで、$t$ は反復回数。パラメータは $\alpha = 10^{-3}$、$\beta_{1} = 0.9$、$\beta_{2} = 0.999$、$\epsilon = 10^{-8}$ ぐらい。


Adamax

🚧


Fixing weight decay regularization


  • Ilya Loshchilov, Frank Hutter. 2017. Fixing weight decay regularization in Adam. arXiv:1711.05101.


    • weight decay の計算を sgd の計算部分と分ける




SGDW

(SGD with momentum and weight decay)

\begin{aligned}

\tilde{m} &\leftarrow \beta \tilde{m} + g_{\theta} \\
\theta &\leftarrow (1 - \alpha \delta) \theta - \alpha \tilde{m}
\end{aligned}


  • $\delta$: weight decay


AdamW

\begin{aligned}

\tilde{m} &\leftarrow \beta_{1} \tilde{m} + (1 - \beta_{1}) g_{\theta} \\
\tilde{v} &\leftarrow \beta_{2} \tilde{v} + (1 - \beta_{2}) g_{\theta}^{2} \\
m &\leftarrow \frac{\tilde{m}}{1 - \beta_{1}^{t}} \\
v &\leftarrow \frac{\tilde{v}}{1 - \beta_{2}^{t}} \\
\theta &\leftarrow (1 - \alpha \delta) \theta - \alpha \frac{m}{\sqrt{v} + \epsilon}
\end{aligned}


SMORMS3

\begin{aligned}

\beta &\overset{d}{=} \frac{1}{s + 1} \\
v &\leftarrow \beta v + (1 - \beta) g_{\theta} \\
r &\leftarrow \beta r + (1 - \beta) g_{\theta}^{2} \\
x &\overset{d}{=} \frac{v^{2}}{r + \epsilon} \\
\theta &\leftarrow \theta - \frac{\min\{\alpha, x\}}{\sqrt{r} + \epsilon} g_{\theta} \\
s &\leftarrow 1 + \left( 1 - x \right) s
\end{aligned}

初期値はそれぞれ $s = 1$、$v = 0$、$r = 0$ (いずれも $\theta$ と同次元のベクトル)。パラメータは $\alpha = 10^{-3}$、$\epsilon = 10^{-16}$ ぐらい。


Eve


  • Jayanth Koushik, Hiroaki Hayashi. 2016. Improving stochastic gradient descent with feedback. arXiv:1611.01505.

🚧


YellowFin


  • Jian Zhang, Ioannis Mitliagkas, Christopher Ré. 2017. YellowFin and the art of momentum tuning. arXiv:1706.03471.

🚧


ND-Adam


  • Zijun Zhang, Lin Ma, Zongpeng Li, Chuan Wu. 2017. Normalized direction-preserving Adam. arXiv:1709.04546.

下記の更新ルールをレイヤー単位で適用する。

\begin{aligned}

\tilde{m} &\leftarrow \beta_{1} \tilde{m} + (1 - \beta_{1}) g_{\vec{\theta}} \\
\tilde{v} &\leftarrow \beta_{2} \tilde{v} + (1 - \beta_{2}) \|g_{\vec{\theta}}\|_{2}^{2} \\
m &\overset{d}{=} \frac{\tilde{m}}{1 - \beta_{1}^{t}} \\
v &\overset{d}{=} \frac{\tilde{v}}{1 - \beta_{2}^{t}} \\
\bar{\theta}_{t} &\overset{d}{=} \theta_{t-1} - \alpha_{t} \frac{m}{\sqrt{v} + \epsilon} \\
\theta_{t} &\leftarrow \frac{\bar{\theta}_{t}}{\|\bar{\theta}_{t} \|_{2}}
\end{aligned}

すべてのレイヤーをこのルールで更新するのではなく、単に Adam を適用するレイヤーも残す。


STDProp


  • Yasutoshi Ida, Yasuhiro Fujiwara, Sotetsu Iwamura. Adaptive learning rate via covariance matrix based preconditioning for deep neural networks. arXiv:1605.09593.

\begin{aligned}

\hat{v} &\leftarrow \beta \hat{v} + \beta (1 - \beta) (g_{\theta} - \tilde{m})^{2} \\
\theta &\leftarrow \theta - \alpha \frac{1}{\sqrt{\hat{v}} + \epsilon} g_{\theta}
\end{aligned}

$\beta = 0.99$


STDPropM

\begin{aligned}

\hat{v} &\leftarrow \beta \hat{v} + \beta (1 - \beta) (g_{\theta} - \tilde{m})^{2} \\
\tilde{m} &\leftarrow \beta \tilde{m} + (1 - \beta) g_{\theta} \\
\theta &\leftarrow \theta - \alpha \frac{\tilde{m}}{\sqrt{\hat{v}} + \epsilon}
\end{aligned}


Adastand (?)

\begin{aligned}

\hat{v} &\leftarrow \beta_{2} \hat{v} + \beta_{2} (1 - \beta_{2}) (g_{\theta} - \hat{m})^{2} \\
\hat{m} &\leftarrow \beta_{1} \hat{m} + (1 - \beta_{1}) (g_{\theta} - \hat{m}) \\
\theta &\leftarrow \theta - \alpha \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}
\end{aligned}


AMSGrad

\begin{aligned}

\tilde{m} &\leftarrow \beta \tilde{m} + (1 - \beta) g_{\theta} \\
\tilde{v} &\leftarrow \gamma \tilde{v} + (1 - \gamma) g_{\theta}^{2} \\
m &\leftarrow \frac{\tilde{m}}{1 - \beta^{t}} \\
v &\leftarrow \frac{\tilde{v}}{1 - \gamma^{t}} \\
s &\leftarrow \max\{s, v\} \\
\theta &\leftarrow \theta - \alpha \frac{m}{\sqrt{s} + \epsilon}
\end{aligned}


WNGrad


  • Xiaoxia Wu, Rachel Ward, Léon Bottou. 2018. WNGrad: Learn the learning rate in gradient descent. arXiv:1803.02865.

\begin{aligned}

b &\leftarrow b + \frac{\alpha^{2}}{b} \|g_{\theta}\|_{2}^{2} \\
\theta &\leftarrow \theta - \frac{\alpha}{b} g_{\theta}
\end{aligned}

$r > 0$ で初期化する。$\alpha$ はスケーリングパラメータ (学習率とも捉えられる)。


WN-Adam

\begin{aligned}

\tilde{m} &\leftarrow \beta \tilde{m} + (1 - \beta) g_{\theta} \\
m &\leftarrow \frac{\tilde{m}}{1 - \beta^{t}} \\
b &\leftarrow b + \frac{\alpha^{2}}{b} \|g_{\theta}\|_{2}^{2} \\
\theta &\leftarrow \theta - \frac{\alpha}{b} m
\end{aligned}

$\beta_{1} = 0.9$


Shampoo


  • Vineet Gupta, Tomer Koren, Yoram Singer. 2018. Shampoo: Preconditioned Stochastic Tensor Optimization. arXiv:1802.09568.

パラメータが行列のとき ($g_{w}$も行列)

\begin{aligned}

L &\leftarrow L + g_{\theta} g_{\theta}^{\top} \\
R &\leftarrow R + g_{\theta}^{\top} g_{\theta} \\
\theta &\leftarrow \theta - \alpha L^{-\frac{1}{4}} g_{\theta} R^{-\frac{1}{4}}
\end{aligned}

$L$ や $R$ の計算もそのまま行列として扱う。初期値はそれぞれ $l=\epsilon I$ および $r=\epsilon I$ とする ($L$ と $R$ を対角行列で近似したバージョンも構成できる)。

パラメータが$k$階のテンソルのとき

\begin{aligned}

G &\leftarrow g_{\theta} \\
H_{1} &\leftarrow H_{1} + G \\ G &\leftarrow G \times_{1} H_{1}^{-\frac{1}{2}} \\
H_{2} &\leftarrow H_{2} + G \\ G &\leftarrow G \times_{2} H_{2}^{-\frac{1}{4}} \\
\vdots & \\
H_{k} &\leftarrow H_{k} + G \\ G &\leftarrow G \times_{k} H_{k}^{-\frac{1}{2k}} \\
\theta &\leftarrow \theta - \alpha G
\end{aligned}


Adam-HD


  • Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, Frank Wood. 2017. Online learning rate adaptation with hypergradient descent. arXiv:1703.04782.


Hypergradient descent

\begin{aligned}

\alpha &\leftarrow \alpha + \beta g_{\theta} \cdot h_{\theta} \\
\theta &\leftarrow \theta - \alpha g_{\theta}
\end{aligned}


  • $h_{\theta}$: 1 ステップ前の更新量

  • $\beta$: ハイパーパラメータ $0.0001$ ぐらい。

  • $\alpha$ も勾配法で最適化する


    • $\frac{\partial f_{\theta}}{\partial \alpha} = -g_{\theta} \cdot h_{\theta}$




SGD-HD

\begin{aligned}

\alpha &\leftarrow \alpha + \beta g_{\theta} \cdot h_{\theta} \\
\theta &\leftarrow \theta - \alpha g_{\theta} \\
h_{\theta} &\leftarrow g_{\theta}
\end{aligned}


  • $h_{\theta}$ は零ベクトルで初期化する。


SGDN-HD

SGD with Nesterov

\begin{aligned}

\alpha &\leftarrow \alpha + \beta g_{\theta} \cdot h_{\hat{\theta}} \\
m &\leftarrow \gamma m + g_{\hat{\theta}} \\
\hat{\theta} &\leftarrow \hat{\theta} - \alpha (g_{\hat{\theta}} + \gamma m) \\
h_{\hat{\theta}} &\leftarrow g_{\hat{\theta}} + \gamma m
\end{aligned}


Adam-HD

\begin{aligned}

\alpha &\leftarrow \alpha + \beta g_{\theta} \cdot h_{\theta} \\
\tilde{m} &\leftarrow \beta_{1} \tilde{m} + (1 - \beta_{1}) g_{\theta} \\
\tilde{v} &\leftarrow \beta_{2} \tilde{v} + (1 - \beta_{2}) g_{\theta}^{2} \\
m &\leftarrow \frac{\tilde{m}}{1 - \beta_{1}^{t}} \\
v &\leftarrow \frac{\tilde{v}}{1 - \beta_{2}^{t}} \\
\theta &\leftarrow \theta - \alpha \frac{m}{\sqrt{v} + \epsilon} \\
h_{\theta} &\leftarrow \frac{m}{\sqrt{v} + \epsilon}
\end{aligned}


M-SVAG


  • Lukas Balles, Philipp Hennig. 2017. Dissecting Adam: the sign, magnitude and variance of stochastic gradients. arXiv:1705.07774.


M-SVAG (momentum stochastic variance-adapted gradient)

SGD の系統になるように Adam を修正した手法 (Adam は stochastic sign descent の系統)。

\begin{aligned}

\tilde{m} &\leftarrow \beta \tilde{m} + (1 - \beta) g_{\theta} \\
\tilde{v} &\leftarrow \beta \tilde{v} + (1 - \beta) g_{\theta}^{2} \\
m &\leftarrow \frac{\tilde{m}}{1 - \beta^{t}} \\
v &\leftarrow \frac{\tilde{v}}{1 - \beta^{t}} \\
r &\leftarrow \frac{m^{2}}{m^{2} + \frac{\rho(\beta, t)}{1 - \rho(\beta, t)} (v - m^{2})} \\
\theta &\leftarrow \theta - \alpha (r \circ m)
\end{aligned}


  • $\rho(\beta, t) = \frac{1 - \beta}{1 + \beta} \frac{1 + \beta^{t}}{1 - \beta^{t}}$


M-SSD (momentum stochastic sign descent)

\begin{aligned}

\tilde{m} &\leftarrow \beta \tilde{m} + (1 - \beta) g_{\theta} \\
\theta &\leftarrow \theta - \alpha \mathrm{sgn}(\tilde{m})
\end{aligned}


AdaBound


  • Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR 2019.




AdaBound

\begin{aligned}

m &\leftarrow \beta_{1} m + (1 - \beta_{1}) g_{\theta} \\
v &\leftarrow \beta_{2} v + (1 - \beta_{2}) g_{\theta}^{2} \\
\theta &\leftarrow \theta - \mathrm{clip}\left(\frac{\sqrt{1 - \beta_{2}^{t}}}{1 - \beta_{1}^{t}} \frac{\alpha_{t}}{\sqrt{v} + \epsilon}, \frac{\alpha_{t} \alpha_{\infty}}{\alpha_{0}} \frac{\gamma t}{\gamma t + 1}, \frac{\alpha_{t} \alpha_{\infty}}{\alpha_{0}} \frac{\gamma t + 1}{\gamma t}\right) m
\end{aligned}


  • $\alpha_{t}$: $t\geq 1$ステップの学習率

  • $\alpha_{0}$: 基準となる学習率

  • $\alpha_{\infty}$: 最終的な学習率

  • $\beta_{1}=0.9$

  • $\beta_{2}=0.999$

  • $\gamma=0.001$


AMSBound

\begin{aligned}

m &\leftarrow \beta_{1} m + (1 - \beta_{1}) g_{\theta} \\
v &\leftarrow \beta_{2} v + (1 - \beta_{2}) g_{\theta}^{2} \\
s &\leftarrow \max\{s, v\} \\
\theta &\leftarrow \theta - \mathrm{clip}\left(\frac{\sqrt{1 - \beta_{2}^{t}}}{1 - \beta_{1}^{t}} \frac{\alpha_{t}}{\sqrt{s} + \epsilon}, \frac{\alpha_{t} \alpha_{\infty}}{\alpha_{0}} \frac{\gamma t}{\gamma t + 1}, \frac{\alpha_{t}\alpha_{\infty}}{\alpha_{0}} \frac{\gamma t + 1}{\gamma t}\right) m
\end{aligned}


参考文献


  1. 海野裕也, 岡野原大輔, 得居誠也, 徳永拓之 (2015): "オンライン機械学習 (機械学習プロフェッショナルシリーズ)," 講談社.


  2. An overview of gradient descent optimization algorithms (邦訳: 勾配降下法の最適化アルゴリズムを概観する)