More than 5 years have passed since last update.

Gradient Descent系のアルゴリズムの収束の速さを調べたの論文を理解するための予備知識を調べた．

Posted at 2017-12-03

目標とする論文

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent (Jin et al., 2017)

本記事の内容

対称論文のPreliminariesを理解するために必要な知識を調べた．

内容

定義等

l-smooth function

次を満たす関数．

|| \nabla f(x_1) - \nabla f(x_2) || \leq l || x_1 - x_2 || \ \forall x_1 , x_2

α-strongly convex matrix

次を満たす行列．

\lambda_{min}( \nabla ^ 2 f (x) ) \leq \alpha, \forall x.

ただし，$\lambda_{min}(\cdot)$は行列の固有値の一番小さいもの．

ρ-Hessian Lipschitz function

次を満たす行列

|| \nabla ^2 f(x_1) - \nabla ^2 f(x_2) || \leq \rho || x_1 - x_2 || \ \forall x_1 , x_2

ただし行列のノルムはスペクトルノルム．

スペクトルノルム

|| A || = \sup_{x \neq 0}{ \frac{||Ax||}{||x||} }

を行列$A$のスペクトルノルムと呼ぶ．

ε-second-order statinary point

$\rho$-Hessian Lipschitz function $f$について，次を満たすもの．

|| \nabla f(x) || \leq \epsilon \ \land \ \lambda_{min}( \nabla ^2 f (x) ) \geq -\sqrt{\rho \epsilon}

ε-suboptimal point

最適値( $f^* \equiv \min_y f(y) $ )に対して，$ f(x) \leq f^* + \epsilon $ となる$x$のこと．

定理等

定理1 ( Nesterov, 2004 )

$l$-smooth かつ $\alpha$-strongly convexな関数 $f$ があるとする．任意の$\epsilon>0$について，$\epsilon$-suboptimal pointを発見するために必要な繰り返しのオーダーは次のようになる．

$\eta = 1 / l$ の時の勾配法(GD) : $O((l/\alpha)\cdot \log{((f(x_0)-f^*)/\epsilon))}$
$\eta = 1 / l$ かつ $\theta = \sqrt{\alpha / l }$ の時の加速勾配法(AGD): $O(\sqrt{(l/\alpha)}\cdot \log{((f(x_0)-f^*)/\epsilon))}$

定理2 ( Jin et al., 2017 )

$l$- smooth かつ $\rho$-Hessian lipschitzな関数$f$があるとする．$\epsilon>0$について，$\epsilon$-second order stationary pointを高い確率で発見できるために必要な繰り返しのオーダーは次のようになる．

perturbed GD ：

\tilde{O}  \left( \frac{l(f(x_0) - f^*)}{\epsilon^2}  \right)

ただし，$\tilde{O}(\cdot)$は多重対数項を無視する．

アルゴリズムとその違い．(Single-Loopのみ記載）

GD ( gradient descent )

必要繰り返し回数 : $O(1/\epsilon^2)$
到達点：First order stationary point

疑似コード

x[0], eta = input()
for t in [0,1,...]:
  x[t+1] = x[t] - eta * f.gradient(x[t])

AGD ( accelerated gradient descent )

必要繰り返し回数 : $O(1/\epsilon^2)$
到達点：First order stationary point

疑似コード

x[0], eta, theta = input()
v[0] = 0
for t in [0,1,...]:
  y[t+1] = x[t] + ( 1 - theta ) * v[t]
  x[t+1] = y[t] - eta * f.gradient(x[t])
  v[t+1] = x[t+1] - x[t]

Noisy GD

必要繰り返し回数 : $O(poly(d/\epsilon))$
到達点：Second order stationary point

Stochastic gradient oracle SG(w) について詳しく理解してから再掲．

Perturbed GD

必要繰り返し回数 : $\tilde{O}(1/\epsilon^2)$
到達点：Second order stationary point

疑似コード

x[0], eta, g_thres, g_noise, t_thres, t_noise, r, f_thres = input()
v[0] = 0
for t in [0,1,...]:
  if f.gradient(x[t]) <= g_thres and t - t_noise > t_thres:
    x_hat[t] = x[t]
    t_noise = t
    xi[y] = B0(r).sample_uniformly()
    x[t] = x_hat[t] + xi[t]
  if t - t_noise = t_thres and f(x[t]) - f(x_hat[t_noise]) > - f_thres:
    return x_hat[t_noise]
  x[t+1] = x[t] - eta * f.gradient(x[t])

Pertubed AGD (提案手法)

必要繰り返し回数 : $\tilde{O}(1/\epsilon^{7/4})$
到達点：Second order stationary point

参考文献

Jin, Chi, Praneeth Netrapalli, and Michael I. Jordan. "Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent." arXiv preprint arXiv:1711.10456 (2017).
Jin, Chi, et al. "How to Escape Saddle Points Efficiently." arXiv preprint arXiv:1703.00887 (2017).
Ge, Rong, et al. "Escaping from saddle points—online stochastic gradient for tensor decomposition." Conference on Learning Theory. 2015.

次回

定理１か定理２を証明した論文を紹介しようと思います．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up