LoginSignup
0
0

More than 5 years have passed since last update.

[Review] Gradient Descent Variants ~ Optimisation Algorithms ~

Last updated at Posted at 2018-08-12

Introduction

Gradient descent optimisation algorithms, while increasingly popular, are often used as black-box optimizers, especially when it comes to the actual implementation using some DL libraries. Indeed, practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow us to put them to use.

Original Papaer : https://arxiv.org/pdf/1609.04747.pdf

Gradient Descent

It is a way to minimise an objective function $J(\theta)$ parametrised by a model's parameters $\theta \in R^d$ by updating the parameters in the opposite direction of the gradient of the objective function $\nabla_{\theta}J(\theta)$ with regard to the parameters. The learning rate $\eta$ determines the size of the steps we take to discover a minimum.

Variants

  1. Batch Gradient Descent
  2. Stochastic Gradient Descent
  3. Mini-Batch Gradient Descent

1. Batch Gradient Descent

Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. the parameters $\theta$ for the entire training dataset at once.

\theta = \theta - \eta\nabla_{\theta}J(\theta)

Since we need to calculate the gradient for the entire dataset, it takes much time than other algorithms explained later.
And it does not apply to online learning.

pseudocode
for i in range(epochs):
  params_grad = evaluate_gradient(loss_func, data, params)
  params = params - learning_rate * params_grad

2. Stochastic Gradient Descent

SGD, in contrast, performs a parameter update for each training example $x^{(i)}$ and $y^{(i)}$.

\theta = \theta - \eta\nabla_{\theta}J(\theta; x^{(i)}; y^{(i)})

Because it computes the error/gradient and updates the parameters for each row of dataset, it can apply online learning as well. Yet, it too much frequently updates with a high variance causing the objective function to fluctuate heavily as shown below. On the other hand, we can say that this excessive fluctuation allows the algorithm to perform jumping to a new better local optimum.

image.png

pseudocode
for i in range(epochs):
  np.random.shuffle(data)
  for row in data:
    params_grad = evaluate_gradient(loss_func, row, params)
    params = params - learning_rate * params_grad

3. Mini-Batch Gradient Descent

Mini-batch GD finally takes the best of both worlds and performs an update or every mini-batch of $n$ training example.

\theta = \theta - \eta\nabla_{\theta}J(\theta; x^{(i:i+n)}; y^{(i:i+n)})

This benefits as shown below.
It...
1. Reduces the variance of the parameter updates
2. Can make use of highly optimised matrix optimisations common to state-of-the-art deep learning libraries efficiently. Empirically resulting, common mini-batch sizes range between 50 and 256.

pseudocode
for i in range(epochs):
  np.random.shuffle(data)
  for batch in get_batches(data, batch_size=50):
    params_grad = evaluate_gradient(loss_func, batch, params)
    params = params - learning_rate * params_grad

Challenges

  • choosing a proper learning rate is hard.
  • learning rate schedules try to adjust the learning rate during training.
  • the same learning rate applies to all parameter updates.

In the following, we will outline some algorithms that are widely used by the deep learning community to deal with the aforementioned challenges.

Advanced Techniques

  1. Momentum Gradient Descent
  2. Nesterov accelerated Gradient
  3. Adagrad
  4. Adadelta
  5. RMSProp
  6. Adam
  7. AdaMax
  8. Nadam

1. Momentum Gradient Descent

Even SGD has trouble navigating ravines, such as areas where the surface curves much more steeply in one dimension than in another, which are common around local optima.
So SGD can only make hesitant minor progress along the bottom towards the local optima.
image.png

In contrast, Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can b seen below.
image.png

v_t = \gamma v_{t-1} + \eta\nabla+{\theta}J(\theta)\\
\theta = \theta - v_t

where momentum term : $\gamma$ is usually set to 0.9 or so.
Essentially, we can consider momentum as an accelerated speed. For instance, when we push a ball down a hill, it accumulates momentum as it rolls downhill in this case, becoming faster and faster on the way(until it reaches its terminal velocity.) So the same effect applies to our parameters updates. As a result, we gain faster convergence and reduced oscillation.

2. Nesterov accelerated Gradient

NAG(Nesterov accelerated Gradient) gives us a smarter updates method. which is aware of the direction where it is going towards. Hence it is able to slow down before the hill slopes up again.

v_t = \gamma v_{t-1} + \eta\nabla+{\theta}J(\theta - \gamma v_{t-1})\\
\theta = \theta - v_t

Computing $\theta - \gamma v_{t-1}$

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0