# Overview of different Optimizers for neural networks

Source: Deep Learning on Medium

In this post we will start understand the objective of machine Learning algorithms. How Gradient Descent helps achieve goal of the the machine learning . Understand the role of optimizers in Neural networks. Explore different optimizers like Momentum, Nesterov, Adagrad, Adadelta, RMSProp, Adam and Nadam.

### Objective of Machine Learning algorithm

Goal of machine learning and deep learning is to reduce the difference between the predicted output and the actual output. This is also called as Cost function(C) or Loss function. Cost functions are a convex functions.

As our goal is to minimize the cost function by finding the optimized value for weights. We also need to ensure that the algorithm generalize well. This will help make better prediction for the data that was not seen before

To achieve this we run multiple iterations with different weights. This helps to find the minimum cost. This is Gradient descent.

Gradient descent is an iterative machine learning optimization algorithm to reduce the cost function. This will help models to make accurate predictions.

Gradient indicates the direction of increase. As we want to find the minimum point in the valley we need to go in opposite direction of the gradient.We update parameters in the negative gradient direction to minimize the loss.

Different types of Gradient descents are

### Role of an optimizer

Optimizers update the weight parameters to minimize the loss function. Loss function acts as guides to the terrain telling optimizer if it is moving in the right direction to reach the bottom of the valley, the global minimum.

### Types of Optimizers

#### Momentum

Momentum is like a ball rolling down hill. The ball will gain momentum as it rolls down the hill.

Momentum helps accelerate Gradient Descent(GD) when we have surfaces that curves more steeply in one direction than in another direction. It also dampens the oscillation as shown above

For updating the weights it takes the gradient of the current step as well as the gradient of the previous time steps. This helps us move faster towards convergence.

Convergence happens faster when we apply momentum optimizer to surfaces with curves.

Nesterov acceleration optimization is like a ball rolling down the hill but knows exactly when to slow down before the gradient of the hill increases again.

We calculate the gradient not with respect to the current step but with respect to the future step. We evaluate the gradient of the looked ahead and based on the importance then update the weights.

NAG is like you are going down the hill where we can look ahead in the future. This way we can optimize our descent faster. Works slightly better than standard Momentum.

We need to tune the learning rate in Momentum and NAG which is an expensive process.

It is well suited when we are have sparse data as in large scale neural networks. GloVe word embedding uses adagrad where infrequent words required greater update and frequent words require smaller updates.

For SGD, Momentum and NAG we update for all parameters θ at once. We also use the same learning rate η. In Adagrad we use different learning rate for every parameter θ for every time step t

In the denominator we accumulate the sum of square of the past gradients. Each term is a positive term so it keeps on growing making the learning rate η infinitesimally small to the point that algorithm is no longer able learning. Adadelta, RMSProp and adam tries to resolve Adagrad’s radically diminishing learning rates.

• It does this by restricting the window of past accumulated gradient to some fixed size of w. Running average at time t then depends on the previous average and the current gradient
• In Adadelta we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient

#### RMSProp

• RMSProp is Root Mean Square Propagation. It was devised by Geoffrey Hinton.
• In RMSProp learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.
• RMSProp divides the learning rate by the average of the exponential decay of squared gradients

• Another method that calculates the individual adaptive learning rate for each parameter from estimates of first and second moments of the gradients.
• Adam can be viewed as a combination of Adagrad, which works well on sparse gradients and RMSprop which works well in on-line and non stationary settings.
• Adam is computational efficient and has very little memory requirement
• Adam optimizer is one of the most popular gradient descent optimization algorithms

Adam algorithm first updates the exponential moving averages of the gradient(mt) and the squared gradient(vt) which is the estimates of the first and second moment.

Hyper-parameters β1, β2 ∈ [0, 1) control the exponential decay rates of these moving averages as shown below

Moving averages are initialized as 0 leading to moment estimates that are biased around 0 especially during the initial timesteps. This initialization bias can be easily counteracted resulting in bias corrected estimates

Finally we update the parameter as shown below

• Learning process is accelerated by summing up the exponential decay of the moving averages for the previous and current gradient

In the diagram below we see can see how different optimizer will converge to the minimum. Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converge. Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG corrected itself quickly

#### References:

Adam: A Method for Stochastic Optimization by Diederik P. Kingma, Jimmy Ba

http://cs231n.github.io/neural-networks-3/

https://arxiv.org/pdf/1609.04747.pdf

http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf