Gradient Descent with Momentum

Source: Deep Learning on Medium

Gradient Descent with Momentum

Hi ,

Here I want to write an article regarding an optimization technique in Deep Learning called Gradient Descent with Momentum.

This article will cover what is momentum how it is useful optimization technique.Before this we look about Exponential Moving Average(EMA)

Exponential Moving Average(EMA):

An exponential moving average (EMA) is a type of moving average(MA) that places a greater weight and significance on the most recent data points. The exponential moving average is also referred to as the exponentially weighted moving average. An exponentially weighted moving average reacts more significantly to recent price changes than a simple moving average (SMA), which applies an equal weight to all observations in the period.

EMA(1)= β *EMA(0)+ (1 — β) *Price(1)

EMA(2)= β *EMA(1)+ (1 — β) *Price(2)…….so on

where β is weight parameter ranges between 0 to 1

  • The EMA is a moving average that places a greater weight and significance on the most recent data points.
  • Like all moving averages, this technical indicator is used to produce buy and sell signals based on crossovers and divergences from the historical average.

The above method generally using in prediction of time series data, we will use this analogy for Gradient Descent Momentum.


Look at this first what is the actual problems will occur

Consider an example where we are trying to optimize a cost function which has contours like below and the red dot denotes the position of the local optima (minimum).

We start gradient descent from point ‘A’ and after one iteration of gradient descent we may end up at point ‘B’, the other side of the ellipse. Then another step of gradient descent may end up at point ‘C’. With each iteration of gradient descent, we move towards the local optima with up and down oscillations. If we use larger learning rate then the vertical oscillation will have higher magnitude. So, this vertical oscillation slows down our gradient descent and prevents us from using a much larger learning rate.


In vertical direction we want slow learning

In horizontal direction we want fast learning


To achieve this we introduce momentum with gradient descent

in above diagram it shows in vertical direction it has ups and downs of gradients , we want slow learning rate in vertical direction so we will cancel out the gradients dw and db by applying exponential smoothing averages.

By using the exponentially weighted average values of dw and db, we tend to average out the oscillations in the vertical direction closer to zero as they are in both directions (positive and negative).

Whereas, on the horizontal direction, all the derivatives are pointing to the right of the horizontal direction, so the average in the horizontal direction will still be pretty big. It allows our algorithm to take more straight forwards path towards local optima and damp out vertical oscillations. Due to this reason, the algorithm will end up at local optima with a few iterations.

How to Implement?

During backward propagation, we use dw and db to update our parameters W and b as follows:

w = w — learning rate * dw

b = b — learning rate * db

In momentum, instead of using dw and db independently for each epoch, we take the exponentially weighted averages of dw and db.

Vdw = β * Vdw + (1 — β) * dw

Vdb = β *Vdb + (1 — β) *db

Here …

dw= acceleration,

VdW = velocity

if we combine both we got momentum it is an analogy with physics you drop a coin in a bowl it has acceleration and velocity then it got momentum then it directly reach to center(minimum) point

Where beta ‘β’ is another hyperparameter called momentum and ranges from 0 to 1. It sets the weight between the average of previous values and the current value to calculate the new weighted average.

After calculating exponentially weighted averages, we will update our parameters.

W = W — learning rate *VdW

b = b — learning rate * Vdb

How to select beta?

  • The momentum (beta) must be higher to smooth out the update because we give more weight to the past gradients.
  • It is recommended to use the default value for β = 0.9 but if required, it can be tuned between 0.8 to 0.999.
  • Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.

Hope this will help full….!

source: deep learning andrew ng