Gradient Descent with Momentum

Original article was published on Deep Learning on Medium

Gradient Descent with Momentum

Gradient descent with momentum will always work much faster than the algorithm Standard Gradient Descent. The basic idea of Gradient Descent with momentum is to calculate the exponentially weighted average of your gradients and then use that gradient instead to update your weights.It functions faster than the regular algorithm for the gradient descent.

How it works ?

Consider an example where we are trying to optimize a cost function that has contours like the one below and the red dot denotes the local optima (minimum) location.

We start gradient descent from point ‘A’ and we through end up at point ‘B’ after one iteration of gradient descent, the other side of the ellipse. Then another phase of downward gradient can end at ‘C’ level. With through iteration of gradient descent, with oscillations up and down, we step towards the local optima. If we use higher learning rate then the frequency of the vertical oscillation would be greater.This vertical oscillation therefore slows our gradient descent and prevents us from using a much higher learning rate.

By using the exponentially weighted average dW and db values, we tend to average the oscillations in the vertical direction closer to zero as they are in both (positive and negative) directions. Whereas all the derivatives point to the right of the horizontal direction in the horizontal direction, the average in the horizontal direction will still be quite large. It enables our algorithm to take a straighter forward path to local optima and to damp out vertical oscillations. Because of this the algorithm will end up with a few iterations at local optima.


We use dW and db to update our parameters W and b during the backward propagation as follows:

W = W — learning rate * dW

b = b — learning rate * db

In momentum we take the exponentially weighted averages of dW and db, instead of using dW and db independently for each epoch.

VdW = β * VdW + (1 — β) * dW

Vdb = β * Vdb + (1 — β) *db

Where beta ‘β’ is a different hyperparameter called momentum, ranging from 0 to 1. To calculate the new weighted average, it sets the weight between the average of previous values and the current value.

We’ll update our parameters after calculating the exponentially weighted averages.

W = W — learning rate * VdW

b = b — learning rate * Vdb

How to choose Beta?

  • The momentum (beta) must be higher to smooth out the update because we give more weight to the past gradients.
  • Using the default value for β = 0.9 is suggested but can be tuned between 0.8 to 0.999 if needed.
  • Momentum takes into account past gradients so as to smooth down gradient measures. It can be implemented with descent by batch gradient, descent by mini-batch gradient or descent by stochastic gradient.


Deep Learning Specialization by Andrew Ng