Different Optimization Methods in Deep Learning

Source: Deep Learning on Medium

Contour lines of a landscape, http://www.land-navigation.com/terrain-features.html

Different Optimization Methods in Deep Learning

A technical guide to train neural networks more efficiently

In this article we will go over different types of optimization techniques which will train neural networks much faster and in a more accurate manner.

They are —

  1. Batch Gradient Descent
  2. Mini-Batch Gradient Descent
  3. Stochastic Gradient Descent
  4. Gradient Descent with Momentum
  5. Gradient Descent with RMSProp
  6. Adam

Note, in each section we will have the code snippet. The final comprehensive code will be at the end of the article.

Batch Gradient Descent

Say we have “m” examples in our dataset. In general lingo it is also called as Gradient Descent.

When you take gradient steps with respect to all ‘m’ examples on each step, it is also called Batch Gradient Descent.

So, the pseudo code for the Batch Gradient Descent algorithm will be —

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
# Forward propagation
a, caches = forward_propagation(X, parameters)
# Compute cost.
cost += compute_cost(a, Y)
# Backward propagation.
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)

Now as we know that “vectorization” can speed up the training process but if your “m” is too large say 5 million then Gradient Descent will still be slow. In a normal gradient descent you have to loop through the entire set of “m” (5 million) examples to take the first step towards the minima (first step of learning where you update the weights and biases for the first time), then again you have to loop through the same “m” examples to take the second step and so on. In short it is computationally very expensive even if all the operations are vectorized.

So, we need to find out another method which can speed up the training process.

Mini-Batch Gradient Descent & Stochastic Gradient Descent

Say you split up your whole training set in few mini training sets called “mini-batches” and say each of them has 1000 examples. So, in total you will have 5,000 mini-batches.

Now with this above notation, let’s see how does one pass through the training set looks like which is also called as 1 epoch. Note as this is a mini-batch gradient descent we will have one mini-batch go in 1 epoch then the next mini-batch until all the mini batches are covered. So, in our case every epoch will have 1000 examples on which the parameters will be trained in using gradient descent. And we repeat it for say 1,000 times so that the gradient descent reaches the minima (the algorithm converses to the global minima, very similar to the “iteration” in the above case.)

Below is the pseudo code for Mini-Batch Gradient Descent —

Mini-Batch gradient descent will learn much faster than a normal gradient descent (batch gradient descent) and used by a lot of practitioners.

Now the question is how do you select the mini-batch size “m”. There are three range of values —

  1. If the mini-batch size is “m” i.e. all the examples are considered for gradient descent in one epoch then it is similar to Batch Gradient Descent.
  2. If the mini-batch size is 1 i.e. only one example (x{1}, y{1}) are considered for the gradient descent step in one epoch we call it as Stochastic Gradient Descent.
  3. If the mini-batch size is not extremes as above and we choose few examples which comprises of a mini-batch for gradient descent in one epoch and the next batch in next epoch it is known as Mini Gradient Descent.

Let’s see how the above two extremes “Batch” and “Stochastic” would do to a Cost function. In case of a Batch the learning happens very systematically and at each step it marches towards the global minima. However it takes too long for each step (1 epoch) and hence the learning is very time consuming.

But in case of Stochastic the learning happens very haphazardly, in some steps it goes towards the global minima and in other steps it deviates from there. It actually never converges and its too noisy. It’s a very inefficient way of training as you are processing only 1 example per epoch, hence you loose all the concept of speeding which you get in vectorization.

So, mini-batch is a balance between the above two and in practice is the fastest among these three methods. Typically (in practice) a mini-batch size is of the power of 2 starting from 64, 128, 256, 512 and 1024 as this is bit efficient the way the data is stored in the computer memory.

Here is the implementation of mini-batch gradient descent.

Gradient Descent with Momentum

The basic idea here is to calculate the exponentially weighted average of your gradients and use that to update the weights. It smooths out the steps of gradient descent because you are averaging out the oscillations (step learning) of gradient descent.

Here’s what’s happening with the pseudo code —

Here both α and β are hyperparameters. However, the most common value of β is 0.9 which also means that we are averaging over last 10 (i.e. 1–0.9=0.1) gradients.

Here is the code how to initialize the momentum parameters and then updating the weights —

Gradient Descent with RMS Prop

RMS Prop stands for Root mean squared prop which can also speed up gradient descent. Here it does the exponential weighted average of the squared of derivatives and then update the weights. This squaring of derivatives are an element-wise operation.

So, the objective over here is if you consider a normal (batch or mini-batch) gradient descent, the training happens at each step in both directions. Visualize the contour of the objective function and the way the learning happens. It moves both horizontally and vertically. Horizontal is towards the global minima and vertical is basically the oscillations. Now we want horizontal to be fast (which also means the update of weights to be fast as weight vector is in the horizontal direction) and vertical to be slow (i.e. oscillation to be slow which corresponds to the update of bias to be slow as the bias vector is in the vertical direction).

If you see the slope, it is much steep in the vertical direction, hence db is greater than dw whose slope is much smaller in the horizontal direction. So, “dw” is small hence “dw²” is much more smaller, so when we update the weights we divide it by dw² which make the update of weights much faster. With the similar reasoning we update the bias very slowly so that we could dampen the oscillation. This has been shown in the above mathematical formulation.

So, the net effect is that it diverges very less in the vertical direction and moves quite steadily in the horizontal direction, so we can keep a larger learning rate α and speed up the training process. Just note, here we mentioned “w” as x-axis and “b” as y-axis just for the illustration. In practice they are high dimensional vectors and horizontal and vertical could be a mix of different dimension of parameters.

As you are squaring the derivatives and then dividing by the square root in the end we call this process as Root Mean Squared Prop (or RMSProp)

Adam optimization algorithm

Adam is an acronym for Adaptive Moment Estimation. We will see what these moments are. It takes Momentum and RMSProp together and formulate this algorithm called Adam. This is how the mathematical formulation looks like —

So, we can see that there are lot of hyperparameters some of them need to be tuned. Let us list them —

The term dw is called as the first moment as it calculates the average of the derivatives. The term dw² is called the second moment as it calculates the average of the square of the derivatives. Thus the term Adam which refers to the Adaptive Moment Estimation.

Here is the implementation of Adam —

Now that we have learnt all these optimization technique. Let’s use this to build a small binary classifier on the below data. Let’s build a model which can classify the “red” and “blue” dots.

So, we have built a 3 layer neural network and trained it using the above optimization techniques. For the entire code implementation, please see the section below where I have the link to the code.

Here is the result —

Gradient Descent:


Adam :

In terms of the accuracy —

Source Code

The entire code of this article can be found here —

If you want you can read the research paper of Adam here.


  1. Deep Learning Specialization by Andrew Ng & team.