Optimization Algorithms — from mini-batch,weighted averages to ADAM,part-1

Source: Deep Learning on Medium

This article is a note to myself for future reference and those who maybe interested.I thank deeplearning.ai’s 2nd course on deep learning specialization for the knowledge imparted .

Deep learning methods work on huge amount of data.So the training could be slow.But with good optimization algorithms there is a good opportunity to speed up the training process.Now, in this article i would like to begin with minibatch gradient descent.

BATCH AND MINIBATCH GRADIENT DESCENT:

Let’s say we have m training samples.Usually they are stacked as a vector like:

X = [x(1),x(2)……….x(m)] ,dimensions -(n,m)

and as for the labels

Y = [y(1),y(2)…….y(m)] , dimensions-(1,m)

Surely vectorization would help in speeding up the computation

Y = W . X +b(where W,X,Y are vectors that is m examples of X stacked horizontally)

if m is relatively small. But what if m is in millions or more.Even vectorization wouldn’t help, wouldn’t be fast enough. Why?

This is because gradient descent(batch gradient descent that is) would require processing of the entire training set before taking a step in gradient descent. But what if we could make slight improvements, i.e make gradient descent make progress even without fully processing the entire batch of training examples? This is where minibatch gradient descent comes to the rescue.

The training set can be split into smaller sibling training sets.These smaller sets are called minibatches. for example let’s say m = 5,000,000

Then if the training batch is split into minibatches, each containing 1000 examples, then we would have 5000 minibatches.

X{1} = [x(1)……..x(1000)]

X{2} = [x(1001)……..x(2000)]

where X{1} is the first minibatch, X{2} is the second minibatch and so on.

Similarly Y can also be split up into minibatches.Batch gradient descent runs on the entire training sample whereas mini-batch gradient descent rusn on one minibatch at a time(X{t}). So, to run mini-batch gradient descent

for t =  1...5000 ( as we have 5000 mini-batches)
Z[1] = W[1] X{t} + b[1] ( where t denotes the mini-batch,we perform vectorized computation on the minibatch i.e first 1000 examples. [1] denotes the first layer)
A[1] = g[1](Z[1])
.
.
.
A[L] = g[L](Z[L])
(the computation taking place layer by layer can also be vectorized)

Now for computing cost for this scenario it would be :

where here 1000 denotes the number of examples in one minibatch, and t denotes the minibatch. If regularization is used then there would be an addition of frobenius form, but we will get to that part in another article.

Now, backpropagation can be used to compute gradients and it is done with respect to J{t}(i.e w.r.t X{t}.Y{t}). and weight update rule becomes,

The above forward and backward propagation amounts to one pass through the training set.The snippet written above amounts to one epoch, which refers to one pass of the training set.When m = 5,000,000, batch gradient descent enables to take one step in gradient descent after passing through the entire training set.While in mini-batch gradient descent, i epoch amounts to taking 5000 gradient descent steps(as there are 5000 mini-batches).

Intuition as to why mini-batch gradient descent works:

In batch gradient descent the cost is expected to decrease on every iteration.If the cost goes up even a bit in a iteration then something is wrong and needs to be corrected(like the learning rate).

Whereas, int the mini-batch gradient descent when we plot the progress against the cost function, we find out that the cost does not decrease in every iteration.This is because each iteration is trained with X{t} and Y{t} i.e. each iteration sees a different mini-batch.So, the plot of J{t} is like below:

Of Course, the trend is downwards but a little noisy.The noise i.e oscillations maybe due to the reason that some min-batches maybe harder to train than others,which in turn increases the cost.Example: X{1},Y{1} could be an easy mini-batch, whereas X{2},Y{2} could be a hard one thereby increasing the cost.

Choosing the mini-batch size is very important.

If mini-batch size = m (batch gradient descent)
If mini-batch size = 1 (stochastic gradient descent)(here every example is it's own mini-batch)

Above is a contour plot . Blue lines indicate the batch gradient descent which take less noisy steps to the minimum.The violet lines denote the steps taken by the stochastic gradient descent which are a little noisy but fast.But the risk in stochastic gradient descent is that one wrong training example could lead to the wrong path.Stochastic gradient descent never converges,it leads to the path of the minimum and wanders around that region, but never converges at the minimum.

In practice, mini-batch size would be somewhere between 1 and m.This is because , batch gradient descent takes too long on every iteration if the dataset is large.On the other hand the main disadvantage with stochastic gradient descent is not the noise as it can be reduced with lower learning rate, but the fact that we would lose speedup due to vectorization as we are processing a single training example.So choosing in between gives faster learning.

Ok so that’s the end of first part.I am attaching the code snippet to compute mini-batch.Please leave your feedback.This is my first article on medium.

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
"""
Creates a list of random minibatches from (X, Y)

Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
mini_batch_size -- size of the mini-batches, integer

Returns:
mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
"""

np.random.seed(seed)
m = X.shape[1] # number of training examples
mini_batches = []

# Step 1: Shuffle (X, Y)
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1,m))
# Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
for k in range(0, num_complete_minibatches):
mini_batch_X = shuffled_X[:,k*mini_batch_size:(k+1) *mini_batch_size]
mini_batch_Y = shuffled_Y[:,k*mini_batch_size:(k+1) *mini_batch_size]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)

# Handling the end case (last mini-batch < mini_batch_size)
if m % mini_batch_size != 0:
mini_batch_X = shuffled_X[:,math.floor(m/mini_batch_size)*mini_batch_size:(math.floor(m/mini_batch_size)+1) *mini_batch_size]
mini_batch_Y = shuffled_Y[:,math.floor(m/mini_batch_size)*mini_batch_size:(math.floor(m/mini_batch_size)+1) *mini_batch_size]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)

return mini_batche