Learning Rate Tuning and Optimizing

Learning rate is one of the most important hyper parameters to select while you are trying to optimize your Neural Network.

Once we initialize our neural network and calculate the error, next we calculate partial derivative of the error with respect to input(we use the chain rule to calculate it so we can update our weights at every layer). This tells us in which direction are we suppose to update our weights in order to reduce the error for the network. The quantity by which we need to update our weights is decided by a hyper parameter we set known as Learning Rate.

The quantity by which we are suppose to move is decided by multiplying the the value we obtained by the partial derivative by the learning rate.

Consider the diagram below, out goal is to reach the lowest point of the curve. If we set a very high value for the learning rate we overshoot otherwise if the learning rate is too slow we end up taking an extremely long time to reach the minima. By keeping the learning rate very small we have another issue of the the value getting stuck in local minima as shown the figure below.

Therefore if we choose the wrong learning rate we might never reach the global minima which leads to bad performance of the network.

The advantages of having the a high learning rate is that we reach close to the global minima very soon, and the advantage of having a small learning rate is we can reach the most optimal solution for the given dataset.

One efficient solution to solve this sort of problem is by using something known as learning rate decay. In this process what we do is we have a high learning rate at the beginning of our training so that we can get very close to the global minima very soon. And as the training time progresses we continuously decrease the learning rate after a certain number of epochs.

By doing this we are able to have the advantages of both high and low learning rate i.e. faster and accurate convergence on a given dataset.

Source: Deep Learning on Medium