Cyclical Learning Rates — The ultimate guide for choosing learning rates for Neural Networks

Source: Deep Learning on Medium


In this quick yet important post, we will discuss a new phenomenal technique for choosing learning rates described by Leslie N. Smith in his paper Cyclical Learning Rates for Training Neural Networks.

Learning Rate

It is one of the most important hyper-parameter for training neural network and is the key to effective and faster training of the network. Learning rate decides how much of the loss gradient is to be applied to our current weights to move them in the direction of lower loss.

new_weight = current_weight - learning_rate * gradient

NOTE : For rest of the article, I will use LR instead of learning rate.

Source: Jeremy Jordan’s blogpost

What is Cyclical Learning Rate?

A technique to set and change and tweak LR during training.

This methodology aims to train neural network with a LR that changes in a cyclical way for each batch, instead of a non-cyclic LR that is either constant or changes on every epoch. The learning rate schedule varies between two bounds.

When using a cyclical LR, we have to calculate two things :
 1) The bounds between which the learning rate will vary — base_lr and max_lr.
 2) The step_size — in how many epochs the learning rate will reach from one bound to the other.

Why it works?

We have always learnt that we should keep decreasing LR as training progresses so that we converge with time.

In CLR, we vary the LR between a lower and higher threshold. The logic is that periodic higher learning rates within each epoch helps to come out of any saddle points or local minima if it encounters into one. If saddle point happens to be an elaborated plateau, lower learning rates will probably never generate enough gradient to come out of it, resulting in difficulty in minimising the loss.

Objective

Pick a learning rate and change it on each iteration(batch) to make the training process performant — which means -:

  1. Achieve the maximum possible accuracy in order to get best prediction results.
  2. Speed up the training process by achieving above in minimum number of epochs .

Important Terms

Epoch 
One epoch is completed when an entire dataset is passed forward and backward only once through the neural network.

Batch Size
Number of training examples to utilise in one iteration.

Batch or Iteration
A training set of 1000 examples, with a batch size of 20 will take 50 iterations/batches to complete one epoch.

Cycle
Number of iterations we want for our learning rate to go from lower bound to upper bound, and then back to lower bound.

Step size
Number of iterations to complete half of a cycle.

Setting base_lr and max_lr

The loss plot will see a decrease in loss as we increase the learning rate, but will start increasing again at a point. Note the LR at which loss starts to decrease, and also the LR when it starts stagnating. These are good points to set as base_lr and max_lr.

Alternatively, you can note the LR where accuracy peaks, and use that as max_lr. Set base_lr as 1⁄3 or 1⁄4 of this.

Variations of CLR

Other than triangular profile used above, Lesley Smith also suggested some other forms of CLR.

Triangular2 : Here the max_lr is halved after every cycle.

Exponential Range : Here max_lr is reduced exponentially with each iteration.

Conclusion

Cyclical Learning Rate is an amazing technique setting and controlling learning rates for training a neural network to achieve maximum accuracy, in a very efficient way.

References

Other Readings