The Learning Rate Black Magic

Original article can be found here (source): Deep Learning on Medium

The Learning Rate Black Magic

Evaluation of the Learning Rate Finder Technique

The choice of the most important hyperparameter of deep models has been long considered “more an art than a science” [1] and relied mainly on trial and error. Indeed, one of the many challenges in training deep neural networks has historically been the selection of a good learning rate, that is until the Learning Rate Range Test (LRRT) was proposed in 2015 [2] and made popular by the fast.ai’s deep learning library as the Learning Rate Finder (LRFinder) [3]. In this post, we evaluate the reliability and usefulness of this technique.

Intuitively, the learning rate measures how much the model can ‘learn’ from a new mini-batch of training data, that is how much we update the model weights with information coming from each new mini-batch. The higher the learning rate, the bigger the steps we take along the trajectory to the minimum of the loss function, where the best model parameters are (Figure 1).

Figure 1. Illustration of how learning rates affect the navigation through the loss function during the deep neural networks optimization process. Source: https://www.jeremyjordan.me/nn-learning-rate/

Learning Rate Range Test Overview

The LRRT consists of, at most, one epoch of training iterations, where the learning rate is increased at every mini-batch of data.

Figure 2. Typical behavior of the training loss during the Learning Rate Range Test.

During the process, the learning rate goes from a very small value to a very large value (i.e. from 1e-7 to 100), causing the training loss to start with a plateau, descend to some minimum value, and eventually explode. This typical behavior can be displayed on a plot and used to select an appropriate range for the learning rate, specifically in the region where the loss is decreasing (Figure 2).

The recommended minimum learning rate is the value where the loss decreases the fastest (minimum negative gradient), while the recommended maximum learning rate is 10 times less than the learning rate where the loss is minimum.

Why not just the very minimum of the loss? Why 10 times less? Because what we actually plot is a smoothed version of the loss, and taking the learning rate corresponding to the minimum loss is likely to be too large and make the loss diverge during training.

Let’s start with reproducing one of the experiments that fast.ai published in this notebook and presented in this blog post [4]. The experiment trains a ResNet-56 on the CIFAR-10 dataset with batch size of 512 and optimizer Stochastic Gradient Descent (SGD) with momentum. We perform the same experiment using both fast.ai LRRT implementation and a Keras implementation.

The Importance of Smoothing

First of all, the actual training loss recorded during the process should be smoothed in order for the plot to be readable and to filter out some noise. But what smoothing should we use?

A classic smoothing is an exponentially weighted average of the loss (see formula (1) and formula (2) for an unbiased version). We set the smoothing parameter β=0.98 as done in fast.ai. In some implementations, a different moving average is used, without zero as the starting value or bias correction (formula (3) below).

The biased moving average and the unbiased average (orange and red in Figure 3, respectively) produce curves with similar shape, thus recommending similar learning rates. However, due to the high value of β both smoothing procedures result in curves that are quite different from the original loss, especially for high learning rates; we wonder whether this could impair the LRFinder recommendation.

Figure 3. Different smoothed versions of the training loss in the Learning Rate Range Test.

Dependence on Initialization

One first striking observation in using the LRFinder is that the loss can be quite different as we change the initial weights of the model.

We perform 20 runs of the LRFinder, changing the initial weights by means of the he_uniform initialization (called kaiming_uniform in PyTorch and fast.ai) every time. The median smoothed loss and recommended range, as well as the 10% and 90% quantiles of both the loss and the range values, are displayed in Figure 4 for Keras and Figure 5 for fast.ai.

Figure 4. Learning Rate Range Test results on 20 runs with different initial weights using Keras.
Figure 5. Learning Rate Range Test results on 20 runs with different initial weights using fast.ai.

Although the median values of the suggested learning rate range (0.1, 2.) are of the same order of what was published by fast.ai (0.08, 3.), reported in Figure 6, we found lower suggested learning rates with the current fast.ai implementation (5e-4, 0.4).

Figure 6. Learning Rate Range Test result published by fast.ai in their notebook and blog post [4].

Because of the large variation of the LRFinder plots, it might happen that for a different weight initialization, we arrive at different conclusions. For instance, for two different weight initializations, say A and B, we get different non-overlapping suggested ranges, as shown in Figure 7.

In this case, if we use the maximum learning rate of the model with initial weights A to train the model with initial weights B, it is likely that this learning rate would make the loss diverge. Actually doing so leads the model to overfit. One important recommendation, then, is to use the same initial weights for the LRFinder and the subsequent model training.

Figure 7. The LRFinder suggests a different range for the same model with different initial weights.

LRFinder: Black Magic or Sound Science ?

Although we can produce similar results to fast.ai in the LRFinder experiments, such high suggested learning rates are quite different from what is used in state-of-the-art contexts to actually train similar models with SGD. The LRFinder recommends a maximum learning rate of 2.0, while the usual value is around 0.1.

Furthermore, if we look at the unsmoothed training and validation loss during the LRRT displayed in Figure 8, it doesn’t seem safe to use such large learning rates. The smoothing applied might be masking the actual behavior of the training loss in the rightmost region of the plot, where it becomes unstable due to the high learning rates. Note that displaying the validation loss requires running a full validation at every batch, making the LRRT impractical, as it would take much longer to run (1.5 vs. 50 minutes on a GTX 970 GPU in this case).

Figure 8. Unsmoothed training and validation loss during the Learning Rate Range Test.

On the other hand, selecting the minimum learning rate as the one where the loss has the steepest descent seems reasonable and safe. The LRFinder is then a useful tool to automatically select a sensible learning rate for an arbitrary model.

The Relation With the Batch Size

The LRFinder confirms a common deep learning practice that, for a larger batch size, we should pick a higher learning rate, as illustrated in Figure 9. The intuition behind this practice is that, with a larger batch size, we are using more data and getting more accurate estimates of the gradients for a single weights update.

For this reason, we can use a higher learning rate so we can be more confident about the estimated direction of the loss descent. Note that the curves in Figure 9 seem to suggest that the loss decreases less when increasing the batch size, but this is merely an effect of the smoothing.

Figure 9. The Learning Rate Range Test recommends picking a larger learning rate for a larger batch size.

Takeaways About the LRFinder

The LRFinder is a useful tool to receive a recommended learning rate to use for a specific setting, but it must be used with some caution. It is essential to set the same initial weights in the LRRT and in the subsequent model training. We should never assume that the found learning rates are the best for any model initialization.

Further, setting a narrower range than what is recommended is safer and could reduce the risk of divergence due to very high learning rates. The use of advanced dynamic learning rate policies (such as cyclical learning rates and warm restarts) further reduces the dependency on the learning rates, and thus the risk of divergence, but that is its own separate topic.

The code to reproduce the LRRT experiments with different initial weights (Figures 4 and 5) is available in this repository.