 FAST AI JOURNEY: PART 1. SPECIAL LESSON.

Source: Deep Learning on Medium

Documenting my fast.ai journey: PAPER REVIEW. A DISCIPLINED APPROACH TO NEURAL NETWORK HYPER-PARAMETERS: PART 1 — LEARNING RATE, BATCH SIZE, MOMENTUM, AND WEIGHT DECAY.

For the Special Lesson Project, I decided to dive into the Leslie Neil Smith’s 1cycle Learning Rate Policy, introduced in the 2018 paper, named, A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay.

Our objective here is to understand the 1cycle Learning Rate Policy presented in the paper, and be able to describe its contents using the notions we have learned during the course. We will also take a look at this blog post, that also explains the concepts exposed in the paper.

1. Introduction.

We have seen during the course that the 1cycle Learning Rate Policy allows us to train our Neural Networks more easily.

To use it, first of all, we need to find an optimal learning rate. This can be obtained, by using the Learning Rate Range Test. A method introduced in the 2015 paper, called Cyclical Learning Rates for Training Neural Networks. The method sets 2 learning rate values. A minimum, and a maximum.

Then it starts performing an SGD, starting with the minimum learning rate, which is increased by a small factor, after every mini-batch, until reaching our maximum rate.

During this process we have to monitor our losses, and once we have went over a large enough range of learning rates, we can plot the losses we have found against the learning rates. Our optimal learning rate will the value that is straight before the minimum, which is where the loss is still improving.

If you want to dig deeper into how to find an optimal learning rate, check out this post.

2. Policy Application.

Now we can apply the 1cycle Learning Rate Policy, following these 3 steps detailed in the paper.

2.1. Cyclical Learning Rates (CLR).

First, of all, after having found our optimal learning rate, we set it as our maximum learning rate, and the minimum can be a value 10 times lower. Next, we do a cycle of two steps of equal lengths.

In the first half of the cycle we go from the lowest rate to the highest one, and in the second half we start decreasing our learning rate until annihilation, i.e., we decrease it a little bit further from the minimum from which we started.

In practice, this means that our minimum learning rate during the end of the cycle, will be a value, 1000 lower than the maximum learning rate.

Note that we increase the rate linearly and take at least, as much time in descending as in climbing. The author determines that the high rates during the middle of the cycle, perform as regularization methods, which help avoid overfitting. This prevents the model from stepping into a steep area of the loss function, and helps it find a minimum that is much flatter. This way, during the end of the training, while we are going back, now we will be able to go inside a steeper local minimum which is inside that flatter part we found earlier.

Finally, the initial small learning rate, will allow us to achieve super-convergence. This is the phenomenon that the author describes as the constant low difference between the validation loss and the training loss, that is maintained until the annihilation of the learning rates.

For a visual explanation, take a look at the following Figure from the paper. The dataset that was used is CIFAR-10, and the the experiments used the resnet-56 architecture.

Figure 5a, is an example of super-convergence. The author notes the training was finished in 10000 iterations by using learning rates that went as high as 3.0, instead of needing 80000 iterations with a constant initial learning rate of 0.1.

Keep in mind, that since our learning rates are becoming large, we need to reduce other regularization methods to compensate for the regularization effects of large learning rates.

Observe that Figure 5b shows that weight decay values of 1e−4 or smaller allow the use of large learning rates as high as 3.0, but setting a higher weight decay makes it impossible to train Neural Networks with a high learning rate.

2.2. Cyclical Momentum (CM).

The author notes that, since momentum and learning rate are closely related, their optimal values are dependent on each other. Remember, that momentum’s purpose is to accelerate the network’s training. Therefore, while we are increasing our learning rate in the first half of the cycle, we will have to decrease our momentum at the same time, and start increasing it during the second half of the cycle.

The author notes though, that in the case of momentum, using a method similar to the Learning Rate Range Test, described above, is not useful. The experiments showed that a Momentum Range Test, produced a loss that continued to decrease, with an accuracy that continued to increase, as the momentum was increasing from 0.7 to 1. Hence, the authors were not able to discern an optimal momentum value.

The author recommends to test momentum values in the range of 0.9 to 0.99, and pick the two values (as our minimum and maximum momentum values) that performed best. Unlike with the learning rate, at the end of each cycle, we keep our momentum locked at the maximum value we chose.

On a final note, the decreasing momentum allows the learning rate to become larger at the start and middle of training. The author notes that a constant momentum, like 0.9 will only speed up training and act as a pseudo increasing learning rate.

For a visual explanation, take a look at the following Figure from the paper. The dataset that was used is CIFAR-10, and the the experiments used the resnet-56 architecture.

Figure 8a shows a test of cyclical momentum. It is combined with the Cyclical Learning Rate method going from 0.1 to 1.0 in 9000 iterations, then all the way down to 0.001 at iteration 18000. The inverse method is applied to the momentum value. In the figure we can observe that increasing the learning rate while decreasing the momentum, performs better than a constant momentum or, increasing the momentum in conjunction with the learning rate.

Finally, Figure 8b shows the validation accuracy for training resnet-56 on the Cifar-10 dataset with the 1cycle learning rate schedule, and using cyclical momentum. Each curve represents an average of four runs, with each using the same batch sizes. These four curves represent four different total batch sizes (TBS) of 128, 256, 512, and 1024.

2.3. Batch Size, Weight Decay, and Dropout.

Last but not least, to achieve better results, we must also set these hyper-parameters.

For the Batch Size, the author recommends using the highest possible value that can fit in the memory available to the GPU, and enable using larger learning rates.

In the case of Weight Decay, the author’s recommended strategy is to run the Learning Rate Range Test, using a different values of weight decay. The author considers that this grid search method is a sound one, since the differences can be observed early in the training, i.e., the validation loss observed early in the training is enough to determine the Weight Decay value. We will always pick the largest value, that lets us train the network at a high maximum learning rate.

The author notes that another grid search method for finding the Weight Decay, would be to run a cycle, at a middle value for Weight Decay, for example 1e−3. Next, we should save a snapshot after reaching the minimum loss and when our accuracy is not improving anymore.

Then, we could use this snapshot to restart runs, with different values of Weight Decay, for example 1e-2, and 3e-3. This can save time in searching for the best weight decay value.

For a visual explanation, take a look at the following Figure from the paper. The dataset that was used is similar to CIFAR-10, but with 100 classes (CIFAR-100). So instead of 5000 training images per class, there are only 500 training images and 100 testing images per class. Also note, that the experiments used the resnet-56 architecture.