Original article was published by Akash Shastri on Deep Learning on Medium
Why I use Fastai and you should too.
This is part 1 of a multipart series: The things I love the most about my favorite deep learning library, fastai.
This episode: Learning rate (LR)
LR before fastai
The general consensus on finding the best LR was usually to train a model fully, until the desired metric was achieved, with different optimizers at different LRs. The optimal LR and optimizer are picked depending on what combination of them worked best in the picking phase. This is an ok technique, although computationally expensive.
Note: As I was introduced early in my deep learning career to fastai, I do not know a lot about how things are done without/before fastai, so please let me know if this was a bit inaccurate, also take this section with a grain of salt.
The fastai way
The fastai way to LRs is influenced by Leslie Smith’s Paper . There are mainly 3 components to finding the best LR, find an optimal LR for training (explained in LR find section), As training progresses, reduce the LR (explained in LR annealing section), and a few caveats for transfer learning (explained in discriminative LR) and one cycle training (part of LR annealing).
What should our learning rate be?
This is an important question to ask, as the learning rate is what drives the parameters of our model to optimal solutions. Too low and the learning will take too long. Too high and the model will NOT EVEN learn. We need a learning rate in a range of values that drives parameters to convergence while doing so at a reasonable pace.
LR find is fastai’s approach to finding a good learning rate. They do this by selecting a very low LR at first, training one mini-batch at this LR, and calculate the loss. The next mini-batch is trained at an incrementally higher LR, and this process continues till we reach an LR where the model clearly diverges.
LR is plotted against loss, and we can see our graph below. There is a certain LR where loss is minimum, after which increasing LR any more will worsen loss.
Picking the LR where loss is lowest is erroneous. We need the LR for which loss decreases the fastest. This is also the steepest part of the graph. Fastai implements a function to find both the steepest and the minimum LR divided by 10 (which is also a good LR to train on)
This method is computationally cheaper and faster, as we need only 1 epoch (not even that) to find the optimal LR, whereas traditionally we would train 1 epoch for each logical range of LR (definitely more than 1 epoch in total).
The next step of the puzzle is LR annealing. Initially, our parameters are imperfect. They are very far from optimal parameters. But as we train, the parameters get closer to the optimal values.
When our parameters are very far from optima (beginning) we want to take larger, more imprecise steps, in the general direction of optima. But as we get closer, we don’t want to take large steps and accidentally overshoot optima, we want to take smaller steps to get a precise perfect parameter set.
This is analagous to golf. You don’t attempt a hole in one, (you could but that’s more luck than skill) you simply try to lob the ball as far as possible in the general direction of the hole. And as you get closer, you change your clubs to gain more precision and control, and look for smaller steps, each inching you closer to the hole.
This is why at the beginning of training, we want large learning rates, that push us hard and fast towards optimal parameters, but as we get closer, we want to lower the learning rate. As our loss decreases, we want to take smaller steps, and hence use a smaller LR.
This process of altering LR during training is called LR decay/ LR annealing. The figure below shows the large initial steps contrasted with the smaller final steps.
Fit one cycle
The current general consensus on the difficulty of GD (gradient-descent) based training is that the optimizer can get stuck in saddle points. This is different from what we used to think was the main issue ( local minima). Leslie Smith shows in  that increasing the LR helps escape saddle points and get to a good region of the loss function, after this, we once again reduce the LR for reasons explained in LR annealing. However, there’s one final step to the fit one cycle approach as explained by Sylvain Gugger here, which is to reduce the LR to one-hundredth of the minimum LR for the last few iterations. Also known as annihilation.
The steps are simple, pick an LR as explained in the LR find section of this article. That is the maximum acceptable LR. We also choose a minimum LR which is, as suggested by Sylvain, to be one-tenth the maximum LR. Now we cycle between this range, where the cycle length is slightly lesser than the total number of epochs. And in the last part, we reduce LR to one-hundredth of the minimum LR.
fit_one_cycle(learn:Learner, cyc_len:int, max_lr:Union[float, Collection[float], slice]=slice(None, 0.003, None), moms:Point=(0.95, 0.85), div_factor:float=25.0, pct_start:float=0.3, final_div:float=None, wd:float=None, callbacks:Optional[Collection[Callback]]=None, tot_epochs:int=None, start_epoch:int=None) [docs]
There are 2 main parameters to use when scheduling LR, which is what we’re doing during the one cycle policy. And these are momentum and step size. For more information and further reading on how LR is scheduled, check here.
Discriminative LR (during Transfer learning)
Transfer learning is the process of using a neural network trained for one task, to do a different task after minimal training. This is extremely useful for reasons I will explain now.
Zeiler and Fergus published an amazing paper called Visualizing and Understanding Convolutional Networks , In which they show what different layers of a neural network learn, and they visualize what is learned by different layers. From the image below, we can see the first layer recognizes basic lines, colors, and color gradients. The second layer recognizes more complex shapes like edges and circles, and then by the third layer, the network starts recognizing patterns.
Now consider you have a cat detector, which can recognize a cat. And you want to make a bear detector. A cat is much more similar to a bear than random noise, and most layers have of a cat detector have already learned very useful parameters. So we only need to fine-tune the model and change the last few layers.
And fastai is perfect for doing this. The fastai approach to transfer learning is basically 2 steps.
- We first freeze our model, meaning stop gradient calculations for earlier layers, and train only the last 2 layers.
- We then unfreeze the model, so gradients flow back all the way. However, we already know that the earlier layers don’t need a lot of learning, as they have already learned important parameters (common attributes like lines for images). So we need lower learning rates for earlier layers and higher ones for later layers. Fastai has this functionality built-in.
The 4 lines of code to do transfer learning with fast ai are
learn = cnn_learner(dls, resnet34, metrics=error_rate)
#import a model of your choice, pretrained on imagenet (default)learn.fit_one_cycle(3, 3e-3)
#train the last few layers of modellearn.unfreeze()
#unfreeze the model, so we can train all layerslearn.fit_one_cycle(12, lr_max=slice(1e-6,1e-4))
# use a lr_max, which uses lowest value of LR for first layer
# and increments LR for later layers
The graph below is a graph of LR plotted against loss after the model is unfrozen. As we can see, there’s no steep descent of loss, this is in line with our idea that initial layers have learned useful information. Hence the low loss even at the start, and gradual decline.
The method for selecting LR remains same, minima/10 or point of steepest descent