Deep Learning; Personal Notes Part 1 Lesson 2

This blog post series will be updated as I have a second take on the fast ai lessons. These are my personal notes; a strive to understand things clearly and explain them well. Nothing new, only living up this blog.

Lesson 1 review

We used three lines of code to build an image classifier

How data is organised under PATH there should be train folder and valid folder, and under each of these, folders with classification labels i.e. cats and dogs for in each of the folder, with corresponding images in them.

The training output: [epoch number , training loss, validation loss,accuracy]

0      0.157528   0.228553   0.927593

Choosing a good learning rate

The learning rate decide how quickly we zoom/hone in on a solution. What we basically do is try and find the minimum of a function that might have very many parameters.

We start at a random point, find the gradient to determine which way is up or down. The distance we travel to the minima is proportional to the gradient, If it is steeper we are further away. We pick a gradient at a point and multiply it by a number. (learning rate/step).

If the learning rate is too small, it will take very long time to get to the bottom.If the learning rate is too big, it could get oscillate away from the bottom. If training a neural net and find that the loss or accuracy is speeding to infinity the learning rate is too high.

We use a learning rate finder (learn.lr_find) to find an appropriate learning rate. With each mini-batch (how many images we look at each time as we use parallel processing power of the GPU effectively, generally 64 or 128 images at a time). We gradually increase the learning rate multiplicatively, eventually the learning rate will be too big that the loss will start getting worse.

We plot the learning rate against loss determine the lowest point and go back with one magnitude and pick that as our learning rate as that is the place the loss is decreasing 0.01.

Math notations in python

Learning rate is the key number to set. Fast ai picks the rest of the hyper parameters for you. There are some more things we can tweak to get slightly better results.

This learning rate finder technique sits on top of Adam optimiser. Momentum and Adam are ways of improving gradient desent.

The most important thing you can do to make the model better is to give it more data. Since these models have millions of parameters, if you train them for a while, they start to “overfit”.

Overfitting — The model starts to see the specific details of the images in the training set rather than learning something general that can be transferred to the validation set.

We can either collect more data or use Data augmentation.

Data Augmentation

This refers to randomly changing the images in ways that shouldn’t impact their interpretation, such as horizontal flipping, zooming, and rotating.

We can do this by passing aug_tfms (augmentation transforms) to tfms_from_model, with a list of functions to apply that randomly change the image however we wish. For photos that are largely taken from the side (e.g. most photos of dogs and cats, as opposed to photos taken from the top down, such as satellite imagery) we can use the pre-defined list of functions transforms_side_on. We can also specify random zooming of images up to specified scale by adding the max_zoom parameter.

You build a data class 6 times and each time you plot the same cat. Let’s look at some cat pictures of data augmentation.

You want to use different types of data augmentation for different types of image (flip horizontally, vertically, zoom in, zoom out, vary contrast and brightness, and many more). for example, you don’t want to recognice letters and digits you don’t want to flip horizontally as they will have different meaning. you don’t want to flip vertically for cats and dogs as the images are mostly upright, for icebergs in setllite images you may want to flip them upside down as it doesn’t matter which side the settlite was when taking the image.

transfrom side_on for images taken on from the side ,lightly varies the photos zoom, rotate the slightly, vary contrast and brightness,

It is not exactly creating new data, but allows the convolutional neural net to learn how to recognize cats or dogs from somewhat different angles.

tsfm contains the data augmentation

The data object includes augmentation. Initially the augmentation don’t do anything because of precompute=True

In the above picture each different layer has this activations that look for anything like the middle of flowers or eye balls of birds (circled in red) etc. The latter layers of this convolutional neural networks have activation (which are basically a numbers) that for example says in this picture the eye ball of a bird is in this location with this level of confidence (probability).

We have pre-trained networks that has learnt to recognise features (certain kind of things). We take the second last layer that has all the necessary information to recognise these certain kind of things for example the level of “eyeballness”, “fluffy earness” etc. We save for every image this activations and we call them pre-computed activations. We can then create a new classifier that takes advantage of this pre computed activations, we can quickly train a simple linear model based on the pre-computed activations. That is what precompute=True means.

This is why when you train your model for the first time, it takes longer — it is pre-computing these activations.

Even though we are trying to show a different version of the cat each time, we had already pre-computed the activations for a particular version of the cat i.e. we are not re-calculating the activations with the altered version. When precompute=True data augmentation does not work. We have to set it to learn.precompute=False for data augmentation to work.

Bad news is that accuracy is not improving, the good news is that the training loss (trn_loss) a way of measuring if the error of this model, is getting better i.e the error is decreasing. The validation error (val_loss) is not decreasing, but we are not overfitting. Overfitting would mean that the training loss is much lower than the validation loss. In other words, when your model is doing much better job on the training set than it is on the validation set, that means your model is not generalizing.

What is that cycle_len parameter?

cycle_len=1 This enables stochastic gradient descent with restarts (SGDR). The basic idea is as you get closer and closer to the spot with the minimal loss, you may want to start decreasing the learning rate (taking smaller steps) in order to get to exactly the right spot. The idea of decreasing your learning rate as you train is called learning rate annealing. This is helpful because as we get closer to the optimal weights, we want to take smaller steps.

Stepwise annealing — you train a model with a certain learning rate for a while, and when it stops improving, manually drop down the learning rate. pick another learning rate and repeat the process very manually.

Cosine annealing — this turns out to be a better approach simply pick some kind of functional form — turns out the really good functional form is one half of the cosign curve which starts with a high learning rate at the beginning, then drop quickly when you get closer.

During training it is possible for gradient descent to get stuck at local minima rather than the global minimum.

At local minima the loss is worse and with a slightly different data set it won’t generalize. At global minimum the model will generalize better.

Note that annealing is not necessarily the same as restarts

We are not starting from scratch each time, but we are ‘jumping’ a bit to ensure we are in the best minima.

However, we may find ourselves in a part of the weight space that isn’t very resilient — that is, small changes to the weights may result in big changes to the loss. We want to encourage our model to find parts of the weight space that are both accurate and stable. Therefore, from time to time we increase the learning rate (this is the ‘restarts’ in ‘SGDR’), which will force the model to jump to a different part of the weight space if the current area is “spikey”. Here’s a picture of how that might look if we reset the learning rates 3 times (in this paper they call it a “cyclic LR schedule”):

By increasing the learning rate suddenly, gradient descent may “hop” out of the local minima and find its way toward the global minimum. Doing this is called stochastic gradient descent with restarts (SGDR), an idea shown to be highly effective in this paper.

The number of epochs between resetting the learning rate is set by cycle_len, and the number of times this happens is referred to as the number of cycles, and is what we’re actually passing as the 2nd parameter to fit(). So here’s what our actual learning rates looked like:

The learning rate is restored to its original value after each epoch.

The learning rate is reset at the start of each epoch to the original value you entered as a parameter, then decreases again over the epoch as described above in cosine annealing.

Each time the learning rate drops to it’s minimum point (every 100 iterations in the figure above), we call this a cycle.

Can we get the same effect by using random starting point? Before SGDR was created, people used to create “ensembles” where they would relearn a whole new model ten times in the hope that one of them would end up being better. In SGDR, once we get close enough to the optimal and stable area, resetting will not actually “reset” but the weights keeps better. So SGDR will give you better results than just randomly try a few different starting points.

We pick the highest learning rate that is 1e-2 (0.01) for the SGD to use. We change the learning rate every single mini batch. The number of times we reset it is defined by the cycle_len=1 parameter. 1 means reset it after every epoch.

Our main goal is to generalize and not end up in the narrow optima. In this method, are we keeping track of the minima and averaging them and ensembling them? We are not currently doing that but if you wanted it to generalize even better, you can save the weights right before the resets and take the average. But for now, we are just going to pick the last one. (at the 1000 iteration)

There is a parameter called cycle_save_namewhich you can add as well as cycle_len, which will save a set of weights at the end of every learning rate cycle and then you can ensemble them.

Our validation loss isn’t improving much, so there’s probably no point further training the last layer on its own.

Saving and loading the model

From time to time save your weights call and pass the filename 224_lastlayer

Pre-computed activations and resized images are saved in the data folder in tmp files. Deleting the tmp folder is fast ai equivalent of turning on and off

Models are saved in the models folder when is called

What if you wanted to retrain a model from scratch? There is generally no reason to delete the pre-computed activations, because the precomputed activations are without any training.

Fine-tuning and differential learning rate annealing

So far anything we have done has not change the pre-trained filters. We have used a pre-trained model that knows how to find edges and gradients(layer1), corners and curves(layer2), then repeating partners, texts (layer3) and eventually eyeballs(layer4 and 5). We have not retrained any of those activation more specifically weights in the convolutional kernel. All we have done is we added some new layers on top and learned how to mix and match pre-trained features.

Images like satellite images, CT scans, etc have totally different kinds of features all together compared to ImageNet images, so you want to re-train many layers. For dogs and cats, images are similar to what the model was pre-trained with, but we still may find it is helpful to slightly tune some of the later layers.

Now that we have a good final layer trained, we can try fine-tuning the other layers. To tell the learner that we want to unfreeze the remaining layers, just call unfreeze(). This tells the learner we want to start changing the convolutional filters.


A frozen layer is a layer that is not trained that is not updated.

unfreeze() unfreezes all the layers.

Layer one which detects edge and gradient and layer two which detects curves and corners don’t need much learning; they don’t need to change. while the much later layers need to change. This is universally true when training for other image recognition.

What we do is create an array of learning rate.


The earlier layers (as we’ve seen) have more general-purpose features. Therefore we would expect them to need less fine-tuning for new datasets. For this reason we will use different learning rates for different layers: the first few layers will be at 1e-4 for basic geometric features and layers closest to the pixels, the middle layers at 1e-3 for the middle sophisticated convolutional layers, and 1e-2 as before for the layers we add on top (fully connected layers). We refer to this as differential learning rates, although there’s no standard name for this technique in the literature that we’re aware of.

Why 3? Actually they are 3 ResNet blocks but for now, think of it as a group of layers.

How is differential learning rate different from grid search? There is no similarity to grid search. Grid search is where you are trying to find the best hyperparameter. For differential learning rate it tries a lot of learning rate to find which is best. For the entire training it uses a different learning rate for each layer.

What if I have a bigger images than the model is trained with? With this library and modern architectures we are using, we can use any size we like.

Can we unfreeze just specific layers? We are not doing it yet, but if you wanted, you can do lean.unfreeze_to(n) which will unfreeze layers from layer n onwards . It almost never helps because, using differential learning rates the optimizer can learn just as much as it needs to. The one place it is helpful is if you are using a really big memory intensive model and if you running out of GPU, the less layers you unfreeze, the less memory and time it takes.

Note; you can’t unfreeze one specific layer.

Earlier we said 3 is the number of epochs, but it is actually cycles. In this case learn is doing 3 cycles of 1 epoch.(cycle_len=1)

If cycle_len=2 , It will do 3 cycles where each cycle is 2 epochs (i.e. 6 epochs).

Then why did it 7 epochs? It is because of cycle_mult this doubles the length of each cycle.(1 epoch + 2 epochs + 4 epochs = 7 epochs).

Using differential learning rate we have a model that is 99.05% accurate.

If the cycle length is too short, it starts going down to find a good spot, then pops out, and goes down trying to find a good spot and pops out, and never actually get to find a good spot. Earlier on, you want it to do that because it is trying to find a spot that is smoother, but later on, you want it to do more exploring. That is why cycle_mult=2 seems to be a good approach.

We are introducing more and more hyper parameters having told you that there are not many. You can get away with just choosing a good learning rate, but then adding these extra tweaks helps get that extra level-up without any effort. In general, good starting points are:

  • n_cycle=3, cycle_len=1, cycle_mult=2
  • n_cycle=3, cycle_len=2 (no cycle_mult)

Why do smoother surfaces correlate to more generalized networks?

X-axis is showing how good this is at recognizing dogs vs. cats as you change this particular parameter. To be generalizable means that we want it to work when we give it a slightly different dataset. Slightly different dataset may have a slightly different relationship between this parameter and how cat-like vs. dog-like it is. It may, instead look like the red line. In other words, if we end up at X or Z, then it will not going to do a good job on this slightly different dataset. Or else, if we end up at Y, it will still do a good job on the red dataset.

Let’s take a look at pictures we predicted incorrectly

When we do the validation set, all of our inputs to our model must be square. The GPU does not go very quickly if you have different dimensions for different images. It needs to be consistent so that every part of the GPU can do the same thing.

To make it square, we just pick out the square in the middle, as you can see below, it is understandable why this picture was classified incorrectly.

The dogs head was not identified

We will use Test Time Augmentation(TTA) or inference time or test time it makes predictions not just on the images in your validation set, but also makes predictions on a number of randomly augmented versions of them too (by default, it uses the original image along with 4 randomly augmented versions given that they move around). It then takes the average prediction from these images, and uses that as our final prediction. To use TTA on the validation set, we can use the learner’s TTA() method.

The accuracy improved to 99.25%. The Neural net gets multiple argumentations of the same picture making the accuracy go up.

NOTE; TTA is for validation/ test set. when training we are not doing TTA.

Why not add a border or padding to make it square? It does not help much with neural net as the image of the cat does not change. Zooming would work. Reflection padding where by you add borders on the outside to reflect the image making the image bigger, works well with satellite imagery.Generally speaking, using TTA plus data augmentation, the best thing to do is try to use as large images as possible. If you crop you tend to lose for example the dogs face.

Data augmentation for non-image dataset? No one seems to know. It seems like it would be helpful, but there are very few number of examples. In natural language processing, people tried replacing synonyms for instance, but on the whole the area is under researched and under developed.

Can we use a sliding window to generate other images for example generate 3 image parts from one picture of a dog? For training that would not be better because we would not get much better variations, because you have like three standard ways you are giving it to look at the data. You want to give it as many ways to look at the data. Having fixed crop locations plus random contrast, brightness, rotation changes might be better for TTA.

Analyzing results

Confusion Matrix

This is a quick way to evaluate classification algorithm is using a confusion matrix. It helps with identifying which group of classification you are having trouble with.

preds = np.argmax(probs, axis=1)
probs = probs[:,1]
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, preds)
plot_confusion_matrix(cm, data.classes)

We have 987 cats that we predicted right and 13 we predicted wrongly. 993 dogs that we predicted were right and 7 that we got wrong.

Steps to train a world-class image classifier

  1. Enable data augmentation(side_on or top_down depending on what you doing), and precompute=True
  2. Use lr_find() to find highest learning rate where loss is still clearly improving.
  3. Train last layer from precomputed activations for 1–2 epochs
  4. Turn off precompute ( precompute=False)which allows us to use data augmentation for 2–3 epochs with cycle_len=1
  5. Unfreeze all layers
  6. Set earlier layers to 3x-10x lower learning rate than next higher layer. Rule of thumb: for pre-trained 10x for ImageNet like images, 3x for satellite or medical imaging.
  7. Use lr_find() again (Note: if you call lr_find having set differential learning rates, it prints out the learning rate of the last layers.)
  8. Train full network with cycle_mult=2 until over-fitting.

Let’s do it again

This challenge is to determine the breed of a dog in an image.

Use the kaggle CLI to download data. It is an unofficial kaggle command line tool. Useful for downloading the data when using cloud VM instances such as AWS or paperspace. Make sure you accept the competition rules before using the CLI. If you have you account connected with another account for login you have to forget your password and choose the third option to set up a new password and link your two accounts.

$ kg download -u <username> -p <password> -c dog-breed-identification -f <name of file>

Where dog-breed-identification is name of the competition, you can find the name of competition at end of URL of competition after /c/ part, .

Once the file download is complete, we can extract the files using following commands.

#To extract .7z files
7z x -so <file_name>.7z | tar xf -
#To files
unzip <file_name>.zip

structure of the dogbreeds folder

This is different to our previous dataset. Instead of train folder which has a separate folder for each breed of dog, it has a CSV file with the correct labels.

The imports

from fastai.imports import *
from fastai.torch_imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *
PATH = "data/dogbreeds/"
sz = 224
arch = resnext101_64
bs = 58

We will read CSV file with Pandas. Which is used to do structured data analysis.

label_csv = f'{PATH}labels.csv'
n = len(list(open(label_csv))) - 1 # header is not counted (-1)
val_idxs = get_cv_idxs(n) # random 20% data for validation set
array([2882, 4514, 7717, ..., 8922, 6774, 37])
len(val_idxs) #20% of 10222

n = len(list(open(label_csv)))-1 : Open CSV file, create a list of rows, then take the length. -1 because the first row is a header. Hence n is the number of images/rows we have.

val_idxs = get_cv_idxs(n) : “get cross validation indexes” — this will return, by default, random 20% of the rows (indexes) to use as a validation set. You can also send val_pct to get a specific percentage e.g val_idxs = get_cv_idxs(n, val_pct=1.0) gets 100%, but 20% is the default.

This consists of image name or id and the label.

Below is a pandas frame to group how many dogs are of the different breeds.

There is 120 rows representing 120 breeds.

Going through the steps;

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', test_name='test', # we need to specify where the test set is if you want to submit to Kaggle competitions
val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)

Enabling data augmentation;

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

call tfms_from_model and pass aug_tfms=transforms_side_on there are probably side on photos.

max_zoom — we will zoom into the image with up to 1.1 times

data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', test_name='test', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)

ImageClassifierData.from_csv — last time, we used from_paths (which says the name of the folder are the name of the labels) but since the labels are in CSV file, we will call from_csv instead and call f’{PATH}labels.csv csv file that contains the labels. PATH is the contains all the data, train folder contain the training data. test_name specifies where the test set is if you will submit to Kaggle later.

val_idx — there is no validation folder but we still want to track how good our performance is locally. Separates out images and puts them in a validation set.

suffix=’.jpg’ — File names have .jpg at the end, but CSV file does not. So we will set suffix so it knows the full file names.

Get the training data set in the data object using trn_ds which contains the file names. Below is an example of a filename (fnames).

img =; img

We need to check the size of the image, if the are too large or too small we need to know how to deal with them. Most of ImageNet models are trained on either 224 by 224 or 299 by 299 images.

Create a dictionary comprehension;

size_d = {k: + k).size for k in data.trn_ds.fnames}

Go through all the files and create a dictionary that maps the name of the file to the size of that file.

row_sz, col_sz = list(zip(*size_d.values()))

Takes the dictionary and turns it to rows and columns. Then turn them into numpy arrays as shown below

row_sz = np.array(row_sz); col_sz = np.array(col_sz)

Here are the first five row sizes

array([500, 500, 500, 500, 500])

ploting with matplotlib. Images and the number of pixels.

from the histogram most images are around 500 pixels.

Plotting those less than 1000pixels to zoom in on the diagram

4599 images lie within 451 pixels.

How many images should be in the validation set? The size of the validation set depends on the size of your dataset. It should not always be 20%. If you train the same model multiple times and you are getting very different validation set results, then your validation set is too small.

The image of the dog seems to be at the centre and taking the largest part of the frame therefore, we don’t need cropping, this would be different for medical imaging as sometimes the tumor might be on one side of the frame thus requiring zooming.

Initial Model

Here is the regular two lines of code. When we start working with new dataset, we want everything to go super fast. So we made it possible to specify the size and start with something like 64 which will run fast. Later, we will use bigger images and bigger architectures at which point, you may run out of GPU memory. If you see CUDA out of memory error, the first thing you need to do is to restart kernel, then make the batch size smaller.


we will use a pre-computed classifier.

We get 91% accuracy for the 120 classes. With no data augmentation or unfreezing.

Let’s turn pre-compute off and a few more epochs.

The accuracy improved to 92%.

Anepoch is one pass through the data, a cycle is how many epochs you said is in a cycle. it is the learning rate going from what you asked for down to zero. In this case since the cycle_len=1 the epochs and cycles are the same.

let’s save‘224_pre’)

Increase image size

If you trained a model on smaller size images, you can then call learn.set_data and pass in a larger size dataset. That is going to take your model, however it has been trained so far, and it is going to let you continue to train on larger images.

learn.set_data(get_data(299, bs))

Starting training on small images for a few epochs, then switching to bigger images, and continuing training is an amazingly effective way to avoid overfitting.

set_data doesn’t change the model at all. It just gives it new data to train with.

Validation loss (0.239) is much lower than training loss (0.297). This is a sign of underfitting. Cycle_len=1 may be too short. The learning rate is getting reset before it had a chance to zoom in properly.

let’s add cycle_mult=2 (i.e. 1st cycle is 1 epoch, 2nd cycle is 2 epochs, and 3rd cycle is 4 epochs = 7 epochs)

The validation loss and training loss are getting closer and smaller and almost the same. We are on the right track.

Test Time Augmentation(TTA)

Other things to try:

  • Running one more cycle of 2 epochs
  • Unfreezing; in this case, training convolutional layers did not help in the slightest since the images actually came from ImageNet.
  • Remove validation set and just re-run the same steps, and submit that. This lets us use 100% of the data.

How do we deal with unbalanced dataset? This dataset is not totally balanced it is between 60 and 100, but it is not unbalanced enough to give it a second thought. A recent paper says the best way to deal with very unbalanced dataset is to make copies of the rare cases.

Difference between precompute=True and unfreeze?

We started with a pre-trained network which was finding activation with rich features. We added a couple of layers on the end of it which start out random. With everything frozen and precompute=True, all we are learning is the layers we have added. With precompute=True, we actually pre calculate how much the image look likes the activations, therefore data augmentation does not do anything because we are showing exactly the same activations each time.

We then set precompute=False it means we are still only training the last layers we added because it is frozen but data augmentation is now working because it is actually going through and recalculating all of the activations from scratch.Then finally, we unfreeze which is saying “okay, now you can go ahead and change all of these earlier convolutional filters”.

Why not just set precompute=False from the beginning? The only reason to have precompute=True is it is much faster 10 or more times. If you are working with quite a large dataset, it can save quite a bit of time. There is no accuracy reason ever to use precompute=True

Minimum version that would get you good results;

  1. use lr_find() to find highest learning rate where loss is still clearly improving.
  2. Train last layer with data augmentation (i.e. precompute=False) for 2–3 epochs with cycle_len=1 (By default everything is frozen from the start)
  3. Unfreeze all layers.
  4. Set earlier layers to 3x-10x lower learning rate than next higher layer. (Use differential learning rates)
  5. Train full network with cycle_mult=2 until over-fitting.

Does reducing the batch size only affect the speed of training? Yes, If you are showing it less images each time, then it is calculating the gradient with less images hence, less accurate. In other words, knowing which direction to go and how far to go in that direction is less accurate. So as you make the batch size smaller, you are making it more volatile. It impacts the optimal learning rate that you would need to use, but in practice, dividing the batch size by 2 vs. 4 does not seem to change things very much. If you change the batch size by much, you can re-run learning rate finder to check if it has changed.

What would you have done if the dog was off to the corner or tiny ? This will be covered in Part 2, but there is a technique that allows you to figure out roughly which parts of an image most likely have the interesting things in them. Then you can crop out that part.

Further Improvements

  1. Assuming the size of images you were using is smaller than the average size of images you have been given, you can increase the size. As we have seen before, you can increase it during training.
  2. Using better architecture. There are different ways of putting together what size convolutional filters and how they are connected to each other. Different architectures have different number of layers, size of kernels, number of filters, etc.

We used ResNet34, It does not have too many parameters and works well with small dataset. ResNext50 takes twice as long and 2–4 times more memory than ResNet34.

Ran into RuntimeError: cuda runtime error (2) : out of memory. Try restarting your kernel and using a smaller batch size, I used 10

Using ResNext50 which achieved 99.65% accuracy.

Source: Deep Learning on Medium