Deep Learning ver3 Lesson 3

Source: Deep Learning on Medium

Refer notes: Lesson 1, Lesson 2

This article is compiled based on my notes on Lesson 3of Deep Learning for Coders ver3.

· Caution: Update the fastai library and course material before running any code.

Deploying models in production is as important as development of machine learning models. Zeit is one such platform suggested by Jeremy, and the guidelines are available on course website (will be made available after course is over).

Great applications created by students:

· ‘Which car’ by Edward Ross: Identifies 400 Australian cars based on images. Application works on phone too. As model is deployed as web app, it works even with mobile.
· Guitar classifier by C Werner.
· Humming bird classifier for Trinidad and Tobago.
· Face image to dynamic smiley
· American sign language translator
· ‘Your city from space’ by Henri Palacci: Identifies the country based on satellite image.
· ‘Time Series learning with CNN’ by Ignacio Oguiza. Refer ‘’ for details on encoding Time Series as Images using CNN.

Multilevel Prediction with Planet Amazon Dataset (Kaggle)

· Multilevel data: Each image is not labelled with a single label, but multiple labels like ‘Cloudy+Rainforest’, ‘Dry+Desert’ etc.
· Data comes from Kaggle: Kaggle provides API tool to download data.
· Data Structure: csv files with tags for images.
· The trickiest step is to create DataBunch object: Data Block API in fastai provides the way to create DataBunch object as per requirements. Refer docs for guidelines.
· Required PyTorch objects to be known:
 — Dataset: Defines indexing (get item) and length(len).
 — Data Loader: Takes a dataset, grabs items in random, combines them into minibatch and pops them on GPU for modelling.
 — DataBunch: Combines training data loader and validation dataloader.

·Parts to convert to Databunch:
 — ImageFileList folder: where to find data.
 — Labels: from csv file.
 — Random split/test folder.
 — Datasets: How to convert to datasets.
 — Transform: Data transformations/ augmentation.

· Image transformations & augmentation:
 — Flipping: Horizontal flipping applies for dog/cat classification data. For satellite images, vertical flipping is also done.
 — Warping: Images can be wrapped due perspective changes. Should be included in usual images. However, is not applicable for satellite images, astronomical maps.

· create_cnn: basic requirements are data, architecture and metric:
 — Metric: It does not affect the training of model. It is just outputted to track the performance of the model training by user.

· Threshold in fbeta: In multilabel classification, argmax function cannot be used as argmax chooses the index with highest probability value. Instead, threshold value is set for probability, and all the indices (and respective labels) with probability higher than the threshold are returned.

· To create customized functions, concept of python partial functions is used: refer ‘’

· Process of training model: Train the model without unfreezing layers > Run LR rate finder > Unfreeze all layers > Train using a range of LR (Check Lesson1 & 2 for details of differential LR for different layers)

· How to choose LR when LR ranger plot is flat: Use Learning rate which is around 10 times lower than the point where Losses start rising steeply.

Courtesy: Jeremy Howard & Rachel Thomas

· Progressive resizing: Model is initially trained on small image size, and subsequently on larger images, and so on. It helps model to generalize better and use transfer learning. However, a new DataBunch object should be created every time image size is changed.

Cam video image segmentation

· Data source: ‘’
· The description at site reads: “The Cambridge-driving Labeled Video Database (CamVid) is the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.”


· The images are labelled with pixel based colours. Such images are encountered in medical video frames/images, self-driving cars etc.
· Marked files are provided as bunch of numbers. Use ‘open_mask’ instead of ‘open_image’ to open them.
· Codes text file provides labels respective to each code.

· Accuracy for pixel data case: Void pixels are ignored.
· Model architecture used is u-net.
· Other steps are same as previous model

Cycles vs Loss. Image courtesy: Jeremy Howard & Rachel Thomas

Plot of loss over cycles shows that the loss initially increases then goes down. settling in flat stable minima.

‘Fit 1 cycle’ varying LR helping overcome the coefficient values stuck in non-optimal valley

It is because of ‘fit one cycle’ is used to fit the model to data. As, initially the learning rate is low, so during fitting one can get stuck in local minima. However, with the high LR part of ‘fit 1 cycle’, it jumps over the hurdle, with temporary increase in loss, ultimately.

· If memory is an issue, lower precision point numbers (like 16bit, use fp_16 in the end of Learner) can be used. Surprisingly, it can even improve performance as it can reduce overfitting and help generalize better.

IMDB Reviews Dataset

· Dataset contains movie reviews, in the form of text data. Classify them as positive or negative.
· To create the DataBunch, following steps are followed:
 — Tokenization: Each unit represents a single linguistic concept.
 — Numericalization: Convert tokens to numbers.
· Datablock API: It gives much more flexibility than default methods.
· Model training: The model is not trained from scratch, but a pretrained model trained on a bigger dataset is used. This pretrained model is trained to guess next word.
· Next same steps are repeated as in previous models. (Details to be covered in Lesson 4)

Behind the scenes in Deep Learning

· The layers of the network are linear, connected to each other by Activation functions.
· If activation functions are linear too, the whole network will be linear.
· Introducing non-linear activation function introduces non-linearity in the network.
· Universal Approximation Theorem: Any shape can be approximated by a combination of linear layers and non linear activation functions.
· Earlier sigmoid was the most used activation function. Currently, ReLU(Rectified Linear Unit) or leaky-ReLU is generally used.