fast.ai V2 Lesson1 Synopsis (TL; DR)

If you who have already watched, or about to watch, or bookmarked the ‘fast.ai’ website, or have just come across the course, this blog can be a good abstract to look at, or a source to prep yourselves about stuff that you’ll gain from Lesson 1.

I’ve added some extra information towards the end of the blog, for those who are new to Data Science. If you are one, then keep reading until the end. I tried decomposing things into small modules. So that each paragraph talks about a particular topic in the video. If you find any loose ends, don’t worry. The first lesson is just to provide a general idea.

fast.ai uses a top-down approach. Which means that to learn Deep Learning, instead of taking a statistics course, a mathematics course, a programming course and then finally a Deep Learning course, we directly dive into the models and learn the necessary topics when needed.

Now coming back to Lesson 1. The first lesson is like a typical first chapter of every course textbook. Chapter 1: THE INTRODUCTION.

Yes! I am talking about that chapter which invariably will the last chapter of the book half of the times. (I do that too. High Five!) Well, Lesson 1 of fast.ai is that introduction chapter.

The gist of Lesson 1 can broadly be divided into 4 categories.

  1. General Deep Learning intro
  2. The topics of later lessons
  3. Setting up Virtual Machines (V.M.)
  4. A bit about the code, the language this code is built on and other details

General Intro to Deep Learning:

The advantage Deep Learning (DL) has over Machine Learning (ML) algorithms, is ML requires a lot of feature engineering. ML also has various algorithms to choose from and for each data set, we should check which one works the best. DL makes lives easier by using Deep Neural Networks (hence the name deep learning)

Neural Networks (NN) look something like this.

©http://neuralnetworksanddeeplearning.com/images/tikz12.png

Deep neural networks, on the other hand look something like this.

©https://i.stack.imgur.com/1bCQl.png

Forget about its complexity, just see how beautiful the patterns are.

So the advantage Deep learning has, is threefold. Infinitely Flexible, All Purpose Parameter Fitting, and Fast & Scalable.

Given any data, a Neural Network, theoretically should be able to approximately fit it provided we give it sufficient parameters. This is known as Universal Approximation Theorem. But this fails in real datasets, because the data we use has a lot of noise. For this reason we use deep networks by layering one over another. And it works as a charm. So, Flexibility — CHECK

The goal of a DNN is to fit the parameters as perfectly as possible to the data at hand. So randomly looking for the best parameters will not work. The better way to get good parameters, is to follow the direction which minimizes the loss. Gradient Descent gives this direction. Given any problem, (negative) gradient descent will point towards the minimum loss. So, All Purpose Parameter Fitting — CHECK

Though Deep neural nets takes considerable time to train, the prediction time is minimal. Huge shout-out for the GPUs which played a major role over the past decade in cutting down the time taken while training or while prediction.

©https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

Mostly all languages built for Deep Learning research, like TensorFlow, Theano, PyTorch, etc, use these GPUs. Fast & Scalable — CHECK

Convolution Neural Networks (CNN) are one class of such Deep Learning algorithms.We will be working on CNN in the first couple of fast.ai lectures. CNN work on a concept called Convolution (Obviously!!). Convolution is a mathematical way of integrating two functions and observing how one changes when there is a modification in another.

‘fast.ai’ structure:

Lesson 1 — Introduction (Boring? I know. But keep reading. there are things that might interest you.)

Lesson 2 — Image classifier on different data sets. Consider various data sets and learn different techniques in DL and classify images.

Lesson 3 — Predictive Analytics. From structured data, we will try to predict the sales or weather, or detect fraudulent behavior in an account and other cool applications.

Lesson 4 — Natural Language Processing. (NLP) Understanding what is present in a given text data. Or classifying to which context the text belongs to. Problems like this will be covered in Lesson 4.

Lesson 5 — Collaborative Filtering and Recommender Systems. This will be a very interesting class (and personally my favorite) wherein we will discuss how to recommend a movie or a book based on the previous watch list. Netflix, Prime Video, or other such site use recommender systems.

Lesson 6 — Text Generation. Think of an algorithm which can provide you with dialogues for your play in Shakespeare’s style. You can learn how to do that in this lesson

Lesson 7 — Image Segmentation. In the last lesson, you’ll learn how to train a model to find where a cat or a dog is, in the given picture. With that the first part of fast.ai ends.

Setting up Virtual Machines:

We have discussed how Deep Learning uses GPUs to handle complicated tasks. But not everyone has a laptop or a desktop with a good GPU. And it’s a big investment to buy a system with good GPU. To overcome this, people started using virtual machines. They are servers which are open to public use at a nominal monthly cost.

Some of the Virtual machines that have fastai already set up are :

  • Crestle: Easy to use. Can switch between CPU and GPU to reduce the overall cost.
  • Paperspace: Cheaper and faster.

There are other VM providers other than these two

  • Google Cloud Platform: It gives you $300 credits and one-year free subscription. So Yay! But you need to pay once you’re done with those $300 credits.
  • Amazon Web Services: Has a lot of variants to choose from. Some of which are really cheap and almost free. AWS also has one-year free tier.
  • Google Colab: My favorite. Why you ask? It’s absolutely free. What’s the catch? Every time you request Google for a server, it gives you access to one of them for a limited time period. After which the connection terminates, refreshes and access is given to another user. So if you are trying to reconnect, then you have to re-install all the libraries.

I could elaborate more on how to set these virtual machines up, but that’ll take us off the topic. In lesson1 Jeremy covers this and by just watching the first 12 minutes, you can easily setup Crestle or Paperspace without any issues.

If you face issues setting up a VM, just use Google Colab. All you need for this is a working Google account. After opening a Colab notebook, in the toolbar go to Runtime>Change runtime type>Hardware accelerator>GPU. And you are good to go

Other Miscellaneous topics:

The lr_find() function. Setting the right learning rate is very crucial to learn the right parameters. In the fastai library, lr_find() function starts with a very small learning rate and double it after every iteration. After a few thousand iterations, the loss starts increasing, At which point lr_find() stops and we can plot the loss against learning rate.

The region where the loss falls steeply is the optimum learning rate. In this case, it is around 1e-2.

fastai library. fastai is an open source library created by Jeremy Howard and Rachel Thomas. It is built on PyTorch and uses python 3.6 (Beware. not 3.5. Some of the code might look new. Don’t you worry child!). To use the library, clone the git repository, change the directory to fastai, and install by typing conda env update.

That almost covers everything in the first lesson. But if you are a newbie to Deep learning or Machine learning, then there are a few things that you must be knowing.

Most importantly, the parameters. I have been talking about parameters the whole time. But what are these parameters? Let’s say we have data which is already segregated into two classes. And the job of a model is to classify each data point into its respective bin. “The parameters” are those values that decide which point will go into which class.

The loss function is the next important thing that you should know. Loss function acts like a quality check for parameters. If the parameters are values which classify each data point, then loss function gives us the information about how good the parameters are for the given data.

So training a model is nothing but finding the right parameters which give the least loss. So we randomly take one set of parameters and update these after every iteration to get the minimum loss. And as discussed earlier, we use Gradient descent to decide the right direction of change.

To give an analogy, if you are in a car at the top of a mountain, and the aim is to drive down, gradient descent is the direction in which your car travels. Speaking of driving a car downhill, there is one thing that we do while driving downhill. Or we MUST do. What might that be? Take a wild guess.. Got it? YES! We use brakes. No sane person drives the car wildly without taking friction’s help. (except for a brake failure though). While updating the parameters using gradient descent, we apply special kind of brakes, popularly known as the Learning rate.

Learning rate provides that smooth transition from random parameters to state of the art models. Setting the learning rate too high is like accelerating the car downhill. Let alone reaching the foothill, acceleration will throw you off the road. High learning rate too, will not converge to minimum loss. Loss increase with high learning rate and the model stops learning.

Setting the leaning rate too small is like using brakes extensively. The model will reach minimum for sure. But it takes a lot of time for that to happen. And we cannot train a model for an indefinite amount of time. So using the right learning rate is always very important.

Finally, the parameter initialization. We always randomly set parameters and update using gradient descent. Thumb rule for initialization: Symmetry does not work. Our brains are all the same. The first idea we get for initialization, is a very simple one. “Use all zeros.” Nope. That won’t work.

The best way to initialize is to take random numbers from a Gaussian distribution. That just means taking random numbers and multiplying them by sqrt{2/n}. n is the total number of parameters across all the layers.

That ends the summary of fast.ai V2 L1. As I said in the beginning, the underlying concept of fastai is to learn the necessary things when they are needed. I respect their idea and so I have given a general introduction to every concept without diving into their nuances. Once you understand how neural nets work, I can give you a better picture using mathematics. Which is on the way.

If there is anything that I have missed or anything that you would like to know more about, please mention in the comments. I am always open to positive criticism and discussions.

Happy learning. Cheers!

Source: Deep Learning on Medium