Deep Learning Part 1 — Lesson 1 My Personal Notes.

Summary: Setting up all needed tools, simple example code which recognize dogs and cats, and idea behind this code explained briefly.



Jupyter notebook short cuts (e = editing mode | v = view mode):

Run cell — shift + enter (e,v)

Short cuts — h (v)

Editing mode to view mode — esc (e)

View mode to editing mode — enter (v)

Add cell above — a (v)

Add cell below — b (v)

Jupyter notebook tips:

When you don’t remember the function you can just hit tab after dot and get list of options.

To see arguments of method hit shift + tab

If you want to read documentation of method you can hit shift + tab twice or three times. You can also get this same window by writing ? at the beginning of the line.

To see source code write ?? at the beginning of the line.


You can download data used below from here.

Get data:

PATH = "data/dogscats/"
files = !ls {PATH}valid/cats | head

Then we can have a look to the data:

img = plt.imread(f'{PATH}valid/vats/{files[0]}')
OUTPUT: "random cat picture from data."
Example output

We can also look the raw data:

OUTPUT: (198,179,3) "width of the image, height of the image, RGB"
OUTPUT: array([[[29,20,23],

We try to build a model which can just look pixel numbers (like above) and predict is it a dog or a cat. This was from Kaggle competition 2012. Winner got about 80% accuracy.

Following code will use library to predict is the object in the picture a dog or a cat.

arch = resnet34
data = ImageClassifierData.
learn = ConvLearner.pretrained(arch,data,precompute=True),5)
100% 5/5 [00:09<00:00, 1.98s/it]
epoch      trn_loss   val_loss   accuracy                      
0 0.082848 0.023114 0.992
1 0.042392 0.026235 0.9915
2 0.039483 0.029074 0.988
3 0.035312 0.024955 0.9915
4 0.028535 0.034177 0.991

CPU times: user 10 s, sys: 4.73 s, total: 14.8 s
Wall time: 9.91 s

So these four lines of code can predict 99.1% of the time is it a dog or a cat. From here you can see how fast this industry is moving and why you should not watch over three years old tutorials.

This course teaching method is top to down which mean that we are going to find out over the lectures what these four lines of code really take and how we can write them without library. library is a library built top of Pytorch which is also library :D. Basicly this mean that Pytorch is something which make easier to us to build deep learning models and is designed to make Pytorch a lot easier to use. Pytorch is written by Facebook.

I have written an article where I explain how to use Tensorflow. Tensorflow’s idea is basically the same than Pytorch’s but it is written by Google and I personally prefer it more. If you are not familiar with Tensorflow I recommend to read my article. In this course Jeremy use Pytorch so you might learn a few functions but that doesn’t mean you should start use it after finishing the course.


Analyzing the data and the results

# The label for a validation data

This was just 1s and 0s. Next we want to know is zero a cat or a dog.


We currently have only two objects. data and learn. To see our predictions one by one we can run this code:

log_preds = learn.predict()
# First 3 predictions
array([[-0.00002, -11.07446],
[-0.00138, -6.58385],
[-0.00083, -7.09025]], dtype=float32)

The first number in every row is a prediction for cat and the second for a dog. Numbers are log probabilities so we can turn them into percents with following code.

probs = np.exp(log_preds[:,1])
array([1.55e-05, 1.38e-03, 8.33e-04])

All probabilities were close to zero which mean that all three pictures were cats.


Course structure:

  • CNN image intro
  • Structured neural net intro
  • Language RNN intro
  • Collaborative filtering intro
  • Collaborative filtering in depth
  • Structured neural net in depth
  • CNN image in depth
  • Language RNN in depth


Great machine learning algorithm is infinitely flexible function, all-purpose parameter fitting, and fast and scalable. All deep learning algorithms satisfies these three rules. Neural network is the most important deep learning algorithm.

Neural network

Every node in hidden layers keeps inside linear function and non-linear function. Universal approximation theorem says that this kind of function can solve any given problem as long as you add enough parameters. So it is proved that it is infinitely flexible function.

Gradient descent optimize parameters to the direction where loss is lower. I have written a great article about this too so read it first and then continue reading this. Gradient descent is the best way to fulfill the second rule which is all-purpose parameter fitting

Kuvahaun tulos haulle gradient descent
Gradient descnet

The third rule is fast and scalability. That we can achieve by making our neural networks more hidden layers. By making more layers our model can scale to very different kind of problems.

GPU is important thing in machine learning because with GPU you can calculate matrix multiplications way faster than using CPU. BASICALLY ANY GPU IS GOOD IF IT IS MADE BY NVIDIA! I’m not saying this because Nvidia paid me. I’m saying this because AMD and other GPUs are not designed for machine learning so the most popular and the fastest machine learning libraries (like Tensorflow and Pytorch) works only on Nvidia GPUs. This will probably change in the future but for now Nvidia GPU is the only GPU you should think to buy. Jeremy explained at the beginning how to rent cloud GPU but I think it is cheaper to buy own.

Growing use of deep learning at Google

Deep learning is one of the most amazing things currently. Big companies start to use it and there is just a lot of opportunities for new startups. Understanding how to build deep learning model is a very powerful talent and with this talent you can make billions of dollars or change the world like never before.


Now let’s talk about what happened when the code learned to classify dogs and cats. Algorithm used method called convolutional neural network (CNN) which key piece is convolution.

Picture above shows what CNN do. It takes 3×3 (can be any size) and then multiply pixel values of those pixels with kernel values which are in this case 0,-1,0,-1,5,-1,0,-1,0 (between pictures below pixel values in the boxes). Finally it add all values together producing one number which is placed in the output. This way 3×3 area in the input image become 1 area in the output image.

You can play yourself with this demo here.

You might think what is the point of this. Well, kernels can detect shapes which will help object detection. For example, if we set different values to our kernel, we can see that it is now detecting vertical edges.

Now we have convolutional layer but with only this we can’t predict anything.

Non-linear function can give better predictions than linear function

Basically in every node we first calculate linear function (ax+b) and then add it to non-linear function. ReLU is one example of non-linear function.

Kuvahaun tulos haulle relu

ReLU is just replacing all negative values to zero and others stay where they were. This little function combined to linear function is already and will change the world. It is one of the most important pieces to understand in deep learning and artificial intelligence.

A great article about neural networks here.

In linear function we have two parameters a and b which we can modify and change the shape of function. Basically we could just randomly test different numbers and try to find the ones which make loss (this is just how far the points are from the function) as close to zero as plausible. Better way is to calculate the gradient which tells us where to move our parameters. So if you haven’t read my article about gradient descent READ IT NOW! I just show picture which represent gradient descent. You can explain to yourself what is happening in the image.

Kuvahaun tulos haulle gradient descent
Gradient descent


That’s it. Now you should understand how the image classification is happening. You just combine these three (convolutional layer, linear + non-linear, and gradient descent) things together. Below is image which shows how simple convolutional layers can detect objects.


A long time setting the step size parameter (the first parameter in fitfunction) have been a huge problem. Luckily 2015 researchers developed a technique which solves this problem. It was developed in paper called Cyclical Learning Rates for Training Neural Networks. In library it is built on function called lr_find().Idea is that we start at a random point and with very small learning rate. Then we increase learning rate by doubling it up every step until we finally go too big numbers. At the beginning the learning rate is small so we take tiny steps. Then we increase it and take bigger and bigger steps until finally take way too big steps.

After doing this we plot the learning rate against the loss and get a plot which should look something like this.

Kuvahaun tulos haulle cyclical learning rate

Then we find the point where loss is dropping fastest. In our case it might be something close to 10e-3.

You can try to use method lr_find().

lrf = learn.lr_find()

It start normally training the model but some point it will stop when the learning rate gone too far and then return the best learning rate. We can also plot the learning rate against loss using this function.


The second parameter in fit function is epoch. Epoch is just how many times go through the data. The question people often ask is how many epochs we should do. Jeremy advised to run as many epochs you want but at some point doing too many epochs start making the accuracy worst and then you should stop adding more epochs.


Source: Deep Learning on Medium