Source: Deep Learning on Medium
This blog is a continuation of PyTorch on Google Colab. You can check my last blog here.
Method to read these blogs → You can find a link of google colab link down below, open it in a new tab and execute every cell as we go along.
Basic Outline of this Article:-
- Generating Data
- Feedforward network with PyTorch and autograd
- PyTorch’s NN (Functional, Linear, Sequential & PyTorch’s Optim)
- Using CUDA
Before we start generating data here are some common import that we will need in our code
We will understand their use gradually in the following blog.
To generate data we are using
make_blob provided by
sklearn.datasets. Here is how we do it
make_blobs take arguments like
- n_samples :- number of data points you want.
- centers:- The number of centers to generate, or the fixed center locations. In simple words, it provides labels to datapoint we generate, For example, if you print
labelsyou’ll see there is 4 value (0,1,2,3) the 4 centers
- n_features:- The number of features for each sample
- random_state: to regenerate same data
and it returns
- X (array of shape [n_samples, n_features]) : The generated samples. Here we referenced it as data
- Y (array of shape [n_samples]) : The integer labels for cluster membership of each sample. Here we referenced it as the label.
Scatter plot for these data points
Here you can see we have blobs shape data points and we will divide these data points in test and train and try to classify these using FNN.
You can use different datasets like
make_classification etc. You can find all here.
So to summarise we have 2 input features and 4 classes
Before we move, let’s split these points in test and train set
Using torch tensors and autograd
Till now are data are numpy array, we have to convert it into PyTorch tensors to perform different torch operations, we can do this by
Here is the picture of the network that we will try to implement. It has two input features and 4 output class just like our data.
So in this network, a few naming are wrong (mean it does not match to the code). So here is the name that we will refer to the code:
a1:- first layer’s linear part
h1:- first layer non-linear part
and in 2nd layer a2 is linear part and h2 is softmax part.
In deep learning keeping the track of the size of the matrix is very important, if it goes wrong it messes up our whole algorithm. So our input matrices size is (N,2). N for the number of inputs and 2 is for the number of features. Similarly, we will take the weight as matrices whose row number are the number of nodes from the previous layer and column number is equal to the number of nodes in that layer. So the first layer weight matrix will be of the size (2,2). Similarly, weight for the next layer will be of (2,4).
So here we are doing matrice multiplication of input data and weight then adding bias to it and top of that we are introducing non-linearity with sigmoid function. In the next layer, we are doing the same matrix multiplication of data from the previous layer and weight then adding bias to it but on top of that, we are doing softmax operation. Softmax operation is we take individual numbers and then exponentiate them and then divide it by sum of all the exponentials. The main advantage of using Softmax is the output probabilities range. The range will 0 to 1, and the sum of all the probabilities will be equal to one. We will learn it in detail in some later blog.
To find softmax we are exponentiating each element of the matrix then adding them along the row (
sum(-1)). It will give us a matrix of shape 2 but we need a matrix of size (2,1), we can achieve this by unsqueezing them along with the number of columns (
unsqueeze(-1)). And then we can divide each element by the sum.
If it is too confusing for you then run this code it’ll give you a better insight
Functions for loss, Accuracy, and Backpropagation
Here we will define our loss function. To be more precise, we are taking cross entropy loss function. We will learn cross entropy loss in some other blog just in one line I can say we are taking negative of the log of the value given in ground truth. And then we take mean of all values.
Now we can define the accuracy function also. Accuracy is just finding the position of maximum probability in each row of the matrix and then comparing it with the actual output if it matches we will return 1 else 0. The returned values will be in an array so we will take mean of all values.
argmax returns the index of the maximum value in a row or column of a matrix.
Now time to write training loop for our model but first let’s initialize parameters for both layers
we are using He initialization for weight here, and what is it, again we’ll learn that in some another blog. And we are also doing
requires_grad=True so that torch knows that these variables are differentiable.
Now learning loop
We are defining learning rate and the number of epochs, converting our x_train to float and y_train to long (y_train is index ) and also defining array for loss and accuracy to store them after each epoch. It will be useful to draw graphs.
For each epoch, we predict y represented here by y_hat in the code using our model and then calculate the loss with predicted y and actual y then we are doing backpropagation from
loss.backward() and bookkeeping of loss and accuracy
Now that our gradients are computed, we will update our parameters (weights and bias) and then also set grads to zero for the next epoch.
We are plotting loss (red color) and accuracy (blue color) graphs
It ends with printing loss before and after training
Loss before training 1.5456441640853882
Loss after training 0.19288592040538788
PyTorch Modules: NN and Optim
We have seen how to write a feedforward network using PyTorch tensors and existing PyTorch operations. Now we’ll see some additional features
torch.nn.functional – it has many different functions but we’ll use cross-entropy instead our loss function. First import that
and then change this line
loss = loss_fn(y_hat, Y_train) to this
loss = F.cross_entropy(y_hat, Y_train) in our training loop and everything will remain unchanged. The benefit is that we don’t need to write our own loss function, PyTorch can take care of it.
Now we would like to use
And then we will write a class for our model that will inherit the properties of
So in this FirstNetwork class, we typically call
__init__ function of super class. We will want our parameters (weights and bias) initialized as before but this we’ll wrap it in
nn.parameter So our model knows these are the parameters and then the same function for forward propagation.
Now that our model is ready we’ll write our fit function
So now our fit function is also compacted. Here for each epoch, we are predicting y by passing
fn (fn will be the instance of FirstNetwork class), then calculating loss, some bookkeeping for our graph and then backpropagation.
Here when we are updating our parameters we don’t need to specify all weights and biases. It is all stored in
fn.parameters() , we are just looping through it. and after that, we are setting our gradients to zero for next epoch.
Now we need to initialize our FirstNetwork class and then call fit function
It will return a graph representing loss, accuracy, and value of Loss before and after training.
Loss before training 1.4111980199813843
Loss after training 0.9939236044883728
Now let’s get into more abstraction provided by PyTorch. In this method, we’ll see that we don’t need to initialize weights and biases for each layer separately. Just mention the dimensions and it can do it for you
So here we are initializing
lin2. As I said, it will internally take care of all parameters and also in forward pass we don’t need to write linear operation it does that for you. Making life so much easier
Our fit function will remain the same. We need to initialize this class and call that fit function
Now Let’s change our fit function a little bit by using optimizers provided by torch. We’ll discuss all different optimizers in any other blog but in short, optimizers are used to update our weight parameters in such a way that our loss is decreased.
First, start with importing optim
And fit function using optim is
Here we are using stochastic gradient descent, it takes parameters, the parameters of network and learning rate. After doing backpropagation we are doing
opt.step() that will update all the parameters and then
opt.zero_grad() to set gradient equals to zero for the next epoch.
And then again we can instantiate our model class and call this version of fit function
Can we make model class and fit function smaller?
Let’s recap what we were doing till now for our network. First were initialized parameters and then did different operations in the forward function. Now we can do that in just one step using
Let’s see the code first and then I’ll explain it down below
we are defining
net variable in which we are doing a bunch of transformation that my data will go through and wrap it up in
nn.Sequential . And that’s it now we just need to instantiate this class and call our fit function
Another version of fit function where we are passing model, loss function and optim as an argument of that function
In this function, we are calculating the loss in just 4 lines of code.
This time with instantiating our model we also need to define our loss function and optim and then pass it as an argument in fit function. Here is how we do this
Here we had a very simple network but in real life dimensions of hidden networks are very large. Usually, It is in millions. So to compute all those parameters we need to move things to GPU.
Moving things to GPU
So to do computation on GPU our variables, models everything should be on GPU
Here is an example of our previous network but all calculation on GPU
We discussed this in our last blog. The time taken to run 1000 epochs was only
Let’s up our game, and increase the size of the hidden layer. We will make it 1024*4. We have to change our model class
And when we instantiate the class and run fit function on this it took only 1 sec and just I ran it on CPU to it took almost 25 sec. So that’s the power of GPU.
In this blog, I tried to explain all the different level of abstraction provided by PyTorch. It depends on Your use case, what level of abstraction you need
You can find the full notebook here
This article is part of a series that I am writing If you wish to receive more connect with me on below mentioned Social media links.
I hope you find this article useful and I can use some clap to boost my confidence for the upcoming articles . If we are meeting for the first time Hi, I am Vaibhav and if you wish to connect with me I am active on LinkedIn and Twitter.
Poka Poka :)