Your First FNN in PyTorch (2)

Source: Deep Learning on Medium

This blog is a continuation of PyTorch on Google Colab. You can check my last blog here.

Method to read these blogs → You can find a link of google colab link down below, open it in a new tab and execute every cell as we go along.

Basic Outline of this Article:-

  • Generating Data
  • Feedforward network with PyTorch and autograd
  • PyTorch’s NN (Functional, Linear, Sequential & PyTorch’s Optim)
  • Using CUDA

Before we start generating data here are some common import that we will need in our code

We will understand their use gradually in the following blog.

Generating Data

To generate data we are using make_blob provided by sklearn.datasets. Here is how we do it

make_blobs take arguments like

  • n_samples :- number of data points you want.
  • centers:- The number of centers to generate, or the fixed center locations. In simple words, it provides labels to datapoint we generate, For example, if you print labels you’ll see there is 4 value (0,1,2,3) the 4 centers
  • n_features:- The number of features for each sample
  • random_state: to regenerate same data

and it returns

  • X (array of shape [n_samples, n_features]) : The generated samples. Here we referenced it as data
  • Y (array of shape [n_samples]) : The integer labels for cluster membership of each sample. Here we referenced it as the label.

Scatter plot for these data points

Here you can see we have blobs shape data points and we will divide these data points in test and train and try to classify these using FNN. 
You can use different datasets like make_moon, make_classification etc. You can find all here.

So to summarise we have 2 input features and 4 classes

Before we move, let’s split these points in test and train set

Using torch tensors and autograd

Till now are data are numpy array, we have to convert it into PyTorch tensors to perform different torch operations, we can do this by map

Here is the picture of the network that we will try to implement. It has two input features and 4 output class just like our data.

So in this network, a few naming are wrong (mean it does not match to the code). So here is the name that we will refer to the code:

a1:- first layer’s linear part

h1:- first layer non-linear part

and in 2nd layer a2 is linear part and h2 is softmax part.

In deep learning keeping the track of the size of the matrix is very important, if it goes wrong it messes up our whole algorithm. So our input matrices size is (N,2). N for the number of inputs and 2 is for the number of features. Similarly, we will take the weight as matrices whose row number are the number of nodes from the previous layer and column number is equal to the number of nodes in that layer. So the first layer weight matrix will be of the size (2,2). Similarly, weight for the next layer will be of (2,4).

So here we are doing matrice multiplication of input data and weight then adding bias to it and top of that we are introducing non-linearity with sigmoid function. In the next layer, we are doing the same matrix multiplication of data from the previous layer and weight then adding bias to it but on top of that, we are doing softmax operation. Softmax operation is we take individual numbers and then exponentiate them and then divide it by sum of all the exponentials. The main advantage of using Softmax is the output probabilities range. The range will 0 to 1, and the sum of all the probabilities will be equal to one. We will learn it in detail in some later blog.

To find softmax we are exponentiating each element of the matrix then adding them along the row (sum(-1)). It will give us a matrix of shape 2 but we need a matrix of size (2,1), we can achieve this by unsqueezing them along with the number of columns (unsqueeze(-1)). And then we can divide each element by the sum.

If it is too confusing for you then run this code it’ll give you a better insight

Functions for loss, Accuracy, and Backpropagation

Here we will define our loss function. To be more precise, we are taking cross entropy loss function. We will learn cross entropy loss in some other blog just in one line I can say we are taking negative of the log of the value given in ground truth. And then we take mean of all values.

Now we can define the accuracy function also. Accuracy is just finding the position of maximum probability in each row of the matrix and then comparing it with the actual output if it matches we will return 1 else 0. The returned values will be in an array so we will take mean of all values.

argmax returns the index of the maximum value in a row or column of a matrix.

Now time to write training loop for our model but first let’s initialize parameters for both layers

we are using He initialization for weight here, and what is it, again we’ll learn that in some another blog. And we are also doing requires_grad=True so that torch knows that these variables are differentiable.

Now learning loop

We are defining learning rate and the number of epochs, converting our x_train to float and y_train to long (y_train is index ) and also defining array for loss and accuracy to store them after each epoch. It will be useful to draw graphs.

For each epoch, we predict y represented here by y_hat in the code using our model and then calculate the loss with predicted y and actual y then we are doing backpropagation from loss.backward() and bookkeeping of loss and accuracy

Now that our gradients are computed, we will update our parameters (weights and bias) and then also set grads to zero for the next epoch.

We are plotting loss (red color) and accuracy (blue color) graphs

It ends with printing loss before and after training

Loss before training 1.5456441640853882 
Loss after training 0.19288592040538788

PyTorch Modules: NN and Optim

We have seen how to write a feedforward network using PyTorch tensors and existing PyTorch operations. Now we’ll see some additional features

torch.nn.functional – it has many different functions but we’ll use cross-entropy instead our loss function. First import that

and then change this line loss = loss_fn(y_hat, Y_train) to this loss = F.cross_entropy(y_hat, Y_train) in our training loop and everything will remain unchanged. The benefit is that we don’t need to write our own loss function, PyTorch can take care of it.

Using NN.Parameter

Now we would like to use nn.Parameter


And then we will write a class for our model that will inherit the properties of nn.Modules class

So in this FirstNetwork class, we typically call __init__ function of super class. We will want our parameters (weights and bias) initialized as before but this we’ll wrap it in nn.parameter So our model knows these are the parameters and then the same function for forward propagation.

Now that our model is ready we’ll write our fit function

So now our fit function is also compacted. Here for each epoch, we are predicting y by passing X_train in fn (fn will be the instance of FirstNetwork class), then calculating loss, some bookkeeping for our graph and then backpropagation.

Here when we are updating our parameters we don’t need to specify all weights and biases. It is all stored in fn.parameters() , we are just looping through it. and after that, we are setting our gradients to zero for next epoch.

Now we need to initialize our FirstNetwork class and then call fit function

It will return a graph representing loss, accuracy, and value of Loss before and after training.

Loss before training 1.4111980199813843
Loss after training 0.9939236044883728

Using NN.Linear

Now let’s get into more abstraction provided by PyTorch. In this method, we’ll see that we don’t need to initialize weights and biases for each layer separately. Just mention the dimensions and it can do it for you

So here we are initializing lin1 and lin2. As I said, it will internally take care of all parameters and also in forward pass we don’t need to write linear operation it does that for you. Making life so much easier

Our fit function will remain the same. We need to initialize this class and call that fit function

Using Optim

Now Let’s change our fit function a little bit by using optimizers provided by torch. We’ll discuss all different optimizers in any other blog but in short, optimizers are used to update our weight parameters in such a way that our loss is decreased.

First, start with importing optim

And fit function using optim is

Here we are using stochastic gradient descent, it takes parameters, the parameters of network and learning rate. After doing backpropagation we are doing opt.step() that will update all the parameters and then opt.zero_grad() to set gradient equals to zero for the next epoch.

And then again we can instantiate our model class and call this version of fit function


Can we make model class and fit function smaller?

Let’s recap what we were doing till now for our network. First were initialized parameters and then did different operations in the forward function. Now we can do that in just one step using nn.Sequential

Let’s see the code first and then I’ll explain it down below

we are defining net variable in which we are doing a bunch of transformation that my data will go through and wrap it up innn.Sequential . And that’s it now we just need to instantiate this class and call our fit function

Another version of fit function where we are passing model, loss function and optim as an argument of that function

In this function, we are calculating the loss in just 4 lines of code.

This time with instantiating our model we also need to define our loss function and optim and then pass it as an argument in fit function. Here is how we do this

Here we had a very simple network but in real life dimensions of hidden networks are very large. Usually, It is in millions. So to compute all those parameters we need to move things to GPU.

Moving things to GPU

So to do computation on GPU our variables, models everything should be on GPU

Here is an example of our previous network but all calculation on GPU

We discussed this in our last blog. The time taken to run 1000 epochs was only 0.74 sec

Let’s up our game, and increase the size of the hidden layer. We will make it 1024*4. We have to change our model class

And when we instantiate the class and run fit function on this it took only 1 sec and just I ran it on CPU to it took almost 25 sec. So that’s the power of GPU.

In this blog, I tried to explain all the different level of abstraction provided by PyTorch. It depends on Your use case, what level of abstraction you need

You can find the full notebook here

This article is part of a series that I am writing If you wish to receive more connect with me on below mentioned Social media links.

I hope you find this article useful and I can use some clap to boost my confidence for the upcoming articles . If we are meeting for the first time Hi, I am Vaibhav and if you wish to connect with me I am active on LinkedIn and Twitter.

Poka Poka :)