Intro to Deep Learning with Pytorch- Part2

Source: Deep Learning on Medium

This blog is a continuation to the earlier one published as Intro to Deep Learning with pytorch _ part1.

Course outline:

This course comes with 8 lessons and one lab. the 8 lessons are

  1. introduction to neural networks: you will learn the concept behind deep learning and how we train deep neural network with back propogation.
  2. talking pytorch with soumith chintala: soumith chintala , the creator of pytorch talks past, present and future of pytorch.
  3. introduction to pytorch: you will learn how to build deep neural networks with pytorch and builds the state of art model using pre-trained networks that classifies dog and cat images.
  4. Convultional neural network: you will learn convultional neural network and powereful architecture for solving computer vision problems.
  5. Style transfer: using trained networks to transfer the style of one image to another and implementing style transfer model.
  6. Recurrent neural network: you will learn how recurrent neural networks learn from sequence of data such as time series and also builds a recurrent neural network that learns from text and generates new text with one character at a time.
  7. Sentiment prediction with RNN: will build & train a recurrent network that can classify the sentiment of movie reviews.
  8. Deploying pytorch model: will learn how to use pytorch’s hybrid frontend to convert models from pytorch to C++ for use in production.

In earlier blog we seen clear explanation of lesson 1 : introduction to Neural networks, where you been introduced to several concepts like linear boundary, higher dimension, perceptrons, neural networks, perceptrons as logical operators , perceptrons algorithms, error function, Discrete vs Continous predictions, softmax function, One hot encoding, Cross entropy, multiclass cross entropy, Perceptron vs Gradient descent, Neural network architecture,Feed forward and back propogation.

Now in this blog we will cover lesson -2 which is Talking pytorch with soumith chintala and lesson- 3 which is intro to pytorch.

Lesson-2 Talking Pytorch with soumith chintala

soumith chintala is the creator of pytorch. you can follow him here from his twitter.

In this lesson he tells you everything about pytorch and how it originated and what is its story , what are its applications, what are its implementations and the future of pytorch.

  1. Origins of pytorch.
  2. Debugging and Designing of pytorch.
  3. From Research to Production.
  4. Cutting edge Applications In Pytorch.
  5. Pytorch and the Facebook Product.
  6. The Future of Pytorch.

Now lets look at the lesson -3 Intro to Pytorch

Lesson-3 Intro To Pytorch

I’ll first give you a basic introduction to PyTorch, where we’ll cover tensors — the main data structure of PyTorch. I’ll show you how to create tensors, how to do simple operations, and how tensors interact with NumPy.

Then you’ll learn about a module called autograd that PyTorch uses to calculate gradients for training neural networks. Autograd, in my opinion, is amazing. It does all the work of backpropagation for you by calculating the gradients at each operation in the network which you can then use to update the network weights.

Next you’ll use PyTorch to build a network and run data forward through it. After that, you’ll define a loss and an optimization method to train the neural network on a dataset of handwritten digits. You’ll also learn how to test that your network is able to generalize through validation.

However, you’ll find that your network doesn’t work too well with more complex images. You’ll learn how to use pre-trained networks to improve the performance of your classifier, a technique known as transfer learning.

PyTorch, a framework for building and training neural networks. PyTorch in a lot of ways behaves like the arrays you love from Numpy. These Numpy arrays, after all, are just tensors. PyTorch takes these tensors and makes it simple to move them to GPUs for the faster processing needed when training neural networks.

Neural Networks

Deep Learning is based on artificial neural networks which have been around in some form since the late 1950s. The networks are built from individual parts approximating neurons, typically called units or simply “neurons.” Each unit has some number of weighted inputs. These weighted inputs are summed together (a linear combination) then passed through an activation function to get the unit’s output.

mathematically looks like

With vectors this is the dot/inner product of two vectors:


It turns out neural network computations are just a bunch of linear algebra operations on tensors, a generalization of matrices. A vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, an array with three indices is a 3-dimensional tensor (RGB color images for example). The fundamental data structure for neural networks are tensors and PyTorch (as well as pretty much every other deep learning framework) is built around tensors.

import torch #importing pytorch library
def activation(x): #defining sigmoid activation func we knew part1
return 1/(1+torch.exp(-x))
### Generating some data
torch.manual_seed(7) # Set the random seed so things are predictable

features = torch.randn((1, 5))# 5Features of random normal variables

weights = torch.randn_like(features)# weights for our data

bias = torch.randn((1, 1))#bias term

after applying activation function to above one, we will get

y = activation(torch.sum(features * weights) + bias)

You can also do the multiplication and sum in the same operation using a matrix multiplication. but both features and weights have the same shape, (1, 5). This means we need to change the shape of weights to get the matrix multiplication to work.

using .view( ) method ,now we can reshape weights to have five rows and one column with something like weights.view(5, 1).

y = activation(, weights.view(5,1)) + bias)

That’s how you can calculate the output for a single neuron. The real power of this algorithm happens when you start stacking these individual units into layers and stacks of layers, into a network of neurons.

torch.manual_seed(7) # Setting the random seed
# Features are 3 random normal variables
features = torch.randn((1, 3))
# Define the size of each layer in our network
n_input = features.shape[1] # input units
n_hidden = 2 # hidden units
n_output = 1 # units
W1 = torch.randn(n_input, n_hidden) # Weights from inputs to hidden
W2 = torch.randn(n_hidden, n_output) # Weights from hidden to output
B1 = torch.randn((1, n_hidden))#bias for hidden 
B2 = torch.randn((1, n_output))#bias for output

after applying activation function to this multilayer network, it is like

h = activation(, W1) + B1)
output = activation(, W2) + B2)

i got output tensor([[ 0.3171]]).

Neural networks with PyTorch

Deep learning networks tend to be massive with dozens or hundreds of layers, that’s where the term “deep” comes from. PyTorch has a nice module nn that provides a nice way to efficiently build large neural networks.

# Importing necessary packages
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import numpy as np
import torch
import helper
import matplotlib.pyplot as plt

Now we’re going to build a larger network that can solve a (formerly) difficult problem, identifying text in an image. Here we’ll use the MNIST dataset which consists of greyscale handwritten digits. Each image is 28×28 pixels, you can see a sample below.

Our goal is to build a neural network that can take one of these images and predict the digit in the image.

get our data through the torchvision package.

from torchvision import datasets, transforms
# Defining a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader =, batch_size=64, shuffle=True)

We have the training data loaded into trainloader and we make that an iterator with iter(trainloader). Later, we’ll use this to loop through the dataset for training, like

for image, label in trainloader:
dataiter = iter(trainloader)#iterator
images, labels =

output: <class ‘torch.Tensor’>
torch.Size([64, 1, 28, 28])

So, 64 images per batch, 1 color channel, and 28×28 images. and this is how one image looks like

plt.imshow(images[1].numpy().squeeze(), cmap='Greys_r');

First, let’s try to build a simple network for this dataset using weight matrices and matrix multiplications. Then, we’ll see how to do it using PyTorch’s nnmodule.

our images are 28×28 2D tensors, so we need to convert them into 1D vectors. Thinking about sizes, we need to convert the batch of images with shape (64, 1, 28, 28) to a have a shape of (64, 784), 784 is 28 times 28. This is typically called flattening, we flattened the 2D images into 1D vectors.

def activation(x):
return 1/(1+torch.exp(-x))
# Flatten the input images
inputs = images.view(images.shape[0], -1)
# Create parameters
w1 = torch.randn(784, 256)
b1 = torch.randn(256)
w2 = torch.randn(256, 10)
b2 = torch.randn(10)
h = activation(, w1) + b1)
out =, w2) + b2

Now we have 10 outputs for our network. We want to pass in an image to our network and get out a probability distribution over the classes that tells us the likely class(es) the image belongs to.

it looks like

Here we see that the probability for each class is roughly the same. This is representing an untrained network, it hasn’t seen any data yet so it just returns a uniform distribution with equal probabilities for each class.

To calculate this probability distribution, we often use the softmax function to squish each input 𝑥𝑖xi between 0 and 1.

Building networks with PyTorch

PyTorch provides a module nn that makes building networks much simpler.

from torch import nn
class Network(nn.Module):
def __init__(self):

# Inputs to hidden layer linear transformation
self.hidden = nn.Linear(784, 256)
# Output layer, 10 units - one for each digit
self.output = nn.Linear(256, 10)

self.sigmoid = nn.Sigmoid() # sigmoid activation
self.softmax = nn.Softmax(dim=1) # softmax output

def forward(self, x):
# Pass the input tensor through each of our operations
x = self.hidden(x)
x = self.sigmoid(x)
x = self.output(x)
x = self.softmax(x)

return x
# Create the network and look at it's text representation
model = Network()

You can define the network somewhat more concisely and clearly using the torch.nn.functional module.

import torch.nn.functional as F
class Network(nn.Module):
def __init__(self):

# Inputs to hidden layer linear transformation
self.hidden = nn.Linear(784, 256)
# Output layer, 10 units - one for each digit
self.output = nn.Linear(256, 10)

def forward(self, x):
# Hidden layer with sigmoid activation
x = F.sigmoid(self.hidden(x))
# Output layer with softmax activation
x = F.softmax(self.output(x), dim=1)

return x

Activation functions

So far we’ve only been looking at the softmax activation, but in general any function can be used as an activation function. The only requirement is that for a network to approximate a non-linear function, the activation functions must be non-linear. Here are a few more examples of common activation functions: Tanh (hyperbolic tangent), and ReLU (rectified linear unit).

In practice, the ReLU function is used mostly.

On your own:

Now lets Create a network with 784 input units, a hidden layer with 128 units and a ReLU activation, then a hidden layer with 64 units and a ReLU activation, and finally an output layer with a softmax activation as shown above. You can use a ReLU activation with the nn.ReLUmodule or F.relu function.

class Network(nn.Module):
def __init__(self):

# Defining the layers, 128, 64, 10 units each
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 64)
# Output layer, 10 units - one for each digit
self.fc3 = nn.Linear(64, 10)

def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
x = F.relu(x)
x = self.fc3(x)
x = F.softmax(x, dim=1)

return x
model = Network()

The weights and biases are tensors attached to the layer you defined, you can get them with model.fc1.weight for instance.


it gives:

Parameter containing:
tensor([[-2.3278e-02, -1.2170e-03, -1.1882e-02, ..., 3.3567e-02,
4.4827e-03, 1.4840e-02],
[ 4.8464e-03, 1.9844e-02, 3.9791e-03, ..., -2.6048e-02,
-3.5558e-02, -2.2386e-02],
[-1.9664e-02, 8.1722e-03, 2.6729e-02, ..., -1.5122e-02,
2.7632e-02, -1.9567e-02],
[-3.3571e-02, -2.9686e-02, -2.1387e-02, ..., 3.0770e-02,
1.0800e-02, -6.5941e-03],
[ 2.9749e-02, 1.2849e-02, 2.7320e-02, ..., -1.9899e-02,
2.7131e-02, 2.2082e-02],
[ 1.3992e-02, -2.1520e-02, 3.1907e-02, ..., 2.2435e-02,
1.1370e-02, 2.1568e-02]])
Parameter containing:
tensor(1.00000e-02 *
[-1.3222, 2.4094, -2.1571, 3.2237, 2.5302, -1.1515, 2.6382,
-2.3426, -3.5689, -1.0724, -2.8842, -2.9667, -0.5022, 1.1381,
1.2849, 3.0731, -2.0207, -2.3282, 0.3168, -2.8098, -1.0740,
-1.8273, 1.8692, 2.9404, 0.1783, 0.9391, -0.7085, -1.2522,
-2.7769, 0.0916, -1.4283, -0.3267, -1.6876, -1.8580, -2.8724,
-3.5512, 3.2155, 1.5532, 0.8836, -1.2911, 1.5735, -3.0478,
-1.3089, -2.2117, 1.5162, -0.8055, -1.3307, -2.4267, -1.2665,
0.8666, -2.2325, -0.4797, -0.5448, -0.6612, -0.6022, 2.6399,
1.4673, -1.5417, -2.9492, -2.7507, 0.6157, -0.0681, -0.8171,
-0.3554, -0.8225, 3.3906, 3.3509, -1.4484, 3.5124, -2.6519,
0.9721, -2.5068, -3.4962, 3.4743, 1.1525, -2.7555, -3.1673,
2.2906, 2.5914, 1.5992, -1.2859, -0.5682, 2.1488, -2.0631,
2.6281, -2.4639, 2.2622, 2.3632, -0.1979, 0.7160, 1.7594,
0.0761, -2.8886, -3.5467, 2.7691, 0.8280, -2.2398, -1.4602,
-1.3475, -1.4738, 0.6338, 3.2811, -3.0628, 2.7044, 1.2775,
2.8856, -3.3938, 2.7056, 0.5826, -0.6286, 1.2381, 0.7316,
-2.4725, -1.2958, -3.1543, -0.8584, 0.5517, 2.8176, 0.0947,
-1.6849, -1.4968, 3.1039, 1.7680, 1.1803, -1.4402, 2.5710,
-3.3057, 1.9027])

These are actually autograd Variables, so we need to get back the actual tensors with Once we have the tensors, we can fill them with zeros (for biases) or random normal values.

# Set biases to all zeros
# sample from random normal with standard dev = 0.01

Forward pass

Now that we have a network, let’s see what happens when we pass in an image.

# Grab some data 
dataiter = iter(trainloader)
images, labels =
# Resize images into a 1D vector, new shape is (batch size, color channels, image pixels) 
images.resize_(64, 1, 784)
# or images.resize_(images.shape[0], 1, 784) to automatically get batch size
# Forward pass through the network
img_idx = 0
ps = model.forward(images[img_idx,:])
img = images[img_idx]
helper.view_classify(img.view(1, 28, 28), ps)

As you can see above, our network has basically no idea what this digit is. It’s because we haven’t trained it yet, all the weights are random!

Training Neural Networks

The network we built in the previous part isn’t so smart, it doesn’t know anything about our handwritten digits. Neural networks with non-linear activations work like universal function approximators. There is some function that maps your input to the output. For example, images of handwritten digits to class probabilities. The power of neural networks is that we can train them to approximate this function, and basically any function given enough data and compute time.

We train the network by showing it examples of real data, then adjusting the network parameters such that it approximates this function.To find these parameters, we need to know how poorly the network is predicting the real outputs. For this we calculate a loss function (also called the cost), a measure of our prediction error. For example, the mean squared loss is often used in regression and binary classification problems.


For single layer networks, gradient descent is straightforward to implement. However, it’s more complicated for deeper, multilayer neural networks like the one we’ve built.

athematically, this is really just calculating the gradient of the loss with respect to the weights using the chain rule.

Losses in PyTorch

Let’s start by seeing how we calculate the loss with PyTorch. Through the nn module, PyTorch provides losses such as the cross-entropy loss (nn.CrossEntropyLoss). You’ll usually see the loss assigned to criterion. As noted in the last part, with a classification problem such as MNIST, we’re using the softmax function to predict class probabilities. With a softmax output, you want to use cross-entropy as the loss. To actually calculate the loss, you first define the criterion then pass in the output of your network and the correct labels.

Something really important to note here. Looking at the documentation for nn.CrossEntropyLoss,

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

The input is expected to contain scores for each class.

This means we need to pass in the raw output of our network into the loss, not the output of the softmax function.

import torch
from torch import nn
import torch.nn.functional as F
from torchvision import datasets, transforms
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5, 0.5,0.5), (0.5, 0.5, 0.5)),])
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader =, batch_size=64, shuffle=True)
# Build a feed-forward network
model = nn.Sequential(nn.Linear(784, 128),
nn.Linear(128, 64),
nn.Linear(64, 10))
# Define the loss
criterion = nn.CrossEntropyLoss()
# Get our data
images, labels = next(iter(trainloader))
# Flatten images
images = images.view(images.shape[0], -1)
# Forward pass, get our logits
logits = model(images)
# Calculate the loss with the logits and the labels
loss = criterion(logits, labels)

output: tensor(2.2810)


Now that we know how to calculate a loss, how do we use it to perform backpropagation? Torch provides a module, autograd, for automatically calculating the gradients of tensors.

x = torch.randn(2,2, requires_grad=True)

output: tensor([[ 0.7652, -1.4550], [-1.2232, 0.1810]])

y = x**2

output:tensor([[ 0.5856, 2.1170], [ 1.4962, 0.0328]])

Training the network!

There’s one last piece we need to start training, an optimizer that we’ll use to update the weights with the gradients. We get these from PyTorch’s optimpackage. For example we can use stochastic gradient descent with optim.SGD.

from torch import optim
# Optimizers require the parameters to optimize and a learning rate
optimizer = optim.SGD(model.parameters(), lr=0.01)
print('Initial weights - ', model[0].weight)
images, labels = next(iter(trainloader))
images.resize_(64, 784)
# Clear the gradients, do this because gradients are accumulated
# Forward pass, then backward pass, then update weights
output = model(images)
loss = criterion(output, labels)
print('Gradient -', model[0].weight.grad)


Initial weights -  Parameter containing:
tensor([[ 3.5691e-02, 2.1438e-02, 2.2862e-02, ..., -1.3882e-02,
-2.3719e-02, -4.6573e-03],
[-3.2397e-03, 3.5117e-03, -1.5220e-03, ..., 1.4400e-02,
2.8463e-03, 2.5381e-03],
[ 5.6122e-03, 4.8693e-03, -3.4507e-02, ..., -2.8224e-02,
-1.2907e-02, -1.5818e-02],
[-1.4372e-02, 2.3948e-02, 2.8374e-02, ..., -1.5817e-02,
3.2719e-02, 8.5537e-03],
[-1.1999e-02, 1.9462e-02, 1.3998e-02, ..., -2.0170e-03,
1.4254e-02, 2.2238e-02],
[ 3.9955e-04, 4.8263e-03, -2.1819e-02, ..., 1.2959e-02,
-4.4880e-03, 1.4609e-02]])
Gradient - tensor(1.00000e-02 *
[[-0.2609, -0.2609, -0.2609, ..., -0.2609, -0.2609, -0.2609],
[-0.0695, -0.0695, -0.0695, ..., -0.0695, -0.0695, -0.0695],
[ 0.0514, 0.0514, 0.0514, ..., 0.0514, 0.0514, 0.0514],
[ 0.0967, 0.0967, 0.0967, ..., 0.0967, 0.0967, 0.0967],
[-0.1878, -0.1878, -0.1878, ..., -0.1878, -0.1878, -0.1878],
[ 0.0281, 0.0281, 0.0281, ..., 0.0281, 0.0281, 0.0281]])

Training for real

Now we’ll put this algorithm into a loop so we can go through all the images. Some nomenclature, one pass through the entire dataset is called an epoch. So here we’re going to loop through trainloader to get our training batches. For each batch, we’ll doing a training pass where we calculate the loss, do a backwards pass, and update the weights.

model = nn.Sequential(nn.Linear(784, 128),
nn.Linear(128, 64),
nn.Linear(64, 10),
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.003)
epochs = 5
for e in range(epochs):
running_loss = 0
for images, labels in trainloader:
# Flatten MNIST images into a 784 long vector
images = images.view(images.shape[0], -1)

# TODO: Training pass

output = model(images)
loss = criterion(output, labels)

running_loss += loss.item()
print(f"Training loss: {running_loss/len(trainloader)}")


Training loss: 1.8959971736234897
Training loss: 0.8684300759644397
Training loss: 0.537974218426864
Training loss: 0.43723612014990626
Training loss: 0.39094475933165945

now check out it’s predictions.

%matplotlib inline
import helper
images, labels = next(iter(trainloader))
img = images[0].view(1, 784)
# Turn off gradients to speed up this part
with torch.no_grad():
logps = model(img)
# Output of the network are log-probabilities, need to take exponential for probabilities
ps = torch.exp(logps)
helper.view_classify(img.view(1, 28, 28), ps)

Even i am the beginner in this and your likes motivate me to write the next parts. first time i am feeling that learning is too tough, especially this deeplearning. so please hit a like and share it to your friends.


references mentioned in part -1 here.