PyTorch Crash Course, Part 2

Source: Deep Learning on Medium


From Deep Learning with PyTorch by Eli Stevens and Luca Antiga
Take 37% off Deep Learning with PyTorch. Just enter code fccstevens into the promotional discount code box at checkout at

In part one, we learned about PyTorch and its component parts, now let’s take a closer look and see what it can do. In this article, we explore some of PyTorch’s capabilities by playing with pre-trained networks.

Playing with pre-trained networks

Computer vision — a field that deals with making computers to gain high-level understanding from digital images or videos — is certainly one of the fields most impacted by the advent of deep learning, for a variety of reasons. The need for classifying or interpreting the content of natural images was there, huge datasets became available and new constructs, such as convolutional layers, came about and started to run quickly on GPUs with unprecedented accuracies. All this combined with the motivation of the Internet giants to understand pictures shot by millions of users through their mobile devices and managed on said giants’ platforms made for quite the perfect storm.

Throughout this process of development, academic competitions have been one of the main playing fields where researchers at institutions and companies have challenged each other. Among others, the Large Scale Visual Recognition Challenge (ILSVRC) gained popularity since its inception in 2010. ILSVRC is an image classification and object detection competition based on a subset of the ImageNet dataset, which is maintained by Stanford University. It’s a massive dataset of over fourteen million images, all labeled with a hierarchy of nouns coming from the WordNet dataset, in turn a large lexical database of the English language. Of all, the rise in performance of algorithms submitted to ILSVRC is a testimony of the rise of deep learning in computer vision.

The ILSVRC takes place on a training set of 1.2 million images, labeled with one thousand classes. The competition takes place on a few tasks, which can vary year to year, such as image classification (telling what object categories the image contains), object localization (identifying objects in images), object detection (identifying and labeling objects in images), scene classification (classifying a situation in an image), scene parsing (segmenting an image into regions associated with semantic categories, such as cow, house, cheese, hat). In particular, the image classification task consists in taking an image in input and spitting out a list of five labels out of one thousand total categories, ranked by confidence, describing the content of the image.

Figure 1. A small sample of ImageNet images and related annotations

A pre-trained network that recognizes the subject of an image

As our first foray into deep learning, we’ll now run a state of the art deep neural network that was pre-trained on the ImageNet classification task. This allows us to get accustomed with the mechanics of obtaining and running a neural network on real-world data and visualize and evaluate its outputs.

Several pre-trained networks can be accessed through source code repositories. It’s common for researchers to publish their source code along with their papers, and often the code comes with weights which are obtained by training the model on a reference dataset. Were we to create a new web-service with image recognition capabilities, learning how to run one or more pre-trained models using PyTorch could be everything we need.


As discussed, we’ll now equip ourselves with a network trained on ImageNet. To do this, we’ll take a look at the TorchVision project, which conveniently enables access to datasets, like ImageNet, models and utilities for getting up to speed with computer vision applications in PyTorch. We can install it using conda:

$ conda install torchvision -c pytorch

The torchvision module contains a few of the best performing neural network architectures for computer vision, such as AlexNet, ResNet and Inception v3. Let’s load up and run a residual network, ResNet for short, which won the ImageNet classification, detection and localization competitions, among others, in 2015.

The pre-defined models can be found in torchvision.models.

# In[1]:  
from torchvision import models

We can take a look at the models

# In[2]:

# Out[2]:

The capitalized names refer to classes that implement a number of popular models. They differ in their architecture, in the arrangement of the operations occurring between the input and the output.

The AlexNet architecture won the 2012 ImageNet Large-Scale Visual Recognition Challenge by a large margin, with a top five test error rate (i.e. correct label must be in the top five predictions) of 15.4%. By comparison, the second-best submission, not based on a deep network, trailed at 26.2%.

Figure 2. The AlexNet architecture.

We can create an instance of AlexNet easily:

# In[3]:  
alexnet = models.AlexNet()

At this point alexnet is an object that can run the AlexNet architecture. It’s not essential for us to understand the details of this architecture for now. For the time being, this is an opaque object that can be called like a function. By providing alexnet with some input data (we’ll see shortly what this input data should be), we’ll run a forward pass through the network. The input runs through the first set of neurons, whose outputs are fed to the next set of neurons, all the way to the output. Practically speaking, assuming we’ve an input object of the right type, we can run the forward pass with output = alexnet(input).

We’d have fed data through the whole network to produce… garbage! This is because the network is uninitialized: its weights haven’t been trained on anything, the network is a blank (or rather, random) slate. We’d need to either train it from scratch or load weights from a prior training, which we’ll do now.

To this end, let’s look at the models module. The uppercase names correspond to classes that implement popular architectures for computer vision. The lowercase names, on the other end, are functions that instantiate models with pre-defined number of layers and units and optionally download and load pre-trained weights into them. Note that there’s nothing fundamental about using one of these functions. They make it convenient to instantiate the model with a number of layers and units that matches how pre-trained networks were built.

Using the resnet101 function, we’ll now instantiate a 101-layer convolutional neural network. To put things in perspective, before the advent of residual networks 2015, achieving a stable training at such depths was considered extremely hard. Residual networks pulled a trick that made it possible and by doing that beat several benchmarks in one sweep that year.

Let’s create an instance of the network now. We’ll pass an argument that instructs the function to download the weights of a ResNet101 trained on the ImageNet dataset, with 1.2 million images and one thousand categories.

# In[4]: resnet = models.resnet101(pretrained=True)

It’s downloading. While we’re staring at the download progress, we can take a minute to appreciate that ResNet101 sports 44.5 million parameters — a lot of parameters to optimize iteratively through stochastic gradient descent.


Done? Let’s take a peek at what a ResNet101 looks like. We can do that by printing the value of the returned model.

# In[5]:

# Out[5]:
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): Bottleneck(
(avgpool): AvgPool2d(kernel_size=7, stride=1, padding=0)
(fc): Linear(in_features=2048, out_features=1000, bias=True)

If we scroll down, we’ll see a lot of those Bottleneck modules repeating one after the other, containing convolutions and other modules. This is the anatomy of a typical deep neural network for computer vision: a more or less sequential cascade of filters and non-linear functions, ending with a last layer (fc) producing scores for each of the one thousand output classes (out_features).

The resnet variable can be called like a function, taking in input one or more images and producing an equal number of scores for each of the one thousand ImageNet classes. Before we can do that we must pre-process any input image to ensure that it has the right size and that its values (its colors) sit roughly in the same numerical range. In order to do that, the torchvision module provides transforms, which allows you to quickly define pipelines of basic pre-processing functions:

# In[6]:
from torchvision import transforms
preprocess = transforms.Compose([
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]

In this case, we defined a preprocess function that scales the input image to 256×256, crops the image to 224×224 around the center, transforms it to a tensor (a PyTorch multidimensional array, a 2D array in this case), and normalizes its RGB (red, green, blue) components to meet the requirements for defined means and standard deviations. These need to match what was presented to the network during training, if we wish to hope that the network produces meaningful answers.

We can now grab a picture of our favorite dog, say bobby.jpg. We can load the image using Pillow, an image manipulation module for Python

# In[7]:  
from PIL import Image img ="bobby.jpg")

If we were following along from a Jupyter notebook, we’d do the following to see the picture inline where the <PIL.JpegImagePlugin… is below:

# In[8]:

# Out[8]:
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1280x720 at 0x2074868AEF0>

Otherwise we can invoke the show method, which opens a window with a viewer:


Figure 3. Bobby, our special input image.

We can now pass the image through our pre-processing pipeline

# In[9]: img_t = preprocess(img)

and shape up the input tensor in a way that the network expects.

# In[10]: import torch batch_t = torch.unsqueeze(img_t, 0)


We’re now ready to run our model. Before doing that, we need to put the network in eval mode because we’ll be querying it for answers, rather than training it.

# In[11]:

# Out[11]:
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): Bottleneck(
(avgpool): AvgPool2d(kernel_size=7, stride=1, padding=0)
(fc): Linear(in_features=2048, out_features=1000, bias=True)

If we forget to do the above, several pre-trained models won’t produce meaningful answers because of how some modules, like batch normalization, work internally.

Because eval has been set, we can now run the model.

# In[12]:
out = resnet(batch_t)

# Out[12]:
tensor([[ -3.4803, -1.6618, -2.4515, -3.2662, -3.2466, -1.3611,
-2.0465, -2.5112, -1.3043, -2.8900, -1.6862, -1.3055,
2.8674, -3.7442, 1.5085, -3.2500, -2.4894, -0.3354,
0.1286, -1.1355, 3.3969, 4.4584]])

A staggering set of operations involving 44.5 million parameters has happened, producing a vector of one thousand scores, one per ImageNet class. That didn’t take long, did it?

Figure 4. The inference process.

Let’s see what happened. The input image is first pre-processed into a torch.FloatTensor shaped as a three-channel 2D matrix of a specific size. Our model took that processed input image and passed it into the pre-trained network to obtain class scores, which are mapped one-to-one onto class labels. The highest score corresponds to the most likely class according to the weights. That output is contained in a torch.FloatTensor with one thousand elements, each representing scores associated with each class.

We now need to find out what was the label of the class that received the highest score. To do this, we’ll load a text file listing the labels in the same order they were presented to the network during training, then pick out the label at the index that produced the highest score from the network.

Let’s load the file containing the one thousand labels for the ImageNet dataset classes [REF Lasagne repository]:

# In[13]:
with open('imagenet_classes.txt') as f:
labels = f.readlines()

At this point we need to find out the index corresponding to the maximum score in the out tensor we obtained above. We can do that using the max function in PyTorch, which outputs the maximum value in a tensor as well as the indices where that maximum value occurred:

# In[14]:  
_, index = torch.max(out, 1)

We can now use the index to access the label (index is a PyTorch Variable containing a 1-element tensor; forgive the ceremony for getting the value of the index:

# In[15]:

# Out[15]:
'golden retriever\n'

Oh-oh, who’s a good boy?

Because the model produced scores, we can also find out what are the second best, third best, and on. To do this, we can use the sort function, which sorts the values in ascending or descending order and also provides the indices of the sorted values in the original array:

# In[16]:
_, indices = torch.sort(out, descending=True)
[labels[idx] for idx in indices[0]][:5]

# Out[16]:
['golden retriever\n',
'Labrador retriever\n',
'cocker spaniel, English cocker spaniel, cocker\n',
'tennis ball\n']

We see that the first three are dogs, then things start to get funny. Time to play, we could go ahead and interrogate our network with random images and see what it comes up with. How successful the network is largely depends on whether the subjects are well represented in the training set. If we present an image containing a subject outside the training set, it’s quite possible that the network comes up with a wrong answer with high confidence. It’s quite useful to experiment and get a feel for how a model reacts to unseen data.

We’ve just run a network that won an image classification competition in 2015! It learned to recognize our dog from examples of dogs, together with a ton of other real-world subjects.

In part three, we will see how different architectures can achieve other kinds of tasks, starting with image generation.

For more information about the book, check it out on liveBook here.

About the authors:
 Eli Stevens has worked in Silicon Valley for the past 15 years as a software engineer, and the past 7 years as Chief Technical Officer of a startup making medical device software. Luca Antiga is co-founder and CEO of an AI engineering company located in Bergamo, Italy, and a regular contributor to PyTorch.

Originally published at