Style Transfer of Images with CNN in PyTorch

Source: Deep Learning on Medium

This post is to describe the style transfer of images. Style Transfer is transferring style from one image to other. This process considers two images, Content Image and Style Image, the goal is to transfer the style from style image onto content image. Style transfer extracts contents from content image and style from style image to form a new third image with content/style extracted. Example below,

This post covers below to better understand style transfer-

· Transfer Learning

· VGG19 Pre-trained network

· Content Loss

· Style and Content Separation

· Gram Matrix

· Style Loss

· Total Style Transfer Loss

· Implementation of Style Transfer in PyTorch

Transfer Learning

Transfer learning is a machine learning method where a neural network developed for one task is re-used for another new task. An example is that we can use VGG network trained to identify different class of images, is re-used to train our own network to identify our own specific category of images.

Transfer learning only works in deep learning if the model features learned from the first task are general.

In transfer learning, we first train a base network on a base dataset and task, and then we repurpose the learned features, or transfer them, to a second target network to be trained on a target dataset and task. This process will tend to work if the features are general, meaning suitable to both base and target tasks, instead of specific to the base task.

(Reference —

An example of transfer learning in pytorch-

I. From TorchVision Models use ‘resnet’ pretrained model

model = models.resnet50(pretrained=True)



(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)

(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

(relu): ReLU(inplace)

(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)

(layer1): …….

(layer2): ……..

(layer3): …….

(layer4): …….

(avgpool): AvgPool2d(kernel_size=7, stride=1, padding=0)

(fc): Linear(in_features=2048, out_features=1000, bias=True)


If we see the above model summary, we could see the entire network structure with few convolution layers to extract the features. The idea is to use the same structure until feature extraction for our own images and train a different fully connected classifier with extracted features.

II. Now Use our own classifier instead of fully connected layer at the end

for param in model.parameters():

param.requires_grad = False

classifier = nn.Sequential(nn.Linear(2048,512),



nn.Linear(512,2), # Identify two classes of images



model.fc = classifier #Own network instead of resent fully connected

criterion = nn.NLLLoss()

optimizer = optim.Adam(model.fc.parameters(),lr=0.003)

III. Now the above model can be trained with our own images for classification

Style and Content Separation

Below shows a generic architecture of CNN

Input Image -> Convolution Layer -> Pooling Layer -> Fully Connected Layer. A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels). The last layers before fully connected is feature representation of image. This feature representation is flattened to feed through fully connected layer. This feature representation at the depth of layers is more of content of that image rather any textures, colors.

To get the representation of style a feature space designed to extract the texture, colors, curvatures etc. is used. This space looks at the spatial correlations with in a layer. The correlations tell how feature maps or similar or dis-similar. This measure gives information about the color and texture of image.

As it is possible to get the style and content of images, we will see later as how we can combine and create new image with style transfer.

VGG19 Pre-trained network

The approach defined in paper (reference below) is outlined here using pre-trained VGG19 network.

This network accepts a color image as input and passes it through a series of convolution and pooling layers followed by 3 fully connected layers to classify the image. In between 5 poling layers there are stacks of 2 and 4 convolution layers. The depth is increased after each pooling layer down sampling the image.

Content Loss

Style transfer creates a new image taking content from one image and style from another image. Each of content and style transfer images are passed through this VGG 19 network. First when the content image is feed through, the content representation is extracted in the deepest convolution layer before the fully connected layer. Output of this layer is representation of content image. For style image, the style is extracted is from the other intermittent convolution layers. Now we have Content and Style representation we can create new image from these two. How to create that new image? We can start with one of content style or content and slowly add the other representation to other until we minimize the loss.

In the paper, the content representation is taken from output of 4th convolution stack (conv4_2). When we form new image, we compare its content representation with content representation from conv4_2. These two should be same even if the style of target is differing. The loss between these two is content loss.

L Content = Mean Squared Difference between two representations

Our goal is to minimize this loss along with Style loss.

Gram Matrix

Style representation of an image relies on looking at correlations between features of individual layers. Typically, similarities can be found between multiple layers. By considering correlations between multiple layers of different sizes, multi style representation of image can be obtained. Style representation is captured from output of first convolution layer in all 5-layer stacks (conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1). The correlation at each layer are given by Gram Matrix.

G = V.V-Transpose

To calculate gram matrix, the feature maps at each layer is to be flattened and formed one matrix. This matrix multiplied by its transpose gives a gram matrix.

Suppose if a layer has 8 feature maps each with L*W of 4*4, then matrix of 8*16 is formed. This is multiplied with its transpose to get gram matrix of 8*8. This is more of co-variance matrix between each feature.

Now we have gram matrix representing the style information in a layer. We can calculate the style loss now.

Style Loss

To calculate the style loss between style image and target image, the mean squared distance between style and target gram matrices are used.

With VGG 19, all matrices formed at 5 layers are used.

Ss — List of Gram Matrices for Style

Ts — List of Gram Matrices for Target

L-Style = a *Sum (W*Square(Target — Style))

a — Constant represents number of values in each layer

W — Style Weights

Total Style Transfer Loss

Now with Content and Style loss, we can calculate the total loss. We add both these losses and use back propagation to minimize the loss by iteratively changing our target image to match content and style images.

The calculation of Content and Style loss differs, so both should be considered equally. So, it is necessary to apply constant weights, Alpha and Beta, Alpha — Content Weight, Beta — Style Weight

Total ST Loss = Alpha * Content-Loss + Beta * Style-Loss

As Beta gets larger the style representation of target image matches more to style image.

Implementation of Style Transfer in PyTorch

Now that we have seen the concept and math behind Style Transfer. Let’s implement these concepts in PyTorch.

Import the required PyTorch Modules

from PIL import Image

import matplotlib.pyplot as plt

import numpy as np

import torch

import torch.optim as optim

from torchvision import transforms, models

Load VGG19 pre-trained model

# VGG19 contains two parts, features and classifier

# Features is part of network with convolution and max pool layers

# Classifier is part of network with 3 fully connected layers and classifier output

vgg = models.vgg19(pretrained=True)

# freeze all VGG parameters since we’re only optimizing the target image

for param in vgg.parameters():


Load VGG19 pre-trained model

# move the model to GPU, if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Load images

def load_image(img_path, max_size=400, shape=None):
''' Load in and transform an image, making sure the image
is <= 400 pixels in the x-y dims.'''
image ='RGB')
# large images will slow down processing
if max(image.size) > max_size:
size = max_size
size = max(image.size)
if shape is not None:
size = shape
in_transform = transforms.Compose([
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))])
# discard the transparent, alpha channel (that's the :3) and add the batch dimension
image = in_transform(image)[:3,:,:].unsqueeze(0)
return image
# helper function for un-normalizing an image
# and converting it from a Tensor image to a NumPy image for display
def im_convert(tensor):
""" Display a tensor as an image. """
image ="cpu").clone().detach()
image = image.numpy().squeeze()
image = image.transpose(1,2,0)
image = image * np.array((0.229, 0.224, 0.225)) + np.array((0.485, 0.456, 0.406))
image = image.clip(0, 1)
return image
# load in content and style image, using shape parameter to make both content and style of same shape to make processing easier
content = load_image("C:\\Users\\vprayagala2\\Pictures\\Content_Img.jpg",shape=[400,400]).to(device)
style = load_image("C:\\Users\\vprayagala2\\Pictures\\Style_Img.jpg", shape=[400,400]).to(device)
# display the images
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
# content and style ims side-by-side

Style and Content Feature Extraction

# Print VGG19 network to understand the layers

The layer numbers and name can be correlated to one presented in VGG19 architecture diagram.

def get_features(image, model, layers=None):
""" Run an image forward through a model and get the features for
a set of layers. Default layers are for VGGNet matching Gatys et al (2016)
## Need the layers for the content and style representations of an image
# As mentioned, conv4_2 is content representation
# Conv1_1 thru conv5_1 is for style representation
if layers is None:
layers = {'0': 'conv1_1',
'5': 'conv2_1',
'10': 'conv3_1',
'19': 'conv4_1',
'21': 'conv4_2',  ## content representation is output of this layer
'28': 'conv5_1'}
features = {}
x = image
# model._modules is a dictionary holding each module in the model
for name, layer in model._modules.items():
x = layer(x)
if name in layers:
features[layers[name]] = x
return features
# get content and style features only once before training
content_features = get_features(content, vgg)
style_features = get_features(style, vgg)

Gram Matrix

The output of every convolutional layer is a Tensor with dimensions associated with the batch_size, a depth d and some height and width (h, w). The Gram matrix of a convolutional layer can be calculated as follows:

Get the depth, height, and width of a tensor using batch_size, d, h, w = tensor.size

Reshape that tensor so that the spatial dimensions are flattened

Calculate the gram matrix by multiplying the reshaped tensor by its transpose

Note: You can multiply two matrices using, matrix2).

def gram_matrix(tensor):
""" Calculate the Gram Matrix of a given tensor
Gram Matrix:
# get the batch_size, depth, height, and width of the Tensor
_, d, h, w = tensor.size()
# reshape so we're multiplying the features for each channel
tensor = tensor.view(d, h * w)
# calculate the gram matrix
gram =, tensor.t())
return gram

Calculate the Style Transfer Loss, minimize it and Improve the Target Image

Now we have all the information, calculate the style transfer loss

Individual Layer Style Weights

Below, you are given the option to weight the style representation at each relevant layer. It’s suggested that you use a range between 0–1 to weight these layers. By weighting earlier layers (conv1_1 and conv2_1) more, you can expect to get larger style artifacts in your resulting, target image. Should you choose to weight later layers, you’ll get more emphasis on smaller features. This is because each layer is a different size and together they create a multi-scale style representation!

Content and Style Weight

Just like in the paper, we define an alpha (content weight) and a beta (style weight). This ratio will affect how stylized your final image is. It’s recommended that you leave the content weight = 1 and set the style weight to achieve the ratio you want.

# for displaying the target image, intermittently
show_every = 400
# iteration hyperparameters
optimizer = optim.Adam([target], lr=0.003)
steps = 2000  # decide how many iterations to update your image (5000)
for ii in range(1, steps+1):
# get the features from your target image
target_features = get_features(target, vgg)
# the content loss
content_loss = torch.mean((target_features['conv4_2'] - content_features['conv4_2'])**2)
# the style loss
# initialize the style loss to 0
style_loss = 0
# then add to it for each layer's gram matrix loss
for layer in style_weights:
# get the "target" style representation for the layer
target_feature = target_features[layer]
target_gram = gram_matrix(target_feature)
_, d, h, w = target_feature.shape
# get the "style" style representation
style_gram = style_grams[layer]
# the style loss for one layer, weighted appropriately
layer_style_loss = style_weights[layer] * torch.mean((target_gram - style_gram)**2)
# add to the style loss
style_loss += layer_style_loss / (d * h * w)
# calculate the *total* loss
total_loss = content_weight * content_loss + style_weight * style_loss
# update your target image
# display intermediate images and print the loss
if  ii % show_every == 0:
print('Total loss: ', total_loss.item())
# display content and final, target image
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

The code has been run on google colab to make use of GPU hardware. The completed code is shared at below link. Can experiment with various Beta values to see how the style is being captured in target image. Experimented with 1e6 and 1e8 beta values. The same output below if with beta of 1e8

Sample output


· Paper on Style Transfer —

· Udacity — PyTorch Nanogegree

· Stanford CNN for Visual Recognition —