Neural Style Transfer — A high-level approach

Original article was published by Daniel Deutsch on Deep Learning on Medium

Transfer learning and style transfer

Another important concept is that a pre-trained network is used. Most often VGG-19. It is noteworthy that we make use of so-called “transfer learning”.

We have 2 concepts here to distinguish:

  1. Transfer learning
  2. Style transfer

Though, both use the word “transfer” they are quite different from an implementation standpoint.

1. Transfer learning

The concept itself is extremely interesting and potent to create new solutions through the use of established models.

For a fantastic introduction, I can recommend this article:

It is crucial to understand how it is used in the concept of style transfer.

In short, we can say

Transfer learning and domain adaptation refer to the situation where what has been learned in one setting … is exploited to improve generalization in another setting

This is especially useful in computer vision, as the computation and training of those models are quite resource-hungry. Using a model that has been trained on a huge dataset, where the result is now freely available is actually very nice for individual experimentation.

You can use transfer learning as:

  1. direct use of a pre-trained model
  2. feature extraction of pre-trained models
  3. changing weights of the last layer of a pre-trained model

In our case, we will use the second approach. Using feature extraction, where, the output of the model from a layer prior to the output layer is used as input for a new classifier.

2. Style Transfer

From the original paper:

Conceptually most closely related are methods using texture transfer to achieve artistic style transfer. However, these previous approaches mainly rely on non-parametric techniques to directly manipulate the pixel representation of an image. In contrast, by using Deep Neural Networks trained on object recognition, we carry out manipulations in feature spaces that explicitly represent the high level content of an image.

So this means that the specialty of the deep learning approach is to extract the style of an image not with mere pixel observation of the style picture, but rather the extracted features of the pre-trained model combined with the content of the style image. So, in essence, to discover the style of an image, we

  1. process the style image by analyzing its pixels
  2. feeding this information to the layer of a pre-trained model to “understand”/classify the provided input as objects

How this is done we will explore in the section “style cost”.

Style and content

The basic idea is to transfer the style of an image to the content of an image.

Therefore we need to understand two things:

  1. What is the content of an image
  2. What is the style of an image

Loosely speaking the content of an image is what we humans identify as objects in an image. A car, a bridge, houses, etc. Style is harder to define. It heavily depends on the image. It is overall texture, color selection, contrast, etc.

Those definitions need to be expressed in a mathematical way to be implemented in the world of machine learning.

Cost calculation

First, why cost/loss calculation? It is important to understand that in this context the cost is the mere difference between the original and the generated image. There are multiple ways on how to calculate it (MSE, euclidean distance, etc). By minimizing the differences of the images we are able to transfer styles.

When we start out with big differences in the loss, we will see that the style transfer is not that good. We can see that styles are transferred, but it seems rough and unintuitive. With each cost minimization step, we go in the direction of a better merger of the style and content and ultimately a better resulting image.

As we can see the central element for this process is the loss calculation. There are 3 costs that need to be calculated:

  1. Content cost
  2. Style cost
  3. Total (variation) cost

Those steps are in my opinion the hardest to understand, so let’s dive into it one by one.

Always keep in mind that we are comparing the original input with the generated image. Those differences are the cost. And this cost we want to minimize.

It is so important to understand this because in the process other differences will also be calculated.

Content cost

What is content cost? As we found out before, we define the content of an image by its objects. Things that we as humans can recognize as things.

Having understood the structure of a CNN, it now becomes apparent that at the end of the neural network we can access a layer, that represents the objects (the content) quite well. Going through the pooling layers we lose the stylistic parts of the image, but in terms of getting the content, this is desired.

Now the feature maps in higher layers of the CNN are activated in the presence of different objects. So if two images have the same content, they should have similar activations in the higher layers.

That is the premise for defining the cost function.

The following image helps to understand how the layer is rolled out to be prepared for calculations (which are not covered in this article):

from Aditya Guptas article under MIT License

Style cost

Now it is getting sophisticated.

Make sure to understand the difference between the style of an image and style loss of an image. Both calculations are different. One is to detect the “style representation” (texture, colors, etc), the other is to compare the style of the original image with the style of the generated image.

The total style cost is calculated in two steps:

  1. the style cost of all convolutional layers. Identifying the style of the style image

a. Getting the feature vectors from a convolutional layer

b. Comparing those vectors with feature vectors from another layer (finding its correlation)

2. the style cost between the original (the original style image!) and the generated image.

To find the style the correlation is captured by multiplying the feature map to its transpose, resulting in the gram matrix.

from Aditya Guptas article under MIT License

Luckily the CNN provides us with multiple layers we can choose to find its styles correctly. Comparing various layers and their correlations we can identify the style of an image.

So instead of using a layer’s raw output, we use the gram matrix of the feature map of an individual layer to identify the style of an image.

The first cost is the difference between those gram matrices, ie the difference of correlations. The second cost is again the difference between the original image and the generated one. This is in essence the “style transfer”.

Total variation cost

It acts like a regularizer that improves smoothness in the generated image. This was not used in the original paper but improves the results. In essence, we smooth out the differences between style and content transferal within the generated image.


This section provides reading material that was used when writing this article. It is a selection of articles related to style transfer.

Implementations with code:

Research papers:


Daniel is an entrepreneur, software developer, and business law graduate. He has worked at various IT companies, tax advisory, management consulting, and at the Austrian court.

His knowledge and interests currently revolve around programming machine learning applications and all its related aspects. To the core, he considers himself a problem solver of complex environments, which is reflected in his various projects.

Don’t hesitate to get in touch if you have ideas, projects, or problems.

You can support me on

Connect on: