Image Style Transfer using Convolutional Neural Networks

Source: Deep Learning on Medium

Go to the profile of Utkarsh Verma

Image style transfer is very interesting and artsy application of Deep Learning where in you can transfer the style of one image onto other. Implication of style is the fine details like brush strokes, edges, patterns and patches and other finer details. Like some of the artists had their signature way of stylizing their artwork. With the help of Image Style Transfer, we can find amusing results by crossing over the styles and works of different artisans.

Style transfer of Hokusai’s Wave on an image of a cat.

With the ground-breaking results produced by the convolutional neural networks, it had been extensively used in almost all the Computer Vision applications. However. the hidden layers of the ConvNets were seen as a black box, and what I consider a major breakthrough in clearing up this black-box mist was achieved by Gatys et al. in their publication. They demonstrated, with the help of experiments, that the feature maps in the deeper hidden layers contain rich information about the content of the image. Also, it was established that the style information contained in the image can be extracted from the correlation of the feature maps of the first convolutional layers in the conv-stacks.

Convolutional Layer Stacks in VGG19

I’ll take the help of a diagram to explain it better. Looking at the architecture, the stacks are named as ‘conv1’, ‘conv2’, etc. and their respective layers are referred by the corresponding positions such as, the first layer of the third stack would be called ‘conv3_1’.

Content Loss

Now, because of the pooling layers, the height and width of the feature maps keeps on reducing as we go deeper into the network. This leads to gradual elimination of unimportant data, or should I say, accumulation of only the content, i.e. the complex features information in dense layers which lie deep in the network. So, in order to preserve the content information (since only style is being transferred from other image), the feature maps deep in the architecture are taken as targets. We take the mean squared difference to be the loss function,

The content loss is set to be the mean squared difference between the dense features of target and content images.

where, C_target is the convolutional feature map of the resultant image and C_content is that of the image whose content is to be preserved. In the paper, they have taken the ‘conv4_2’ feature map for content loss.

So, the content loss is sorted! Let’s move on to the style loss. It’s a bit tricky, but hold on.

Style Loss

To transfer the style from, lets say, style image, we take the first conv layer of each stack, which has the dimensions, say, d*w*h, where d = depth of feature map, w = width and h = height of the features, and flatten it out in form of a 2-D matrix which has d rows and w*h columns.

Multiplying the said matrix with its transpose, we get what is known as a Gram Matrix. Each cell in the matrix now represents in a rather position independent form the correlation between the feature maps of the layer. It is very interesting to notice that the dimensions of the final gram matrix depends only on the number of feature maps present initially. Each of the cells represent a likelihood score from the corresponding layers. The gram matrices are calculated for the first conv layer of each of the five stacks. Then, the mean squared difference of this set of gram matrices is set to be the style loss,

Style Loss

where, S_target is the individual gram matrix of the target image and S_style is the corresponding gram matrix of the style image. Note that, each layer in the depth of the architecture contributes differently to the style content of the image, generally decreasing as we go deeper into the architecture.
This sums up the Style Loss.

Final Step

Now, in the final step, we make a loss function from a linear combination of Content and Style loss and then backpropagate it through our neural net. The ratio of the coefficients of style loss and content loss is typically in the order of 10000. Since how much style transfer we want is dependent on our choice, the number of epochs/training duration is arbitrary.

Essentially, to sum up the entire flow, we took a content and a style image, and a clone of content image (the target). We made such a loss function that preserves the content from the content but due to high multiplication factor of style loss, it is dominated from the style image. This leads to creation of an interesting masterpiece from your very own code.

You can find the code here.