Towards Fast Neural Style Transfer

Source: Deep Learning on Medium


The seminal paper of Neural Style Transfer presented by Gatys et al. [1] demonstrates a remarkable characteristic of Deep Convolutional Neural Networks. The sequential representations learned from layers of parametric convolutions can be separated into ‘content’ and ‘style’. The fundamental idea behind Style Transfer is that pre-trained DCNNs on tasks such as ImageNet classification can be used as descriptor networks. An image is passed through a pre-trained DCNN such as VGG [2] and the intermediate feature activations can be used to fuse the ‘style’ of one image with the ‘content’ of another. Deriving the loss function from a pre-trained network’s feature activations is the foundational idea behind Neural Style Transfer.

Despite amazing results, implementing Neural Style Transfer according to [1] requires a slow iterative optimization process. First, an image is generated using a generator network. This outputted image is passed through the pre-trained VGG. The ReLU activations from layers 1, 2, 3, 4, and 5 are non-localized by calculating the Gram matrix and this forms the style output. The inner product of ReLU activations in layer 4 forms the content output. The image is then optimized via backpropagation such that the style and content output match the target style and target content images.

Results from Ulyanov et al. [3] ‘s Faster Style Transfer with Feed-Forward Networks

This article will present a paper from Ulyanov et al. [3] that speeds up this process of Neural Style Transfer by training Feed-forward networks such that only a single forward pass is needed to stylize an image. Following is a link to this paper:

Quick Interesting Statistics from the Paper

  • The network took 2 hours to train on an NVIDIA Tesla K40 GPU
  • 20 ms for Style Transfer
  • 170 MB to generate a 256 x 256 Sample
  • The authors found the best results when training with 16 content images

Network Architecture

I think one of the most useful ways to understand a new Deep Learning paper is to look at the architecture used and so that is how this article will begin exploring this technique.

Multi-Scale Generator Architecture for Feed-Forward Style Transfer [3]

There are many parts to this architecture. Firstly, it is a multi-scale architecture similar to what is used in LAP-GAN or Progressively-Growing GANs. Each z represents a random noise input of a different spatial resolution. Each training epoch samples a noise vector z which contains K tensors, in the picture above, K = 5. Each noise vector is convolved over 3 times and then joined with the layer below via an upscaling and concatenation operation. The picture above is used for synthesizing texture only. Unfortunately, the authors did not provide an additional picture for how style is synthesized but it is a quick augmentation that can be read and understood. Contrast to the image above, when they are doing style transfer they downsample the content image y to match each noise tensor in z and concatenate these together. E.g. the 4x4xc noise tensor 0 is concatenated with a downsampled 4x4x3 content image, 8x8xc + 8x8x3, .. , and so on.

Another interesting detail is the use of 1×1 convolutions towards the end of the network. This is done to preserve the spatial resolution but reduce the depth of the feature maps such that the output converges from a HxWxC tensor to a HxWx3 RGB image, (H = height, W = width, C = channel).

Perhaps, more interesting than the architecture however, is the derivation of the loss function used in this task. This loss function comprises of two parts, a style and a content loss each derived from the intermediate activations of a pre-trained Deep Convolutional Neural Network, (in this case VGG-19).

The style loss is non-localized through the computation of the Gram matrix:

Gram Matrix Equation

This takes the inner product of the Feature Maps at each depth. For example, if the output of a convolution is 50x50x64 with 64 feature maps, the dot product of feature map 1 is taken with feature map 2, and so on until feature map 64. Thus forming the loss function,

In contrast, the content loss is computed as follows:

This equation calculates the differences across spatial locations at each feature map. Thus, the loss functions are capturing very different kinds of information for each task. These two losses are combined for Style Transfer and weighted with a parameter alpha.

This loss function is not much different than what was proposed from Gatys et al. [1], however, it is interesting to think of it being used to train a feed-forward network.

It is very interesting to see how Neural Style Transfer can be accomplished with a single forward pass of a Deep Neural Network. Future works highlight that this approach is limited to the pre-trained set of styles and the quality of results are not always as high as that of Gatys et al. [1]. It will be interesting to see how Neural Style Transfer algorithms develop further, Thanks for reading!

More Results from This Paper [3]

References

[1] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge. A Neural Algorithm of Artistic Style. 2015.

[2] Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014.

[3] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, Victor Lempitsky. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. 2016.