Original article can be found here (source): Deep Learning on Medium
We start with a messy canvas by generating a random image. Let’s call it G.
Content: what we paint
Content has to do with what we’re painting rather than how it’s painted, different artists can paint a cat in different styles, but the object is found in both images.
We run our content image through VVG-19 and collect its activation at a layer that’s not too deep but not too shallow in our network or else it’ll have the same pixel values of our image.
Let’s use a layer in the middle like the 4_2 layer (second part of the 4th layer).
We then feedforward pass our noise image G
We want to minimize the difference between our content in our content image and our generated image. We can measure this difference using a loss function like mean square error. We’ll call it J(G).
Where l is a layer of the network, F is the style image, and P is the generated image. F^l and P^l are the feature representations of the images at layer l.
Then we check how similar their content is by checking if both have similar C and Gs activations functions; if they do, then they contain similar content.
Style: how we paint
Van Gogh can paint hundreds of different objects but all in a similar style.
To extract style, we’ll use the highlighted layers.
How does our network know what style is?
Our CNN will look for style by keeping an eye out for constant relationships of texture, colours and lines in-between multiple layers, creating a feature map.
we look for activation correlations between different features maps, say the red feature map represents colour and the yellow represents texture. If they’re strongly correlated, then where that specific texture is found on the image we’ll also find that correlated colour (e.g., where straight lines are on the image so, is the colour blue.)
Sometimes a layer will pick up on one feature, but that doesn’t mean the entire image has that feature. To ensure the entire image has that general style, we check multiple layers for that pattern.
To find correlations in different feature maps, we find the dot product of the feature maps activations. When we multiply the features in each channel, we get a gram matrix.
If the gram matrix is large, then those two feature maps are correlated.
The gram matrix finds correlations between feature maps regardless of what part of the image it’s looking at. This is important because style is throughout the image, not just in one specific area.
We’ll call our gram matrixes Gkk’
We find the differences in the style of our images by using the mean square error loss function with the gram matrixes of the style image and the generated image.
Overall loss = Content loss + Style loss
The goal of the algorithm is to minimize the loss functions of both style and content.
We combine the loss functions for style and content to get an overall loss function, which tells us how good our baby image is.
But notice how we calculated their loss functions differently. To balance this out, we scale them differently with weights (Alpha and Beta). Typically we’d make the weight for style much heavier than content.
We can tweak these weights to have more content or style than the other. Say we make our content weight significantly heavier than style, then our resulting image would have much less style and focus more on content.
To minimize this loss, we’ll by using gradient descent. We find the gradient of our loss function and apply backpropagation, changing and optimizing our image until we minimize our loss.
We keep on repeating this process until we can’t get our loss function to be any smaller; this is when we have a good baby image.
Executing Neural Style Transfer
- Find activations at 4_2 layer of our content image, when passing it through VVG-19
- Find gram matrixes and activations at multiple layers of our style image, when passing it through VVG-19
- Generate a random image
- Run random image through VVG-19, repeating 1&2 for the generated image, repeatedly run backpropagation until loss function is minimized.