Original article can be found here (source): Deep Learning on Medium
Image Enhancement: “Teaching AI To Fill-In The Pixels”
In this article, we are going to talk about and implement image-to-image translation for Super-Resolution. Furthermore, talk about and add the best tips tricks to improve our results.
I started my Data Science journey with a fascination for the raw data processing power and potential for doing good with the insights gained from the data. In this day and age, we are producing more data than any living organism can process, so we have to turn to our own creations to help us in this regard. More than that this field is the key to the project of my life. It’s incredible how bits complement atoms, the perfect symbiosis between the real-world and the digital world to extend beyond their individual limitations.
Computer Vision(CV), specifically Generative Adversarial Networks(GANs) and image-to-image translation have been very fascinating to me, I have been working for the past 6 months towards learning, building and playing with some algorithms, namely:
The first three I have had a lot of fun with and shared my experience with you in previous articles. I saved the best for last.
Without further due, let us get into it.
Image Transformation task
Many classic problems can be framed as image transformation tasks, where a system receives some input image an transforms it into an output image.
I believe the main landmark work on Super-Resolution is the 2016 paper entitled Perceptual Losses for Real-Time Style Transfer and Super-Resolution. This paper was a game-changer because it brought a significant improvement not only on the quality of the results but to the efficiency of both Style Transfer and Super-Resolution, they forgot to include segmentation but that would probably make the title too big and complicated.
There are many ways Image Super-Resolution can be done, we are going to focus on a couple of techniques and improve on them.
The previous approach was to train a feed-forward convolutional neural network in a supervised manner(providing inputs and labels) and using per-pixel loss function which directly measures the difference between output and ground-truth images. This approach is efficient and only requires a forward pass through the trained network but however per-pixel losses don’t capture finer differences between output and ground-truth images, the authors call them perceptual differences.
But before we move on to perceptual loss let’s go into the details of pixel loss because it’s a fairly simple to implement.
This kind of problems can be formulated as a supervised learning problem, where we provide our algorithm with the inputs and labels, in this case, we have both input and ground-truth as images, but the input is 2x, 4x or 8x low-resolution version of the ground-truth image.
There are many ways of creating a data pipeline using the various deep learning frameworks but I’m sticking to fastai v1 because for me it’s the easiest and more concise. It just makes more sense, Jeremy and the Fastai community really nailed in the library development.
To build our pipeline, we use an ImageImageList which is suitable for Image-to-Image translation tasks such as Image SuperResolution, Colorisation, de-noising and etc.
It is as simple as:
We create a get_data() because we are going to do progressive resizing /scaling we will need to have a function to create our Dataloader on the fly. The beauty of fastai’s ImageImageList is that beside transform that applies transforms to your inputs we also have an extra transform_y() where we can set the transformations we want to the ground-truth images, and here we have the size of the input image multiplied by the scale.
With the foundation we built, we can easily change the size and scale of our data.
It’s as easy as:
data = get_data(bs, (sz_lr, sz_lr*scale))
A typical ConvNet used in a task such as an image classification extract features by passing it through a series of convolutional layers followed by a batch norm, ReLU and max pool layers which reduces the size of the features, after this, it has an Adaptive Average Pooling layer just before the fully connected layers that output the classification.
Adaptive Average Pooling layer — transforms variable size input features into the same size vector. This is helpful because it allows models to be trained with images of size let’s say 256×256 and do inference on images of any size like 400×500.
For Image-to-Image translation task, we want the every before the Adaptive Average Pooling Layer because this throws away too much geometrical information.
For this task, we don’t want our model to downsample(reduce the size) our input image because it will be too computationally intensive to train an upsampling path to resize it to the original size and then upscale it to the set scale. It’s not impossible but it’s not feasible for this kind of simple model, we will check in a later section how to overcome this.
So, we have to build a custom model that will do all the processing but will keep the input size constant throughout the network and will have an upsampling block that will do 2x or 4x scaling.
Since we want to do a lot of computation without reducing the size or changing the image much, ResNet blocks are better for this task because the input and output are very similar. For this network, we will have 1 conv->8 Resnet blocks.
In the previous article, we used transposed convolutional but here we will use a technique called Pixel Shuffle and a weight initialization technique called ICNR which helps eliminate checkboard artifacts. We will go in-depth about this topic in a later section.
Finally, we just create a learner object that joins everything related to the model, data, loss and optimization function under one object.
After training it for a little bit.
As you can see the only thing our model learnt with the per-pixel loss is how to smooth a rather blocky image, that’s partly because we are asking it to take the mean squared error of the pixels so it is averaging the pixels to approximate the ground truth, it is incapable of learning fine details. Now let’s see what a Perceptual loss has to offers.
Some work has shown that high-quality images can be generated using perpetual loss functions based not on differences between low-level pixel information but instead on differences between high-level image feature representations(intermediate activations) extracted from a pretrained convolutional neural network. This perceptual loss measures image similarities more robustly than per-pixel losses, and at test-time, the transformation networks run real-time.
In very simple plain English, a perceptual loss is having to extract intermediate activations( or features they mean the same) of the image your model produced and ground-truth image from a frozen pretrained network without the classification head and then comparing these activations from different blocks of the pretrained network using a loss function like L1 or L2 loss and using the loss value to update the weights of the Generator Network. We are sticking with L1 loss.
The way we extract activations from a model in PyTorch is by using hooks.
We pre-train our model with pixel loss and 2x scale then we finetune our 4x scale model do progressive resizing and rescaling.
Bag of tricks
The following are tips and tricks that have been tested and improved the results in terms of performance and quality.
Weighted Residual Connections(WRC)
There is an entire subfield about how to help models converge faster(superconvergence), one tip I picked up from the fastai 18 course is that you can weigh the residual connection if it is a factor less than 1 it will help stabilize training. You can train a subnet to learn these weights or you can set them manually.
“The weighted residual network is able to learn to combine residuals from different layers effectively and efficiently. The proposed models enjoy a consistent improvement over accuracy and convergence with increasing depths from 100+ layers to 1000+ layers” — Weighted Residuals for Very Deep Networks paper
Setting the weight to 1 is the same as a normal residual connection
PixelShuffle and ICNR
Are the current best practice techniques to eliminate checkerboard artifacts in Fully Convolutional architectures.
Introduced in the checkerboard artifact free sub-pixel convolution paper.
PixelShuffle or Sub-pixel convolutional is a specific implementation of a deconvolution layer that can be interpreted as a standard convolution in low-resolution space followed by a periodic shuffling operation as shown in Figure above. Sub-pixel convolution has the advantage over standard resize convolutions that, at the same computational complexity, it has more parameters and thus better modelling power. Sub-pixel convolution is constrained to not allow deconvolution overlap which is an improvement over transposed convolutions, however, it suffers from checkerboard artifacts following random initialization, that where ICNR comes to the picture.
- ICNR(Initialized to convolution NN resize)
In this weight initialization scheme, we simply set the same weight matrix to every activation of the convolution before the shuffle, this completely eliminates checkerboard artefacts and produces clean outputs.
Expanded Loss Function
We can expand our perceptual loss function and make it more powerful by adding the following function:
- Per-Pixel loss
- Gram loss
- Total Variation
This produces really amazing results.
You can find the notebook with the normal loss here:
And expanded loss here:
This architecture consists of two paths, the downsampling path(left side)that passes cross activations to the upsampling path(right side).
The main improvement is the U-shaped architecture that in order to produce better results the high-resolution features from downsampling path are combined(concatenated) with the equivalent upsampled output block and a successive convolution layer can learn to assemble a more precise and detailed output based on this information.
This architecture is famous in the image segmentation field because it can produce a fine segmentation mask, therefore, it can as well work in other image-to-image translation tasks such as Image Super-Resolution.
Finally, we replace our previous Generator net with UNet and couple it with our extended perceptual loss and bum we got the state-of-the-art if not, we are pretty close.
There 3 key takeaways:
- When doing Image Super-Resolution it’s much better to have our loss function compare the activations from various blocks of a pretrained network because it captures finer details and helps the model produce better results instead of comparing pixels directly.
- You can make experiments by combining various techniques from multiple papers and produce unseen results.
- UNet is a versatile architecture, it has improved Image Segmentation, object detection and many other fields.