Source: Deep Learning on Medium
This article describes the techniques and training a deep learning model for image improvement, image restoration, inpainting and super resolution. This utilises many techniques taught in the Fastai course and makes use of the Fastai software library. This method of training a model is based upon methods and research by very talented AI researchers, I’ve credited them where I have been able to in the information and techniques.
As far as I’m aware some of the techniques I’ve applied with the training data are unique at this point with these learning methods (as of February 2019) and only a handful of researchers are using all these techniques together, who will mostly are likely to be Fastai researchers/students.
Super resolution is the process of upscaling and or improving the details within an image. Often a low resolution image is taken as an input and the same image is upscaled to a higher resolution, which is the output. The details in the high resolution output are filled in where the details are essentially unknown.
Super resolution is essentially what you see in films and series like CSI where someone zooms into an image and it improves in quality and the details just appear.
I first heard about ‘AI Super resolution’ last year in early 2018 in the excellent YouTube 2 minute papers, which features short fantastic reviews of the latest AI papers (often longer than 2 minutes). At the time it seemed like magic and I couldn’t understand how it was possible. Definitely living up to the Arthur C Clarke quote “any advanced technology is indistinguishable from magic”. Little did I think that less than a year on I would be training my own super resolution model and writing about it.
This is part of a series of articles I am writing as part of my ongoing learning and research in Artificial Intelligence and Machine Learning. I’m a software engineer and analyst for my day job aspiring to be an AI researcher and Data Scientist.
I’ve written this in part to reinforce my own knowledge and understanding, hopefully this will also be of help and interest to others. I’ve tried to keep the majority of this in as much plain English as possible so that hopefully it will make sense to anyone with a familiarity in machine learning with a some more in depth technical details and links to associates research. These topics and techniques have been quite challenging to understand and its taken me many months to experiment and write this. If you don’t agree with what I’ve written or think it’s just wrong, please do contact me as it’s a continual learning process and I would appreciate feedback.
Below is an example of a low resolution image with super resolution performed upon it to improve it:
The problem deep machine learning based super resolution is trying to solve is that traditional algorithm based upscaling methods lack fine detail and cannot remove defects and compression artifacts. For humans who carry out these tasks manually it is a very slow and painstaking process.
The benefits are gaining a higher quality image from one where that never existed or has been lost, this could be beneficial in many areas or even life saving on medical applications.
Another use case is for compression in transfer between computer networks. Imagine if you only had to send a 256×256 pixel image where a 1024×1024 pixel image is needed.
In the set of images below there are five images:
- The lower resolution input image to be upscaled
- The input image upscaled by nearest neighbour interpolation
- The input image upscaled by bi-linear interpretation, this is what your Internet browser would typically need
- The input image upscaled and improved by this model’s prediction
- The target image or ground truth, which was downscaled to create the lower resolution input.
The objective is to improve the low resolution image to be as good (or better) than than the target, known as the ground truth, which in this situation is the original image we downscaled into the low resolution image.
To accomplish this a mathematical function takes the low resolution image that lacks details and hallucinates the details and features onto it. In doing so the function finds detail potentially never recorded by the original camera.
This mathematical function is known as the model and the upscaled image is the models prediction.
There are potential ethical concerns with this mentioned at the end of this article, once how the model and its training is explained.
Image repair and inpainting
Models that are trained for super resolution should also be useful for repairing defects in a image (jpeg compression, tears, folds and other damage) as the model has a concept of what certain features should look like, for example materials, fur or even an eye.
Image inpainting is the process of retouching an image to remove unwanted elements in the image, such as a wire fence. For training it is common to cut out sections of the image and train the model to replace the missing parts based on prior knowledge of what should be there. Image inpainting is a usually a very slow process when carried out manually by a skilled human.
Super resolution and inpainting seem to be often regarded as separate and different tasks. However if a mathematical function can be trained to create additional detail that’s not in an image, then it should be capable of repairing defects and gaps in the the image as well. This assumes those defects and gaps exist in the training data for their restoration to be learnt by the model.
GANs for Super resolution
Most deep learning based super resolution model are trained using Generative Adversarial Networks (GANs).
One of the limitations of GANs is that they are effectively a lazy approach as their loss function, the critic, is trained as part of the process and not specifically engineered for this purpose. This could be one of the reasons many models are only good at super resolution and not image repair.
Many deep learning super resolution methods can’t be applied universally to all types of image and almost all have their weaknesses. For example a model trained for the super resolution of animals may not be good for the super resolution of human faces.
The model trained with the methods detailed in this article seemed to perform well across varied dataset including human features, indicating a universal model that is effective at upscaling on any category of image may be possible.
Examples of X2 super resolution
Following are ten examples of X2 super resolution (doubling the image size) from the same model trained on the Div2K dataset, 800 high resolution images of a variety of subject matter categories.
Example one from a model trained on varied categories of image. During early training I had found improving images with humans in had the least improvements and had taken on an more artistic smoothing effect. However this version of the model trained on a generic category data set has managed to improve this image well, look closely at the added detail in the face, the hair, the folds of the clothes and all of the background.
Example two from a model trained on varied categories of image. The model has added detail to the trees, the roof and the building windows. Again impressive results.
Example three from a model trained on varied categories of image. During training models on different datasets, I had found human faces to had the least pleasing results, however the model here trained on varied categories of images has managed to improve the details in the face and look at the detail added to the hair, this is very impressive.
Example four from a model trained on varied categories of image. The detail added to the pick-axes, the ice, the folds in the jacket and the helmet are impressive here:
Example five from a model trained on varied categories of image. The improvement of the flowers is really impressive here and the detail on the bird;s fur and wings:
Example six from a model trained on varied categories of image. The model has managed to add detail to the people’s hands, the food, the floor and all the objects. This is really impressive:
Example seven from a model trained on varied categories of image. The model has brought the fur into focus and kept the background blurred:
Example eight from a model trained on varied categories of image. The model has done well to sharpen up the lines between the windows:
Example nine from a model trained on varied categories of image. The detail of the fur really seems to have been imagined by the model.
Example ten from a model trained on varied categories of image. This really seems impressive sharpening around the lines of the structure and the lights.
This model’s predictions having performed super resolution
All the images above were improvements made on validation image sets during or at the end of training.
The trained model has been used to create upscaled images of up to 1.7 megapixels, these are a few of the best examples:
In this first example a 256 pixel square image saved at high JPEG quality (95) is inputted into the model that upscales the image to a 1024 pixel square image performing X4 super resolution:
The image sets above don’t necessarily do the prediction justice, view the full size PDF on my public Google drive folder:
In the next example a 512 pixel image saved at low JPEG quality (30) is inputted into the model that upscales the image to a 1024 pixel square image performing X2 super resolution on a lower quality source image. Here the model’s prediction I believe looks better than the target ground truth image, which is amazing:
The image sets above don’t necessarily do the prediction justice, view the full size PDF on my public Google drive folder:
In this very basic terms this model:
- Takes in an image as an input
- Passes it through a trained mathematical function which is a type of neural network
- Outputs an image of the same size or larger that is an improvement over the input.
This builds on the techniques suggested in the Fastai course by Jeremy Howard and Rachel Thomas. It uses the Fastai software library, the PyTorch deep learning platform and the CUDA parallel computation API.
The Fastai software library breaks down a lot of barriers to getting started with complex deep learning. As it is open source it’s easy to customise and replace elements of your architecture to suit your prediction tasks, if needed. This image generator model is build on top of the Fastai U-Net learner.
This method uses the following, each of which is explained further below:
- A U-Net architecture with cross connections similar to a DenseNet
- A ResNet-34 based encoder and a decoder based on ResNet-34
- Pixel Shuffle upscaling with ICNR initialisation
- Transfer learning from pretrained ImageNet models
- A loss function based on activations from a VGG-16 model, pixel loss and gram matrix loss
- Learning rate annealing
- Progressive resizing
This model or mathematical function has over 40 million parameters or coefficients allowing it to attempt to preform super resolution.
Residual Networks (ResNet)
ResNet is a Convolutional Neural Network (CNN) architecture, made up of series of residual blocks (ResBlocks) described below with skip connections differentiating ResNets from other CNNs.
When first devised ResNet won that year’s ImageNet competition by a significant margin as it addressed the vanishing gradient problem, where as more layers are added training slows and accuracy doesn’t improve or even gets worse. It is the networks skip connections that accomplish this feat.
These are shown in the diagram below and explained in more detail as each ResBlock within the ResNet is described.
Residual blocks ( ResBlocks) and dense blocks
Convolutional networks can be substantially deeper, more accurate, and more efficient to train if they contain shorter connections between layers close to the input and those close to the output.
If you visualise the loss surface (the search space for the varying loss of the model’s prediction), this looks like a series of hills and valleys as the left hand image in the diagram below shows. The lowest loss is the lowest point. Research has shown that a smaller optimal network can be ignored even if it’s an exact part of a bigger network. This is because the loss surface is too hard to navigate. This means that by adding layers to the model it can make the prediction become worse.
A solution that’s been very effective is to add cross connections between layers of the network allowing large sections to be skipped if needed. This creates a loss surface that looks like the image on the right. This is much easier for the model to be trained with optimal weights to reduce the loss.
Each ResBlock has two connections from its input, one going through a series of convolutions, batch normalisation and linear functions and the other connection skipping over that series of convolutions and functions. These are known as an identity, cross or skip connections. The tensor outputs of both connections are added together.
Densely Connected Convolutional Networks and DenseBlocks
Where a ResBlock provides an output that is a tensor addition, this can be changed to be tensor concatenation. With each cross/skip connection the network becomes more dense. The ResBlock then becomes a DenseBlock and the network becomes a DenseNet.
This allows the computation to skip over larger and larger parts of the architecture.
Due to the concatenation DenseBlocks consume a lot of memory compared to other architectures and are very well suited to smaller datasets.
A U-Net is a convolutional neural network architecture that was developed for biomedical image segmentation. U-Nets have been found to be very effective for tasks where the output is of similar size as the input and the output needs that amount of spatial resolution. This makes them very good for creating segmentation masks and for image processing/generation such as super resolution.
When convolutional neural nets are commonly used with images for classification, the image is taken and downsampled into one or more classifications using a series of stride two convolutions reducing the grid size each time.
To be able to output a generated image of the same size as the input, or larger, there needs to be an upsampling path to increase the grid size. This makes the network layout resemble a U shape, a U-Net the downsampling/encoder path forms the left hand side of the U and the upsampling/decoder path forms the right hand part of the U.
For the upsampling/decoder path several transposed convolutions accomplishes this, each adding pixels between and around the existing pixels. Essentially the reverse of the downsampling path is carried out. The options for the upsampling algorithms are discussed further on.
Note that this model’s U-Net based architecture also has cross connections which are detailed further on, these weren’t part of the original U-Net architecture.
The original research is available here: https://arxiv.org/abs/1505.04597
Upsampling/ transposed convolutions
Each upsample in the decoder/upsampling part of the network (right hand part of the U) needs to add pixels around the existing pixels and also in-between the existing pixels to eventually reach the desired resolution.
This process can be visualised as below from the paper “A guide to convolution arithmetic for deep learning” where zeros are added between the pixels. The blue pixels are the original 2×2 pixels being expanded to 5×5 pixels. 2 pixels of padding around the outside are added and also a pixel between each pixel. In this example all new pixels are zeros (white).
This could have been improved with some simple initialisation of the new pixels by using the weighted average of the pixels (using bi-linear interpolation), as otherwise it is unnecessarily making it harder for the model to learn.
In this models prediction to initially expand the image to be the same size as the output, it instead uses an improved method known as pixel shuffle or sub-pixel convolution with ICNR initialisation, which results in the gaps between the pixels being filled much more effectively. This is described in the paper “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network”.
After the representation for these new pixels are added, the subsequent convolutions improve the detail within them as the path continues through the decoder path of the network.
U-Nets and fine image detail
When using a only a U-Net architecture the predictions tend to lack fine detail, to help address this cross or skip connections can be added between blocks of the network.
Rather than adding a skip connection every two convolutions as is in a ResBlock, the skip connections cross from same sized part in downsampling path to the upsampling path. These are the grey lines shown in the diagram above.
The original pixels are concatenated with the final ResBlock with a skip connection to allow final computation to take place with awareness of the original pixels inputted into the model. This results in all of the fine details of the input image are at the top on the U-Net with the input mapped almost directly to the output.
The outputs of the U-Net blocks are concatenated making them more similar to DenseBlocks than ResBlocks. However there are stride two convolutions that reduce the grid size back down, which also helps to keep memory usage from growing too large.
ResNet-34 is a 34 layer ResNet architecture, this is used as the encoder in the downsampling section of the U-Net (the left half of the U).
The Fastai U-Net learner when provided with an encoder architecture will automatically construct the decoder side of the U-Net architecture, in the case transforming the ResNet-34 encoder into a U-Net with cross connections.
For the model to know how to do perform super resolution it vastly speeds up training time to use a pretrained model so that model has a starting knowledge of the kind of features that need to be detected and improved. Using a model and weights that have been pre-trained on ImageNet is almost ideal. The pretrained ResNet-34 for pyTorch is available from Kaggle: https://www.kaggle.com/pytorch/resnet34
The loss function is based upon the research in the paper Losses for Real-Time Style Transfer and Super-Resolution and the improvements shown in the Fastai course (v3).
This paper focuses on feature losses (called perceptual loss in the paper). The research did not use a U-Net architecture as the machine learning community were not aware of them at that time.
This model used here is trained with a similar loss function to the paper, using VGG-16 but also combined with pixel mean squared error loss loss and gram matrix loss. This has been found to be very effective by the Fastai team.
VGG is another CNN architecture devised in 2014, the 16 layer version is utilised in the loss function for training this model.
The VGG model. a network pretrained on ImageNet, is used to evaluate the generator model’s loss. Normally this would be used as a classifier to tell you what the image is, for example is this a person, a dog or cat.
The head of the VGG model is ignored and the loss function uses the intermediate activations in the backbone of the network, which represent the feature detections. The head and backbone of networks are described a little further in the training section further on.
Those activations can be found by looking through the VGG model to find all the max pooling layers. These are where the grid size changes and features are detected.
Heatmaps visualising the activations for varied images can be seen in the image below. This shows examples of varied features detected in the different layers of network.
The training of this super resolution model uses the loss function based on the VGG model’s activations. The loss function remains fixed throughout the training unlike the critic part of a GAN.
The Feature map has 256 channels by 28 by 28 which are used to detect features such fur, an eyeball, wings and the type material among many other type of features. The activations at the same layer for the (target) original image and the generated image are compared using mean squared error or the least absolute error (L1) error for the base loss. These are feature losses. This error function uses L1 error.
This allows the loss function to know what features are in the target ground truth image and to evaluate how well the model’s prediction’s features match these rather than only comparing pixel difference.
The training process begins with a model as described above: a U-Net based on the ResNet-34 architecture pretrained on ImageNet using a loss function based on the VGG-16 architecture pretrained on ImageNet combined with pixel loss and a gram matrix.
With super resolution it is fortunate in most applications there is an almost infinite amount of data can be created as a training set. If a set of high resolution images is acquired, these can be encoded/resized to smaller images, so that we have a training set with a low resolution and high resolution image pair. The prediction from our model can then be used evaluated against the high resolution image.
The actions taken in this method of creating the training data are what the model learns to fit (reversing the process)
The training data can be further augmented by:
- Randomly reducing the quality of the image within bounds
- Taking random crops
- Flipping the image horizontally
- Randomly adding noise
- Randomly punching small holes into the image
- Randomly adding overlaid text or symbols
Changing the quality reduction and noise to be random for each image improved the resulting model allowing it to learn how to improve all of these different forms of image degradation.
Feature and quality improvement
The U-Net based model enhances the details and features in the upscaled image generating an improved image though the function containing approximately 40 million parameters.
Training the head and the backbone of the model
The model’s architecture is split into three parts, the stem of the backbone, the main back bone and the head.
The head is the last layers of the model that need the most adaptation when using a pretrained network for knowledge transfer. As the model needs to learn to do something different with its pretrained knowledge. The model, for example, will want to retrain the knowledge of its layers of feature detection.
Freeze the backbone, train the head
The weights in the backbone of the network are frozen so that only the weights in the head are initially being trained. A learning rate finder is run for 100 iterations and plots the graph of loss against learning rate, a point around the steepest slope down towards the minimum loss is selected as the maximum learning rate.
The fit one cycle policy is used to vary learning rate and momentum, described in detail in Leslie Smith’s paper
Unfreeze the backbone
The weights of the entire model are then unfrozen and the model is trained with learning rates much smaller, usually between one and two orders of magnitude less and learning rate annealing is used to reduce the learning rate as the loss stops improving. This is to fine-tune the model without risking losing much of the accuracy already found.
It’s faster to train on larger numbers of smaller images initially and then scale up the network and training images. Upscaling and improving an image to 128px by 128px image from 64px by 64px is a much easier task than performing that operation on a larger image and much quicker on a larger dataset. This is called progressive resizing, it also helps the model to generalise better as is sees many more different images and less likely to be overfitting.
This progressive resizing approach is based on excellent research from Nvidia with progressive GANs: https://research.nvidia.com/sites/default/files/pubs/2017-10_Progressive-Growing-of/karras2018iclr-paper.pdf . It was also the approach Fastai used to beat the Tech giants at training on ImageNet: https://www.fast.ai/2018/08/10/fastai-diu-imagenet/
The process is to train with small images in larger batches, then once the loss is decreasing to an acceptable level then a new model is created that accepts larger images transferring the learning from the model trained on smaller images.
PyTorch is very well suited for this as it allows the network size to grow, randomly initialising a new upscaling convolution to cope with the larger output. All the layers are frozen other than this new layer, then use ICNR, training can then be continued.
With each new model the learning rate is reduced slightly before starting training. As the training rate increases the batch size has to decreased to avoid running out of memory, as each batch contains larger images with four times as many pixels in each.
Note that the defects in the input image have been randomly added to improve the restorative properties of the model and to help it generalise better.
Examples from the validation set are shown here at some of the progressive sizes:
After one cycle of 10 epochs with the backbone weights frozen and another cycle of 10 epochs training with the backbone weights unfrozen, the loss and image prediction becomes acceptable then the image size used for training is doubled to 128px by 128px with the predicted/generated images improved to 256px by 256px.
Again after one cycle of 10 epochs with the backbone weights frozen and another cycle of 10 epochs training with the backbone weights unfrozen, the loss and image prediction becomes acceptable then the image size used for training is doubled to 256px by 256px with the predicted/generated images improved to 512px by 512px.
The same two cycles of training with the backbone weights frozen, then unfrozen were carried out with these 256px by 256px input images improved to 512px by 512px.
Continuing training would improve the quality of super resolution on larger input images, however the batch size has to keep shrinking to fit within memory constraints and training time increases and limits of my training infrastructure was reached.
All training was carried out on a Nvidia Tesla K80 GPU with 12GB RAM.
The above images in the progressive resizing section of training, show how effective deep learning based super resolution is at improving the detail, removing watermarks, defects and impainting missing details.
The next three image predictions based on images form the Div2K dataset all had super resolution performed on them by the same trained model, showing a deep learning super resolution model might be able universally applied.
Note: these are from the actual Div2K training set, although that set was split into my own training and validation datasets and the model did not see these images during training. There are further examples from the actual Div2K validation set further on.
Left: 256 x 256 pixel input. Middle: 512 x 512 prediction from the model. Right: 512 x 512 pixel ground truth target. Looking at the vents on front of the train, the detail improvement is clear and very close to the ground truth target.
Left: 256 x 256 pixel input. Middle: 512 x 512 prediction from the model. Right: 512 x 512 pixel ground truth target. The feature improvement in the image prediction below here is quite amazing. During my early training attempts I had almost concluded super resolution of human features would be a task too complex.
Left: 256 x 256 pixel input. Middle: 512 x 512 prediction from the model. Right: 512 x 512 pixel ground truth target. Notice how the white “Fire Exit” text and the paneling lines have been improved.
Super resolution on the Div2K validation set
Examples of super resolution from the official Div2K validation set. A PDF version is available here: https://drive.google.com/open?id=1ylselPp__emdYwIHpMlhw4fxjN_LybkQ
Super resolution on the Oxford 102 Flowers dataset
The super resolution results from a separate trained model on a dataset of images of flowers I think is quite outstanding, many of the model predictions actually look sharper than the ground truth having truly performed super resolution upon the validation set (images not seen during training).
Super resolution on the Oxford-IIIT Pet dataset
The example below from a separate trained model upscaling low resolution images of dogs is very impressive, again from the validation set, creating finer details of fur, sharpening eyes and noses and really improving the features in the images. Most of the upscaled images are close to being as good as the ground truth and certainly much better than the bilinear upscaled images.
These results I believe are impressive, the model must have developed a ‘knowledge’ of what a group of pixels must have been in the original subject of the photograph/image.
It knows that certain areas are blurred and knows to reconstruct a blurred background.
The model couldn’t do this if it hadn’t performed well on the feature activations of the loss function. Effectively the model has reverse engineered what features would match those pixels to match the activations in the loss function.
For a type of restoration to be learnt by the model it must be in the training data as a problem to solve. When holes were punched into the input images of a trained model and the model had no idea what to do with them and left them unchanged.
The features, or at least similar ones, that need to be hallucinated onto the image must be present in the training set. If the model is trained on Animals, then it’s not likely the model will perform well on a completely different dataset category such as room interiors or flowers.
The results of super resolution on models trained on close up human faces weren’t particularly convincing although on some examples in the Div2K training set did see good improvements in features. Especially in X4 super resolution, although features are sharpened more than nearest neighbour interpolation, the features take on an almost drawn/artistic effect. For very low resolution images or those with a lot of compression artifacts, this may still be preferable. This is an area I plan to continue to explore.
U-Net deep learning based super resolution trained using loss functions such as these can perform very well for super resolution including:
- Upscaling low resolution images to higher resolution images
- Improving the quality of a image maintaining the resolution
- Removing watermarks
- Removing damaging from images
- Removing JPEG and other compression artifacts
- Colourising greyscale images (another work in progress)
For a type of restoration to be learnt by the model it must be in the training data as a problem to solve. Holes were punched into the input images of a trained model and the model had no idea what to do with them and left them unchanged, where as when the punched holes were added to the training data, these were restored well by the trained model.
All of the examples of super resolution on images shown here were predictions from the models I have trained.
I plan to move the model into a production web application, then possibly into a mobile web application.
I will be publishing my source code and trained models once a some refinements have been made and a little refactoring.
I am in the process of training on larger sub-sets of the Image Net dataset, which contains many categories to produce an effective universal super resolution model, that performs well on any category of images. I am also in the process of training a greyscale version of the same datasets I trained on here, where the model is colourising the images.
I plan to try model architectures such as ResNet-50 and also a ResNet backbone with an Inception stem.
By hallucinating details that aren’t there in used in categories such as security footage, aerial photography or similar then generating an image from a low resolution image might take it further from the original real subject matter.
Imagine if facial features were changed subtlety but enough to identify a person by facial recognition who wasn’t actually there or a aerial photo is changed just enough that a building is recognised by another algorithm as being something other than it is. Diverse training data should help avoid this, although as super resolution methods improve it is a concern as is the lack of diverse training data used historically in the machine learning research community.
Thank you to the Fastai team, without your courses and your software library I doubt I would have been able to carry out these experiments and learn about these techniques.