Original article was published on Deep Learning on Medium
Super Resolution Convolutional Neural Network- An Intuitive Guide
Extracting high resolution images from low resolution images is a classical problem in computer vision. The SRCNN paper published in 2015 was a major improvement to the pre-existing solutions for this problem since it was simple, efficient and provided an end to end solution. This blogpost is intended to give you an intuitive understanding of the same.
This blogpost assumes that you have a basic understanding of how convolutional neural networks work.
Modelling the problem
Assuming the input image, is an image Y, the goal of the algorithm was to find a function F(Y) that was as similar to the ground truth high resolution image X.
Problem Solving Approach
Before SRCNN came about, a pre-existing method called Sparse Coding was used for image restoration. It used a pipeline, which involved extracting overlapping patches from the image, mapping them to a higher resolution space and then aggregating this high resolution vectors and then restore the image by aggregating these vectors. But it involved a complex pipeline and use of complex mathematical techniques. The SRCNN authors were smartly able to model the same pipeline in the form of a CNN.
It involved extraction or cropping out of various patches from an image in an overlapped manner and converting them into a high dimensional vector for further processing. This can be thought of as a window sliding over each pixel of an image and converting each window into a high dimensional vector.
Non Linear Mapping
Mapping the above extracted high dimensional vectors to other high dimensional vector non linearly. Keep in mind in this case the latter high dimensional vectors are the ones which will be a part of the high resolution image.
Aggregate the above vectors into a high resolution image thus restoring the input image.
- Patch Extraction: 64 filters of size 9 x 9 x 3 were used to perform the first phase which is patch extraction of the solution pipeline. The first layer can be expressed as:
F1(Y) = W1 ∗ Y + B1
where Y is the image,
W1 corresponds the the 64 filters being used,
B1 corresponds to the biases being used.
- Non Linear Mapping: The previously found feature maps were then put through a relu activation function. 32 filters of size 64 x 1 x 1 were used for mapping the previously activations from the relu to a higher resolution space. A relu activation was applied to the same to make sure that the mapping was not linear. It can be expressed as:
F2(Y) = max (0, W2 ∗ F1(Y) + B2)
where Y is the image,
W2 corresponds the the 32 filters being used,
B2 corresponds to the biases being used.
- Restoration: The high resolution maps were then subjected to 3 filters (since the image is composed of 3 channels) of size 5 x 5 in order to aggregate the high resolution mappings in the previous layer. This was a linear mapping and can be expressed as:
F(Y) = W3 ∗ F2(Y) + B3
where Y is the image,
W3 corresponds the the 3 filters being used,
B3 corresponds to the biases being used.
The input to the model was a standard low resolution image. While training its output was set to be its equivalent high resolution image. Mean squared error was used as the loss to be minimised. The error in this case was nothing but the difference between the ground truth high resolution image and the generated high resolution image.
As mentioned above it aggregated the solving technique of complex algorithmic pipelines used earlier into a single convolutional network thus providing an end to end solution for the problem.
A win for deep learning
The authors were able to demonstrate how deep learning was useful in solving classical computer vision problems which at the time was impactful.
With moderate number of filters, the network was able to achieve fast processing speeds even on CPU’s.
The authors were also able to demonstrate that diverse datasets and more number of layers,were able to improve the the performance of the dataset.
On comparing the results from SRCNN with state of the art methods like SC(Sparse Coding), A+(Adjusted Anchored Neighbourhood Regression method) and ANR(Anchored Neighbourhood Regression method), in terms of the PSNR (peak signal to noise ratio) metric, SRCNN turned out to be a clear winner. It rightly proved to be a simple, robust and accurate solution.I hope this post provided a good insight into this algorithm. I highly encourage you to go through the original paper and further research about other solutions for this classical computer vision problem.
Link to the original paper and code : https://paperswithcode.com/paper/image-super-resolution-using-deep