Super Resolution explained — Ep1

Source: Deep Learning on Medium

Today we are surrounded by cameras and surveillance devices almost everywhere. From sparkling clear DSLR images to blurry microscope images, we have varieties in sizes and qualities of image. The hardware technology is reinventing itself very fast to enable capture of high resolution images in all cameras. But in my opinion, it is yet to catch up with software, which can intelligently scale up the resolution of an image artificially.

That brings us the need for image super resolution. We cannot employ high resolution cameras everywhere. Objects like robots, remotely operated systems or even security cameras may not produce sharp images. Whenever we need to process poor quality images from such cameras, we might need to do a lot of things like magnifying, de-noising, or de-blurring among others. However, getting an enlarged image that matches exactly to what it should look like.

Thus we are always in the search of methods to provide a enlarged image as realistically as possible. While there exist a lot of classical methods of interpolation, with the advancements in the field of Computer Vision and Deep Learning, neural networks have almost completely taken over them. In a series of articles we will discuss some of the famous methods of image super resolution (henceforth, referred as SR).

Before we start discussing the algorithms, let us get acquainted with the exact problem and the metrics.

The Problem

Suppose you have an image X, (m ×n ×c) pixels in size. We want to see how it will look in reality after scaling k times. This means we want the output image, Y to be (km × kn × c) pixels in dimension. The number of channels, c remains the same. The task in SR is specifically to develop such a mapping, F such that F(X) = Y.

However, the catch lies in the fact that we need to create more data from less. And that too times each pixel. Another problem is that lower resolution images have higher perception to noise. Thus always generating a perfect generalized mapping is difficult.

Consider a low resolution image X. Its corresponding desired output image is called as the ground truth image. Let us denote it as Y_true.


The quality of our mapping is evaluated by the quality of the output. The closer the output of a mapping to the desired output, better the mapping.

For a given mapping F, let F(X) = Y_pred. Using the two images Y_true and Y_pred we define the metrics of evaluation.

Some metrics are used to measure the quality of an algorithm while others can also be used to guide the training such that maximizing or minimizing a metric is the final goal. Such a metric is referred to as the loss function. Each algorithm might use a different training setup with a variety of loss functions.

  1. Mean Squared Error (MSE):

Mean Squared Error, also called as MSE, is the average of the square of error. For our images of size km x kn, for each pixel (i,j) we will find the norm of the difference across all channels. This will be squared and averaged over the total number of pixels. That is, the MSE is found by the equation below.

It is usually used as a loss function in training the networks. The mean squared error is the simplest metric of correspondence that takes into account size of the image/data.

2. Peak Signal to Noise Ratio (PSNR):

The peak signal to noise ratio, also referred to as PSNR is a measure of the peak (maximum) error in the image. It is directly related to MSE as —

where, R represents the maximum range of the data. For an image, which is usually 8 bit, the maximum value is 255 while the minimum is 0. Then the value of R is their difference, that is 255.

PSNR is always defined in decibels (dB) and hence we take the logarithm of the ratio.

3. Mean Opinion Score (MOS):

Mean Opinion Scores are basically opinions/user ratings on the same image being processed by different algorithms. Multiple users rate these images and the average is given as MOS. It is usually done as a rating from 1 to 5, but other ranges can also be used.

Some papers in the recent years, like EnhanceNet explained that the PSNR was not a sufficient metric. Every pixel will have similar importance irrespective of its position and the numerical closeness may not be depicted in the photo realistic sense. Hence MOS was proposed.

In this article we will discuss the first method, SRCNN. It was one of the first to formally use deep learning to solve the problem of SR.

Approach I: SRCNN

SRCNN , also known as Super Resolution Convolutional Neural Network, was among the early few papers to use Deep Learning, especially CNNs to solve the problem of SR. It was developed by a team from MMLAB in CUHK. The main paper can be found here.

Note: The notation in the paper interchanges X and Y w.r.t our definition. Using X for input is more intuitive to understand. Hence the change.


The method used in SRCNN has 3 sub networks/layers. It does not involve any up-scaling by neural networks but instead uses classical interpolation techniques to upscale and improves the resultant quality by a CNN.

Prior to passing through Convolutional Networks, the small image (x) is enlarged to the required factor by Bicubic/Spline Interpolation. This image (X) is the input (referred still as “Low Resolution”) and the “Ground Truth” SR image Y (also, called Y_true) is the output such that a mapping F(X) = Y is possible.

The figure below gives an overview of the network. It will be further explained later. Low resolution now refers to the interpolated image.

An overview of the Neural Network pipeline. Input represents interpolated image and output represents the high resolution image. Each layer of the CNN serves a very specific purpose [1]

Step I — Patch Extraction

From the large image (interpolated), the first layer acts as a patch extractor such that each (c ×f1× f1) dimensional region of the image is mapped as one 1 x 1 point in the n1 dimensional feature space. This is done using a convolutional layer (weights W1, bias B1); with n1 filters each and having a kernel size f1 is passed on the image. The mapping is activated with a Rectified Linear Unit (ReLU, max(0,x)). We refer the entire mapping as F1(X). Mathematically the convolution is shown as —

This gives us n1 different feature maps for the low resolution image, that can now be mapped further for a finer quality image.

Step II — Non Linear Mapping

The first layer extracts an n1-dimensional feature for each patch. In this layer we map these features F1(X) to another feature map F2(F1(X)) using a convolutional layer with weights W2, bias B2 and kernel size f2. This is also activated using a ReLU activation. Mathematically,

Here W2 is of a size n1×1×1×n2, and B2 is n2-dimensional. Each of the output n2-dimensional vectors is is a representation of the high resolution image output, which is ready for reconstruction.

We can add more such layers to improve on the accuracy but the computation cost s accuracy trade-off may not be favorable.

Step III — Reconstruction

In this step we want to generate the final image. The reconstruction layer convolves over the feature map of the previous layer to change the depth to c, the number of channels. It uses weights W3 and bias B3 while the filter size is set to f3. Thus —

Here W3 has size (n2×f3×f3×c), and B3 is a c dimensional vector. This brings the image to c number of channels.

Note: In all convolution operations in this algorithm, the image is padded in the ‘same’ configuration, i.e. the zeros are always padded along the image such that the convolution output dimensions equal the input.

All such patches are then added to make the final image. Hence, in regions where one pixel is a part of more than one patches, the average of the value of that pixel in all such patches is taken as the final value.


The above 3 layer network comprises the SRCNN network with each layer driven by a different intuition. For training this network, we use the parameters as n1=64, n2=32, f1=9, f2=1 and f3=5.

This is inspired from the previous sparse coding approaches. It is however an experimental value and can be changed as per dataset or convergence graphs of the loss function.

Loss Function — Mean Squared Error

For each patch Xi, we evaluate F(Xi) = Yi for the given network. This gives us Yi’s of different patches in the images. We want to update the weights such that minimize the loss function or the optimum goal.

In this network configuration, Mean Squared Error (MSE) as defined above is used as the optimum goal. That is, the mean squared error in the generated and expected image should be the least.

Optimizer — Gradient Descent

We want to update weights to reduce the loss. This is done by moving in the direction of maximum slope in the weight dimension. This descent method is called gradient descent.

When we update the weights using Gradient Descent after each sample, the optimization method is called stochastic gradient descent. If we update it after every batch of data points, it is called Batch Gradient Descent.

Dataset — T91, Set5, Set14

The T91 dataset contains 91 images. Set5 and Set14 contain 5 and 14 images respectively. We use T91 for training while the other two for testing.

We extract small patches using the network itself to run on it. We experiment for values of k = 2,3 and 4.

However, we can use data from the BSDS dataset for training and even augment the existing data by flipping or rotating the images/patches.


Multiple implementations of the algorithm can be found on github. The SRCNN page itself provides implementations in Matlab and Caffe.

Explanations of code and helpers can also be found at


[1] Chao Dong, Chen Change Loy, Kaiming He, Xiaoou Tang, “Image Super-Resolution Using Deep Convolutional Networks”,

Stay tuned for more content! Let me know your thoughts on the article in the comments or here.