Original article was published on Deep Learning on Medium
The simplest and earliest network designs are Linear Networks. Their network architecture is linear without any skip connections or multiple branches. The up-sampling operation performed in linear networks are either Early up-sampling or Late up-sampling.
To avoid vanishing gradients and to design very deep networks, Residual Networks use skip connections in their network design. The network learns residue, i.e. the high frequencies between the input and ground-truth. Such networks are categorized into Single-stage networks and Multi-stage networks based on the number of stages.
The main motivation behind Recursive Networks is to break down the harder SR problem into a set of simpler ones. Recursive networks either employ recursively connected convolutional layers or recursively linked units.
Progressive Reconstruction Designs:
Sometimes using CNN algorithms to predict the outcome in one step is not feasible for large scaling factors. For such large factors, the algorithms predict in multiple steps, i.e. 2× followed by 4× and so on.
Densely Connected Networks:
Densely Connected Networks are inspired by DenseNet architecture for image classification. They combine hierarchical cues available along the network depth to achieve high flexibility and richer feature representations.
Multi-branch networks aim to obtain a diverse set of features at multiple context scales. Such complementary information is then fused to obtain better HR reconstructions. This design also enables a multi-path signal flow, leading to better information exchange in forward-backward steps during training. Multi-branch designs are becoming common in several other computer vision tasks as well.
The previously network designs consider all spatial locations and channels to have uniform importance for SR. In several cases, it helps to selectively attend to only a few features at a given layer. Attention-based models allow this flexibility and consider that not all the features are essential for super-resolution but have varying importance. Coupled with deep networks, recent attention-based models have shown significant improvements for SR.
Multiple-degradation Handling Networks:
All the designs so far consider bicubic degradations. However, in reality, this may not be a feasible assumption as multiple degradations can simultaneously occur.
Generative Adversarial Networks:
Generative Adversarial Networks have two components, namely a generator and discriminator. The generator creates SR images that a discriminator cannot distinguish as a real HR image or an artificially super-resolved output. In this manner, HR images with better perceptual quality are generated. The corresponding PSNR values are generally degraded, which highlights the problem that prevalent quantitative measures in SR literature do not encapsulate the perceptual soundness of generated HR outputs.
This section describes the comparison of more than 30 state-of-the-art algorithms over six challenging datasets — Set5, Set14, BSD100, Urban100, DIV2K, and Manga109.
Number of Parameters:
The algorithms were evaluated on the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) measures. Methods with direct reconstruction perform one-step upsampling from the LR to HR space, while progressive reconstruction predicts HR images in multiple upsampling steps. Depth represents the number of convolutional and transposed convolutional layers in the longest path from input to output for 4× SR. Global residual learning (GRL) indicates that the network learns the difference between the ground truth HR image and the upsampled (i.e. using bicubic interpolation or learned filters) LR images. Local residual learning (LRL) stands for the local skip connections between intermediate convolutional layers.
As one can notice, methods that perform late upsampling have considerably lower computational cost compared to methods that perform upsampling earlier in the network pipeline.
The PSNR and SSIM performance of DRLN are better for 2× and 3× and ESRGAN for 4×. However, it is difficult to declare one algorithm to be a clear winner compared to the rest as there are many factors involved such as network complexity, depth of the network, training data, patch size for training, number of features maps, etc. A fair comparison is only possible by keeping all the parameters consistent.
A visual comparison between a few of the state-of-the-art algorithms which aim to improve the PSNR of the images.
The output of the GAN-based algorithms which are perceptually-driven and aims to enhance the visual quality of the generated outputs is shown below.
As one can notice, outputs are generally crisp, but the corresponding PSNR values are relatively lower compared to methods that optimize pixel-level loss measures.
The algorithms for higher magnification levels, the artifacts in the images became more visible.
It is clear from the images that most of the state-of-the-art algorithms struggle to reproduce the textures in high magnified versions of the images.
Choice of Network Loss:
The most popular choices for network loss is either mean square error or mean absolute error in the convolutional neural network for the image super-resolution. Similarly, Generative adversarial networks (GANs) also employ perceptual loss (adversarial loss) in addition to the pixel level losses such as the MSE. It is evident that the initial CNN methods were trained using l2 loss; however, there is a shift in the trend towards l1more recently, and absolute mean difference measure (l1) has shown to be more robust compared to l2. The reason is that l2 puts more emphasis on more erroneous predictions while l1 considers a more balanced error distribution.
Contrary to the claim made in SRCNN that network depth does not contribute to the better numbers rather it sometimes degrades the quality, VDSR initially proved that using deeper networks helps in better PSNR and image quality. EDSR further established this claim, where the number of convolutional layers were increased by nearly four times that of VDSR. Recently, RCAN employed more than four hundred convolutional layers to enhance image quality. The current batch of CNNs, are incorporating more convolutional layers to construct deeper networks to improve the image quality and numbers, and this trend has continuously remained a dominant one in deep SR since the inception of SRCNN.
Overall, skip connections have played a vital role in the improvement of SR results. These connections can be broadly categorized into four main types: global connections, local connections, recursive connections, and dense connections. Initially, VDSR utilized global residual learning (GRL) and has shown enormous performance improvement over SRCNN. Further, DRRN and DRCN have demonstrated the effectiveness of recursive connections. Recently, EDSR and RCAN employed local residual learning (LRL) i.e. local connections while keeping the global residual learning (GRL) as well. Similarly, RDN and ESRGAN engaged dense connections and global ones. Modern CNNs are innovating ways to improve and introduce other types of connections between different layers or modules.
Single-image super-resolution is a challenging research problem with important real-life applications. The phenomenal success of deep learning approaches has resulted in rapid growth in deep convolutional network-based techniques for image super-resolution. A diverse set of approaches have been proposed with exciting innovations in network architectures and learning methodologies. Through extensive quantitative and qualitative comparisons, we note the following trends in the existing art: (a) GAN-based approaches generally deliver visually pleasing outputs while the reconstruction error based methods more accurately preserve spatial details in an image, (b) for the case of high magnification rates (8× or above), the existing models generally deliver sub-optimal results, (c) the top-performing methods generally have higher computational complexity and are deeper than their counterparts, (d) residual learning has been a major contributing factor for performance improvement due to its signal decomposition that makes the learning task easier. Overall, we note that the SR performance has been greatly enhanced in recent years with a corresponding increase in the network complexity. Remarkably, the state-of-the-art approaches still suffer from limitations that restrict their application to key real-world scenarios (e.g., inadequate metrics, high model complexity, inability to handle real-life degradations). We hope this article will attract new efforts towards the solution of these crucial problems.