Source: Deep Learning on Medium
Travel websites thrive on high quality and engaging imagery. Images of Hotels are a powerful tool that can be used to create an exceptional customer experience. They allow the viewer to gain important information about the property and inspire the viewer’s imagination. Expedia receives millions of images from our users and hotel partners. We want to make sure the images displayed to our customers are very high quality and legally compliant. Obviously not all images meet these minimum requirements.
When we start with a small-size image we can either add black borders to compensate for the empty space in a gallery display or stretch the image to fit using standard upsizing. This often results in noticeable pixelation and artifacts. A pixelated image does not create the exceptional experience we want our customers to have. In this work, we investigate solutions to the problem of generating high-quality images from small-size images in a commercial setting. We developed a machine learning model to upscale images more dynamically in a way that preserves as much of the images’ quality as possible while allowing for a much larger display.
We did an extensive literature survey and found numerous interesting approaches. But most of the prior works focus on small image sizes, meaning upscale a tiny image (say ~100px) by four times. Our business needs us to generate images at higher resolutions than that. We were tasked to generate images of at least 2000px — four times greater than the sizes typically reported in the literature. Such large images are popularly known as high-definition images, in this blog, we will address them as high-resolution (HR) images. A large portion of older images in our inventory do not meet this minimum resolution criterion so we need to upscale them. This poses a challenge. State-of-the-art models that were validated on tiny images (100px) have not been shown to handle larger images. In this resolution space (2000px) avoiding minor pixelation artifacts is extremely difficult. Images are scaled up and so are the pixelations. In addition, large-size images are more susceptible to the object consistency problem due to the larger number of pixels and long-term dependencies across different regions of the image. These challenges encouraged us to do a deep dive into the super-resolution literature and propose an efficient solution.
SISR is a well-researched problem with broad commercial relevance. The classical super-resolution techniques (bilinear, bicubic, etc.) cannot provide perceptually appealing images in many cases. Developing a deep learning model to solve this Single Image Super-Resolution (SISR) problem appears to be a more generic and robust solution. We studied a few deep learning approaches that could potentially solve this problem. In order to evaluate these ideas, we needed thousands of images with high and low-resolution pairs for training the model.
We had a large repository of images with diverse scenes, objects, and localities from all over the world. Next, we needed pairs of high-resolution and low-resolution (LR) images in order to train our model.
For our first dataset (training) we used 20k HR (~2000 px) images and synthetically created the corresponding LR pairs for them by applying down-sampling. For our second dataset (testing) we had more than 3000 images of varying resolutions. We needed to achieve some minimum score on our testing dataset in order to get sign-off. Please note our test sets were original small-sized images from hotel partners and not synthetically generated.
Earlier super-resolution (SR) models were mainly based on sparse coding. More recently, deep learning approaches have produced favorable results in many computer vision tasks. There are many interesting deep learning frameworks for SR (such as Convolutional Neural Network (CNN) based model, as well as Generative Adversarial Networks (GAN) based models), but here we focus on models of particular relevance to our work: Super-Resolution GAN (SRGAN¹) and Self-Attention GAN (SAGAN²).
Fine-tuning on a pre-trained SRGAN:
As a baseline, we apply a pre-trained SRGAN model⁵ to our test set of original smaller images. This model was trained with the RAISE dataset which comprises HR images from a large-scale resolution space (8,156 images ranging from 2500 to 4000px). Thus, we have reason to believe it is more compatible with our target resolution space than models trained on smaller size images in other publicly-available data sets such as ImageNet. But we observed plenty of pixelations and blue-patches at some locations of the image. Hence we decided to fine-tune this model with Expedia’s high-quality images. After only 10 epochs of training with 11.5k Expedia images, we were able to see significant improvements and almost all the blue-patches were gone. Given the above result, we think the earlier artifacts with the pre-trained models were due to domain differences between the RAISE and Expedia’s dataset.
We further analyzed a few thousand images manually and observed some ringing artifacts along with long-term dependencies such as walls, desks, pool edges, etc. We needed to extend the model to improve the object inconsistency³ that we saw in the fine-tune approach. To address this issue we enabled attention to the salient part of the image.
Self-Attention is expensive to fit in GPU memory for large-size images:
The idea of including an attentional component in the SR task is to capture long-term multilevel dependencies across image regions that are far apart and unseen by kernels. However, the amount of memory required to store the correlation matrix (i.e., attention map) of SAGAN’s self-attention layer is prohibitively large for large-scale images¹. For instance, the flattened correlation matrix for an input image of size 500*500px is 250k*250kpx, which is very costly to store in-memory.
To address this memory issue, we come up with an idea of flexible self-attention (FSA), which essentially uses pooling and un-pooling to get a smaller-sized attention map. Our FSA layer adds attention to the model without exploding memory for large-scale images. We wrap the SAGAN self-attention layer with max-pooling and then resize the image to match the shape of the input, as shown in Figure 1. Since the input and output feature maps are of the same size, the FSA can be inserted between any two convolutional layers. This wrapping reduces the size of the attention map, enabling us to perform attention on large size images on GPUs like Nvidia Tesla K80.
We then trained our model with 20k images and observed a significant improvement in the SSIM (structural similarity) score on our validation set. This provided a strong signal that adding attention to the model improves the structural consistency of the output images. That can also be seen in Figure 3 & 4.
Proposed Attention Model:
The A-SRGAN architecture extends SRGAN¹ with a Flexible Self Attention Layer (FSA) layer inspired by SAGAN². Figure 2 explains our model architecture. Note that the Learnable Sum operation refers to the weighted skip connection from SAGAN. In each layer, the weights are normalized using spectral normalization. The generator and discriminator networks of A-SRGAN are shown with their corresponding kernel size (k), the number of feature maps (n) and stride (s).
The generated high-resolution images for two examples by our model are provided below. We zoomed in over a small crop of each image since these images are at least HD. You need to zoom in 4–6x while viewing on a computer screen in order to see how the model is performing.