How ‘Copy-and-Paste’ is embedded in CNNs for Image Inpainting — Review: Shift-Net: Image…

Original article was published by RONCT on Deep Learning on Medium


How ‘Copy-and-Paste’ is embedded in CNNs for Image Inpainting — Review: Shift-Net: Image Inpainting via Deep Feature Rearrangement

Hello everyone:) Welcome back!! Today, we will dive into a more specific deep image inpainting technique, Deep Feature Rearrangement. This technique takes both the advantages of using modern data-driven CNNs and conventional copy-and-paste inpainting method. Let’s learn and enjoy together!

Recall

This is my fifth post related to deep image inpainting. In my first post, I introduced the objective of image inpainting and the first GAN-based image inpainting method. In my second post, we went through an improved version of the first GAN-based image inpainting method in which a texture network is employed to enhance the local texture details. In my third post, we dived into a milestone in deep image inpainting in which the proposed network architecture can be regarded as a standard network design for image inpainting. In my fourth post, we had a revision and skimmed through a variant/improved version of the standard inpainting network. If you are new to this topic, I highly recommend you to read the previous posts first. I hope that you can have a full picture of the progress in recent deep image inpainting. I have tried my best to tell the story:)

Motivation

Figure 1. Qualitative comparison of inpainting results by different methods. (a) Input (b) Conventional method (based on copy-and-paste) (c) First GAN-based method, Context Encoder (d) Proposed method. Extracted from [1]

As I mentioned in my previous posts, a conventional way to fill in the missing parts in an image is to search for the most similar image patches then directly copy-and-paste these patches on the missing parts (i.e. copy-and-paste method). This method offers good local details as we directly paste other image patches on the missing parts. However, the patches may not perfectly fit the context of the entire image, which may lead to poor global consistency. Please see Figure 1(b) as an example, you can see that the local texture details of the filled region is good but it is not consistent with the non-missing parts (i.e. valid pixels).

On the other hand, deep learning-based methods focus on the context of the entire image. Fully-connected layers or dilated convolutional layers are used to capture the context of the entire image. Deep learning models are trained using L1 loss to ensure the pixel-wise reconstruction accuracy. Therefore, filled images offered by deep learning approaches are with better global consistency. However, L1 loss leads to blurry inpainting results even though adversarial loss (GAN loss) can be used to enhance the sharpness of the filled pixels. Please see Figure 1(c) as an example, you can see that the filled region is more consistent with the non-missing region but the filled region is blurry.

So, the authors of this paper would like to take both advantages of using conventional “Copy-and-Paste” method (good local details) and modern deep learning approach (good global consistency).

Introduction

In image inpainting, we want a completed image with good visual quality. Therefore, we need both correct global semantic structure and fine detailed textures. Correct global semantic structure means that the generated pixels and the valid pixels should be consistent. In other words, we have to fill in an image and its context has to be maintained. Fine detailed textures mean that the generated pixels should be realistic-looking, and as sharp as possible.

In the previous section, we mentioned that conventional “Copy-and-Paste” methods can offer fine detailed textures while recent deep learning approaches can provide much better correct global semantic structure. So, this paper introduces a shift-connection layer to achieve deep feature rearrangement with the concept of “Copy-and-Paste” inside their network. Figure 1(d) shows the inpainting results offered by their proposed method.

Solution (in short)

A guidance loss is proposed to encourage their network (Shift-Net) to learn to fill in the missing parts during the decoding process. Apart from that, a shift-connection layer is suggested to match the decoded feature inside the missing region to the encoded feature outside the missing region, and then each matched location of the encoded feature outside the missing region is shifted to the corresponding location inside the missing region. This captures the information about the most similar local image patches found outside the missing region and this information is concatenated to the decoded feature for further reconstruction.

Contributions

As mentioned, a shift-connection layer is proposed to embed the concept of copy-and-paste in modern CNNs such that their proposed model can offer inpainting results with both correct global semantic structure and fine detailed textures.

Apart from standard L1 and adversarial losses, they also suggest guidance loss to train their Shift-Net in an end-to-end data-driven manner.

Approach

Figure 2. Network architecture of Shift-Net. The shift-connection layer is added at resolution of 32×32. Extracted from [1]

Figure 2 shows the network architecture of Shift-Net. Without the shift-connection layer, this is a very standard U-Net structure with skip connections. Note that the encoded feature is concatenated to the corresponding layer of the decoded feature. This kind of skip connections is useful for low-level vision tasks including image inpainting in terms of both better local visual details and reconstruction accuracy.

Guidance loss

The guidance loss is proposed to train their Shift-Net. Simply speaking, this loss calculate the difference between the decoded feature of input masked image inside the missing region and the encoded feature of the ground truth inside the missing region.

Let’s define the problem first. Let Ω be the missing region and Ω(bar) be the valid region (i.e. non-missing region). For a U-Net with L layers, ϕ_l(I) represents the encoded feature of the l-th layer and ϕ_Ll(I) represents the decoded feature of the (Ll)-th layer. Our final objective is to recover I^gt (ground truth), thus we can expect that ϕ_l(I) and ϕ_Ll(I) contain almost all the information in ϕ_l(I^gt). If we consider y Ω, (ϕ_l(I))_y should be 0 (i.e. the encoded feature of the missing region in an input masked image at l-th layer is zero). So, (ϕ_Ll(I))_y should contain the information of (ϕ_l(I^gt))_y (i.e. the decoded feature of the missing region in an input masked image at (Ll)-th layer should be equal to the encoded feature of the missing region in the ground truth image at l-th layer). This means that the decoding process should fill in the missing region.

Equation 1 shows the relationship between (ϕ_Ll(I))_y and (ϕ_l(I^gt))_y. Note that for x Ω(bar) (i.e. non-missing region), they assume that (ϕ_l(I))_x is almost the same as (ϕ_l(I^gt))_x. Hence, the guidance loss is only defined in the missing region. By concatenating ϕ_l(I) and ϕ_Ll(I) as shown in Figure 2, almost all information in ϕ_l(I^gt) can be obtained.

Figure 3. Visualisation of features learned by Shift-Net. (a) Input (the lighter region indicates the missing region) (b) visualisation of (ϕ_l(I^gt))_y (c) visualisation of (ϕ_Ll(I))_y (d) visualisation of (ϕ^shift_Ll(I))_y

To further show the relationship between (ϕ_Ll(I))_y and (ϕ_l(I^gt))_y, the authors visualise the features learned by their Shift-Net as shown in Figure 3. Comparing Figure 3(b) and (c), we can see that (ϕ_Ll(I))_y can be a reasonable estimation of (ϕ_l(I^gt))_y but the estimation is too blur. This leads to blurry inpainting results without fine texture details. This problem is solved by their proposed shift-connection layer ant the result is shown in Figure (d). So, Let’s talk about the shift operation.

For readers who are interested in their visualisation method, please refer to their paper or their github page. The visualisation method is just used to show the learned features, thus I would not cover it here.

Shift-connection layer

Personally, I would say this is the core idea of this paper. Recall that ϕ_l(I) and ϕ_Ll(I) are assumed to have almost all information in ϕ_l(I^gt). From the previous section, we can see that (ϕ_Ll(I))_y can be a reasonable estimation of (ϕ_l(I^gt))_y but it is not sharp enough. Let’s see how the authors make use of the feature outside the missing region to further enhance the blurry estimation inside the missing region.

Simply speaking, the equation 4 in above is to find the most similar encoded feature outside the missing region to each decoded feature inside the missing region. This is a cosine similarity operation. For each (ϕ_Ll(I))_y with y Ω, we find its nearest neighbour in (ϕ_l(I))_x with x Ω(bar). The output x*(y) represents the coordinates of the matched feature locations and we can obtain a shift vector u_y = x*(y)y. Note that this shift operation can be formulated as a convolutional layer. I will talk about this in detail in my next post.

After getting the shift vector, we can rearrange the spatial locations of (ϕ_l(I))_x and then concatenate it to ϕ_l(I) and ϕ_Ll(I) to further enhance the estimation of (ϕ_l(I^gt))_y. The spatial rearrangement of (ϕ_l(I))_x is as follows,

Verbally, for each decoded feature inside the missing region, after finding the most similar encoded feature outside the missing region, we form another set of feature maps based on the shift vector. This set of feature maps contains the information about the nearest encoded features outside the missing region to the decoded features inside the missing region. All the related information is then combined as shown in Figure 2 for further reconstruction.

Here I would like to highlight some points about the shift-connection layer. i) The conventional “Copy-and-Paste” method operates at pixel or image patch domain while the shift-connection layer operates at deep feature domain. ii) The deep features are learned from a large amount of training data, all the components are learned in an end-to-end data-driven manner. Hence, both the advantages of using “Copy-and-Paste” and CNNs are inherited.

Loss Function

Their loss function is very standard. As mentioned, apart from the proposed guidance loss we have introduced, they also employ L1 loss and standard adversarial loss. The overall loss function is as follows,

Lambda g and lambda adv are used to control the importance of the guidance loss and the adversarial loss respectively. In their experiments, these two hyper-parameters are set to 0.01 and 0.002 respectively.

If you are familiar with the training process of CNNs, you may notice that the shift operation is a kind of manually modification in feature maps. Therefore, we have to modify the calculation of the gradient with respect to the l-th layer of feature F_l = ϕ_l(I). Based on equation 5, the relationship between ϕ^shift_L-l(I) and ϕ_l(I) can be written as follows,

where P is the shift matrix of {0, 1}, and only one element of 1 in each row of P. Element of 1 shows the location of the nearest neighbour. Therefore, the gradient with respect to ϕ_l(I) is computed as,

where F^skip_l represents F_l after the skip connection, and F^skip_l = F_l. All the three terms can be directly computed except that we have to multiply the transpose of the shift matrix P to the last term in order to make ensure that the gradient is correctly back-propagated.

Perhaps, you may find this part is a bit difficult to understand as we have to modify the computation of the gradient. For readers who are interested in how the authors actually do the implementation, I highly recommend you to visit their github page. If you do not understand this part, it doesn’t matter as long as you can catch the core idea of shift operation. Here, their shift operation is a kind of hard assignment. This means that each decoded feature in the missing region can only have one single nearest neighbour outside the missing region. This is why the shift matrix P is in the form of {0, 1} and why we have to modify the computation of the gradient. Later on, similar idea of shift operation is proposed and soft assignment is employed. In such a case, all neighbours outside the missing region are assigned weights to indicate the closeness to each decoded feature inside the missing region and we do not need to modify the computation of the gradient as this operation is completely differentiable. I will talk about this in detail in my next post:)

Experiments

The authors evaluate their model on two datasets, namely Paris StreetView [2] and six scenes from Places365-Standard [3]. Paris StreeView contains 14,900 training images and 100 testing images. For Places365, there are 1.6 million training images from 365 scenes. Six scenes are selected for the evaluation. Each scene has 5,000 training images, 900 testing images, and 100 validation images. For both datasets, they resize each image such that the smallest dimension is 350 then randomly crop a sub-image of size 256×256 as the input to their model.

For training, they use Adam optimiser with a learning rate of 0.0002 and beta_1 = 0.5. The batch size is set to 1 and the total number of training epochs is 30. Note that flipping is adopted as data augmentation. They claim that around one day is required to train their Shift-Net on a Nvidia Titan X Pascal GPU.

Figure 4. Visual comparison of inpainting results on Paris StreetView dataset. (a) Input (b) Content-Aware Fill (copy-and-paste method) (c) Context Encoder (d) Multi-scale Neural Patch Synthesis (MNPS) (e) Shift-Net. Extracted from [1]

Figure 4 shows the visual comparison of state-of-the-art approaches on Paris StreetView dataset. Content-Aware Fill (Figure 4(b)) is the conventional method which utilises the concept of copy-and-paste. You can see it offers fine local texture details but wrong global semantic structure. Figure 4(c) and (d) are the results of Context Encoder and Multi-scale Neural Patch Synthesis respectively. We have reviewed these two methods previously. You can see that the results of Context Encoder are with correct global semantic structure but they are blurry. MNPS provides better results than Context Encoder but we still can easily observe the filled region with a bit artifacts. In contrast, Shift-Net can offer the inpainting results with both correct global semantic structure and fine local texture details. The results are as shown in Figure 4(e), please zoom in for a better view.

Figure 5. Visual comparison of inpainting results on Places dataset. (a) Input (b) Content-Aware Fill (copy-and-paste method) (c) Context Encoder (d) Multi-scale Neural Patch Synthesis (MNPS) (e) Shift-Net. Extracted from [1]

Figure 5 shows the qualitative comparison of state-of-the-art approaches on Places dataset. Similar observations are made, please zoom in for a better view of the local texture details.

Table 1. Quantitative comparison of state-of-the-art approaches. Extracted from [1]

Table 1 lists some quantitative evaluation metric numbers on Paris StreeView dataset. It is obvious that the proposed Shift-Net offers the best PSNR, SSIM and mean l2 loss. As mentioned in my previous posts, these numbers are related to the pixel-wise reconstruction accuracy (objective evaluation). They cannot reflect the visual quality of the inpainting results.

Figure 6. Examples of filling random regions. From top to bottom: Input, Content-Aware Fill, and Shift-Net. Extracted from [1]

Figure 6 shows some examples of filling random regions using Content-Aware Fill and the proposed Shift-Net. Shift-Net is able to handle random cropped regions with good visual quality. Please zoom in for a better view of the local texture details.