NeuroNuggets: Cut-and-Paste in Deep Learning

Source: Deep Learning on Medium


…Many people think that authors just cut and paste from real life into books. It doesn’t work quite that way.

― Paul Fleischman

As the CVPR in Review posts (there were five: GANs for computer vision, pose estimation and tracking for humans, synthetic data, domain adaptation, and face synthesis) have finally dried up, we again turn to our usual stuff. In the NeuroNugget series, we usually talk about specific ideas in deep learning and try to bring you up to speed on each. We have had some pretty general and all-encompassing posts here, but it is often both fun and instructive to dive deeper into something very specific. So we will devote some NeuroNuggets to reviewing a few recent papers that share a common thread.

And today, this thread is… cut-and-paste! And not the kind we all do from other people’s GitHub repositories. In computer vision, this idea is often directly related to synthetic data, as cutting and pasting sometimes proves to be a fertile middle ground between real data and going fully synthetic. But let’s not get ahead of ourselves…

Naive Cut-and-Paste as Data Augmentation

We have talked in great detail about object detection and segmentation, two of the main problems of computer vision. To solve them, models need training data, the more the merrier. In modern computer vision, training data is always in short supply, so researchers always use various data augmentation techniques to enlarge the dataset.

The point of data augmentation is to introduce various modifications of the original image that do not change the ground truth labels you have or change them in predictable ways. Common augmentation techniques include, for instance, moving and rotating the picture and changing its color histogram in predictable ways:

Image source

Or changing the lighting conditions and image parameters that basically reduce to applying various Instagram filters:

Image source

Notice how in terms of individual pixels, the pictures change completely, but we still have a very predictable and controllable transformation of what the result should be. If you know where the cat was in the original image, you know exactly where it is in the rotated-and-cropped one; and Instagram filters usually don’t change the labels at all.

Data augmentation is essential to reduce overfitting and effectively extend the dataset for free; it is usually silently understood in all modern computer vision applications and implemented in standard deep learning libraries (see, e.g., keras.preprocessing.image).

Cutting and pasting sounds like a wonderful idea in this regard: why not cut out objects from images and paste them onto different backgrounds? The problem, of course, is that it is hard to cut and paste an object in a natural way; we will return to this problem later in this post. However, last year (2017) has seen a few papers that claimed that you don’t really have to be terribly realistic to make the augmentation work.

The easiest and most straightforward approach was taken by Rao and Zhang in their paper “Cut and Paste: Generate Artificial Labels for Object Detection” (appeared on ICVIP 2017). They simply took an object detection dataset (VOC07 and VOC12), cut out objects according to their ground truth labels and pasted them onto images with different backgrounds. Like this:

Source: (Rao, Zhang, 2017)

Then they trained with these images, using cut-and-paste like usual augmentation. Even with this very naive approach, they claimed to noticeably improve the results of standard object detection networks like YOLO and SSD. More importantly, they claimed to reduce common error modes of YOLO and SSD. The picture below shows the results after training on the left; and indeed, wrong labels decrease and bounding boxes significantly improve in many cases:

Source: (Rao, Zhang, 2017)

A similar but slightly less naive approach to cutting and pasting was introduced, also in 2017, by researchers from the Carnegie Mellon University. In “Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection” (ICCV 2017), Dwibedi et al. use the same basic idea but instead of just placing whole bounding boxes they go for segmentation masks. Here is a graphical overview of their approach:

Source: (Dwibedi et al., 2017)

Basically, they take a set of images of the objects they want to recognize, collect a set of background scenes, and then paste objects into the scene. Interestingly, they are recognizing grocery items in indoor environments, just like we did in our first big project on synthetic data.

Dwibedi et al. claim that it is not really important to place objects in realistic ways globally but important to achieve local realism. That is, modern object detectors do not care as much to have a Coke bottle on the counter rather than on the floor; however, it is important to blend the object as realistically as possible into the local background. To this purpose, Dwibedi et al. consider several differ blending algorithms for pasting images:

Source: (Dwibedi et al., 2017)

They then make blending another dimension of data augmentation, another factor of variability in order to make the detector robust against boundary artifacts. Together with other data augmentation techniques, it proves highly effective; “All Blend” in the table below means that all versions of blending for the same image are included in the training set:

Source: (Dwibedi et al., 2017)

This also serves as evidence for the point about the importance of local realism. Here are some sample synthetic images Dwibedi et al. come up with:

Source: (Dwibedi et al., 2017)

As you can see, there is indeed little global realism here: objects are floating in the air with no regard to the underlying scene. However, here is how the accuracy improves when you go from real data to real+synthetic:

Source: (Dwibedi et al., 2017)

Note that all of these improvements have been achieved in a completely automated way. The only thing Dwibedi et al. need to make their synthetic dataset is a set of images for that would be easy to segment (in their case, they have photos of objects on a plain background). Then it is all in the hands of neural networks and algorithms: a convolutional network predicts segmentation masks, an algorithm does augmentation for the objects, and then blending algorithms make local patches more believable, so the entire pipeline is fully automated. Here is a general overview of what algorithms constitute this pipeline:

Source: (Dwibedi et al., 2017)

Smarter Augmentation: Pasting with Regard to Geometry

We have seen that even very naive pasting of objects can help improve object detection by making what is essentially synthetic data. The next step in this direction would be to actually try to make the pasted objects consistent with the geometry and other properties of the scene.

Here we begin with a special case: text localization, i.e., object detection specifically for text appearing on an image. That is, you want to take a picture with some text on it and output bounding boxes for the text instances regardless of their form, font, and color, like this:

Image source

This is a well-known problem that has been studied for decades, but here we won’t go into too many details on how to solve it. The point is, in 2016 (the oldest paper in this post, actually) researchers from the University of Oxford proposed an approach to blending synthetic text into real images in a way coherent with the geometry of the scene. In “Synthetic Data for Text Localisation in Natural Images”, Gupta et al. use a novel modification of a fully convolutional regression network (FCRN) to predict bounding boxes, but the main novelty lies in synthetic data generation.

They first sample text and a background image (scraped from Google Image Search, actually). Then the image goes through several steps:

  • first, through a contour detection algorithm called gPb-UCM; proposed in (Arbelaez, Fowlkes, 2011), it does not contain any neural networks and is based on classical computer vision techniques (oriented gradient of histograms, multiscale cue combination, watershed transform etc.), so it is very fast to apply but still produces results that are sufficiently good for this application;
  • out of the resulting regions, Gupta et al. choose those that are sufficiently large and have sufficiently uniform textures: they are suitable for text placement;
  • to understand how to rotate the text, they estimate a depth map (with a state-of-the-art CNN), fit a planar facet to the region in question (with the RANSAC algorithm), and then add the text, blending it in with Poisson editing.

Here is a graphical overview of these steps, with sample generated images on the bottom:

Source: (Gupta et al., 2016)

As a result, Gupta et al. manage to produce very good text placement that blends in with the background scene; their images are not realistic only in the sense that we might not expect text to appear in these places at all, otherwise they are perfectly fine:

Source: (Gupta et al., 2016)

With this synthetic dataset, Gupta et al. report significantly improved results in text localization.

In “Synthesizing Training Data for Object Detection in Indoor Scenes”, Georgakis et al. from the George Mason University and University of North Carolina at Chapel Hill applied similar ideas to pasting objects into scenes rather than just text. Their emphasis is on blending the objects into scenes in a way consistent with the scene geometry and meaning. To do this, Georgakis et al.:

  • use the BigBIRD dataset (Big Berkeley Instance Recognition Dataset) that contains 600 different views for every object in the dataset; this lets the authors blend real images of various objects rather than do the 3D modeling required for a purely synthetic approach;
  • use an approach by Taylor & Cowley (2012) to parse the scene, which again uses the above-mentioned RANSAC algorithm (at some point, we really should start a NonNeuroNuggets series to explain some classical computer vision ideas — they are and will remain a very useful tool for a long time) to extract the planar surfaces from the indoor scene: counters, tables, floors and so on;
  • combine this extraction of supporting surfaces with a convolutional network by Mousavian et al. (2012) that combines semantic segmentation and depth estimation; semantic segmentation lets the model understand which surfaces are indeed supporting surfaces where objects can be placed;
  • then depth estimation and positioning of the extracted facets are combined to understand the proper scale and position of the objects on a given surface.

Here is an illustration of this process, which the authors call selective positioning:

Source: (Georgakis et al., 2017)

Here (a) and (e) show the original scene and its depth map, (b) and © show semantic segmentation results with predictions for counters and tables highlighted on ©, (f) is the result of plane extraction, and (g) are estimated supporting surfaces; they all combine to find regions for object placement shown on (d), and then the object is properly scaled and blended on (h) to obtain the final result (i). Here are some more examples to show that the approach indeed works quite well:

Source: (Georgakis et al., 2017)

Georgakis et al. train and compare Faster R-CNN and SSD with their synthetic dataset. Here is one of the final tables:

Source: (Georgakis et al., 2017)

We won’t go into the full details, but it basically shows that, as always, you can get excellent results on synthetic data by training on synthetic data, which is useless, and you don’t get good results on real data by training purely on this kind of synthetic data. But if you throw together real and synthetic then yes, there is a noticeable improvement compared to using just the real dataset. Since this is still just a form of augmentation and thus is basically free (provided that you have a dataset of different views of your objects), why not?

Cutting and Pasting for Segmentation… with GANs

Finally, the last paper in our review is a quite different animal. In this paper recently released by Google, Remez et al. (2018) are actually solving the instance segmentation problem with cut-and-paste, but they are not trying to prepare a synthetic dataset to train a standard segmentation model. Rather, they are using cut-and-paste as an internal quality metric for segmentations: a good segmentation mask will produce a good image with a pasted object. In the image below, a bad mask (a) leads to an unconvincing image (b), and a good mask © produces a much better image (d), although the ground truth (e) is better still:

Source: (Remez et al., 2018)

How does the model decide which images are “convincing”? With an adversarial architecture, of course! In the model pipeline shown below, the generator is actually doing the segmentation, and the discriminator judges how well the pasted image is by trying to distinguish it from real images:

Source: (Remez et al., 2018)

The idea is simple and brilliant: only a very good segmentation mask will result in a convincing fake, hence the generator learns to produce good masks… even without any labeled training data for segmentation! The whole pipeline only requires the bounding boxes for objects to cut out.

But you still have to paste objects intelligently. There are several important features required to make this idea work. Let’s go through them one by one.

  1. Where do we paste? One can either paste uniformly at random points of the image or try to take into account the scene geometry and be smart about it, like in the papers above. Here, Remez et al. find that yes, pasting objects in a proper scale and place in the scene does help. And no wonder; in the picture below, first look on the left and see how long it takes you to spot the pasted objects. Then look on the right, where they have been pasted uniformly at random. Where will the discriminator’s job be easier?
Source: (Remez et al., 2018)

2. There are a couple of degenerate corner cases that formally represent a very good solution but are actually useless. For example, the generator could learn to “cut off” all or none of the pixels in the image and thus make the result indistinguishable from real… because it is real! To discourage from choosing all pixels, the discriminator simply receives a larger viewpoint, seeing, so to speak, the bigger picture, so this strategy ceases to work. To discourage from choosing no pixels, the authors introduce an additional classification network that attempts to classify the object of interest and the corresponding loss function. Now, if the object has not been cut, classification will certainly fail, incurring a large penalty.

3. Sometimes, cutting only a part of the segmentation mask still results in a plausible object. This is characteristic for modular structures like buildings; for example, in these satellite images some of the masks are obviously incomplete but the resulting cutouts will serve just fine:

Source: (Remez et al., 2018)

To fix this, the authors set up another adversarial game, now trying to distinguish the background resulting from cutting out the object and the background resulting from the same cut elsewhere in the scene. This is basically yet another term in the loss function; modern GANs often tend to grow pretty complicated loss functions, and maybe someday we will explore them in more details.

The authors compare their resulting strategy with some other pretrained baselines; while they, of course, lose to fully supervised methods (with access to ground truth segmentation masks in the training set), they come out ahead against the baselines. It is actually pretty cool that you can get segmentation masks like this with no effort for segmentation type labeling:

Source: (Remez et al., 2018)

There are failure cases too, of course. Usually they happen when the result is still realistic enough even with the incorrect mask. Here are some characteristic examples:

Source: (Remez et al., 2018)

This work is a very interesting example of a growing trend towards data-independent methods in deep learning. More and more often, researchers find ways around the need to label huge datasets, and deep learning gradually learns to do away with the hardships of data labeling. We are not quite there yet but I hope that someday we will be. Until next time!

Sergey Nikolenko
Chief Research Officer, Neuromation