Background removal with U²-Net

Original article was published by Maxence Alluin on Deep Learning on Medium


Background removal with U²-Net

Removing the background of an image is an old problem with many applications. For my use I needed to erase the background on a photo of a centered piece of furniture, in various settings, while leaving the object itself intact.

The image we are going to use for our experiment

While in most cases this task can be achieved with classic computer vision algorithms like image thresholding (using OpenCV[1] for example), some images can prove to be very difficult without specific pre or post-processing. If the object has a color very similar to the background it can be very challenging to find a clean contour due to weak edges or shadows.

An example of an image with weak edges and shadows.

Conversely, if the background is of a characteristic color different from the object, it is easy to construct a mask to get rid of the background. This is the principle behind Chroma Keying, using green and blue screens to erase the background and replace it with another scene, a widely used effect in the entertainment industry.

I don’t have the luxury to use a green screen but luckily for me, thanks to the recent advances in deep learning in the last decade, this problem is getting explored again with new models designed specifically for this task.

The specific subtask related to this problem is called Saliency Object Detection.

Sounds nice, but what does Saliency mean exactly?

Here is Wikipedia’s definition of a Saliency map:

In computer vision, a saliency map is an image that shows each pixel’s unique quality. The goal of a saliency map is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.

In layman’s term, saliency is the ability to stand out from the rest of the image. The saliency map allows to distinguish the important part of the image (usually an object at the foreground), from the rest (the background).

Perfect for our use case !

Since i wanted a quick solution to the problem and therefore making my own dataset was out of question, I looked for a pretrained model that would perform well enough for my need without further fine-tuning.

A lot of different models have been recently released for this task, I have chosen U²-Net[2] because of it performed well according to benchmarks and the code was easy to modify for inference on my own images.

The name is derived from the well-known Unet[3] from which it borrows its general architecture. I won’t dwelve too deep into the subject but both networks are ‘U-shaped’ with a succession of convolutional and downsampling layers down to a low point at which begins succession of convolutional and upsampling layers, up to the original input shape.

The main difference for U²-Net is that the ‘layers’ are themselves ‘U-structures’ with down and upsampling layers.

Network architectures of Unet (left) vs U²-Net (right)

Now after some slight modification to the original code, we can visualize the saliency map straight out of the model:

Each individual pixel is a float between 0 and 1, which means that this saliency map does handle transparency !

Without any further modification we can apply the saliency map as a mask on our image by a simple matrix multiplication using numpy.

Result of applying Unet² saliency map as a mask

The model performs very well. We can still see a bit of the floor through the chair but that is not a problem in our case. In general the result is near perfect but there may be some artifact left especially in the case of objects with ‘gaps’ in them like this chair.

This result is obtained without any pre- or post-processing of the image. You could for example round up higher values to 1 and round down low values to avoid the ‘fuzziness’ around the image, or just sharpen it instead.

The model also comes with another smaller architecture (u2netp) which weights only 4.7 mb compared to the original 176 mb for similar results !

This comes handy when we want to deploy our model to some cloud-based platform that may have some restriction on model size. For example Amazon Lambda, which allows you to run an infinitely scalable API without setting up your own server, has heavy restrictions on RAM usage and disk space.

Here are a the results on a sample of images in different settings: