Modeling a Background Clean-up Deep Learning Model — Part I

Source: Deep Learning on Medium

These series of articles are focused on giving an idea into how to build a deep learning model to perform the task of background removal from portrait images.

In Part one of the article, the focus will be on the following:

  • Describing the task at hand.
  • How Deep Learning emerged as a good choice for the task.
  • A brief introduction to Deep Neural Network.
  • An overview of our selected architecture behind the background clean-up model.


Foreground/Background subtraction is one of the major problems in the field of computer vision and image processing. Basically, it involves identifying a foreground (in this case a human) and extracting the background — usually with the aim of replacing the background with a plain white colour.

Sample Background Clean-up from our model

However, in our case, the idea is leveraged to further optimize our capture process. We would want to be able to capture images on different backgrounds and still ensure our images end up having a steady white background. The reason is to help clients reduce their worries over noisy backgrounds and also make our data capture uniform.

Deep Learning to The Rescue.

Researching approaches to implement this, we tried one which is a simple thresholding approach using Ceiluv — Ceiluv is a color space that focuses on perceptual uniformity, i.e, the difference or distance between different colors in an Image. Hence, we can set a particular color sequence as a target and look for pixels that are very close to the target sequence. However, this works well for images with less noisy backgrounds and fails woefully on images with backgrounds that are almost close to the human skin.

Since there is no single way to qualify the colour of clothes an individual will be wearing, this approach also did not help in the area of extracting the clothes of the individual. With the setbacks encountered with the thresholding principle, we turned to deep learning.

Deep learning has seen a lot of applications in various aspects of computer vision and image processing, some of which include — Image Classification, Image Color restoration, Pixel reconstruction, pose estimation, photo descriptions, object detection and object segmentation in images.

We capitalized on the concept of object segmentation to further the implementation of a background removal AI model that helps differentiate human foreground from the background and ultimately underlay it with a white background.

Simple and Intuitive Understanding of Deep Learning

To get an intuition of the principles that guide the background removal model, it’s important to introduce the concepts of Machine learning and Deep Learning.

In the simplest form, machine learning algorithms are mathematical functions which have been designed to learn some of its variables from data, such that the function can make correct predictions on data that belongs to a similar distribution to the training data once these learned variables reached an optimum value.

Below is a simple machine learning function;

Usually, the X and Y are the training data and the prediction label respectively, and a and b are the learnable values. After training for a while, with different values of training data represented as X and Y, a and b reaches optimal values such that for any new X that belongs to the distribution of the training data, the function makes a prediction Y that is assumed actual for that X value.
Oftentimes, this approach of training a function has shown to give incredible results and its application has been used in services like recommendation services, stock market prediction, etc.

This basic is what is further extended into what we know as Deep Learning.

Deep Learning is a subset of Machine Learning but with a special and quite complex mathematical function known as Neural Network. The term deep is attributed to these functions because there are multiple hidden layers (as seen in the figure above) coupled to adequately develop a strong/accurate deep learning network.

Overview of the Deep Learning Components Behind Background Cleanup Model

Much research has gone into developing different architectures for many applications of Neural Network. For background removal, our focus was on architecture build for object segmentation. Object segmentation involves the idea of extracting and locating objects in a digital image with the goal of simplifying the representation of an image into something that is more meaningful by predicting the class of every pixel in the image.

There are basically two groups of this approach i.e Instance Segmentation and Semantic Segmentation. Our background removal application leverages semantic segmentation in predicting pixels of images to background or foreground/human images. There are a number of example architectures for semantic segmentation, they include — U-Net, SegNet, Fully Connected Network with CRF, etc.

For this discussion, our architecture has 4 basic components, all of which represent a series of mathematical functions containing values to be learned.

  1. Deep Convolutional Neural Network
    Intuitively, Standard Deep Convolution Neural Network represents multiple layers of a convolution function that uses filters (an N x N-Dimensional array) to run through pixels of an image — which is often represented as a multi-dimensional array — as they are to generate results which represent custom features that characterize such images. By overlaying multiple of the convolution functions in a hierarchical format, it has shown to generate complex features which can adequately classify a pixel as background or foreground.
  2. Atrous Convolution
    Atrous convolution is a variation on the Standard Convolutional Neural Network, also known as Dilated Convolutions. Atrous Convolution introduces a dilation rate where a kernel is deliberately spaced when interacting with the image arrays, such that the kernel captures more fields of view when generating the complex features that represent the image.
  3. Atrous Spatial Pyramid Pooling
    This layer basically extends Atrous Convolution by applying a series of grid layered Atrous Convolutions having different dilation rate on a feature map and fusing the result. Its main aim is to capture different scales of the same objects in the image.
  4. Decoder Neural Network
    The final layer which was added in a much later version of our architecture is a decoding layer which is often an Up-sampling layer. An Up-sampling layer is a layer that converts a low-resolution image into high-resolution while still trying to keep as much information from the low resolution as possible. It is used to convert the result from Atrous Spatial Pyramid Pooling to the size of the input and also try to refine the foreground edges — The specific Up-sampling approach used is Bilinear Up-sampling.

It is important to also mention that several base models can be used to train our selected architecture. Recommended base models include Xception, ResNet-101 and a mobile-friendly Mobilenet model. There are architecture manipulations that were introduced into Mobilenet that makes its accuracy near comparable to the other models despite having less learnable parameters, the basis for which it is a mobile-based model.

In conclusion of this section of the article, it is important to review the pipeline involve in any machine learning project, which also extends to training a deep neural network. A typical ML pipeline involves the following;

Obtaining the dataset (both the dataset itself — X and the corresponding labels — Y) on which your network or model will learn on; preprocessing the data; building the network or model; running your training on this data; evaluating the model; deployment of the model; and monitoring the model

In part II of this series, the focus will be on a further description of the model architecture, the structure and why our selected architecture is suitable for image segmentation.

Are you a machine learning guru who’d like to share more insights? or a dilettante who would like to learn more? leave your comments below and I’d be more than happy to respond. Thanks.