Everything you need to know about Auto-Deeplab: Google’s latest on Segmentation

Source: Deep Learning on Medium

Everything you need to know about Auto-Deeplab: Google’s latest on Segmentation

All previous versions of DeepLab(v1, v2, v3 and v3plus) stretched the state-of-the-art for semantic segmentation problem when they were published. Meanwhile, Neural Architecture Search had been used to beat the state-of-the-art in the image recognition problem set by networks designed by humans. So, the subsequent step was to have a go at solving a dense image prediction problem like semantic segmentation using Neural Architecture Search. This story (article:2 of the series called “Everything you need to know about …”) is a review of the paper “Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation” presented at CVPR 2019.


  1. Introduction
  2. Architecture Search Space
  3. Getting to the model
  4. Experimental Results
  5. Further Reading


Semantic Segmentation problem is being tackled by some of the earliest CNNs such as AlexNet, VGG, Inception, ResNet etc. Then came improved versions of these models such as Wide ResNet, Xception, ResNeXt etc. However, all these models were generic feature extractors being used for the problem in hand. These models were then followed by a series of models especially for segmentation such as series of DeepLab models proposing the ideas of atrous convolutions or PSPNet for instance that were winners of the segmentation challenges of several datasets when they were respectively published. However, most of the models used nowadays use ImageNet pre-training. However, the model proposed here produces competitive results without such aid.

To reduce computational power needed to search the network, they use differential NAS instead of the older techniques that used evolutionary algorithms or reinforcement learning. This way, we are able to get the structure from a latent space instead of complete random search.

Further, NAS has been used in the past to search for a CNN block that is repeated regularly to form the whole network. In other words, it has been used only on a cell interior level and not for determining the network structure. However, here they propose to use NAS for searching the network as well along with the normal task of searching the cell.

Architecture Search Space

Cell-Level Search Space

To search for a cell, first, we define a cell. Here, a cell is a fully convolutional module consisting of B blocks. Each block is a two-branch structure and can be defined by a 5-tuple (I₁, I₂, O₁, O₂, C) where I₁, I₂ ∈ Iˡᵢ which is the set of all possible selections of input for block i in layer l and O₁, O₂ are selection of layer types for the block C is the method used to combine the individual outputs of the two branches to get the output of this block Hᵢˡ. The cell’s output tensor Hˡ is simply the concatenation of output from the blocks. The set of possible tensors for a block contains the output of the previous two cells i.e. Hˡ⁻¹ and Hˡ⁻². Also, the output from a block j in a cell can be input to every block i where i > j.

Further, the types of branches that will be considered in every block are the following:

depthwise-separable conv was introduced first in MobileNetV2 and atrous conv in DeepLab.

Network Level Search Space

The network-level search space is especially needed because this is a dense image prediction problem and generally networks for such problems tend to start with a high-resolution image and get the spatial dimension down somewhere during the network and then back up again to the original dimension. Here, the authors restrict the total downsampling factor to32. Also, the movement in terms of spatial dimension can be done only by a factor of 2. Further, the total number of layers, i.e totals the number of cells threaded is set to 12.

The figure below is a schematic diagram of the latent space. In the left is the latent space of the whole network. It is a mesh of spatial dimension factor vs layer number. The ASPP module which is used for segmentation is put at every possible according to every possible final spatial dimension factor(s = 4,8,16,32) when searching the architecture.

The figure in right is the latent space for a cell. It contains five blocks with all possible connections. These connections are explained in the next section.