Review DeepLabv3 (Semantic Segmentation)

Original article was published on Deep Learning on Medium

Review DeepLabv3 (Semantic Segmentation)

  • (a): With Atrous Spatial Pyramid Pooling (ASPP), able to encode multi-scale contextual information.
  • (b): With Encoder-Decoder Architecture, the location/spatial information is recovered. Encoder-Decoder Architecture has been proved to be useful in literature such as FPN, DSSD, SegNet and U-Net for different kinds of purposes.
  • (C): DeepLabv3+ makes use of (a) and (b).

1. Atrous Separable Convolution

1.1. Atrous Convolution

Atrous Convolution with Different Rates r

Atrous Convolution

  • For each location i on the output y and a filter w, atrous convolution is applied over the input feature map x where the atrous rate r corresponds to the stride with which we sample the input signal.
A atruous 2D convolution using a 3 kernel with a atrous rate of 2 and no padding. From “An Introduction to Different Types of Convolutions in Deep Learning” by Paul-Louis Pröve
The top is a regular CNN while the second uses atrous convolutions with r>1 in a cascading manner and with an output_stride of 16.
Parallel modules with atrous convolutions. From Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H., 2017

1.2. Atrous Separable Convolution

Depthwise Separable Convolution Using Atrous Convolution

  • (a) and (b), Depthwise Separable Convolution: It factorize a standard convolution into a depthwise convolution followed by a point-wise convolution (i.e., 1×1 convolution), drastically reduces computation complexity.
  • This is introduced in MobileNetV1.
  • (C) Atrous Depthwise Convolution: Atrous convolution is supported in the depthwise convolution. And it is found that it significantly reduces the computation complexity of the proposed model while maintaining similar (or better) performance.
  • Combining with point-wise convolution, Atrous Separable Convolution.

2. Encoder-Decoder Architecture

DeepLabv3+ Extends DeepLabv3
  • For the task of image classification, the spatial resolution of the final feature maps is usually 32 times smaller than the input image resolution and thus output stride = 32.
  • For the task of semantic segmentation, it is too small.
  • One can adopt output stride = 16 (or 8) for denser feature extraction by removing the striding in the last one (or two) block(s) and applying the atrous convolution correspondingly.
  • Additionally, DeepLabv3 augments the Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales by applying atrous convolution with different rates, with the image-level features.

2.2. Proposed Decoder

  • The encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features.
  • There is 1×1 convolution on the low-level features before concatenation to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features.
  • After the concatenation, we apply a few 3×3 convolutions to refine the features followed by another simple bilinear upsampling by a factor of 4.
  • This is much better comparing the one bilinearly upsampling 16× directly.


The authors propose an approach that updates DeepLab prior versions by adding a batchnorm and image features to the spatial “pyramid” pooling atrous convolutional layers. The result is the network can extract dense feature maps to capture long-range contexts, improving the performance of segmentation tasks. The results of their proposed model outperformed the state-of-the-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

From Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H., 2017

DeepLab: Deep Labelling for Semantic Image Segmentation, DIfference between different version

DeepLab is a state-of-art deep learning model for semantic image segmentation, where the goal is to assign semantic labels (e.g., person, dog, cat and so on) to every pixel in the input image. Current implementation includes the following features:

  1. DeepLabv1 : They use atrous convolution to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks.
  2. DeepLabv2 : They use atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales with filters at multiple sampling rates and effective fields-of-views.
  3. DeepLabv3 : They augment the ASPP module with image-level feature to capture longer range information. They also include batch normalization parameters to facilitate the training. In particular, They applying atrous convolution to extract output features at different output strides during training and evaluation, which efficiently enables training BN at output stride = 16 and attains a high performance at output stride = 8 during evaluation.
  4. DeepLabv3+: They extend DeepLabv3 to include a simple yet effective decoder module to refine the segmentation results especially along object boundaries. Furthermore, in this encoder-decoder structure one can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime.