Neural Architecture Search on Semantic Segmentation

Source: Deep Learning on Medium

(1) The first trial on semantic segmentation using NAS is presented by Liang-Chief Chen et al. from Google [5]. Since DeeLabv3+ has achieved remarkable results, the DeepLab Team shifts the emphasis towards the automatic architecture search. Based on the encoder-decoder structure as DeepLab, the work aims to seek a more efficient decoder instead of ASPP following an existing small encoder backbone. A recursive search space is built to encoder multi-scale context information that is called Dense Prediction Cell (DPC) in the work. An efficient random research method is applied as the search strategy. The work achieves 82.7% of the mean IoU (mIoU) on Cityscapes datasets and 87.9% mIoU on PASCAL VOC 12 datasets. Nevertheless, the high computational cost (370 GPUs for a week) limits the application of this approach.

(2) Also from the Google DeepLab Team, the Auto-Deeplab [6] is more efficient on searching and it takes only 3 days on a single GPU. This is due to the efficient search strategy, DARTS [9], a gradient-based approach from CMU and Google. As opposed to the last work, Auto-Deeplab focus on searching the encoder and employ ASPP as decoder. Regarding the experiments, auto-deeplab shows similar performance as deeplabv3+ but it has less FLOPs and number of parameters. Details of DARTS and Auto-DeepLab will be discussed later in this article.

(3) The Customizable Architecture Search (CAS) [4], a joint work of USTC and JD AI lab, designs three different types of cells: the normal cell, the reduction cell and a novel multi-scale cell inspired by ASPP. DARTS is used as the search strategy. It is worth mentioning that it does not only validate the accuracy but also the GPU time, CPU time, FLOPs and number of parameters when evaluate the searched cell. These performance measures are great useful in real-time scenarios. It reports 72.3% mIoU on Cityscapes with the remarkable speed of 108 fps on NVIDIA TitanXP GPU.

(4) The group from Adeleide University has made lots of efforts on light-weight neural networks in recent years. In [7], the authors shift the research focus towards NAS to realize semantic segmentation running in real-time limited resources. In the proposed Fast NAS, the decoder is searched based on reinforcement learning (RL). It is known that RL is notorious for time-consuming, knowledge distillation and Polyak averaging are leveraged to speed up the convergence, which is the major contribution of this work.

(5) Graph-guided Architecture Search (GAS) [8], a more recent work form SenseTime, presents a novel search mechanism in order to efficiently search a light-weight model. Different from the aforementioned typical approaches, it leverages the Graph Convolutional Networks (GCN) to search the connection between each pair of nodes. Analogous to CAS [4], GAS takes the computational cost such as latency into account in optimization as well. It reports 73.3% mIoU on Cityscapes with the speed of 102 FPS on Titan XP GPU.

To summarize, we find that NAS-based networks have achieved considerable successes and many research labs have shifted the research focus towards it. It demonstrates the feasibility to apply NAS on semantic segmentation for higher performance and accuracy. Particularly, the search strategy DARTS makes it possible to build up NAS-based networks for the researches without enough GPU resources. Considering of the source codes and searching computational cost, we implement DARTS and Auto-DeepLab and will introduce in detail.

3. DARTS: Differentiable Architecture Search

Usually the search space is discrete with a set of candidates. In [9], Liu et al. formulates the search space in a continuous relaxation and search the architecture in a differentiable manner. Whereby, the architecture can be optimized via gradient descent with respect to the performance estimation. Therefore, DARTS does not highly depend on the computational resources. Regarding the rich experiments, DARTS is successfully applied to build up CNN and RNN models for image classification and language modelling. Thanks to the public Pytorch source codes https://github.com/quark0/darts, we can better understand how DARTS works and how NAS works. It is a good foundation to build up our own NAS networks.

As aforementioned, the majority of NAS models seek the repeatable cells and stack them in a predefined manner. The cell is represented as a directed acyclic graph (DAG). Generally speaking, a DAG cell consists of an ordered sequence of N nodes. Each node x(i) is a latent representation and is considered as a combination of all the predecessors. Each directed edge (i, j) is associated with some operations o(i,j) transforming from node i to node j. Each cell has two inputs and one output node. There are many possibilities connecting from one node to its adjacent node. The figure below illustrates such a cell with 5 nodes (4 immediate nodes and an output node) from [9]. This cell k has two input nodes from the previous cells marked as c_{k-2} and c_{k-1}. Node c_{k} indicates the output of the cell defined as the depth wise concatenation of all the immediate nodes. The operations on the edges are “max_pool_3x3”, “skip_connect”, etc. Literally there are many possible operations on each edge. Searching the best architecture is equivalent to select the optimal operation connecting each pair of nodes in the cell.

Reduction Cell from DARTS: the cell k consists of 4 immediate nodes as well as an output node

In DARTS, the connections of nodes are parameterized as weights and the optimal solution on each edge can be computed by softmax, specifically, to select the path with largest probability. This is exactly the way to make the architecture continuous relaxation. In [9], two typical cells are designed in the search space, the normal cell and the reduction cell. The cells are stacked in a predefined manner, where the reduction cells are placed at 1/3 and 2/3 location in the network architecture. That is to say, the outer network architecture is fixed but the inner cell structure is changing along with the searching. Namely, it searches the operation parameters W as well as the weights Alpha on the edges. Based on the paper and the source codes, we summarize the workflow of DARTS and illustrate the pipeline below.

Pipeline of DARTS

Starting off to search the network, the first step is to initialize the optimizer, loss, the outer network architecture and the search space. In the search space, several basic operations are defined such as dilated conv 5×5, max pooling 2×2 and concat layers. Afterwards, the normal cell and the reduction cell are learned based on these operations. According to the source codes, the operations and the cells are defined in operation.OPS and genotypes.PROMITIVES. The parameters in operations are denoted as W and the weights Alpha between each pair of nodes are defined as arch_parameters() = {alphas.normal, alphas_reduce} in the source code.

In the searching procedure, the train data is split into two separate patches: the train patch is used in train the operations parameters W and the validation patch is used to evaluate the network and find the optima Alpha. The searching procedure is mainly implemented in architect.step from the code. The optimization objective is to minimize the loss on validation patch with respect to the current network parameters W. Namely, updating Alpha is to seek the optimal operation on each edge. The objective function defined in Eq (3) from the paper:

In order to reduce the searching time, the idea is to approximate w*(α) by adapting w using only a single training patch. However, updating alphas (Eq (7) in the paper) is a bit complex because it is a second-order derivation. The relative implementation can be found in the source codes, e.g. architect._hession_vector_product, which are not expanded here. After updating the model parameters W and Alpha, the best edges are computed mainly based on softmax. The work retains the top-k strongest operations (the largest probabilities after taking the softmax) among all non-zero candidate operations collected from all the previous nodes. The paper gives two reasons for that only the non-zero operations are selected. The first reason is to compare with the existing model where exactly k non-zero operations are required. Besides, increasing the logits of zero operations only affect the scale of the resulting node representations, whereas does not affect the final classification outcome due to the presence of BN. W and Alpha are learned iteratively until the searching procedure terminates.

Rich experiments are conducted to validate DARTS and we take the image classification on CIFAR-10 for an example. It compares with manual designed network DenseNet (Huang et al. CVPR 2017) that reports considerable results on image classification and other popular NAS frameworks, e.g. AmoebaNet and ENAS. Regarding the experimental results, DARTS outperforms from the aspects of accuracy and the computational cost. It takes 4 days to search the model using only 7 operations that is much faster than NASNet using reinforcement learning. To date, DARTS is widely employed in the NAS works owing to its simple, flexible and efficient framework. It is a good choice to start with DARTS when the GPU resource is limited.

4. Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

Drawn inspiration from DARTS, Google Deeplab Team proposed the Auto-DeepLab to automatically search the encoder with decoder fixed. Since the searching codes are not sharing with the public, we implement the NAS backbone from https://github.com/tensorflow/models/tree/master/research/deeplab and evaluate the given model architecture. In this article, we would like to introduce the core idea of Auto-DeepLab and share the experimental results.

In DARTS, the outer network architecture is pre-defined and only the inner cell structure is searched. Differently, Auto-DeepLab relaxes the network architecture and searches the outer structure as well. Following the encoder-decoder structure, Auto-DeepLab aims to search the encoder followed by an ASPP block. Two types of DAG cells are defined in the search space, normal cell and reduction cell, as well as several basic operations, e.g. identity, Atrous convolution and depth separable convolution. The reduction cell is used to change the spatial resolution either twice as larger, twice as smaller or remains the same. In the semantic segmentation task, the smallest spatial resolution is downsampled by 32, which means the smallest downsample rate s=32.

The figure below illustrates the search space for the outer network (upper) and the inner cell (lower). Auto-deeplab defines 12 cells in the model where each cell has five nodes as well as two extra input nodes and an output node. In the left figure, the first two stem nodes are fixed to reduce the spatial resolution. Auto-DeepLab relaxes the outer network structure, consequently, there are many options to determine which type of cell to be used. As the gray arrows illustrated in the figure (upper), the next cell might be either normal cell with the same spatial resolution, the reduction cell with twice spatial resolution larger or smaller within the constraint. We still denote the parameters in the operations as W, the weights on the edge as Alpha. Analogous to Alpha, we denote the scalars representing the connection of cells as Beta. From the cell structure (lower), the output of the cell is indeed a combination of three cells: the cell from the current spatial resolution, the cell with twice smaller spatial solution and the cell with twice larger spatial resolution. Beta values can be interpreted as the “transition probability” between different cells, thus the goal is to find the path with the maximum probability from start to end. For an intuitive view, we highlight the selected three nodes with color blue, yellow and green. The output of cell (blue shadow) is determined by the three cells with different scalars:

Search space for outer network
cell structure

In the optimization, the operation parameters W and the architecture parameters {Alpha, Beta} are updated iteratively. All the weights are non-negative. Considering of the solution to the optimization, Alpha are updated by taking the softmax similar to the DARTS and the best network architecture is decoded greedily using the classic Viterbi algorithm. The figure (upper) illustrates the searched network architecture based on Cityscapes datasets, and the corresponding backbone is defined as [0, 0, 0, 1, 2, 1, 2, 2, 3, 3, 2, 1] indicating the downsample rate. In the model, there are two paths in each cell node as illustrated in the figure (lower) below.

Searched neural network architecture based on Cityscapes
Node structure

Rich experiments are conducted to validate the Auto-DeepLab on three commonly used datasets: Pascal VOC 12, Cityscapes and ADE20k. We summarize some results compare with the DeepLab v3+. More explanations of the experiments can be found in the paper. Auto-DeepLab is denoted as Auto-DeepLab-L when 48 filters are used. The input of Cityscapes data has the size of [769 x 769], and the input size of ADE20k and PASCAL VOC 12 is [513 x 513]. Generally speaking, Auto-DeepLab attains very similar performance as the best DeepLabv3+ but being 2.33 times faster. Moreover, Auto-Deeplab can be trained from scratch.