Source: Deep Learning on Medium

(1) The first trial on semantic segmentation using NAS is presented by Liang-Chief Chen et al. from Google [5]. Since DeeLabv3+ has achieved remarkable results, the DeepLab Team shifts the emphasis towards the automatic architecture search. Based on the encoder-decoder structure as DeepLab, the work aims to seek a more efficient decoder instead of ASPP following an existing small encoder backbone. A recursive search space is built to encoder multi-scale context information that is called Dense Prediction Cell (DPC) in the work. An efficient random research method is applied as the search strategy. The work achieves 82.7% of the mean IoU (mIoU) on Cityscapes datasets and 87.9% mIoU on PASCAL VOC 12 datasets. Nevertheless, the high computational cost (370 GPUs for a week) limits the application of this approach.

(2) Also from the Google DeepLab Team, the Auto-Deeplab [6] is more efficient on searching and it takes only 3 days on a single GPU. This is due to the efficient search strategy, DARTS [9], a gradient-based approach from CMU and Google. As opposed to the last work, Auto-Deeplab focus on searching the encoder and employ ASPP as decoder. Regarding the experiments, auto-deeplab shows similar performance as deeplabv3+ but it has less FLOPs and number of parameters. Details of DARTS and Auto-DeepLab will be discussed later in this article.

(3) The Customizable Architecture Search (CAS) [4], a joint work of USTC and JD AI lab, designs three different types of cells: the normal cell, the reduction cell and a novel multi-scale cell inspired by ASPP. DARTS is used as the search strategy. It is worth mentioning that it does not only validate the accuracy but also the GPU time, CPU time, FLOPs and number of parameters when evaluate the searched cell. These performance measures are great useful in real-time scenarios. It reports 72.3% mIoU on Cityscapes with the remarkable speed of 108 fps on NVIDIA TitanXP GPU.

(4) The group from Adeleide University has made lots of efforts on light-weight neural networks in recent years. In [7], the authors shift the research focus towards NAS to realize semantic segmentation running in real-time limited resources. In the proposed Fast NAS, the decoder is searched based on reinforcement learning (RL). It is known that RL is notorious for time-consuming, knowledge distillation and Polyak averaging are leveraged to speed up the convergence, which is the major contribution of this work.

(5) Graph-guided Architecture Search (GAS) [8], a more recent work form SenseTime, presents a novel search mechanism in order to efficiently search a light-weight model. Different from the aforementioned typical approaches, it leverages the Graph Convolutional Networks (GCN) to search the connection between each pair of nodes. Analogous to CAS [4], GAS takes the computational cost such as latency into account in optimization as well. It reports 73.3% mIoU on Cityscapes with the speed of 102 FPS on Titan XP GPU.

To summarize, we find that NAS-based networks have achieved considerable successes and many research labs have shifted the research focus towards it. It demonstrates the feasibility to apply NAS on semantic segmentation for higher performance and accuracy. Particularly, the search strategy DARTS makes it possible to build up NAS-based networks for the researches without enough GPU resources. Considering of the source codes and searching computational cost, we implement DARTS and Auto-DeepLab and will introduce in detail.

**3. DARTS: Differentiable Architecture Search**

Usually the search space is discrete with a set of candidates. In [9], Liu et al. formulates the search space in a continuous relaxation and search the architecture in a differentiable manner. Whereby, the architecture can be optimized via gradient descent with respect to the performance estimation. Therefore, DARTS does not highly depend on the computational resources. Regarding the rich experiments, DARTS is successfully applied to build up CNN and RNN models for image classification and language modelling. Thanks to the public Pytorch source codes https://github.com/quark0/darts, we can better understand how DARTS works and how NAS works. It is a good foundation to build up our own NAS networks.

As aforementioned, the majority of NAS models seek the repeatable cells and stack them in a predefined manner. The cell is represented as a directed acyclic graph (DAG). Generally speaking, a DAG cell consists of an ordered sequence of N nodes. Each node x(i) is a latent representation and is considered as a combination of all the predecessors. Each directed edge (i, j) is associated with some operations o(i,j) transforming from node i to node j. Each cell has two inputs and one output node. There are many possibilities connecting from one node to its adjacent node. The figure below illustrates such a cell with 5 nodes (4 immediate nodes and an output node) from [9]. This *cell k* has two input nodes from the previous cells marked as c_{k-2} and c_{k-1}. Node c_{k} indicates the output of the cell defined as the depth wise concatenation of all the immediate nodes. The operations on the edges are “max_pool_3x3”, “skip_connect”, etc. Literally there are many possible operations on each edge. Searching the best architecture is equivalent to select the optimal operation connecting each pair of nodes in the cell.

In DARTS, the connections of nodes are parameterized as weights and the optimal solution on each edge can be computed by softmax, specifically, to select the path with largest probability. This is exactly the way to make the architecture continuous relaxation. In [9], two typical cells are designed in the search space, the normal cell and the reduction cell. The cells are stacked in a predefined manner, where the reduction cells are placed at 1/3 and 2/3 location in the network architecture. That is to say, the outer network architecture is fixed but the inner cell structure is changing along with the searching. Namely, it searches the operation parameters **W** as well as the weights **Alpha** on the edges. Based on the paper and the source codes, we summarize the workflow of DARTS and illustrate the pipeline below.

Starting off to search the network, the first step is to initialize the optimizer, loss, the outer network architecture and the search space. In the search space, several basic operations are defined such as dilated conv 5×5, max pooling 2×2 and concat layers. Afterwards, the normal cell and the reduction cell are learned based on these operations. According to the source codes, the operations and the cells are defined in ** operation.OPS** and

**. The parameters in operations are denoted as**

*genotypes.PROMITIVES***W**and the weights

**between each pair of nodes are defined as**

*Alpha***in the source code.**

*arch_parameters() = {alphas.normal, alphas_reduce}*In the searching procedure, the train data is split into two separate patches: the train patch is used in train the operations parameters **W** and the validation patch is used to evaluate the network and find the optima **Alpha**. The searching procedure is mainly implemented in ** architect.step** from the code. The optimization objective is to minimize the loss on validation patch with respect to the current network parameters W. Namely, updating Alpha is to seek the optimal operation on each edge. The objective function defined in Eq (3) from the paper: