Source: Deep Learning on Medium

### 1 Introduction

While Deep convolutional neural network (DCNN) models, for visual scene understanding tasks, have achieved high accuracy, these benchmarks can be improved by increasing network depth and width, while trading off speed and power. But large networks are notoriously problematic on semantic segmentation and other computationally heavy tasks. However, real-world applications such as autonomous/driverless gridlocks, automated industrial healthcare robots, and applications of augmented reality in medicine, military or navigation, are sensitive and demand distributed edge processing and local lower end analytics. In order to deal with this computational complexity of wide networks, convolution factorization has been identified as an effective tactic. We used a convolutional module called the efficient spatial pyramid (ESP) based on convolutional factorization to semantically segment non-small cell lung cancer (with or without IV contrast) referred for curativentent radiotherapy. ESP is an efficient framework, that can be easily deployed on edge devices, which push the frontiers of computations to logical extremes under resource constrained environments, thus making, ESPNet a network structure which is fast, small, capable of handling low power, and low latency while preserving semantic segmentation accuracy.

**2 Model:**

The architecture is as demonstrated in Fig. 1. The layer-wise composition of the same is as follows: (a) The standard convolution layer is decomposed into point-wise convolution which also helps in reducing the dimensionality and spatial pyramid of dilated convolutions to build an efficient spatial pyramid (ESP) module. (b) skip-connections between input and output enhance data flow. Dilated convolutional layers are denoted as (# input channels, effective kernel size, # output channels). The effective spatial dimensions of a dilated convolutional kernel are nₖ× nₖ , where nₖ = (n −1)2^( k−1) + 1, k = 1, · · · , K. Note that only n × n pixels participate in the dilated convolutional kernel.

**a. ESP module:**

If we look at the basic module of an ESP, it is based on a convolution factorization principle that decomposes a standard convolution into two steps: (1) point-wise convolutions and (2) spatial pyramid of dilated convolutions.

**(1)Point-wise convolutions:**

Point-wise convolutions, also known as 1×1 convolutions, has the following features:

- Reduce or increase the dimensionality
- Apply nonlinearity again after convolution
- Can be considered as “feature pooling”

Assuming we have an image of size 32x32x100, where 100 is the number of features, the output will be a 32x32x20 image post 20 1×1 convolutional filtering.

The function of the point-wise convolution is to apply the 1×1 convolution to the image so as to project high dimensional feature maps onto lower dimensional spaces.

#### (2)Spatial pyramid of dialated convolutions:

This is the second step in the module. First, let us understand what are dialated convolutions.

**(a)Dialations:**

**Standard Convolution (Left), Dilated Convolution (Right)**

Dilated convolution is similar to the standard convolution equation. At the summation, s+lt=p where s is stride, l is the dilation factor, and p is zero padding, dilation convolution operation skips some points.

**When l=1, it is standard convolution.**

**When l>1, it is dilated convolution.**

**Dilated Convolution (l=2)**

The effect of the receptive field with respect to the dilation factor is as shown.

The spatial pyramid of dilated convolutions then re-samples the obtained low-dimensional feature maps using K and n × n dilated convolutional kernels simultaneously, each associated with a dilation rate of 2^( k−1), k = {1, · · ·, K} where the number of parameters and memory requirement of the ESP module is drastically slashed by this factorization, while still preserving a large effective

receptive field defined by (n − 1)2^( K−1) + 1. Each dilated convolutional kernel learns weights with different receptive fields, result of which resembles a spatial pyramid and is hence called a patial pyramid of dilated convolutions. An input feature map Fᵢ ∈ R^(W ×H×M) and applies N kernels K ∈ R^(m×n×M )to produce an output feature map Fₒ∈ R^(W ×H×N) , where W is the width of the feature map and m is the width of the kernel, and H represents height of

the feature map and n is the height of the kernel, and M and N represents the number of input and output feature channels.

For simplicity, if you assume m = n, a standard convolutional kernel learns n²MN parameters which are multiplicatively dependent on the spatial dimensions of the n×n kernel and the number of input M and output N channels.

**Width divider K**:Introducing the width divider and hyperparameter K, we reduce the computational cost by uniformly shrinking the dimensionality of feature maps across each ESP module in the network.

Reduce:Given a K, ESP reduces existing feature maps from M to N dimensional space through a K dimensional space using point-wise convolution.

Split: The low-dimensional feature maps are split across K parallel branches.

Transform: Each of the K parallel branches then introduce different dilation rates of 2 ^(k−1), k ={1, · · ·, K − 1} to simultaneously process feature maps using nxn dilated convolutional kernels

Merge: The outputs of each of the K parallel dilated convolutional kernels are concatenated to produce an N-dimensional output feature map

**b.Model Architecture**

“The path from ESPNet-A to ESPNet is as shown. Red and green color boxes represent the modules responsible for down-sampling and up-sampling operations, respectively. Spatial-level l is indicated on the left of every module in (a). Each module is denoted as (# input channels, # output channels).Here, Conv-n represents n × n convolution.

#### ⍺ :

To build computationally better and efficient networks without changing network topolgy this hyper-parameter ⍺ was introduced, it controls the depth of the model. The ESP module is repeated ⍺ₗ times at the level l.CNNs require more memory at higher spatial levels (at l = 0 and l = 1) because of the bigger dimensions of feature maps at these levels. To be memory efficient, neither the ESP nor the convolutional modules are repeated at these levels. But at the later levels, they can be repeated to achieve more depths. So we have taken α₂= 2 and α₃= 8.

#### 3 ESPNet-Performance reality check

1. Comparing the performance of semantic segmentation networks like pre-trained networks (VGG : FCN-8s and SegNet , ResNet : DeepLab-v2 and PSPNet, and SqueezeNet SQNet or the ones trained from scratch (ENet and ERFNet) to that of ESPNet shows that the latter is 2% more accurate than ENet, while running 1.27x and 1.16x faster on a desktop and a laptop, respectively.

2. ESPNet suffers from lower class-wise accuracy meaning it does not perform too well on classes that belong to the same category. For example, rider can be confused with a person. However,ESPNet delivers good category-wise accuracy. ESPNet had 8% lower category wise mIOU than PSPNet while learning 180x fewer parameters

3. ERFNet has better semantic segmentation accuracy than ESPNet but is also bulkier with 5.5×more parameters, and hence is 5.44× larger, consumes more power, and has a higher battery discharge rate.

**4 References:**

1. https://arxiv.org/pdf/1803.06815.pdf

2. https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5