Source: Deep Learning on Medium

## Highway Networks, Inspired By LSTM, Using Gating Function, More Than 1000 Layers.

In this story, **Highway Networks** is briefly presented. This is a 2015 work. At that moment, it is found that there is difficulties optimizing a very deep neural network. However, it’s still an open problem why it is difficult to optimize a deep network. (Of course, later, it is probably due to gradient vanishing problem.) Inspired by Long Short-Term Memory (LSTM), authors thereby **make use of gating function to adaptively transform or bypass the signal so that the network can go deeper.** A deep network with more than 1000 layers can also be optimized. I choose to present this paper so that I can introduce the gating function.

Highway Networks initially was presented in **2015 ICML Deep Learning Workshop **and published as a **2015 arXiv **tech report with over **600 citations**. And later on it is extended and published in **2015 NIPS** with over **500 citations**. (SH Tsang @ Medium)

### Outline

**Highway Networks****Results****Analyses**

**1. Highway Networks**

#### 1.1. Plain Network

- Before talking about Highway Networks, Let’s start with plain network which consists of
*L*layers where the*l*-th layer (with omitting the symbol for layer):

- where
*x*is input,*WH*is the weight,*H*is the transform function followed by an activation function and*y*is the output. And for*i*-th unit:

- We compute the
*yi*and pass it to next layer.

#### 1.2. Highway Network

- In highway network, two non-linear transforms
and*T*are introduced:*C*

- where
and*T*is the Transform Gate**C is the Carry Gate**. - In particular,
:*C*= 1 –*T*

- We can have below conditions for particular
*T*values:

**When***T*=0, we pass the input as output directly which creates an information highway. That’s why it is called Highway Network !!!- When
*T*=1, we use the non-linear activated transformed input as output. - Here, in contrast to the
*i*-th unit in plain network, authors introduce theconcept. For*block*, there is a*i*-th block**block state**, and and*Hi*(*x*)**transform gate output**. And the corresponding*Ti*(*x*)**block output**:*yi*

- which is connected to the next layer.
- Formally,
**T(**:*x*) is the sigmoid function

- If we remember, sigmoid function caps the output between 0 to 1. When the input has too small value, it becomes 0. When the input has too large value, it becomes 1.
**Therefore, by learning***WT*and*bT*, the network can adaptively pass*H*(*x*) or just pass*x*to next layer. - And author claims that this helps to have a simple initialization scheme for
*WT*which is independent of the nature of*H*. *bT*can be initialized with a negative value (e.g. -1, -3 etc.) such that the network is initially biased towards carry behavior.- The above idea is inspired by LSTM as authors mentioned. (LSTM is a very famous module mainly used in Natural Language Processing (NLP))
- And
**Stochastic Gradient Descent (SGD) did not stall for networks with more than 1000 layers.**However, the exact results has not been provided.

### 2. Results

#### 2.1. MNIST

- The first layer is a fully connected plain layer followed by 9, 19, 49, or 99 fully connected plain or highway layers. Finally, the network output is produced by a softmax layer.
- All networks are thin:
**each layer has 50 blocks for highway networks**and**71 units for plain networks**, yielding roughly identical numbers of parameters (5000) per layer.

- As shown above, the errors obtained by Highway Networks are always smaller than those by Plain Network.

**10-layer convolutional highway networks**on MNIST are trained, using two architectures, each with 9 convolutional layers followed by a softmax output. The**number of filter maps (width) was set to 16 and 32**for all the layers.- Compared with Maxout and DSN,
**Highway Networks obtained similar accuracy but with much fewer number of parameters.**(If interested, please visit my review on NoC for a very brief introduction of Maxout.)

#### 2.2. CIFAR-10 & CIFAR-100

- Fitnet cannot optimize the networks directly when the networks are deep. It needs two-stage training.
**By using gating function, Highway can optimize the deep networks directly. In particular, Highway B obtains highest accuracy with 19 layers.**- Though Highway C is inferior to Highway B, it stills can be optimized directly due to the existence of gating function.

- Here, the fully connected layer used in the networks in the previous experiment is replaced with a convolutional layer with a receptive field of size one and a global average pooling layer.
**Highway Network can obtain comparable performance on CIFAR-10 and highest accuracy on CIFAR-100.**

### 3. **Analyses**

- The above figure shows the bias, the mean activity over all training samples, and the activity for a single random sample for each transform gate respectively. Block outputs for the same single sample are displayed in the last column.
- For the CIFAR-100 network,
**the biases increase with depth**forming a gradient. The strong negative biases at low depths are not used to shut down the gates, but to make them more selective. This makes the transform gate activity for a single example (column 3) is very sparse. - For the CIFAR-100 case, most transform gates are active on average, while they show very selective activity for the single example. This implies that for each sample only a few blocks perform transformation but different blocks are utilized by different samples.

- For MNIST digits 0 and 7, substantial differences can be seen within the first 15 layers.
- For CIFAR class numbers 0 and 1, the differences are sparser and spread out over all layers.

- By lesioning, It is meant to manually set all the transform gates of a layer to 0 forcing it to simply copy its inputs. As shown above, for each layer, the network is evaluated on the full training set with the gates of that layer closed.
- For MNIST (left) it can be seen that the
**error rises significantly if any one of the early layers is removed, but****layers 15-45 seem to have close to no effect on the final performance.**About 60% of the layers don’t learn to contribute to the final result, likely because**MNIST is a simple dataset that doesn’t require much depth.** - While CIFAR-10 is a relatively complex dataset, the error rises more significantly.

By looking at the Highway Network, we can know about the **gating function using Sigmoid**. Hope I can review Recurrent Highway Networks in the future.

#### References

[2015] [arXiv]

Highway Networks

[2015] [NIPS]

Training Very Deep Networks

#### My Previous Reviews

**Image Classification**[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet]

**Object Detection**[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

**Semantic Segmentation**[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [SegNet] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3]

**Biomedical Image Segmentation**[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet]

**Instance Segmentation**[DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

**Super Resolution**[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN]