Source: Deep Learning on Medium
Highway Networks, Inspired By LSTM, Using Gating Function, More Than 1000 Layers.
In this story, Highway Networks is briefly presented. This is a 2015 work. At that moment, it is found that there is difficulties optimizing a very deep neural network. However, it’s still an open problem why it is difficult to optimize a deep network. (Of course, later, it is probably due to gradient vanishing problem.) Inspired by Long Short-Term Memory (LSTM), authors thereby make use of gating function to adaptively transform or bypass the signal so that the network can go deeper. A deep network with more than 1000 layers can also be optimized. I choose to present this paper so that I can introduce the gating function.
Highway Networks initially was presented in 2015 ICML Deep Learning Workshop and published as a 2015 arXiv tech report with over 600 citations. And later on it is extended and published in 2015 NIPS with over 500 citations. (SH Tsang @ Medium)
- Highway Networks
1. Highway Networks
1.1. Plain Network
- Before talking about Highway Networks, Let’s start with plain network which consists of L layers where the l-th layer (with omitting the symbol for layer):
- where x is input, WH is the weight, H is the transform function followed by an activation function and y is the output. And for i-th unit:
- We compute the yi and pass it to next layer.
1.2. Highway Network
- In highway network, two non-linear transforms T and C are introduced:
- where T is the Transform Gate and C is the Carry Gate.
- In particular, C = 1 – T:
- We can have below conditions for particular T values:
- When T=0, we pass the input as output directly which creates an information highway. That’s why it is called Highway Network !!!
- When T=1, we use the non-linear activated transformed input as output.
- Here, in contrast to the i-th unit in plain network, authors introduce the block concept. For i-th block, there is a block state Hi(x), and and transform gate output Ti(x). And the corresponding block output yi:
- which is connected to the next layer.
- Formally, T(x) is the sigmoid function:
- If we remember, sigmoid function caps the output between 0 to 1. When the input has too small value, it becomes 0. When the input has too large value, it becomes 1. Therefore, by learning WT and bT, the network can adaptively pass H(x) or just pass x to next layer.
- And author claims that this helps to have a simple initialization scheme for WT which is independent of the nature of H.
- bT can be initialized with a negative value (e.g. -1, -3 etc.) such that the network is initially biased towards carry behavior.
- The above idea is inspired by LSTM as authors mentioned. (LSTM is a very famous module mainly used in Natural Language Processing (NLP))
- And Stochastic Gradient Descent (SGD) did not stall for networks with more than 1000 layers. However, the exact results has not been provided.
- The first layer is a fully connected plain layer followed by 9, 19, 49, or 99 fully connected plain or highway layers. Finally, the network output is produced by a softmax layer.
- All networks are thin: each layer has 50 blocks for highway networks and 71 units for plain networks, yielding roughly identical numbers of parameters (5000) per layer.
- As shown above, the errors obtained by Highway Networks are always smaller than those by Plain Network.
- 10-layer convolutional highway networks on MNIST are trained, using two architectures, each with 9 convolutional layers followed by a softmax output. The number of filter maps (width) was set to 16 and 32 for all the layers.
- Compared with Maxout and DSN, Highway Networks obtained similar accuracy but with much fewer number of parameters. (If interested, please visit my review on NoC for a very brief introduction of Maxout.)
2.2. CIFAR-10 & CIFAR-100
- Fitnet cannot optimize the networks directly when the networks are deep. It needs two-stage training.
- By using gating function, Highway can optimize the deep networks directly. In particular, Highway B obtains highest accuracy with 19 layers.
- Though Highway C is inferior to Highway B, it stills can be optimized directly due to the existence of gating function.
- Here, the fully connected layer used in the networks in the previous experiment is replaced with a convolutional layer with a receptive field of size one and a global average pooling layer.
- Highway Network can obtain comparable performance on CIFAR-10 and highest accuracy on CIFAR-100.
- The above figure shows the bias, the mean activity over all training samples, and the activity for a single random sample for each transform gate respectively. Block outputs for the same single sample are displayed in the last column.
- For the CIFAR-100 network, the biases increase with depth forming a gradient. The strong negative biases at low depths are not used to shut down the gates, but to make them more selective. This makes the transform gate activity for a single example (column 3) is very sparse.
- For the CIFAR-100 case, most transform gates are active on average, while they show very selective activity for the single example. This implies that for each sample only a few blocks perform transformation but different blocks are utilized by different samples.
- For MNIST digits 0 and 7, substantial differences can be seen within the first 15 layers.
- For CIFAR class numbers 0 and 1, the differences are sparser and spread out over all layers.
- By lesioning, It is meant to manually set all the transform gates of a layer to 0 forcing it to simply copy its inputs. As shown above, for each layer, the network is evaluated on the full training set with the gates of that layer closed.
- For MNIST (left) it can be seen that the error rises significantly if any one of the early layers is removed, but layers 15-45 seem to have close to no effect on the final performance. About 60% of the layers don’t learn to contribute to the final result, likely because MNIST is a simple dataset that doesn’t require much depth.
- While CIFAR-10 is a relatively complex dataset, the error rises more significantly.
By looking at the Highway Network, we can know about the gating function using Sigmoid. Hope I can review Recurrent Highway Networks in the future.
Training Very Deep Networks
My Previous Reviews
[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet]