I’ll be working to explain Deep Layer Aggregation, a neural network architecture that explores how best to aggregate layers across a network. Experimentally, this technique show improvements in memory usage and performance over baseline ResNet, ResNext, and DenseNet in classifications tasks.
Deep Layer Aggregation is an umbrella term for two different structures: Iterative Deep Aggregation(IDA) and Hierarchical Deep Aggregation(HDA). Currently, most skip connections(aggregating between layers) are rather shallow. IDA and HDA serve to combine layers in a deeper way.
To really understand why this structure might be an improvement, we have to understand what is currently the state-of-the-art in aggregating layers. In existing structures, the aggregating “skip connections” are usually “shallow” and fuse by simple one-step concatenations shown in (b).
What are Skip Connections and Why are they important?
They’re pretty simple, it is simply a concatenate operation. In the above figure, the skip connection from “pool 4” ended up in a layer that was the concatenation of “pool 4” and some of “conv 7”.
These Skip Connections are important because:
1) You want your network to learn a combination of low and high level features.
2) You want to train deeper networks. Short skip connections like in Resnet connecting to earlier layers in the network help propagate the gradient, and fight the vanishing gradient problem with very deep networks. (This isn’t a concatenation, but a summation — minor difference)
3) Long Skip Connections can help recover spatial information that might be lost during downsampling. This is essentially important in segmentation because to label pixels in the final image, it’s important to consider the lower level features.
4) Improve Convergence Time. This paper found that having both long and short skip connections improved convergence time as opposed to only having one type of connection.
Now that we established why Skip Connection are useful, what DLA intends to do is to improve on the structures of Skip Connections.
IDA focuses on fusing resolutions and scales.
Aggregation in IDA is iterative and starts with the lowest, smallest scale and then iteratively merges deeper larger scales.
HDA focuses on merging features from all modules and channels
Unlike IDA, which combines layers in a sequential way, HDA’s structure uses a tree-like structure to combine layers that span more of a feature hierarchy. Notice how the output of an aggregation node feeds into the input of the next block — this preserves features from previous layers.
Combined HDA and IDA
This is an example that uses both HDA and IDA. The use of HDA and IDA is architectural independent, meaning it can be an add on for any current and future network.
Source: Deep Learning on Medium