In Between ResNets and DenseNets

Source: Deep Learning on Medium

In Between ResNets and DenseNets

What is ResNets? What is DenseNets? Why these architectures are popular? In this blog I’ll introduce you to these architectures in Deep Convolutional Neural Networks, known as DCNN , and explain why they are important.

Background

The initial thought, that accuracy increases as the network gets deeper, is not necessarily true — As the network gets deeper, some performance and accuracy issues emerge. One of the biggest problems is the vanishing gradient problem: DCNN trains the model by computing the derivatives of parameters with respect to the training loss function. However, the gradients of the loss function approach zero as more layers are added to the neural networks. Therefore, the effect of the loss function on the activation functions decreases, which makes the network hard to train. Some normalization techniques could deal with this problem but only in mid-scale layers, not large-scale layers.

DenseNets and ResNets architectures try to deal with this problem and in my opinion they’re breakthroughs in terms of performance the uses of DCNN architectures.

Residual Network — ResNets

The first architecture is Residual Network model, known as ResNets (Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 2016), tries to deal with this problem. The researches introduced residual connections into the network [figure 1]. Residual connections, known also as Shortcut Connections, are simply connections between the output of layer t to the hypothesis of layer t+2. Instead of learning unreferenced functions, the model uses learning residual functions with reference layer’s inputs. The hypotheses function changes from H(X)=F(X) to H(X)=F(X)+X. It means that a building block is defined as y=F{x, {Wᵢ}} + x. We define x and y as the input and output vectors of the layers considered. The function F{x, {Wᵢ}} represents the residual mapping to be learned.

Figure 1. Residual learning: a building block.

In the research, a few experiments were conducted. One experiment compares between a plain network, a network without the shortcut connections, and the residual network. The models were trained on a few known datasets with different numbers of layers. The result was that on a plain network, as the number of layers increases, the error-rates increase too, while in the ResNets as the number of layers increases, the error-rates decrease [figure 2].

Figure 2. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Right: ResNets.

Another experiment compares between a variety number of methods which can have a different kind of layers. The models were trained on a few known datasets. The result was that ResNets with 110 layers got the lowest error!

It was a huge step in the DCNN world because other leading architectures depth was twice as small and still having good accuracy. Moreover, based on this architecture the researchers won 1st place in several tracks in ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Densely Connected Convolutional Networks- DenseNets

The second architecture is Densely Connected Convolutional Networks, known as DenseNets (Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L, 2017). The researches try to deal with few problems, including the vanishing gradient problem.

DenseNets is a family with a number of architectures. The base architecture presents a new connection between the layers [figure 3]. Each layer is connected to all the forward layers. Therefore, L-layers network has [L(L+1)]/2 direct connections. It means that the iₜₕ layer Xi receives the feature-maps of all preceding layers as an input: Xi=Hi([X₀, X₁,,,, Xᵢ₋₁]). Where Hi is a composite function that operates 3 algorithms: Batch-Normalization, ReLU and 3*3 Convolution (I’ll not explain these algorithms in this post, but I encourage you to search if you are interested!).

Figure 3. A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding feature-maps as input.

Another problem that DenseNets tries to deal with is reducing substantially the number of parameters. In other known architectures, the number of parameters is substantially larger because each layer has its own weights. DenseNets addresses new information that is added to the network and preserved information. The layers are very narrow, adding only a small set of feature-maps to the “collective knowledge” of the network and keep the remaining feature-maps unchanged. So how do they do it?

Other architectures do not limit the number of feature each layer gets. However, in DenseNets Hi produces k feature-maps when the iₜₕ layer has k₀+kᵢ₋₁ features. k₀ is the number of channels in the input. Each layer adds k feature-maps of its own to this state when the maximum value of is 12. This limitation makes the layer narrow. At the end of the last dense block, a global average pooling is performed and then a softmax classifier is attached.

The next architecture is DenseNets-B (Bottleneck Layers). The main goal is reducing model complexity and size. To accomlish this goal a new phasewas added before operating Hi. In this phase 3 algorithms are opreated: Batch-Normalization, ReLU, and 1*1 Convolution. The network was divided into multiple dense blocks called transition layers. 1*1 Convolution followed by 2*2 average pooling is used as transition layers between two adjacent dense blocks [figure 4].

Figure 4. A deep DenseNet with three dense blocks. The layers between two adjacent blocks are referred to as transition layers and change feature-map sizes via convolution and pooling.

Another architecture is DenseNets-C. The goal here is to improve the model compacntness. They do it by reducing the number of features-maps at transition layers. How they do it? If a block contains m feature-map, the transition layer generates ⌊θm⌋, 0<θ<1 output feature-map, where θ is the compression factor. When θ=1, the number of feature-maps across transition layers remains unchanged. DenseNets with θ<1, is referred as DenseNets-C.

When both the bottleneck layers and transition layers with θ<1 are used, the architecture called DenseNets-BC.

The experiments which are presented in this research, empirically demonstrate DenseNets effectiveness on several benchmark datasets and are compared with other known architectures, especially with ResNets and its variants. The DenseNet-C and DenseNet-BC architectures defined with θ=0.5.

The main experiment was trained by both basic DenseNets and DenseNets-BC with different depths, L, and growth rates, k, on different datasets. They compare those architectures to different models, including ResNets methods. The results [table 1] indicate that different kinds of DenseNets get the smallest error rates.

Table 1. Error rates (%) on CIFAR and SVHN datasets. k denotes network’s growth rate. Results that surpass all competing methods are bold and the overall best results are blue.

In another experiment, they compare the parameter efficiency of all variants of DenseNets and ResNets architectures. The results showed that DenseNet-BC is consistently the most efficient variant parameter of DenseNets. Additionally, to achieve the same level of accuracy, DenseNet-BC only requires around a third of the parameters comparing to ResNets.

To conclude here are some of the advantages of this family of architectures:

  1. Alleviate the vanishing gradient problem — as evidence the number of layers with good accuracy has grown to 250 layers.
  2. Encourage features reuse — same features are connected to other blocks and reused [figure 3]
  3. Substantially reduce the number of parameters — by limiting the number of feature-maps to 12.

I hope you enjoyed and learned! If you see an error you noticed or something you want to share with me please drop a comment below!