Original article was published by Abheer Bandodker on Deep Learning on Medium
ResNet Architecture Explained
Prior to the advent of ReNet architecture researchers were not able to train deep neural networks with higher number of layers. This was primarily attributed to the presence of vanishing gradient problem during back propagation of neural net. The existing architectures were not able to update the kernel values in an efficient manner after the number of layers had exceeded a certain threshold.
As we can see in the below graph the training and testing error in higher in the 56 layer model as compared to the 20 layer.
This restricted the researchers from using deep neural architectures with higher number of layers. However, the smaller networks were not able to distinguish between similar objects with sophisticated attributes or several features. This usage of smaller network architectures prevented the extraction of feature of features from an image, limiting the learning capabilities of a neural net architecture.
In December 2015, Microsoft came up with a new form of architecture known as Residual Network widely known as ResNet. With the emergence of this new architecture researchers were able to build deep neural nets with higher number of layers.
The primary ideology behind the residual network is to keep the number of parameters to be trained to a minimum or as low as possible. This will minimize the training and testing error as the depth of the network increases.
The concept of cardinality allowed the network to estimate the degrees of freedom to skip some of the connections for certain number of initial epochs. With the implementation of cardinality principle the feature of feature extraction did not take place during the initial epochs. For instance, during the initial few epochs only major features such as eyes, nose and lips could be extracted.
The feature of features extraction would only take place when the need arises in the later layers to distinguish different attributes within a certain feature. For instance, when a feature eye is extracted it also needs to extract features within feature eye such as eyeball, retina, eyelashes. However, it does not need to extract feature with feature for a feature such as nose. So, as and if the need arises to extract feature within features the skip connection will be used.
This in turn reduces the overall amount of trainable parameters during back propagation thus eliminating the occurrence of vanishing gradient problem.
During the initial few epochs the model directly goes through X connection, skipping the weight or convolution layers with 512 kernels. However, when the need arises it will also pass the data through the layers with 512 kernels to extract feature of features. Thereafter, it will concatenate the data generated from identity (X) and weight layer ( F(X) )
When both the outputs F(X) and X are added during the forward propagation, the same route will be used to train the model during the backward propagation. If only the skip connection is used during forward propagation than F(X) = 0. In which case, it will not be updating or modifying the kernel values in the initial few epochs for F(X) connection route. This will be a common scenario for the initial epochs.
When the skip connection is used during back propagation the kernel values for all the layers except the skipped connection will be updated or learnt. This methodology will reduce the amount of trainable parameters during back propagation as it will not result in a lot of changes in a single occurrence or epoch.
As you can see in the image with dimension of 200 by 200 pixels is imported in the model. It passes through the L1 layer (with 64 kernels of shape 7 *7), L2 & L3 layer (with 64 kernels of shape 3*3). A pooling operation is performed thereafter.
L4, L5 and L6 layers (with 128 kernels of shape 3*3) are bypassed through the skip connection.
The skip connection directly passes the data to the L7 layer (with 512kernels of shape 5*5) and then to L8 layer (with 512 kernels of shape 5*5). It then passes the data to the fully connected layer (FC 1000) with 1000 neurons.
If we are going to train the model for 100 epochs than for the first 20 epochs the model will use skip connection for L4, L5 and L6 layers. Here it will try to modify the kernel values for the rest of the layers. After it have learnt the kernel values of the other layers to some degree other, then the data will be passed through the L4, L5 and L6 layers.
Furthermore, it will reduce the training time as well as the number of parameters while increasing the depth of the network eliminating the occurrence of vanishing gradient problem.
Experiments with ResNet
The figure shows three architectures. To the left is VGG 19 model with 19.6 billion FLOPs, middle is 34 layer plain network with 3.6 billion FLOPs and to the right is 34 layer residual network with 3.6 billion FLOPs.
It can be seen that the baseline plain network model has only 3.6 FLOPs (18% of VGG-19) as compared to the 19.6 billion FLOPs of VGG-19.
In the 34 layer residual network the researchers have added identity shortcuts also known as skip connections ( dotted lines)as discussed above. The usage of identity shortcuts increases the dimensions. This will further introduce an extra parameter in the network increasing the predictive capabilities of the architecture.
The researchers of the paper first evaluated the 18 layer baseline model against 34 layer baseline model network. However, the results show that the 18 layer model has lower error rates as compared to the 34 layer model. The results are shown in the image. Here the thin lines depict the training error while the bold ones denote the validation error.
After testing the performance of the plain models with 18 layers and 34 layers the researchers tested the same architecture with identity shortcuts or skip connections. The results were as shown below.
As we can see that the error rate for the ResNet-34 is lower as compared to the ResNet-18 model. This proved that with the use of identity shortcuts or skip connections the deeper neural networks can be trained to extract feature of features.
The error rates for the plain baseline models and ResNet models were as follows.
There are 5 standard versions of ResNet architecture namely ResNet-18, ResNet-34, ResNet-50, ResNet-101 and ResNet-150 with 18, 34, 50, 101 and 150 layers respectively. The image below depicts the architectures.
ResNet-18 and ResNet-34 use 3 * 3 convolution twice at each convolution phase. However, ResNet-50, ResNet-101 and ResNet-150 use one 1 * 1 convolution, followed by 3 * 3 convolution succeeded by 1 * 1 convolution. Both these convolution architectures will extract similar kind of features. But the no. of parameters required to extract features of features will be less in 1–3–1 convolution format as compared to the 3–3 convolution format. The 1–3–1 convolution order starts with 1 * 1 kernels to extract data. This allows the architecture to extract more amount non-linear data.
The use of 1–3–1 architecture will also not reduce the dimensions as much as the 3–3 format. So less amount of data will be lost. The calculation is done as shown below.
Reading the architecture
The below shown convolution depicts that the 3-3 convolution phase will run twice using 64 kernels. So, in total there are 2 * 2 = 4 convolution layers in the the convolution phase.
The below shown convolution depicts that the 3-3 convolution phase will run thrice using 64 kernels. So, in total there are 3 * 2 = 6 convolution layers in the the convolution phase.
The below shown convolution depicts that the 1–3–1 convolution phase will run thrice using 64, 64 and 256 kernels respectively. So, in total there are 3 * 3 = 9 convolution layers in the the convolution phase.
The below shown convolution depicts that the 1–3–1 convolution phase will run 23 times using 256, 256and 1024 kernels respectively. So, in total there are 23 * 3 = 69 convolution layers in the the convolution phase.
Counting the number of layers
The below shown diagram is the architecture for a ResNet-18. We will understand how to count the layers as shown in the diagram.
Conv1 = 1 layer
conv2.x = 4 layers
conv3.x = 4 layers
conv4.x = 4 layers
conv5.x = 4 layers
FC (one fully connected network layer) = 1 layer
Summation of the layers =
1+ 4 + 4 + 4 + 4 + 1 = 18
As we can see there a total of 18 layers
Similarly, the no. of layers can also be counted for ResNet-34, ResNet-50, ResNet-101 and ResNet-150.
Summary and conclusion
ResNet reduces the no. of parameters through skip connections or identity shortcuts. Due to this ResNet is more sensitive to the loss functions. Moreover, it also increases the effectiveness during backward propagation. The use of identity shortcut works similar to a dropout layer as it induces a certain amount of regularization making the model more robust.
The training time is faster or the model converges at a faster pace as it does not have to go through all the activation functions during backward propagation. The identity shortcut optimizes the path during backward propagation.
ResNet-50 is the most widely used architecture as it has a top1-error of 20.74 on CIFAR-1000 dataset . Similarly, ResNet-150 has a top1-error of 19.38 on CIFAR-1000 dataset. The ResNet-50 requires 3.8 * 10⁹ FLOPs as compared to the 11.3 * 10⁹ FLOPs for ResNet-150. As we can see that the ResNet-50 architecture consumes only 33.63 % of the computing resources of what ResNet-150 will require. This makes it one of the most widely used ResNet architectures.
A very special thanks to Sudhanshu Kumar Sir who helped me to understand the architecture.