Source: Deep Learning on Medium

# Understanding and implementation of Residual Networks(ResNets)

*Residual learning framework to ease the training of networks that are substantially deeper than those used previously.*

This article is primarily based on research paper “**Deep Residual Learning for Image Recognition**” published by Microsoft Research. [Link to the research paper] and Convolutional Neural Network course by Andrew Ng.

**The problem of very deep neural networks:**

In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

- One of the major benefits of a very deep network is that it can represent very complex functions.
- However, a huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent prohibitively slow.
- More specifically, during gradient descent, as we backprop from the final layer back to the first layer, we are multiplying by the weight matrix on each step. If the gradients are small, due to large number of multiplications, the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).

The regular networks like VGG-16 are called “plain” networks.

In plain networks, as the number of layers increase from 20 to 56 (as shown below), even after thousands of iterations, training error was worse for a 56 layer compared to a 20 layer network.

In theory, we expect having a deeper network should only help but in reality, the deeper network has higher training error, and thus test error.

**Why is this happening and how could we fix it?**

When deeper networks are able to start converging, a *degradation *problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly.

Using deeper networks is **degrading **the performance of the model. Microsoft Research paper tries to solve this problem using **Deep Residual learning framework.**

**Solution:** Residual Block / Identity block

The idea is that instead of letting layers learn the underlying mapping, let the network fit the residual mapping. So, instead of say H(x), initial mapping*, *let the network fit, F(x) := H(x)-x which gives H(x) := F(x) + x

The approach is to add *a **shortcut** or a *** skip connection** that allows information to flow, well just say, more easily from one layer to the next’s next layer, i.e., you bypass data along with normal CNN flow from one layer to the next layer after the immediate next.

A Residual Block:

**Two take aways from residual block:**

- Adding additional / new layers would not hurt the model’s performance as regularisation will skip over them if those layers were not useful.
- If the additional / new layers were useful, even with the presence of regularisation, the weights or kernels of the layers will be non-zero and model performance could increase slightly.

Therefore, by adding new layers, because of the “Skip connection” / “residual connection”, it is guaranteed that performance of the model does not decrease but it could increase slightly.

By stacking these ResNet blocks on top of each other, you can form a very deep network. Having ResNet blocks with the shortcut also makes it very easy for one of the blocks to learn an identity function. This means that you can stack on additional ResNet blocks with little risk of harming training set performance.

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different.

1**.The identity block** — same as the one we saw above. The identity block is the standard block used in ResNets, and corresponds to the case *where the input activation has the same dimension as the output activation*.