ResNets, DenseNets & UNets

Source: Deep Learning on Medium

ResNets, DenseNets & UNets

Photo by Pietro Jeng on Unsplash


The most important question is how to train the deep convolutional networks? And quest to answer this question led to the invention of the ResNets. As per the traditional approach to train deeper convolutional neural networks, we increase the number of layers.

And as per our knowledge on convolutional neural networks, the below should happen — With the increase in the layers, the network should have low training error. But this was what didn’t happen!

Image source — Deep Residual Learning for Image Recognition Reseach Paper
Image source — Deep Residual Learning for Image Recognition Reseach Paper
  • The deeper network is represented in “red,” and the shallow network is described in “yellow.”
  • We can easily observe that the deep neural network has more training errors, which we didn’t expect.

The confusion led Kaiming He and other researchers to publish their research. Let us understand the idea.

Deep Residual Learning

Image source — Deep Residual Learning for Image Recognition Reseach Paper

As per the above plain neural net, the output H(x) is the combined effect of 2 weight layers and then two non-linear ReLu layers on the input. If we could turn this plain net into some deeper net that sustains the impact of the shallow net with added layers at the same time, then after the operation, the deeper network would at least retain the effect of the process on the shallow network. The same approach led to the phenomenon of the ResNets

Image source — Deep Residual Learning for Image Recognition Reseach Paper
  • In ResNet, we add a layer which is the addition of the F(x) and x i.e. identity function.
  • This helps in solving the problem of dissolving gradient by allowing an alternative path for the gradient to flow through. Also, they use identity function, which allows a higher layer to perform as good as a lower layer, and not worse at least.
  • Now, after every two-weight layers and 1 Relu operation, we add to the output of the neural process, the identity. Thus, it would retain at least the effect of 20 layers on a deep 56 layers network in the situation when we start adding the impact of the 20 layer operation on the added 36 layers to the network.
  • When the back-propagation is done through identity function, the gradient will be multiplied only by 1. This preserves the input and avoids any loss in the information.
  • In the image, for any number of added layers to the deep neural network, even after the added layers zero the gradients, the output must retain the affect of the 20 layers neural network because we have added its effect as identity.
  • The same type of research happened, and to everyone’s surprise there, the output was better.

In other words, instead of doing


— Identity Connection is also known as Skip connection.

What happened was they won ImageNet that year. They easily won ImageNet that year. ResNet has been revolutionary.

Image Source — Visualizing the Loss Landscape of Neural Nets

In the research paper mentioned in the above image caption, we could easily see the effect of ResNet block on the loses. Without ResNets, the loss surface is so bumpy, and this is the reason that loss got stuck in the valley and could not getter better further. But, on the other hand, if we use ResNet, then the loss graph is much better, and therefore, the loss reaches optimum value.

“Dense Nets”

Dense Net is the concept that is similar to ResNet, but rather than adding the convolution output with the identity, we concatenate the identity connection to the convolution output. In this way, our input remains in the final output.

Now, when we are concatenation the identity connection, then the final output that comes out is enormous. This makes dense nets — memory intensive. Thus, dense nets are generally not performed on the deep neural nets. But you definitely, try it on smaller networks.

DenseNets are very helpful in Image Segmentation where we need to keep track of the previous pixels in the photo to construct the picture further.


Image Source — fastai

U-Net has two main components:

Let us understand the above two concepts using the above image.

— Down Sampling

  • U-Net starts with one channel, and their images are 572 by 572.
  • They did stride-2 convolution, leading to 128 channels and images of 568 by 568. They lose two pixels for every stride-2 convolution because of a lack of padding.
  • Then, they did max-pooling leading to half the size of images, i.e., now the pixels of images are 284*by 284.
  • Then again, the first three steps are followed until we get the images of 28 by 28 and 1024 channels.
  • This process is known as downsampling.

This is only one part of the U-Net structure. Let us build an attitude for Upsampling.

— Up Sampling

Image source —
  • Up Sampling is also known as stride half convolution, deconvolution, or transpose convolution.
  • It is creating bigger size images from shorter images.
  • There is a beneficial research paper on this concept.
  • You have a 2×2 input, so the blue squares are the 2×2 input
  • You add not only 2 pixels of padding all around the outside, but you also add a pixel of padding between every pixel.
  • Now if we put this 3×3 kernel here over the enlarged image, you see how the 3×3 kernels are just moving across it in the usual way, you will end up going from a 2×2 output to a 5×5 output.
  • If you added one pixel of padding around the outside, you would end up with a 4×4 output.

This was how people used to do. But this is not an excellent way to do reconstruct larger pixeled images from the low pixeled images. There are many for saying like we are adding a lot of white or black pixels in the image, which would affect the actual pixel of the image, and we may end up with something entirely different from what we expected.

Nowadays, people are doing upsampling in some other way, which is also known as Nearest Neighbor Interpolation.

Image source — fastai
  • So you can do a nearest-neighbour interpolation, and then a stride one conv, and now you’ve got some computation which is using no zeros in upper left 4×4 like in the previous approach.

There is one more way, known as Bilinear Interpolation.

Image source — fastai
  • It means instead of copying A to all those different cells, you take a weighted average of the cells around it.
  • If you are looking at what should go here (red), it’s about 3 A’s, 2 C’s, 1 D, and 2 B’s, and you take the average.

Anytime you look at a picture on your computer screen and change its size; it’s doing bilinear interpolation.

Now, since we have some prior information about the upsampling, let us now understand what Olaf Ronneberger and others did.

  • He used skip connections, as I explained above in ResNets.
  • Rather than adding a skip connection that skipped every two convolutions as in the ResNets, they added skip connections where the grey lines are in the above image.
  • In other words, they concatenated a skip connection from the same part of the downsampling path to the same-sized bit in the upsampling path. So basically, these are like dense blocks.
  • They’ve nearly got the input pixels coming into the computation of these last couple of layers.
  • These grey arrows are called cross-connections.
  • That’s going to make it super handy for resolving the fine details in these segmentation tasks because you’ve got all of the fine details.

So that is mine thought on ResNets, DenseNets, and UNets. Moreover, I would recommend the users to explore fastai library.