Intro to Neural ODEs: Part 1 — ResNets

Original article can be found here (source): Deep Learning on Medium

Intro to Neural ODEs: Part 1 — ResNets

Recently, I found the website depthfirstlearning.com, which has deep dives on new and interesting machine learning topics. I followed the Neural ODE curriculum and would like to share a brief look at some of the things I learned.

Neural ODEs are a relatively new (circa 2018) and exciting neural network model. Before we can understand them, however, we need to take a step back and understand where they come from. This leads us to the topic of today’s post: ResNets.

ResNets, or residual neural networks, were introduced in 2015 with the paper Deep Residual Learning for Image Classification (https://arxiv.org/abs/1512.03385). They were designed to solve a significant problem with very deep neural networks. As these models grew to hundreds of layers deep, their performance decreased! One would think that more layers and increasing model complexity would improve accuracy. Instead, the models performed poorly. Deep learning had reached its limit. In time, researchers found a way to innovate by designing the ResNet. The ResNet allowed deeper networks to be trained with state-of-the-art performance.

The ResNet is made up of individual units called residual blocks. These blocks hold the key difference between a ResNet and other models. The residual block is important because along with the typical function transformation (convolution, linear map, etc.) found in a neural network, it also directly adds the input to the output. This is called a shortcut connection and can be seen in the figure below.

Residual Block (https://arxiv.org/abs/1512.03385)

As seen in the figure, the input to the residual block is directly added to the layer transformations. This shortcut connection improves the model since, at the worst, the residual block does not do anything. It simply takes the input and sends it to the output with all of the transformation parameters equalling zero. What this means is that deeper and deeper ResNets cannot possibly do worse than their smaller, shallower counterparts. In the worst case scenario, the residual blocks will simply become identity functions and the ResNet will perform like a neural network with fewer layers. If it cannot perform worse, then it must end up performing just as good if not better. In fact, ResNets do perform better with some of the highest accuracies on many image classification tasks.