# Explaining ResNet with Crayons

I’ve been trying to figure out ResNet and I keep reading articles which explain it in the same way and I just didn’t quite get it! As such, I’ve decided to try to explain it using basic drawings on my laptop trackpad (I didn’t actually use crayons, sorry).

Before I begin let me briefly explain the concept of a ResNet block: In a traditional convolutional network, you might have a long string of convolutional layers (with batch norms and pooling layers mixed in), but with ResNet, we cluster our network into blocks of three convolutional layers. What makes these blocks unique is that we add the output of the previous block onto the result of our current block.
For example, suppose we want a ResNet block result in Y. To solve this using a traditional conv net, we would have something like this:

F(x) = y

Where F(x) could be a single layer or a set of layers. For a ResNet block, we would do this:

R(x) + x = Y

Where R(x) is a set of three convolutional layers, the traditional makeup of a ResNet block. This adding of x forces R(x) to model the difference, or residual, between Y and x:

R(x) = Y – x

For a more technical and in-depth explanation of ResNet construction, check out this blog post by Michael Dietz.
So why is the ResNet block useful? As promised, I’ll try to explain it with drawings:
Suppose you are trying to model (draw) the following shape:

And your initial state (i.e. the output of the previous block/layer) is:

With a traditional conv net your problem might look something like this:

So we need to write some function, or in my example draw some picture, which matches the “model” image on the right hand side as best we can.

If we were to model the same function using a ResNet block, our function would look something like this:

Subtracting our starting image (the first half of the squiggle) from both sides gives us:

Which simplifies to:

Why is this better than the conv function? Its better because the ResNet function is simpler, we only need to draw half the shape rather than replicating the entire thing. You might be wondering why that’s such a big deal since, in the non-ResNet function, we still had knowledge on how to draw the first half of the squiggle as a starting point. The difference is that, with the non-ResNet function we still had draw both halves of the image, rather than just drawing the missing segment, which exposes us to more human error or, if you were actually training a network, the approximation error which accompanies any attempt to fit a complex pattern. To use another analogy, we all have the knowledge necessary to perform basic arithmetic, but we’d still rather avoid it because every time we do it in our heads, theres a small chance that we’ll get it wrong.
Just let we might make mistakes every time we attempt to solve a problem, our neural network may make mistakes when trying to fit an underlying pattern. The more complex that pattern is, the tougher a time the network will have in fitting to it.

In this way, each ResNet block only models whatever differs between the target pattern/image and the starting point. To take this to an extreme, if we suppose that our starting activations were a perfect match for our target activations, i.e. Y = x in F(x) = Y then, in a traditional conv net, we would have to hope that F(x) is trained to be an identity operation. With the Resnet block, we already add on the initial state x in R(x) + x = Y, therefore rather than training R(x) to be an identity, we can just zero it out, which is a much easier pattern to fit.
In reality we probably won’t have many cases where our starting point is a perfect match for the target point, but in cases where we have some partial match, we may be able to zero out some weights rather than trying to train them to perfectly match whatever segments of our pattern we’ve already figured out.