Source: Deep Learning on Medium
One of the greatest reasons for computer vision success is from convolutions. Convolutions and convolutional neural networks have allowed for many incredible advances in computer vision.
The goal of this post is to understand and implement a toy convolution example, from scratch in NumPy. From experience in doing this myself, I believe that this can solidify your understanding of convolutions and how every component works.
For reference, here is the complete runnable code:
Let’s break it down…
Part 1(Lines 5–15): Inputs And Weights
First off, we need to define our inputs and weights. Just like any other neural network, we use these inputs and weights to calculate the output.
Because we are dealing with images, our inputs are just the pixel values of the image, which is shown on lines 5–12. Showing our array as an image wields us this:
Probably not the most exciting image, but it works as a toy example. Comparing this with our code, we see that it makes sense, as the 0s correspond with the black part of the image, and the 10s correspond with the white part of the image.
Other than the inputs, weights are the other data that is needed in a neural network. In convolutional neural networks, another name for the weights is the filters. Essentially, the filter is just a square array that has a shape of (n, n). In our code, the filter we create is a 3 by 3 array that has 1s in the first column, 0s in the middle column, and -1s in the last column.
However, note that these created numbers were not randomly created, but rather is a filter that is known to detect edges in an image. This idea of having a neural network detect edges is fundamental to computer vision success.
Also remember that since we are dealing with weights, these numbers are the ones that are going to be changed by backpropagation. However, since the focus of this post is just the forward pass of convolutions, the code presented does not cover backpropagation of convolutions.
Part 2(Lines 17–23): Convolutions
How that the inputs and weights are defined, we can move on to convolutions. Essentially, convolutions are a set of operations that connects the inputs and weights to form a feature representation of the input. When I say that convolutions are a set of operations, I really do mean that this is all convolutions really are. Just like how in a normal feed-forward neural network we matrix multiply the inputs and weights to produce the outputs of that layer, convolutions are a way to produce an output based on the inputs and weights.
Looking back at our code, the things presented on lines 16 through 23 don’t exactly look the easiest to understand. However, convolutions aren’t nearly as complicated as the code makes it look, and only some of the lines in the code are actually important when performing convolutions. So, let’s start with something very simple.
Convolutions: A Single Step(Line 22)
Let’s say that we wanted some way of combining the top right 3 by 3 corner of the image and the filter. In doing this, we could represent this 3 by 3 corner’s features as a single number. However, to do this, we need an operation that takes two 3 by 3 array’s and combines them to form a single number. The code on line 22 does all of this with a single line of code:
Recommendation: Open a jupyter notebook and run lines 5–15, so that you can play with the example image and filter. That’s what I did when I developed the code.
However, this is still pretty complicated. Let’s break this code down even further.
Start by running this code:
You should see this as the output:
array([[10, 10, 10],
[10, 10, 10],
[10, 10, 10]])
As you should have figured out by now, what we’re doing here is slicing the top right 3 by 3 corner of the original image, resulting in a 3 by 3 array with all 10s.
Now run this code:
(I do realize that this is just the filter that we created in the complete code, but it is still useful to visualize it anyway)
This should be the output:
array([[1, 0, -1],
[1, 0, -1],
[1, 0, -1]])
Notice that again, this is a 3 by 3 array.
As I said before, we need an operation that combines two 3 by 3 arrays into one single number. The two previous lines of code that you ran are the two 3 by 3 arrays that we want to combine. One is the filter and the other is a 3 by 3 corner of the original image. Now that we know the “inputs”, we need to actually use them to produce an output, which should be a single number. There are two steps to do this.
Step 1: Perform an element-wise multiplication. The easiest way to combine two identical-in-size arrays is to do an element-wise multiplication. The code to do this is as follows:
(Yes, this is literally the same as multiplying two numbers)
After running the code, you should see this output:
array([[ 10, 0, -10],
[ 10, 0, -10],
[ 10, 0, -10]])
If it’s still unclear what happens during this element-wise multiplication, think about it like this:
array([[10 * 1, 10 * 0, 10 * -1],
[10 * 1, 10 * 0, 10 * -1],
[10 * 1, 10 * 0, 10 * -1]])
Step 2: Sum up all the values in the array. Since we only want one number as an output and not an array, we need to sum up all the values in the output array from the previous step. In case you’re unfamiliar with NumPy, there is a NumPy function that does this for us.
Combining the code from both steps wields us this code:
When this code is run, you should see a single number: 0.
Generalizing this code: In the example I use, the filter has a shape of (3, 3). However, is lots of cases, the filter might not be (3, 3), but might be (n, n) in shape. In this case, instead of having a 3 by 3 slice, we have an n by n slice. So our code has to be modified to this:
Notice that this code is (almost) exactly the same as the code at the beginning of this subsection. By now, you should understand what this line of code does. However, this is only a single “step” of convolutions. The next subsection uses what we did in this subsection to generalize and complete convolutions.
Complete Convolutions(Lines 17–23)
Essentially, the complete convolutions process is not really all that different from what we have already done. In the complete version of convolutions, we repeat the single step convolution from the previous subsection for every single 3 by 3 section in the image and put all those numbers into an array. So, all we’re really doing is wrapping a “for” loop around the code in the previous subsection. Here is the complete convolutions code:
Just like before, let’s break this complicated code down into smaller chunks.
Step 1: Repeating for every 3 by 3 section in our image. Like I said before, we need to be able to take every 3 by 3 section in our image and do what we did in the previous subsection on it. Here is the code that does this:
We need two “for” loops, one for the vertical dimension, and the second for the horizontal dimension. Since we can fit six 3 by 3 arrays on the vertical dimension, and five 3 by 3 arrays on the horizontal dimension, we have to calculate 30 numbers that are going to be used in the final output array. By adding a “print” command for every iteration of the loop, we can see those 30 numbers.
These 3 lines in this step are the same as the ones on lines 4–6 of this subsections code(lines 20–22 of the complete code). While we have all the numbers needed to perform the complete convolutions, we still need to put these numbers into a final output array.
Step 2: Packaging our numbers. Even though we have done all the calculations needed to complete convolutions, the output should not be a list of numbers but should be an array of these numbers. Since there are two parts to doing this, let’s start with the first 3 lines of our code(lines 17–19 of the complete code):
Back in step 1’s code, we used 6 and 5 as the numbers in our “for” loop, resulting in 30 numbers. However, there will not always be this amount of numbers, due to different sized image arrays and filter arrays. It turns out, these two numbers are also the sizes of the output array, respective to which axis we are using. So, we need a formula for computing this shape.
In this case, the output size should be the image size minus the filter size plus one, as shown in lines 1 and 2. This should make sense since we are fitting an array of size of 3 into a size of 8 or 7. When we do this for both axes, we get the dimensions of the output array, as well as the number times we have to loop through on lines 4 and 5(lines 20 and 21 of the complete code).
Since we have the dimensions for the output array, we have to initialize the output array, which is what line 3 does(line 19 of the complete code). We are creating an empty array that will eventually be used to store the output values that we calculated in step 1.
Notice that when we are using the “for” loop, the index values of the “for” loop are the dimensions that we have to put the calculated values into. So, we can add a line into the nested “for” loops that adds the calculated value(line 6) into the initialized array(line 3), as done on line 7.
Now that all the components of the code have been covered, let’s package all of this up into some final ready-to-use code, which should look very similar to the complete code as presented at the beginning of this post.
When this code is run, the “convolved” array should contain the output to our convolution example. You should see this when you print the “convolved” array:
array([[ 0, 30, 30, 0, 0],
[ 0, 30, 30, 0, 0],
[ 0, 30, 30, 0, 0],
[ 0, 30, 30, 0, 0],
[ 0, 30, 30, 0, 0],
[ 0, 30, 30, 0, 0]])
At the beginning of this part, I said that convolutions are supposed to produce a feature representation of an image. How does this array represent our image’s features?
This is the image representation of the “convolved” array. Notice that the white part of the image is surrounded by two black parts, essentially creating an “edge”. Also, looking back at our original image, there is a line that separates the black and white parts. Essentially, this output array was able to represent that line in the original image as a feature. Just like in a neural network, this is how a convolutional neural network is able to learn features about the image.
By now, all of the above code should make sense. However, some key concepts are still missing from this code. Every difference between this code and the original code is going to be covered in the next two parts.
Part 3: Padding
In the previous part, we covered the idea of convolutions and created code that was able to detect an edge in an image, which is the fundamental idea of convolutions, and why it works so well. However, there are 2 modifications to convolutions that acts as a solution to some of the problems of the plain convolution algorithm as described in the previous part.
Let’s start with padding.
What is padding?
From the version of convolutions that we know so far, we would use every section of the image that was the shape of the filter. So, we would start at the top left corner and move all the way down to the bottom right section. This way, we would be able to detect features of the entire image, using our filter.
However, notice that at the outer most sections of the image, the filter seems to “ignore” those pixels more than the innermost pixels. So, features at the outer sections of the image would not be detected nearly as well as ones that are in the center of the image. To show why this is, consider the example image that we had from part 1. Let’s say that we are to “scan” our 3 by 3 filter over this image, which is essentially what we do in convolutions. If we were to consider every pixel, notice that the pixels at the edge of the image are only used once, as opposed to the pixels towards the center of the image which are used 3 times.
Since we want features at the outer parts of the image to be detected, we need some way to include those pixels as much as the inner pixels are included. Padding does this for us.
Essentially, padding is a technique that enlarges our image in a way so that the filter can detect everything in the image, so as to solve the problem described above. When I say “enlarge”, I mean that we add more fake pixels that are not actually part of the image but serve as a “pad” around the image.
Now that you have an idea of what padding is, let’s implement it in code.
(The modifications are noted with a + at the front of the line)
Using a function in NumPy, we can implement padding in 1 line of code. This function essentially surrounds our original image with 0s, which is what we want to do. When we perform padding, we need one parameter, which is how many layers of padding we want to surround the image with. This is done on line 3, which we can change is we wanted to have more than one layer of padding.
Everything else in the code stays the same since all the parameters are based off of the variable image, so enlarging the size of the image should be accounted for with other variables as well.
From here, we only have one other technique used in convolutions. This will be covered in the next part.
Part 4: Strides
The final thing that we need to cover is something called strides. Just like padding, this is a mere modification the original convolutions presented in part 2.
What is the stride?
In part 2, I said that when performing convolutions, we take every subsection of the image and use these to perform convolutions. Essentially, we were moving over one step at a time until there were no more subsections of the image. We call this a stride of 1. However, if we wanted to take every other 3 by 3 subsection of the image, this would be a stride of 2, since we would be moving over a step of 2 before we take another subsection.
Let’s implement this in code.
Since there is quite a bit changing here, I will explain how every + sign contributes to our implementation of strides.
Line 4: This is going to be our variable that stores what stride we want. Since we want to change our stride without going into our code and looking for every place we used that number, we will assign this parameter here, just like we did with padding.
Change 1(Lines 16–17): When changing the stride, the shape of the new output array is also different. It turns out, the shape of the output array is the same as before, except we divide by the stride, then add 1. This is shown in lines 16 and 17.
Change 2:(Lines 20–21): Essentially, these “for” loops control how many times we want to perform convolutions on a subsection of the image. However, since we don’t want to take every subsection of the image, we have to make it jump by a factor of the stride amount, instead of the default 1 in a “for” loop. This is shown in lines 20 and 21.
Change 3(Line 23): The final change that completes our implementation of strides is on the final line. Instead of just plugging in the x and y variables from our “for” loop, we have to divide by the stride, since our x and y variables were scaled by that factor when performing the loop.
Notice that the code that I just presented is an exact replica of the one that I showed at the beginning of this post! It turns out, this is all there really is to know about basic convolutions.
In this post, I covered most of what there is to know about convolutions, and how it contributes to computer vision. I used a very code focused approach, and hopefully, it solidified your understanding of convolutions. Going from here, I would recommend that you try and reimplement this completely from scratch again, from memory. This should help tremendously and should solidify your knowledge even more. When doing this, do one thing at a time. Basically, code the basic unmodified version of convolutions first, then go back and change your code to add padding and strides.
What I hope you get from this post is that coding these concepts from scratch using NumPy can be a very powerful way of learning them. It can also be helpful towards implementation in frameworks since you know what’s going on behind the scenes, as apposed to using a black box.
Note: If you have anything to say, any and all feedback is appreciated. Also, if you enjoyed the post, feel free to follow me on twitter.