Source: Deep Learning on Medium
Learn the foundations of convolutional neural networks for computer vision and build a CNN with TensorFlow
Recent advances in deep learning have made computer vision applications leap forward: from unlocking our mobile phone with our face, to safer self-driving cars.
Convolutional neural networks (CNN) are the architecture behind computer vision applications. In this post, you will learn about the foundations of CNNs and computer vision such as the convolution operation, padding, strided convolutions and pooling layers. Then, we will use TensorFlow to build a CNN for image recognition.
The convolution operation is the building block of a convolutional neural network as the name suggests it.
Now, in the field of computer vision, an image can be expressed as a matrix of RGB values. This concept was actually introduced in an earlier post.
To complete the convolution operation, we need an image and a filter.
Therefore, let’s consider the 6×6 matrix below as a part of an image:
And the filter will be the following matrix:
Then, the convolution involves superimposing the filter onto the image matrix, adding the product of the values from the filter and and the values from the image matrix, which will generate a 4×4 convoluted layer.
This is very hard to put in words, but here is a nice animation that explains the convolution:
Performing this on the image matrix above and using the filter defined above, you should get the following resulting matrix:
How do you interpret the output layer?
Well, considering that each value is indicative of color, or how dark a pixel is (positive value means light, negative value means dark), then you can interpret the output layer as:
Therefore, it seems that this particular filter is responsible to detect vertical edges in images!
How do you choose the right filter?
This is a natural question, as you might realize that there is an infinite number of possible filters you can apply to an image.
It turns out that the exact values in your filter matrix can be trainable parameters based on the model’s objective. Therefore, you can either choose a filter that has worked for your specific application, or you can use backpropagation to determine the best values for your filter that will yield the best outcome.
Padding in computer vision
Previously, we have seen that a 3×3 filter convoluted with a 6×6 image, will result in 4×4 matrix. This is because there are 4×4 possible positions for the filter to fit in a 6×6 image.
Therefore, after each convolution step, the image shrinks, meaning that only a finite number of convolution can be performed until the image cannot be shrunken anymore. Furthermore, pixels situated in the corner of the image are only used once, and this results in loss of information for the neural network.
In order to solve both problems stated above, padding is used. Padding consists in adding a border around the input image as shown below:
As you can see, the added border is usually filled with zeros. Now, the corner pixel of the image will be used many times to calculate the output, effectively preventing loss of information. Also, it allows us to keep the input matrix shape in the output.
Considering our 6×6 input image, if we add a padding of 1, we get a 8×8 matrix. Applying a 3×3 filter, this will result in a 6×6 output.
A simple equation can help us figure out the shape of the output:
To reiterate, we have:
- 6×6 input
- padding of 1
- 3×3 filter
Thus, the output shape will be: 6+2(1)-3+1 = 6. Therefore, the output will be a 6×6 matrix, just like the input image!
Padding is not always required. However, when padding is used, it is usually for the output to have the same size as the input image. This yields two types of convolutions.
When no padding is applied, this is called a “valid convolution”. Otherwise, it is termed “same convolution”. To determine the padding size required to keep the dimensions of the input image, simply equate the formula above to n. After solving for p, you should get:
You might have noticed that f should be odd in order for the padding to be a whole number. Hence, it is a convention in the field of computer visions to have odd filters.
Previously, we have seen a convolution with a stride of 1. This means that the filter was moving horizontally and vertically by 1 pixel.
A strided convolution is when the stride is greater than 1. In the animation below, the stride is 2:
Now, taking into account the stride, the formula to calculate the shape of the output matrix is:
As a convention, if the formula above does not yield a whole number, then we round down to the nearest integer.
Pooling layers are another way to reduce the size of the image interpretation in order to speed up computation, and it makes the detected features more robust.
Pooling is best explained with an image. Below is an example of max pooling:
As you can see, we chose a 2×2 filter with a stride of 2. This is equivalent to dividing the input into 4 identical squares, we then take the maximum value of each square, and use it in the output.
Average pooling can also be performed, but it less popular than max pooling.
You can think of pooling as a way to prevent overfitting, since we are removing some features from the input image.
Why use a convolutional neural network?
We now have a strong foundational knowledge of convolutional neural networks. However, why do deep learning practitioners use them?
Unlike fully connected layers, convolutional layers have a much smaller set of parameters to learn. This is due to:
- parameter sharing
- sparsity of connection
Parameter sharing refers to the fact that one feature detector, such as vertical edges detector, will be useful in many parts of the image. Then, the sparsity of connections refers to the fact that only a few features are related to a certain output value.
Considering the above example of max pooling, the top left value for the output depends solely on the top left 2×2 square from the input image.
Therefore, we can train on smaller datasets and greatly reduce the number of parameters to learn, making CNNs a great tool for computer vision tasks.
Building a CNN with TensorFlow
Enough with the theory, let’s code a CNN for hand signs recognition. We revisit a previous project to see if a CNN will perform better.
As always, the full notebook is available here.
Step 1: Preprocess the images
After importing the required libraries and assets, we load the data and preprocess the images:
Step 2: Create placeholders
Then, we create placeholders for the features and the target:
Step 3: Initialize parameters
We then initialize our parameters using Xavier initialization:
Step 4: Define forward propagation
Now, we define the forward propagation step, which is really the architecture of our CNN. We will use a simply 3-layer network with 2 convolutional layers and a final fully-connected layer:
Step 5: Compute cost
Finally, we define a function that will compute the cost:
Step 6: Combine all functions into a model
Now, we combine all the functions above into a single CNN network. We will use mini-batch gradient descent for training:
Great! Now, we can run our model and see how it performs:
_, _, parameters = model(X_train, Y_train, X_test, Y_test)
In my case, I trained the CNN on a laptop using only CPU, and I got a pretty bad result. If you train the CNN on a desktop with a better CPU and GPU, you will surely get better results than I did.
Congratulations! You now have a very good knowledge about CNNs and the field of computer vision. Although there is much more to learn, the more advanced techniques use the concepts introduced here as building blocks.
In the next post, I will introduce residual networks with Keras!