CNNs: The Key to Computer Vision
Whether it’s an orange, a pen, or a phone, we as humans are excellent at recognizing and identifying objects. Telling the difference between a car and a piano might be trivial for us, but computers don’t have it quite that easy.
Computers aren’t able to naturally identify objects in the physical world, and breaking it all down, all a computer can fundamentally understand are binary ones and zeros. However, like most things in the known universe, it isn’t quite that simple.
Using Machine Learning, or ML for short, we have been increasingly able to teach our computers how to do tasks previously limited to humans, whether it’s playing video games, generating art, or even classifying images. The act of computers identifying and classifying images is often referred to as computer vision, and the field exploded onto the scene in 2012 with the creation of AlexNet, a Convolutional Neural Network.
What really is a CNN?
Although the name doesn’t lend itself to simplicity, Convolutional Neural Networks (CNNs) like AlexNet operate on some fairly straightforward premises. First and foremost, there’s the concept of a convolution, which is the combination of two functions to create a third function. To better understand this, let’s start out with the example of a 5 pixel by 5 pixel image that’s in grayscale (black and white) being passed through the network.
We can represent this image as a two-dimensional array, where each of the values is between 0 to 1, representing the intensity of black or white. This would be the first function for the convolution, known as the input. The second is a filter, which we can view as a scanner over the original input. In the case of the gif, the filter is the 3×3 array being scanned over each value in the input.
Each of the filter’s nine values is multiplied by the corresponding value on the image, and its values are summed. The final value is recorded on the convolutional layer, which is visualized as the red values. Two important parameters to note about the filter include the stride and the kernel size. In our example, the stride is one, meaning that the scanner centers at every single pixel value it can. The kernel size is the length/width of the filter, being three in our example.
This new gif shows a feature map (ouput of the input and filter) that has the same size as the input. Usually, this is what we want in our convolutional layers, and is a result of padding, a border of zero values around the image. Padding allows the filter to map out the edge pixels, ensuring that the size of the array stays constant throughout.
Features and Convolutional Layers
Now, let’s backtrack a little bit and return to the idea of features, which is what makes CNNs so special. Features are specific parts of images; for a car, an a feature might be its wheels, windows, or exhaust. In a CNN, we can layer multiple convolutions to form a convolutional layer, which can detect multiple features in images which distinguish them from each other. The computer may detect them as edges or circles, and they increase in complexity with an added number of filters/layers.