Source: Deep Learning on Medium
Introducing Convolutional Neural Networks
Convolutional Neural Networks a.k.a Convnets or CNNs are really the superstars of neural networks in Deep Learning. These networks are able to perform relatively complex tasks with images, sounds, texts, videos etc. The first successful convolution networks were developed in the late 1990s by Professor Yann LeCunn for Bell Labs.
In MNIST-like classification problems, multilayer perceptron (MLP) are mostly used and provide very good results (great accuracy), furthermore, the training time remain reasonable. However, for larger datasets with larger images, the number of parameters for the model increases exponentially, therefore, training becomes longer and performance poorer.
Explosion of the number of parameters
MLPs are extremely efficient to solve several categories of problem such as classification, estimation, approximation, recognition, etc. However, they can quickly become limited, especially with very high-dimension input data.
Let’s say we want to create a Deep Learning model to perform a cat or dog classification task. The model is trained on a dataset of images sized 420x420px grayscale; after the training, we want the model to be able to say an image is a one dog or cat. A first approach would be to use a multilayer perceptron with an input layer of 128 neurons for example. Since each neuron of the layer receives the 420×420 pixels values as input and assigns a weight per pixel value, we got per neuron 420×420 = 176400 weights, multiplied by total 128 neurons to which we finally add 128 bias, one bias per neuron. This tiny model will have to learn more than 22 million parameters, for only 128 neurons and a single intermediate layer. If we want a powerful model, it will take several additional layers and more neurons per layer. The number of parameters explodes. For example, a 3-layer model — 128 – 256 – 1 — requires almost 23 million parameters. Such a model needs a huge amount of data to be trained, and since the parameters to be adjusted are way more numerous, the training phase will last longer and the performance falls; let’s say it clearly, even if we had sufficient computing power, it would be impossible to train this model correctly, since this is not the only problem deal with.
Spatial degradation of the image
Training a multilayer perceptron requires to provide input data as a vector (a one-dimension array). In case the input data is a matrix image, then it will have to be flattened to get a vector. However, in natural images, there are very strong links between neighboring pixels; by flattening the image, we lose this information, it becomes more difficult for the network to interpret the data during the training phase. This justifies an interest in another neural network architecture to solve this kind of problem. Let’s begin by presenting what a convolution consists of.
The convolution algorithm
The convolution is a kind of product operation of a filter — also called a kernel — with a matrix of image to extract from it some pre-determined characteristics. Literally-speaking, we use a convolution filter to “filter” the image to and display only what really matter to us. The considered image is a matrix, the filters used are also matrices, generally 3×3 or 5×5. Let’s see how convolution works with the following kernel,
The 6x6px matrix represents an image. At the beginning, the convolution kernel, here the 3×3 matrix is positioned on the top-left corner of the matrix image, the kernel then covers a part of this matrix image, we then make a product element by element (element-wise) of the two overlapping blocks we eventually sum these products and the final result corresponds to a pixel of the output image.
Then, we move the convolution kernel from horizontally to the right by one pixel, we make a new element-wise product then added up to get a new coefficient of the output image.
Once at the end of a line, the kernel makes a vertical stride down and starts again from the left, we iterate likewise until the kernel has covered all the matrix image. It is important to note that the kernel always remains on the initial matrix, without overflowing.
For sure, we cannot use any filter, the coefficients of our kernel will depend on the features we want the filter to highlight. Let’s see the result of a convolution with some well-known filters.
Vertical Sobel filter
Its action is to highlight the vertical lines of the object. Applied to the initial image in the left side, here is the result,
Horizontal Sobel filter
This time it is to highlight the horizontal contours of the image. Here is the result applied to the left-side image,
The actions of these filters can be combined to perform more complex operations. There are several more filters already listed which can be directly used this way, depending on the task to be solved: average filter, Gaussian filter etc. Before the emergence of Deep Learning, human experts had to calculate and determine the right filters to use in order to perform specific image processing actions: face detection, photo editing, like Snapchat filters, etc. Now with Deep Learning, to determine these filters is automatically done by learning, the model will find according to the problem to solve, the good filters from the training data. In a cat or dog classification problem for example, the filters will make it possible to highlight the determining characteristics for the classification: forms of the ears, shape of the eyes, shape of the muzzle, the contours, etc.
Convolutional Neural Networks
Convolutional neural networks, also known as CNNs or Convnets, use the convolution technique introduced above to make models for solving a wide variety of problems with training on a dataset. Let’s look at the detail of a convolutional network in a classical cat or dog classification problem.
Deep Learning approach for convolution
In this classification problem, we have two categories, namely dog and cat. Classifying an image, in one of these categories, depends on singular characteristics such as, the shape of the skull, shape of the ears, shape of the eyes, etc. Only these characteristics matter to perform this classification task, the other information is not important to us.
The ideal would thus be starting from an image, to be able to extract the main characteristics interesting for the classification problem by using appropriate filters. In a Deep Learning context, it will be the model to determine these filters by training on the dataset.
Training starts with random initialized values for the kernel (the filter), during training, these values will be updated through gradient backpropagation. Since we are talking about Deep Learning, we can guess several layers of convolution will be stacked one after the other to increase the model performance.
Padding and edge effect
Let’s take again the example of the convolution animated above and look at the dimensions of the output image, also called feature map. The input image is a matrix 6x6px sized, the filter is a 3×3 matric we can see that the feature map (output matrix) is 4x4px sized. By the way, generally-speaking, the size of the feature map is,
where n represents the dimension of the input image and p the size of the filter. For example, for an initial 120x120px image and with a 5×5 kernel, the featured is sized 116x116px. Note that convolution reduces the size of input image. If we want to output a feature map with the same size as the input image, we have to add zeros around the input image before the convolution, the input image is “padded with zeros”, hence the name of this operation, padding. Let’s look at an example.
Case of the 6x6px matrix with a 3×3 filter.
We want the feature map to have the same dimensions as the input image, namely 6x6px with a 3×3 convolution filter. In the equation given above we then have p = 3, m = 6 so it comes that n = 6 + 3–1 = 8. Thus, it is necessary to have an input image sized 8x8px, which means we must add zeros all around the original matrix image to reach a size of 8x8px; hence the following matrix,
Padding is not mandatory, and it usually happens to ignore it; it may seem anecdotal, but it does have a real utility. Let’s look at the edge pixels,the topmost left for example, they see the convolution kernel only one time, while most other pixels see it more than twice. The edge pixels will therefore have less influence on those of the feature map. To limit this side effect, we “pad” the original input image with zeros, so that the pixels at the edges are not under-considered.
During training, we know that the coefficients of filters are updated; these can be negative as we have seen above with the Sobel filter, it then comes that the coefficients of the feature map can hold large negative value during training. Since we know these values represent pixel levels, and therefore positive, we can apply a function to replace the negative values with zeros and keep the positive values as they are. This is an activation function called relu. But keep in mind that there are other activations functions, the idea being essentially the same as with relu, to maintain feature map values reasonable.
As a reminder, convolution is to apply a filter or kernel to an input image, we then get a feature map that highlights characteristics or “features” of the input image: outline, spots, shape etc. Each filter has a simple and precise task to achieve. So, to solve our classification problem (cat or dog), we will have to use several filters; and by combining the features highlighted by those filters; such as shape of ears, eyes and contours, our model will be able to get trained to distinguish a dog from a cat. We therefore have to choose the number of convolutions to do, so the number of filters to use, knowing that the more filters we have, the more details will be extracted to make the classification, the more numerous will be the parameters of the model to learn but the more performance will be guaranteed by the model (greater accuracy). After that, we have to decide to do padding or not and which activation function to use. This defines a convolution layer. How many filters to use? 64, 128, 256 … Which activation function? relu, sigmoid, tanh … With or without padding?
Let’s say we have a convolution layer with 128 filters for example; it provides a feature map per filter so 128 feature maps in total for one input image. These feature maps represent different information contained in the image, so we can see them as different channels of this image. As a rgb image contains three channels (red, green and blue), a convolutional layer with 128 filters provides a single feature map but with 128 channels. If the same image goes through another layer of 256 filters, it will be outputted as a 256-channels feature map and so on.
In a natural image, there is a strong local correlation between pixels. It means that with an image, if you know that a pixel is red, it is very likely that its four closest pixels are also shades of red. In the case of a grayscale image, if a pixel has an intensity of 180 for example, its nearest pixels will also have intensities around 180. It is therefore possible to reduce the dimensions of the image by keeping only a local representative per pixel local block, this is called pooling. Generally-speaking, we will take the pixel with maximum intensity (so highest value) in the block — max pooling — , we can also average the intensities of pixels in the block — average pooling — and hold it as a representative.
As it can be seen above, pooling halves each dimension (height and width) of the input image. One could think that pooling strongly degrades the initial image by representing a block of pixels by only one; but in fact, the output image (feature map) is certainly half as great, but it contains the main characteristics of the input image.
For example, let’s apply max pooling to the feature map outputted by the horizontal Sobel filter, “pooled” images have been enlarged for comparison.
Note that even after two pooling, the contours are clearly visible, and the image is less rich in details. Pooling is not just about resizing but also keeping only the meaningful features of the input image.
Typically, in a convolutional network, there is a sequence of operations; convolution-pooling layer-convolution-pooling layer and so on. By repeating these operations many times, we finally get feature maps with only the meaningful (according to the problem so solve) characteristics of the input image. We can now exploit the power of a multilayer perceptron to achieve the classification task.
Flatten the feature map
To finally classify the image into a category, say cat or dog, we will set up a multilayer perceptron (Multi-Layer Perceptron) on top of the last convolution layer. The previous convolution and pooling operations have greatly reduced the size of the input image to keep uniquely the meaningful characteristics for the classification. Since feeding a MLP requires input vectors (one-dimension arrays or 1d arrays), we need to “flatten” the output feature map. The MLP therefore receives small-sized feature map as 1d array and chooses the corresponding category with regard to those feature maps.
Now you know all about the key steps in convolutional neural networks, if you want to dive deeper in convnets, you can read the following paragraph.
Advanced approach of convolution
In the previous description, it is said that convolution is to multiply a sliding matrix (the kernel or filter) with and input matrix image. Although this explanation is widespread in Deep Learning community, the actual explanation is slightly different but not much more complex. Let’s take the example of our first convolution (the beautiful gif) with a 3×3 kernel and a 6x6px matrix. We use the formula above to predict the size of the feature map, 4x4px.
The value of the pixel at position top left of the feature map directly depends both on the pixel values in the input image and the values of the convolution kernel. The value of this pixel is according to the convolution algorithm,
where the wi are the coefficients of the convolution kernel and the xi, the coefficients of the matrix in the green box. Let’s remind the convolution kernel,
Above, we have for example w1 = w2 = w5 = 2 and w9 = 1 and also, x1 = 0, x2 = 1, x5 = 1. Note the weighted sum in the expression of the pixel value. It’s as if we had a neuron where the xi were the inputs and the wi coefficients of the nucleus were the weights of the inputs. The neuron computes the weighted sum, which is then the value of the pixel in the feature map, here 5. The framed zone is called the receptive field of the neuron; in green above the receptive field of the top-left neuron.
When the kernel moves on the input image during convolution, the values of xi change but the values for wi, weights of the kernel remain the same. Then, they are updated by gradient backpropagation. In particular, all these neurons share the same weights wi, which are the coefficients of the convolution kernel, because they are looking at the input image for the same characteristics, for example the presence of an eye or a contour. We can also add a shared bias b to those neurons. The activation function aforementioned is actually applied to those neurons.
As it can be seen, the receptive field of the second neuron partly overlaps the one of the first neuron and so on. Since the receptive fields of these neurons overlap each other, if the desired characteristic is translated — up, down, left or right- it will necessarily be in the receptive field of another neuron, thus detected by this one. We say that we have an invariance by translation. Without going into too much complexity, biology work carried out in the 1960s allowed scientists to identify the same structure in the visual cortex of some animals. In particular, the first cells involved in the vision process will identify very simple geometric shapes: line, round, spot, etc. Then more advanced cells will use this information to identify a complex pattern. Similarly, in convolutional neural networks, the first convolution layers will detect general geometric shapes: corner, round, spots, outline, etc. The following layers then combine these elements to identify more specific shapes: skull, eye, ear etc.
Starting again from the formula above, we know that a convolution on an image of sized nxn with a pxp filter, outputs — without padding — a feature map of size mxm; knowing that each pixel of the feature map is linked to one neuron, we then use m ** 2 neurons and each of them observes a field of size pxp with p ** 2 weights (one per filter coefficient) and one bias. Since these parameters are shared, we finally have p ** 2 + 1 parameters for the m ** 2 neurons.
For example, with an input image sized 420x420px grayscale and a kernel sized 5 * 5 we will have a feature map with size m = 420 – 5 + 1 = 416 therefore 416 * 416 = 173056 neurons for one convolution! With only 26(5*5 + 1) parameters to learn.
This is the point of interest in convolutional neural networks. To perform a convolution, in other words, to detect a pattern, the neurons share the same synaptic weights and possibly a bias, which greatly reduces the number of parameters to learn.
The purpose of this article was to introduce convolutional neural networks with their major interest. In general, these networks provide excellent results for classification and recognition tasks. They are also used to interpret sound, text and video data. If the problem to solve is to look for a pattern in a sequence, then convolution networks will be good candidates. In a future article, we will examine with details how successive layers of the convolutional network evolve during the training phase, we will also see how to make these networks speak using Heatmaps.