Convolution Neural Network- The backbone of image classification

Source: Deep Learning on Medium

Many of us would have come across the term “Convolution Neural Network” while reading about image classification. They have proven to yield the best results for image classification problems. In this article we will talk about the basics of CNN’s, why and how it is used for image classification problems. In any image classification problem our main motive is to capture the important features that help us identify the object in the image. For example, to identify a box we need to identify the edges first. To identify a person, we need to find the face, body, legs and hands. CNN works best to extract these most important features. Before we understand the overall architecture of CNN, let us understand a few methodologies used.


A convolution is used to extract important features in an image. For example it may be used to identify horizontal or vertical lines in images. It is generally a 2 dimensional matrix of size k x k that can move over a n x n image with a sliding window in such a way that k<n. Consider the figure below to understand it better.

Convolution (3*3) for a (5*5) image

The image is of size (5 x 5) and the convolution is of size (3 x 3). Consider moving the convolution on the image from the top left corner to the bottom right corner column by column and row by row (The blue, black and red squares are all convolutions). On each move we have to compute a dot product of the image and the convolution and then add all the values together to get the value of 1 cell in the output. The output is a 3 x 3 matrix.

In this we are moving the convolution by 1 unit column (stride) over the original image, but we can vary the size of the stride. While doing this some portions of the image may not be covered by the convolution. If you consider a convolution of size(4 x 4) and a stride of length 2, then you won’t be able to convolve over the whole image. To prevent this from happening, we use a padding of particular length such that the convolution can move over all parts of the original image. If we add a padding of length 1 like the image below, then we can convolve over the whole image.

Padding of size 1 for a 5 x 5 image

In general while considering an image of size (n x n), the convolution or kernel of size (k x k), the stride (s) and padding (p), we can define the output dimensions be of size [(n+2p-k)/s+1,(n+2p-k)/s+1]

Feature Maps

Let us assume that for a (n x n) image, we want to detect the face in the image. So, we have to identify the left eye, right eye, left eyebrow, right eyebrow, nose and lips. We need to use different convolutions to extract different features. So, the weights in the convolutions would be different for each feature. A feature map is nothing but a collection of all these convolutions that contribute towards gathering all the features from an image.

Feature Map

Consider the figure above. In this, we apply 8 filters to the 5 x 5 image on the left and thereby get a feature map of depth 8.


Pooling is an extremely important stage of Convolution Neural Networks. Consider the below 2 images. Both represent the number ‘1’, but the image on the right side is tilted. In order to handle this, we introduce the concept of pooling which extracts the important features irrespective of the position.

Pooling CNN

Pooling uses a pool of size (m x m) and moves over the image with sliding window and extracts the important features. We have 2 types of pooling- max pooling and average pooling. Max pooling takes the maximum of the values and Average pooling, takes the average of the values.

Max and Average pooling on the same image

The advantage of pooling is that it reduces the dimensions drastically and extracts the important features. But this is also a disadvantage, since there is significant loss of data. Irrespective of this pooling has proved to yield extremely good results and hence convolution layer is mostly followed by a pooling layer.

Fully Connected:

After the pooling layer we have a matrix of size (f x (c x c)) where f represents the total number of filters, c represents the size of the pooled image. We then flatten this matrix to bring it to a [f*c*c, 1] matrix where each cell in the matrix forms a row in the new matrix. This can then be fed into a neural network as we discussed here.

Fully Connected Layer

Convolution Neural Network Full Architecture

Consider the MNIST dataset. Each image is of size 28 x 28 pixels. When we apply a convolution of size (5 x 5) with n1 filters, then we get a convolution layer of size (n1 x 24 x 24). Later we perform max pooling to reduce the dimension to (n1 x 12 x 12). These steps together contribute to one convolution layer. We can add any number of convolution layers over one another until we get to the Full Connected Later (fc_3 in the figure below). Once we get to that we need to perform a ‘ReLU’ activation function to contribute for non-linearity. This results in the input to the Fully Connected Neural Network (fc_4). Now we can train the model and compute the results. Refer to my blog on how I achieved higher accuracy using CNN for the MNIST dataset.


Convolution Neural Networks may seem to little complicated, but understanding this will help you to solve any complex image classification problem. There are lot of image classification data sets available in kaggle and you can try to play with the data to understand the power of CNN in deeper detail.

Useful Links:


Thanks for reading. Leave your comments here and let me know if you have any questions on this. You can also message me on LinkedIn.