A Layman’s Guide to Deep Convolutional Neural Networks

Original article was published on Artificial Intelligence on Medium

Picture Credits — @franckinjapan [Unsplash]

Deep Learning Foundations

A Layman’s Guide to Deep Convolutional Neural Networks

A fast-track, non-mathematical and practical introduction to Convolutional Neural Networks with PyTorch

This post is a part of a medium based ‘A Layman’s guide to Deep Learning’ series that I plan to publish in an incremental fashion. The target audience is beginners with basic programming skills; preferably Python.


This post assumes you have a basic understanding of Deep Neural Networks a.k.a. Feed-forward neural networks. A detailed post covering this has been published in the previous post — A Layman’s guide to Deep Neural Networks. Reading the previous post is highly recommended for a better understanding of this post.


Computer Vision’ as a field has evolved to new heights with the advent of deep learning. The ability to aid machines in comprehending the details of an image in a more intuitive way than siloed pixel values or few hand-crafted features has brought a paradigm shift in the field. Today, a breadth of cutting-edge computer vision applications are used in our daily life commercial as well as enterprise and industrial tech products. We have been recipients of colossal benefits from in our casual life from the recent advances of deep learning in computer vision; you might have missed on recognizing the details of the application of the field in some products. Few notable examples are, the auto-pilot mode in Tesla, the face-id unlock, Animoji and advanced camera features in iPhones, the bokeh effect (portrait mode) in your smartphone camera, filters in Snapchat and Facebook messengers, etc.

The fundamental idea in the field of Computer Vision starts with a very simple problem, identifying what exists in an image. As it turns out, the problem is an extremely difficult task to solve, though we humans find it extremely easy.

Images are represented in its digital form using a 3-dimensional matrix with pixel values, i.e. length x breadth and the #channels (RGB). Extracting information from this 3-d matrix is not so straight forward.

Let’s start with some history

Legacy Computer Vision solutions

In earlier times, computer vision problems had a fair amount of success when machine learning was embraced. These problems were primarily solved by using hand-crafted features and a traditional machine learning algorithm like SVM (Support Vector Machines). The hand-crafted features were properties of images derived using various other algorithms. A common example of such features the presence of edges and corners. A basic edge detector algorithm works by finding areas where the image intensity suddenly changes i.e. a huge difference in values of nearby pixels. Several such basic properties and few more sophisticated features were derived using a combination of algorithms and the results were then fed to a supervised ML algorithm.

This approach works, but the results are not very encouraging. The amount of effort required for hand-crafted features are firstly overwhelming; moreover, it requires a fair amount of domain knowledge and is very specific to the use-case. Say, hand-crafted features created for detecting bone fractures in an X-Ray image may not be useful to identify the name written on a delivery package.

Fig 1 — Illustration — Legacy Computer Vision Solutions using Machine Learning

To reduce the effort on hand-crafting features, if we think about representing an image in a tabular form i.e. each pixel values transformed as a feature, the results are very disappointing. There is barely any information about the image that gets captured by the network/ML algorithm, thus resulting in poor performance.

Fig 2 — Illustration — Flattening image to tabular data for ML solutions

From the above problem-space, we can infer one important detail — feature extraction for images is inevitable but difficult to solve.

Here are a few examples that might help you understand the reasons why computer vision-based tasks are difficult to solve. For simplicity, let’s assume we are trying to predict a cat from an image as a binary problem.

Refer to the two images demonstrated below; based on pixel values, these 2 images will have a completely different representation in the digital format. Since pixel value would only represent the colour that each pixel should output, the semantic meaning of the underlying representation will not make any intuitive sense.

Picture Credits — Alexander London https://unsplash.com/photos/mJaD10XeD7w

Similarly, a cat can often be represented with poor background separation. Have a look at the images below; the traditional features would mostly yield futile results. Therefore, hand-crafted features will be less effective here.

Picture Credits — MIKHAIL VASILYEV, KiVEN Zhao,Hannah Troupe @ Unsplash.com

Also, the sheer volume of different ways in which a cat can be pictured makes the process even more difficult. These different poses from a cat are just a few.

Picture Credits — Marko Blažević, Timo Volz, Willian Justen de Vasconcellos @ Unsplash.com

These issues, when extrapolated to represent a more generic use-case, say detecting multiple objects within an image adds up the complexity exponentially.

Clearly, a tabular representation of pixels or developing handcrafted features that detect specific features or combining both of them are not the best means to solve computer vision problems.

So what could be a better way to solve computer vision problems?

Historically, we have seen that hand-crafted features, though intensive on effort, do, however, solve the problem to a certain extent. But this process would be very expensive and would need a lot of domain knowledge to solve a particular problem.

What if we could automate feature extraction?

Fortunately, this is possible and it finally brings attention to our topic of discussion ‘Convolutional Neural Network’ at the forefront. CNN provides a superior means to solve Computer Vision problems using a generic, scalable and yet sustainable approach that can be applied across domains with no prior knowledge of the domain. The creation of hand-crafted features is no more required as the network learns itself to extract powerful features given enough training and data.

The idea for Deep Convolutional Neural networks was originally published by Hinton, Krizevsky, Sutskever and was used to achieve state-of-the-art performance in the ImageNet Classification challenge back then. This research then revolutionized the field of computer vision. You can read more on the original paper published here.

A Deeper look at Deep Convolutional Neural Networks

A generic architecture for CNN is shown below. Details might be a bit abstract at the moment, but just hold on for a bit longer while will soon get into the details of each component individually. The feature extraction component in this architecture would be the combination of ‘convolution + pooling’. You might have noticed that this component is repeated, and in most modern architectures you will see this component repeating several times in a hierarchy. These feature extractors, first extract low-level features (say edges, lines) then mid-level features as shapes or combinations from several low-level features and eventually high-level features, say an ear/nose/eyes in the case for a cat detection example. Finally, these layers are flattened and connected to the output layer using an activation (Similar to the feed-forward neural networks).

Fig 3— Illustration for Convolution Neural Network

But first, let’s start with the basics

Let’s take a moment to understand how a human brain generally recognizes objects through vision. In a simplified way, our brain receives signals from the retina about the visual it perceived from the external world. At first, edges are detected, these edges then help in detecting curves and then more complex patterns like shapes and so on. The hierarchical orchestration of neural activity from small edges to lines, curves, complex shapes and even more complex shapes finally helps in identifying a specific object. Of course, this a highly simplified view of the process and the human brain processes and crunches simultaneously far more complex operations.

Similarly, in Convolutional neural networks, we have feature learning in the early layers where very basic features are learned. The ‘deep’ in a ‘Deep CNN’ refers to the number of layers in the network. It is common to have 5–10 or even more feature learning layers in a regular CNN. Modern architectures used in cutting edge application have networks that are more than 50–100 layers deep. The working of the CNN is fairly similar to the over-simplified working of a human brain in recognizing visual components through the visual cortex.

Let’s get into the specifics of CNN building blocks

We will first start with understanding what is convolution.

‘Convolution’ is operation borrowed from the field of signal processing; in the realm of Deep Learning, it is basically performing matrix multiplication between an image (matrix) and a kernel or filter (another smaller matrix) by sliding through its length and breadth. The animation below demonstrates convolving a [3×3] filter/kernel on an [5×5] image. The result of the convolution operation is a smaller image of size [3×3].

Source — https://giphy.com/gifs/blog-daniel-keypoints-i4NjAwytgIRDW

This matrix multiplication is, in essence, the foundation of feature extraction. With the right values in a kernel, we can extract notable features from an image. An example of this operation on a real image is demonstrated below. We can see that the original image remains the same on using a kernel as an identity matrix. However, when we use different kernels the results can be seen similar to using other edge detectors/smoothing/sharpening/etc. image processing techniques.

Source — Wikipedia

That completes one part of the story, the next part in the same component is pooling. A pooling layer helps in reducing the spatial size of the image representation to reduce the number of parameters and computation in the network. This operation is simply taken by using a max for a defined kernel size. The below visual is a simple pooling operation example. A pool operation using a kernel with size [10 x 10] is performed on the output from convolution (another matrix) of size [20 x 20]. The end outcome is a [2 x 2] matrix.

Source — http://deeplearning.stanford.edu/tutorial/supervised/Pooling/

Using a combination of convolution layers and pooling layers (usually max pooling) we create a basic building block of the CNN. Performing a convolution + pooling operation reduces the size of the original input depending on the kernel and pooling size. Using 1 kernel, when we perform convolution on the input image, we get a feature map. In a CNN, it is common to use several kernels within a single convolution unit. The below figure highlight the feature maps extracted from n kernels during convolution.

Fig 4— Illustration CNN architecture

Repeating this process multiple times results in deeper convolutional neural networks. Each layer helps in extracting features from the previous layer. The hierarchical organization of these layers help in incrementally learning features from small edges to more complex features created from the lower level features and to high level features that would capture enough information so that the network can predict accurately.

The last convolution layer is connected to a fully connected layer which is used to apply relevant activation for predicting the outcome; for a binary outcome we use sigmoid and or softmax activation for non-binary outcomes.

The entire architecture as discussed can be simplified as shown below —

Fig 5— Illustration simplified CNN architecture

So far, we have ignored several important aspects in a sophisticated CNN architecture. But this was intentional to keep things simple and help you cover the very basic idea of a building block in a CNN.

Few additional key concepts that would be important are –

  • Stride — In simple words, stride can be defined as the amount by which a filter shifts. When we discussed the sliding of the filter over the input image, we assumed that the movement was just 1 unit in the intended direction. We can, however, control the sliding movement with a number of our choice (though it is common to use 1). Based on the need for the use-case, we can choose a more appropriate one. Larger strides often help in reducing computation, generalizing learning of features, etc.
  • Padding — We also saw that applying convolution reduces the size of the feature map when compared to the size of the input image. Zero-padding is a generic way to control the shrinkage of dimension after applying filters larger than 1×1 and avoiding information loss at the boundaries.

To illustrate the concepts of padding and stride, the below visuals would be very handy

  1. Padding with no strides (blue is input, green is output)
Source and Credits –https://github.com/vdumoulin/conv_arithmetic

2. No Padding with strides (blue is input, green is output)

Source and Credits — https://github.com/vdumoulin/conv_arithmetic

Few other important aspects that we didn’t touch-base until now are Batch-Normalization layers and Dropout Layers. Both are relevant and important topics under CNN. Today, we would mostly define a convolutional unit as a combination of (Convolution + Max Pooling+ Batch Normalization), instead of just the first two. Batch normalization is a technique that helps in easily training very deep neural networks by standardizing the inputs to a layer for each mini-batch. Standardizing the inputs helps in stabilizing the learning process and thereby dramatically reducing the number of training epochs required to train deep networks.

Dropout, on the other hand, is a regularization technique that helps best in reducing overfitting and generalizing.

Connecting the dots

Now that we have a fair understanding of the basic building blocks within a convolutional neural network, I am sure you would several questions bothering the finer details. The most important question that might have crossed your thoughts would be ‘How do we decide on what filters to use’, ‘How many filters?’, etc.

Let’s tackle these questions one by one

How do we decide on what filters to use?

Well, the answer to this question is simple. We set filters with random values sampled from a normal or other distribution. The idea might be a bit confusing and difficult to reason, but it works out well. In the process of training a network, it learns incrementally the best filters that would help in extracting the most information required for accurately predicting the label. This is where the magic happens and we technically eliminate the process of hand-crafting features. The network takes care of crafting appropriate filters to extract the best features given sufficient training and data.

How many filters do we use in each convolution unit?

There is no standard number. The size and the number of filters are hyperparameters that can be tuned. A general rule of thumb is to use filters with size in odd numbers, say 3×3, 5×5 or 7×7. Also, smaller filters are mostly preferred over large filters, but this comes with a trade-off that can be solved with empirical validations.

How does the network learn?

This would be similar to the feed-forward neural networks we studied in our previous blog. We use the backpropagation technique for the network to update the weights in the filters and thus learn the ground features in the image. The learning process helps the network discover the optimum filters that can best extract the most information from the input images.

The above image was a simple 2D, most images are 3D; how does the network work for a 3D image?

The illustration used 2D images for simplicity. Most images we use would be 3 dimensional with (RGB) colour channels. In this case, everything remains the same but for the dimensions of the kernel. The kernels would now be a 3-dimensional kernel where the 3rd dimension would be equal to the number of channels. Say, 5x5x3 for 3 colour channels in the input image (R, G and B).

What is the difference between Convolutional Neural Networks and Deep Convolutional Neural Nets?

They both are the same; the deep here refers to the number of layers in the architecture. Most modern CNN architectures are 30–100 layers deep.

Do we need a GPU for training a CNN?

Not mandated but recommended. Leveraging GPUs almost result in 50x faster processing in training Neural networks. Also, Kaggle and Google Colab platforms provide a GPU based environment for free (with weekly usage caps.

Alright, basics done; show me a tangible example

Let’s engage in a practical example that demonstrates the construction of a ConvNet using PyTorch.

We will recall all the topics that we touch based in the content above.

First, let’s import all required packages. We will import the required utility tools, neural net core modules and few external modules from Scikit-learn for evaluating the network performance.

Next, let’s load our dataset into memory. For this example, I am using the MNIST csv dataset available on Kaggle. You can find the complete dataset here on Kaggle.

Fig 6— Output from the above code snippet

Now, that we have loaded our dataset, let us transform it into a PyTorch friendly abstraction.

Fig 7— Output from the above code snippet

We will now define CNN Architecture and also define additional functions that will help us in evaluation and generating predictions.

Fig 8— Output from the above code snippet

Finally, let’s train our model.

Fig 9— Output from the above code snippet

We now have a simple model trained for 5 epochs. For most use-cases, you would need >30 epochs to have great performance. Let us now, calculate the accuracy on the validation dataset and plot the confusion matrix.

Fig 10— Output from the above code snippet