Original article was published on Deep Learning on Medium

**Understanding Convolutional Neural Network**

**Introduction:**

**Convolutional Neural Networks(CNN) **sounds like a peculiar combination of biology and math with a little CS sprinkled in, but these networks play an important role in the field of **Computer Vision.**{ 2012 was the first year that neural nets grew to prominence as Alex Krizhevsky used them to win that year’s **ImageNet competition** (the annual Olympics of computer vision), dropping the classification error record from 26% to 15%. It has wide applications in image and video recognition, recommender systems and Natural Language Processing.

**Architecture:**

A CNN is composed of Convolution layers with non-linear activation functions like ReLU or tanh and then followed by one or more fully connected layers or affine layers. This architecture is analogous to that of the connectivity pattern of Neurons in the Human Brain and was biologically inspired from the visual cortex. The visual cortex has small regions of cells that are sensitive to specific regions of the visual field. The input to the convolution layer is “**m x m x r”** image where **“m”** is the height and width of the image and **“r**” is the number of channels (e.g. for RGB image r = 3). The convolution layer has k filters or kernels of size **“n x n x q”** where **“n”** is smaller than the dimension of the image and **“q”** can either be the same as the number of channels **”r”** or smaller and vary from each kernel. The size of the kernels which give the locally connected structure which convolve with the image to produce k feature maps each of size **“m-n+1”**.

One more thing to care about while doing convolution operation is about stride. Which means by how many unit the kernel should shift after one operation.

All these filters are initialized randomly and become our parameters which will be learned by the network subsequently. As we go deeper to other convolution layers, the filters are doing dot products to the input of the previous convolution layers. So, they are taking the smaller coloured pieces or edges and making larger pieces out of them.

**Parameter Sharing and local Connectivity**

Parameter sharing is sharing of weights by all neurons in a particular feature map. And Local connectivity is the concept of each neural connected only to a subset of the input image (unlike a neural network where all the neurons are fully connected). This helps to reduce the number of parameters in the whole system and makes the computation more efficient.

The objective of the Convolution Operation is to **extract the high-level features** such as edges, from the input image. The first Convolution layer is responsible for capturing the Low-Level features such as edges, color, gradient orientation, etc. With added layers, the architecture adapts to the High-Level features as well.

**Padding**

Sometimes filter doesn’t fit perfectly fit the input image. There are three types of results to the operation — one in which the convolved feature is reduced in dimensionality as compared to the input (drop the part of the image where the filter did not fit), this is called as valid padding and the other in which the dimensionality is either increased or remains the same, this is called as the Same Padding. And the last one is pad the pictures with zeros so that it fits it is called as zero padding.

**Non Linearity(ReLU)**

ReLU stands for Recitified Linear Unit for a non-linear operation. The output is f(x) = max(0,x).

It is used to introduce non-linearity in our convnet. Since, the real world data want our convnet to learn would be non-neagtive linear values. There are other non-linear functions like tanh or sigmoid that can also be used instead of ReLU, but most data scientists use ReLU since it is good as compared to other in performance.

**Pooling Layer**

This layer is responsible for reducing the spatial size of the convolved feature. This is to decrease the computational power required to process the data through dimensionality reduction. It is also useful for extracting dominant features which are rotational and positional invariant for effectively training the model. Spatial pooling also called subsampling or downsampling. It can be of different types:

i) Max Pooling

ii) Average Pooling

Max Pooling returns the maximum value from the portion of the image covered by the kernel. It also performs as a noise Suppressant. It discards the noisy activations altogether and also performs de-noising along with dimensionality reduction. On the other hand Average Pooling returns the average of all the values from the portion of the image covered by the kernel. It simply performs dimensionality reduction as a noise suppressing mechanism. So we can say that Max pooling performs a lot better than Average Pooling. But in some cases average pooling outperforms max pooling.

**Fully Connected Layer**

Now that we can detect these high level features, the icing on the cake is attaching a fully connected layer to the end of the network. This layer basically takes an input volume (whatever the output is of the conv or ReLU or pool layer preceding it) and outputs an N dimensional vector where N is the number of classes that the program has to choose from. For example, if you wanted a digit classification program, N would be 10 since there are 10 digits. Each number in this N dimensional vector represents the probability of a certain class. For example, if the resulting vector for a digit classification program is [0 .1 .1 .75 0 0 0 0 0 .05], then this represents a 10% probability that the image is a 1, a 10% probability that the image is a 2, a 75% probability that the image is a 3, and a 5% probability that the image is a 9. The way this fully connected layer works is that it looks at the output of the previous layer and determines which features most correlate to a particular class. Basically, a fully connected layer or affine layer looks at what high level features most strongly correlate to a particular class and has particular weights so that when you compute the products between the weights and the previous layer, you get the correct probabilities for the different classes.

Now lets dive into how to train a convolutional neural network

**Training**

You might have a lot of questions while going through the article like how do the kernels or the filters in the first conv layer know to look for edges and curves? How does the fully connected layer know what activation maps to look at? How do the filters in each layer know what values to have? Like this many more questions may be there. Don’t worry about all these. The way the model is able to adjust its filter values(or weights) is through a training process called backpropagation. Before diving ourselves into the backpropagation, we must first take a step back and talk about what a neural network needs in order to work. At the moment we all were born, our minds were fresh. We didn’t know what a cat or dog or bird was. In a similar sort of way, before the CNN starts, the weights or filter values are randomly initialized . The filters don’t know to look for edges and curves. The filters in the higher layers don’t know to look for paws and beaks. As we grew older however, our parents and teachers showed us different pictures and images and gave us a corresponding label. This idea of being given an image and a label is the training process that CNNs go through. Before getting too into it, let’s just say that we have a training set that has thousands of images of dogs, cats, and birds and each of the images has a label of what animal is it. Now lets get back to backpropagation.The training of the model has 3 steps, the forward pass, the backward pass, and the weight update. During the **forward pass**, we take a training image and pass it through the whole network. On our first training example, since all of the weights and filter values were randomly initialized, the output will be random and wont make any sense. The network, with its current weights, isn’t able to look for those low level features or thus isn’t able to make any reasonable conclusion about what the classification might be. Using the prediction from the forward pass, loss is computed using a Loss function, which compares the predicted output with the expected output.

Let’s say the variable L is equal to the Loss value. Initially, the loss might be extremely high for the first couple of training images. Now, let’s just think about this intuitively. Suppose there is a game where you have to aim to the center(expected value) of the target board and throw a dart(your model). The more the distance from the hitting point of your dart to the center of the target board the more money you have to pay. That money we are giving away is our loss and we definitely want to minimize that, So our target decrease the distance between our hitting point and center. Similarly the model must adjust itself to minimize the loss and predict accurate values. Mathematically this as just an optimization problem in calculus, we want to find out which set of weights returns minimum possible loss.

This is the mathematical equivalent of a **dL/dW** where W are the weights at a particular layer. Now, what we want to do is perform a **backward pass** through the network, which is determining which weights contributed most to the loss and finding ways to adjust them so that the loss decreases. Once we compute this derivative, we then go to the last step which is the **weight update**. This is where we take all the weights of the filters and update them so that they change in the opposite direction of the gradient. The **learning rate** is a parameter that is chosen by the programmer. A high learning rate means that bigger steps are taken in the weight updates and thus, it may take less time for the model to converge on an optimal set of weights. However, a learning rate that is too high could result in jumps that are too large and not precise enough to reach the optimal point.

The process of forward pass, loss function, backward pass, and parameter update is one training iteration. The program will repeat this process for a fixed number of iterations for each set of training images (commonly called a batch). Once you finish the parameter update on the last training example, hopefully the network should be trained well enough so that the weights of the layers are tuned correctly.

This is all about Convolutional Neural Network. In the next article we will go through the code in which we will compare the accuracy of model without CNN and a model that uses convolution neural network. And we will also discuss about hyperparameter optimization.