Source: Deep Learning on Medium

What is Convolution?
In convolution we have almost every time a 3X3 matrix which we multiply with the block of 3X3 pixels. The multiplication is element wise. As we can see in the image below we a 3X3 filter(kernel) which are the numbers below the square blocks and the actual values of the pixels(0–255) which are in the square blocks along with their color. This multiplication is the convolution. After the multiplication we take the sum of all the elements (shown as ‘+’ in the image below). The sum is 51 which is the value of the pixel on the right. This part where we add the values is the convolution layer. The one where the sum is taken as a value and then gives out another value is called the activation layer. In fancy term, this is called RELU( Rectified Linear layer).

In the following image you can see a kernel converting a 3X3 matrix of pixels into a red and green image on the right

The image on the left will help you understand the convoluted layer further. The input layer is given by a 5X5 matrix. The kernel or filter is a 3X3 matrix. The filter is multiplied by a matrix of same size from the input matrix as shown. This multiplication takes place element wise and the resultant matrix is summed up to give a convoluted feature. The convoluted layer is evidently smaller than the input matrix. The convoluted feature is 3X3 because it takes 3 steps horizontally and 3 steps vertically to pass through every element in the input matrix.

For larger input matrix we would require more than one filters. These filters are stored not as separate matrices in PyTorch but a higher dimensional tensor. When we apply these filters to the input matrix we get convoluted layer. The convoluted layer is smaller in size than its previous layer. You can have multiple convoluted layers where filter is applied to the previous convoluted layer.

What after convolution? There is a concept called Maxpooling. Maxpooling in simple terms can be understood as taking a matrix from the input taking it’s max value, as shown.

Maxpooling is used to half the resolution of the image or the convoluted layer which we previously calculated. It takes a 2X2 matrix from the image and returns back its max value. So it makes sense to return value after every 2 “strides ” in order to reduce its resolution. At the end we get a 2X2 matrix from an input layer of 4X4 thus halving its resolution.

After maxpooling, you can have random weights assigned to the values in the matrix. These maxpool activations and their weights are multiplied together (matrix product) to get a vector which are then trained using a the SGD to better classify the image. This is called the fully connected layer.

The output of convolution/pooling is flattened into a single vector of values, each representing a probability that a certain feature belongs to a label. For example, if the image is of a cat, features representing things like whiskers or fur should have high probabilities for the label “cat”.

Let us assume we were predicting one of these 5 things viz. cat, dog, plane, fish, building. The output from the fully connected layers will be something like this. We would want to convert these output values to probability which all should add to 1 and they should be between 0 and 1. For this we need an activation function . An activation function is a function we apply to an activation value. In other words it is a non linear function which takes in one value and spits out another value. For this we will use a softmax activation function.

Why do we need this?

Softmax always spits out number between 0 to 1 and all the values add up to 1. This is strictly not necessary, you can make the model learn these values. But when you put a constraint on the values that the predictor can take we make our neural network do a better job. It will make it easier for it to learn.

How does softmax work?

First of all we will remove all the negative values in the output rows. So will ‘exp’ all the values. Softmax returns the probabilities of the values you want to predict and it has a feature to give out a large probability for one particular value. Softmax is always applied to the last layer.

This is the Softmax function. The Softmax value is the exp of the output divided by sum of exp of all the output values.

Some limitations of Softmax are that it should not be used for multiple label classification i.e predicting objects which belong to more than one label simultaneously. The reason is that Softmax has this feature to single out a label by giving it a higher probability than other labels and therefore is not suitable for multi label classification.