Understanding Convolution Neural Networks -Part I

Source: Deep Learning on Medium

Understanding Convolution Neural Networks -Part I

In this article I will explain the main blocks used in building Convolution Neural Networks and will then proceed to build convolution neural networks from scratch. CNNs have proven to work quite well in image classification, segmentation, object detection etc. CNNs tends to perform quite well as compared to fully connected neural networks with less number of parameters. Let’s define some terms which we would be using in this article.

f = filter size

n_filers = number of filters

p = padding

s = stride

Why CNNs ?

Figure 1.1 Convolution operation with 10 filters of 3 x 3.

In Figure 1.1, let’s say we have filter size of 3 and number of filters are 10. The output volume would be of size 30 x 30 x 10. Total number of parameters in this convolution layers are [3 x 3 +1] x 10 = 100. Now let’s say if this was a fully connected neural network, the number of parameters would be [32 x 32 x 3] x [30 x 30 x 10] ≃ 27 Million. That’s a lot of parameters to train and moreover its computationally expensive. Convolution layer has small number of parameters due to parameter sharing and sparsity of connections. Parameter sharing means that imagine a filter that is responsible for detecting vertical edges in the image, during the convolution operation that same filter would slide over the whole image looking for vertical edges. This means that a feature detector such as in this case vertical edge detector that’s useful in one part of the image is probably useful in another part of the image as well so the weights are shared. Sparsity of connections means that In each layer each output value depends only small a small number of inputs as shown in Figure 1.2. This makes CNNs more prone to overfitting and can be trained with smaller training sets.

Figure 1.2 Convolution of image I with filter K [Ihab Sami Mohamed, 2017]

In Figure 1.2 during the convolution operation, the value 4 in the output image is dependent upon only the 9 values in the red window in the input image. This is known as sparsity of connections.

Convolution Layer

In convolution layer, the convolution filter goes over the image and computes a matrix (element wise product of pixels in image with the convolution window over the image). Convolution operation shrinks down the image and the pixels at the border of the image are used only once in the output as they are not overlapped in the convolution window. This may lead to loss of information from the edges of the image. To avoid this problem padding is used. The hyper parameters of convolution layers are: padding, stride and filter size.

Figure 1.3 Convolution operation ‘x’ of input image with filter of 3 x 3.

In Figure 1.3, the input image section of size 3 x 3 is convolve with a filter size of 3 x3, doing element wise multiplication of pixel values at each location. Valid convolution means no padding and same convolution means padding such that the output size is same as input size. By convention in the computer vision convolution filter is usually odd. If padding is same, p = f-1/2, where f is the filter size. Strided convolution means that you move the convolution window over the image by s rows and s columns.


It adds zeros on the borders of the image. Eg if an image of (n x n) shape is padded with an amount p, it would result in the following image as shown in Figure 1.4

Figure 1.4 Image of 3 x 3 is padded with amount 2.

Padding helps to restore the information at edges which may loss during convolution operation. It allows you to use a convolution layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the same convolution, in which the height/width is exactly preserved after one layer. It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.


Stride means how long the convolutional filter jumps before computing next value. If an image is (n x n), stride is 2 and the filter is ( f x f ), the filter would convolve over the image with an overlap of stride 2 in both horizontal and vertical directions. In Figure 1.5, the filter would compute element wise product of pixels with the red window, then it would shift 2 pixels in horizontal direction and convolve with blue window and then green window. It would then shift 2 pixels vertically and continue until the end of the image.

figure 1.5 Convolution operation with a filter of 3 x 3 and stride of 2.

Relu Layer

It uses ReLU ( rectified linear activation function ) which returns the value provided by input directly if it is greater than zero or the value zero if input is zero or less than zero.

Where Z is the input.

Figure 1.6 Relu activation on 3 x 3 input image.

ReLU is the default choice for hidden layers. It makes learning faster as it prevents gradients from going to zero which slows down the training.

Max Pool layer

Maxpooling computes the maximum of window (f x f) in image (n x n). Filter size f and stride s are the hyper parameters of this layer.

Figure 1.7 Maxpooling with filter size of 2 x 2 and stride 2.

The pooling (POOL) layer reduces the height and width of the input. It helps reduce computation, as well as helps make feature detectors more invariant to its position in the input. The idea behind max pooling is that it picks up the maximum value in the window which might be any particular feature so it gets preserved in the max pool output. so let’s say if a vertical edge is detected in the upper left 2 x 2 window, then take the maximum value which then preserves this feature. If it is not detected then max value would still be quite small and the feature was not detected in the first place. One interesting thing to note is max pool layer has no parameters for back propagation to train.

Flatten layer

It performs ‘flattening’ on the input vector. It converts n dimensional matrix of features into a vector that can be fed into a fully connected neural network classifier. Flatten layer does not have any parameters to learn.


Mohamed, Ihab Sami Mohamed. Detection and tracking of pallets using a laser rangefinder and machine learning techniques. Diss. Master’s thesis, European Master on Advanced Robotics Plus (EMARO+), University of Genova, 2017.