The Architecture behind CNN and VGGNet-16

Source: Deep Learning on Medium

Go to the profile of Adityam Ghosh

When it comes to Machine Learning , Artificial Neural Networks perform really well. Because of it’s non-linearity function , artificial neural network can be applied to a lot of operations in finding patterns in a problem. We have used Artificial Neural Network(ANN) in various classification tasks such as image , audio, words etc. and also in various regression analysis like time series . Different ANN’s serves different purpose, for example to find the next sequence words in a sentence we employ Recurrent Neural Network (RNN) precisely LSTM , for image classification we employ Convolution Neural Network(CNN).

But before diving into Convolution Neural Network let me first describe to you all the basic building blocks of a neural network.

A Neural Network basically consists of three layers :

  • Input Layer : It’s the layer in which we give input to our model. The number of neurons in this layer is equal to total number of features in our data (number of pixels incase of an image).
  • Hidden Layer : The input from Input layer is then feed into the hidden layer. There can be many hidden layers depending upon our model and data size. Each hidden layers can have different numbers of neurons which are generally greater than the number of features. The output from each layer is computed by matrix multiplication of output of the previous layer with learnable weights of that layer and then by addition of learnable biases followed by activation function which makes the network nonlinear.
  • Output Layer : The output from the hidden layer is then fed into a logistic function like sigmoid or softmax which converts the output of each class into probability score of each class.
FeedForward and BackPropagation

The data is then fed into the model and output from each layer is obtained this step is called feed-forward. We then calculate the error using an error function such as cross-entropy or square-error loss function etc. after that we backpropagate into the model by calculating the derivatives using gradient descent . This step is called Backpropagation which is basically used to minimize the loss function/ error function.

Gradient Descent

A basic python code for the neural network with 2 hidden layers is as follows:

W1, W2, W3, b1, b2, b3 are learnable parameters using Gradient Descent

Convolution Neural Networks

Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine you have an image. It can be represented as a cuboid having its length, width (dimension of the image) and height (as image generally have red, green, and blue channels).

Now imagine taking a small patch of this image and running a small neural network on it, with say, k outputs and represent them vertically. Now slide that neural network across the whole image, as a result, we will get another image with different width, height, and depth. Instead of just R, G and B channels now we have more channels but lesser width and height. his operation is called Convolution. If patch size is same as that of the image it will be a regular neural network. Because of this small patch, we have fewer weights.

Now let’s talk about a bit of mathematics which is involved in the whole convolution process.

  • Convolution layers consist of a set of learnable filters (patch in the above image). Every filter has small width and height and the same depth as that of input volume (3 if the input layer is image input). For example, if we have to run convolution on an image with dimension 34x34x3. Possible size of filters can be axax3, where ‘a’ can be 3, 5, 7, etc (preferably odd)but small as compared to image dimension.
  • During forward pass, we slide each filter across the whole input volume step by step where each step is called stride (which can have value 2 or 3 or even 4 for high dimensional images) and compute the dot product between the weights of filters and patch from input volume.
  • As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together and as a result, we’ll get output volume having a depth equal to the number of filters. The network will learn all the filters.

Layers used to build the ConvNet

Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.

  1. Input Layer: This layer holds the raw input of image with width 32, height 32 and depth 3.
  2. Convolution Layer: This layer computes the output volume by computing dot product between all filters and image patch. Suppose we use total 12 filters for this layer we’ll get output volume of dimension 32 x 32 x 12.
  3. Activation Function Layer: This layer will apply element wise activation function to the output of convolution layer. Some common activation functions are RELU: max(0, x), Sigmoid: 1/(1+e^-x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will have dimension 32 x 32 x 12.
  4. Pool Layer: This layer is periodically inserted in the covnets and its main function is to reduce the size of volume which makes the computation fast reduces memory and also prevents from overfitting. Two common types of pooling layers are max pooling and average pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension 16x16x12.

The dimension reduction formula goes as follows:

(n+2p-f)/s + 1 *  (n+2p-f)/s + 1
n = image height/width
p = padding amount
f = filter height/width
s = stride value

5. Fully-Connected Layer: This layer is regular neural network layer which takes input from the previous layer and computes the class scores and outputs the 1-D array of size equal to the number of classes.


Structure of VGGNet

Now VGGNet is one of the classical Convolution Neural Network that we have pre-trained version of it in keras. But to try out some experimentation I personally broke down the architecture and created a model almost similar in architecture to the VGGNet . Now since I’ve trained it on the CIFAR-10 dataset hence I’ve removed the last layer of Convolution and thus my model is just one total convolution layer less than the original VGGNet-16 model.

Code Download and Suggestions:

Link to download the full code:

Your valuable inputs in this R & D are highly appreciated. So please share your inputs in the comment box or you can share me directly at

Thanks for reading :) !!