Convolutional Neural Networks and Capsule Networks Part-1*qQHYGjDxlRrhhHUn.png

Hey everyone! In this post I would like to tell you about CNNs, the full architecture, layers involved and basic implementation via Keras. In my next post I will continue to describe the issues with CNNs and how Capsule Nets came into the picture.

Convolutional Neural Nets are biologically inspired forms of Multilayer Perceptrons. Visual cortex contains a complex arrangement of cells. These cells are sensitive to small sub-regions of the visual field, called a receptive field. The sub-regions are tiled to cover the entire visual field. These cells act as local filters over the input space and are well-suited to exploit the strong spatially local correlation present in natural images.

Image Classification is what we use it for. We classify an image or output the class it belongs to. So what we do is, we give the input as an image(array of pixel values) and the deep network will classify the image for us like is it a bird? or a bus? or a human? etc. As its a deep network, it will be having a series of layers through which the data must be computing. The layers that we use here are Convolutional Layers, Pooling Layers and Fully-Connected layers. Now lets understand what exactly is going on within these layers.

Convolutional Layer

As the name suggests, we have to convolve over the input image. Lets consider and image of dimension 32*32*3. Now we choose a filter size lets say 5*5*3. Consider it to be a flashlight(parallel beam) and what we want to do is convolve over this 32*32*5 input. We cover each portion of input image by sliding the beam over it. This beam is our filter(5*5*3). Keep in mind, the depth of filter must be same as depth of input image. The area the flashlight is focusing on is what we call a receptive field.

Now we have a filter of 5*5*3 and there will be weights associated with it of same dimension and a bias parameter. To perform a forward pass through this layer we convolve the whole filter across the width and height by keeping a certain value of stride. A stride is nothing but the measure of sliding that occurs. for example, if we keep stride = 1 then the filter will convolve 1 pixel either across width or height. Now as we move the filter, it creates a 2D activation map or feature map. Look at the images below.

5*5 filter convolving over 32*32 input

What we will do now is W.x+b i.e multiply the weights and input filter, add them and add a bias parameter to it. Thus a 28*28*1 activation map will be generated. Similarly if we keep two weight filters i.e. 5*5*2 then we will get 28*28*2 activation map.

Now these filters are basically used to detect curves or edges. These are used to identify different features in an image. The more number of filters, more deep information of the image we get. Like one filter is simple straight edge detector, another one identify outward curves, then maybe another curve identifies inward curves.

Actual features identified by convolutional layers.

Till now, we had an input image of 32*32*3 and a filter of size 5*5*3. It gave us an activation map of 28*28*1. Now lets say we want the output to be of same size as input i.e. 32*32*1. What we do is, we pad the input with zeros. So this zero padding allows us to control spatial size of output.

Lets say we have an input of size W, receptive field size F, stride S and zero-padding Z, then ((W-F + 2P)/S)+1 gives the output activation map height and width. Sometimes, there are some constraints on stride. Stride value cannot be such that it makes input filter move outside the input image dimensions. For example, for W=10, Z=0, F=3 and S=2 output size comes out to be 4.5 which is not possible as output size must be an integer. So we must check for the value of stride.

Sharing of Parameters

For a 32*32*3 input image and 5*5*3 filter size, there will be weights associated with each input filter. Lets say we have 20 weight filters. Now the activation map will be of dimension 28*28*20. If we calculate then 28*28*20 = 15680 neurons and each has 5*5*3= 75 weights and bias. Roughly its 15680*75 = 1176000 params of first convolutional layer. Here we took a small input dimension. Consider for large inputs, these values will increase to very large numbers and thus size will increase to large numbers.

What we do is we keep a single weight for each depth slice of activation map. Like here we have 20 filters so we will keep 20 weights of 11*11*3. So 20*11*11*3= 7260 + 20 bias parameter (1 witheach filter). This way we can reduce the size of parameters. So in each depth slice, same parameters will be shared.

Activation Function

Now, after we got our activation map we apply activation function to it. Activation function adds non-linearity to system. There are many non-linear functions like sigmoid, tanh etc but we use ReLU here. ReLU stands for rectified linear unit, f(x)=max(0,x). ReLu also helps in correcting vanishing gradient problem. Moreover the nework also learns faster with ReLU compared with others.

Pooling Layer

It is desired to put pooling layers after some convolutional layers. It is generally done to reduce size of image spatially. A pooling layer generally takes each activation map output from convolutional layer and prepare a condensed version of features. Here features are relatively positioned to each other. Thus the size is reduced.

Maxpool operation with 2*2 filter and stride=2

If we have a look here, a simple maxpool operation is applied with 2*2 as filter size and stride=2. Here 75% of activations are ignored. The image below shows downsampling of a 224*224*64 input to 112*112*64 after maxpool operation.

Downsampling of an image

Similarly, we can use average pooling or L2-norm pooling. It also controls overfitting in the network.

Fully-Connected Layer

As the name suggests, fully connected layers connect all the output from previous activated neurons or maxpool outputs to all classes of output. It identifies the extracted features by these previous layers and returns a probability of which class the particular input belongs to. The output dimension is same as the number of classes for fully-connected layer. It thus compares actual output and calculated output and we get our losses.

This is a LeNet used to classify outputs for digits recognition. This is how generally all these layers are stacked to get the desired classified output.

In my next post I would like to tell some more things about CNNs, its implementation with Keras and then we will discuss as what are the problems with CNNs and why we use capsule nets.

In this post I used a lot of references and images from CS231n course by Stanford and Neural Network and Deep Learning book by Michael Nielsen. Both of them are highly recommended.