Beyond CNN.. A brief Intro of Capsule Network

Original article was published on Artificial Intelligence on Medium

Beyond CNN.. A brief Intro of Capsule Network

“Inside of every problem lies an opportunity.’’

-Robert Kiyosaki

I was assigned with a task of binary class image classification, for which I used CNN as it performs better than any other other approaches (at the time of writing) as was able to get a f1-score of 97% for both class. I tried searching for other options and revisited Capsule Nets. This is to share the brief intro of CapsuleNets.


The Capsule Network is a type of artificial neural network that is able to model hierarchical relationships. The approach tries to create a multi-layer visual parser system with tree-like structure, it tries to mimic the biological neural structure for a human vision system.

Problems/Limitations with CNN

The ConvNets or CNN are a highly promising and widely used type of neural network which is the first preference for image related tasks and has given many state of art results in image classification, in some cases it can achieve better results than humans. But there are some limitations to it which are discussed in detail below.

Problem 1: ConvNets are translation Invariant

This means that it just only predicts the object without the spatial information. For example let us consider the cat identification CNN, it can predict the cat in both the images in fig. 1. But it does not include the spatial information that is that the cat in image 1 is near to right and that in image 2 is near to left as in fig. 2.

Fig.1. Predictions from cat CNN model (Translation Invariance)[1]

Fig. 2. Predictions expected form Translation Equivariance[1]

This seems to be good and can be considered useful in terms of robustness of the model. But the problem arises when we try to identify objects that hold spatial relationships between features. The good example can be face, If we pass the bunch of randomly assembled face parts it will be detected as face, So will the actual face as in fig. 3. But the Capsule Network would be able to identify that the face parts are not in the correct position and can be predicted correctly as in fig. 4.

Fig. 3. The predictions from CNN.[1]

Fig.4. The predictions from CapsuleNets.[1]

Problem 2: CNN require a lot data to generalize

CNN requires a large number of filter weights to get trained, thus to generalize this the large amount of data is necessary.

Problem 3: Pooling layers discards valuable information

Pooling layers reduce the information passed to the next layer. It also adds invariability to the model by losing spatial information. CapsuleNet calculates a pose to establish a relationship between smaller and larger features.

How CapsuleNet address the issue

The approach behind the CapsuleNet can be said as `Inverse Graphics’ it can be thought as the inverse of computer graphics are rendered. In the normal flow can be explained in the fig.5., Here the mesh object is converted in the pixels on a screen.

Fig.5. Simplified Computer Graphics Rendering Process.[1]

So, by inverse we want to get the approximate pose of the whole object by multiplying 2D image by inverse of transform matrix, Thus the network can learn how to inversely render an image. It can look at an image and try to predict the instantiation parameters for it.

This requires one more loss component in the standard loss function, Thus an additional reconstruction loss is added to encourage the instantiation parameters of the input digit.

Dynamic Routing

Routing is the process passing information to another layer. Currently, CNN performs routing via pooling layers, mostly Max Pooling layer. Which causes information loss.

In CapsuleNet each capsule tries to send information to the higher capsule above it such that the one receiving the information is the best at dealing with it (fig. 6.). As in paper[2]“Using an iterative routing process, each active capsule will choose a capsule in the layer above to be its parent in the tree. For the higher levels of a visual system, this iterative process will be solving the problem of assigning parts to wholes’’.

They use a coupling strength c for each capsule in layer L to each capsule in layer L+1. So, while normal forward propagation has standard weights to pass information is given by z=W*a, with the coupling strength it can be represented as z= c*W*a with (c<1).

Fig. 6. Dynamic Routing

The below fig.7. Effectively displays the difference between the traditional neuron Vs. Capsule. The main difference is Capule is a Vector In and Vector Out, On the other hand neuron is Scalar In and Scalar Out.

Fig. 7. Difference between the capsule and neuron.[3]

Pros and Cons[6]:


1. Good Performance is achieved on smaller datasets e.g. MNIST

2. Easier to interpret more robust images

3. Keeps all the information (pose, texture, location, etc.)


1. Doesn’t outperform larger datasets (i.e. CIFAR10). The extra information in the images takes over the network.

2. Routing-by-agreement algorithm requires more time to compute


  • CapsuleNets maintain the majority of the information in an image.
  • Dynamic routing algorithm makes sure that each capsule is connected to the best pair, thus reducing noise in the network.
  • CapsuleNets are still in development phase
  • They don’t have great performance on large datasets.