Review: Dynamic Routing Between Capsules

Original article was published by Vinh Quang Tran on Deep Learning on Medium

Review: Dynamic Routing Between Capsules

Link to paper:

The paper introduced an implementation of Capsule Networks which use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.


Human visual system uses a sequence of fixation points to ensure that a tiny fraction of optic array is processed at highest resolution. For a single fixation, a parse tree is carved out of a fixed small groups of neurons called “capsules” and each node in the parse tree will correspond to an active capsule. By using an iterative process, each capsule will choose a higher-level capsule to be its parent. This process will solve the problem of assigning parts to wholes.

For activity vector of each active capsule:

  • Its length is the probability that an entity exists in the image.
  • Its orientation is object’s estimated pose parameters like pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.


Since the output of a capsule is a vector, it is possible to use a powerful dynamic routing mechanism to ensure the output is sent to an appropriate parent. For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, a coupling coefficient for that parent will be increased and for other parents will be decreased, thus increases the contribution a capsule makes to that parent, increasing the scalar product of the capsule’s prediction with the parent’s output. This is much more effective when compared to max-pooling, which allows neurons in one layer to care only about the most active feature detector in the previous layer. Also, unlike max-pooling, capsules don’t throw away information about the precise location of the entity or its pose.

Calculating vector inputs and outputs of a capsule:

Squash function. v_j is the vector output of capsule j and s_j is its total input.

Because the length of the activity vector represents the probability that an entity exists in the image, it has to be between 0 and 1. Squash function will ensure that short vectors’ length will get shrunk to almost 0 and long vectors’ one will get shrunk to slightly below 1.

Except the first layer of capsules, the total input to a capsule is a weighted sum over all prediction vectors from the capsules in the previous layer.

Total input to a capsule. c_ij are coupling coefficient.

These prediction vectors are produced by multiplying the output of a capsule in the layer below by a weight matrix.

Prediction vectors.

The coupling coefficients c_ij are determined by the iterative dynamic routing process. Between a capsule and all the capsules in the layer above, they are sum to 1 and are determined by a softmax function whose initial logits b_ij are the log prior probabilities that capsule i should be coupled to capsule j.

Coupling coefficients.

The initial logit b_ij are later updated by adding scalar product:

Routing algorithm for CapsNet

Margin loss for digit existence:

The top-level capsule for an object class should have a long instantiation vector if that object is present in the image. To allow multiple class, the authors use a separate margin loss for each capsule:

Margin loss for each capsule k. T_k = 1 if object of class k is present. m+ = 0.9, m- = 0.1, λ = 0.5.

This ensures that if an object of class k present, the loss should be no less than 0.9 and if it doesn’t, the loss should be no more than 0.1.

The total loss is the sum of the losses of all object capsules.

CapsNet architecture for MNIST

A simple CapsNet architecture consist of 2 convolutional layers and one fully connected layer.

CapsNet achieved state-of-the-art performance on MNIST after just a few training epoch. After training for about 6-7 epoch with this implementation, CapsNet was able to achieve about 99% accuracy on test set. The rest were negligible improvement.

Regularization by reconstruction

Decoder structure to reconstruct digit from DigitCaps layer.

The authors used reconstruction loss to encourage the digit capsules to encode the instantiation parameters of the input digit. It learns to reconstruct the image by minimizing the squared difference between the reconstructed image and the input image. The loss will be the sum of margin loss (||L2||) and reconstruction loss. However, to prevent the domination of reconstruction loss, it was scaled down to 0.0005.

An example of image reconstruction.


When dealing with dataset that the backgrounds are much too varied (like CIFAR-10), CapsNet performs poorly compared to other state-of-the-art architectures.