Original article was published by Vinh Quang Tran on Deep Learning on Medium

# Review: Dynamic Routing Between Capsules

Link to paper: https://arxiv.org/pdf/1710.09829.pdf

The paper introduced an implementation of Capsule Networks which use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

**Motivation**:

Human visual system uses a sequence of fixation points to ensure that a tiny fraction of optic array is processed at highest resolution. For a single fixation, a parse tree is carved out of a fixed small groups of neurons called “capsules” and each node in the parse tree will correspond to an active capsule. By using an iterative process, each capsule will choose a higher-level capsule to be its parent. This process will solve the problem of assigning parts to wholes.

For activity vector of each active capsule:

- Its length is the probability that an entity exists in the image.
- Its orientation is object’s estimated pose parameters like pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.

**Idea**:

Since the output of a capsule is a vector, it is possible to use a powerful dynamic routing mechanism to ensure the output is sent to an appropriate parent. For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, a coupling coefficient for that parent will be increased and for other parents will be decreased, thus increases the contribution a capsule makes to that parent, increasing the scalar product of the capsule’s prediction with the parent’s output. This is much more effective when compared to max-pooling, which allows neurons in one layer to care only about the most active feature detector in the previous layer. Also, unlike max-pooling, capsules don’t throw away information about the precise location of the entity or its pose.

**Calculating vector inputs and outputs of a capsule**:

Because the length of the activity vector represents the probability that an entity exists in the image, it has to be between 0 and 1. Squash function will ensure that short vectors’ length will get shrunk to almost 0 and long vectors’ one will get shrunk to slightly below 1.

Except the first layer of capsules, the total input to a capsule is a weighted sum over all prediction vectors from the capsules in the previous layer.

These prediction vectors are produced by multiplying the output of a capsule in the layer below by a weight matrix.

The coupling coefficients c_ij are determined by the iterative dynamic routing process. Between a capsule and all the capsules in the layer above, they are sum to 1 and are determined by a softmax function whose initial logits b_ij are the log prior probabilities that capsule i should be coupled to capsule j.

The initial logit b_ij are later updated by adding scalar product:

**Margin loss for digit existence**:

The top-level capsule for an object class should have a long instantiation vector if that object is present in the image. To allow multiple class, the authors use a separate margin loss for each capsule:

This ensures that if an object of class k present, the loss should be no less than 0.9 and if it doesn’t, the loss should be no more than 0.1.

The total loss is the sum of the losses of all object capsules.

**CapsNet architecture for MNIST**

CapsNet achieved state-of-the-art performance on MNIST after just a few training epoch. After training for about 6-7 epoch with this implementation, CapsNet was able to achieve about 99% accuracy on test set. The rest were negligible improvement.

**Regularization by reconstruction**

The authors used reconstruction loss to encourage the digit capsules to encode the instantiation parameters of the input digit. It learns to reconstruct the image by minimizing the squared difference between the reconstructed image and the input image. The loss will be the sum of margin loss (||L2||) and reconstruction loss. However, to prevent the domination of reconstruction loss, it was scaled down to 0.0005.

**Drawbacks**:

When dealing with dataset that the backgrounds are much too varied (like CIFAR-10), CapsNet performs poorly compared to other state-of-the-art architectures.