Three instructive and complementary Capsule Network tutorials

Medium hosts a number of excellent capsule network tutorials. Here are three complementary posts that make for a rewarding reading experience:

1) “Understanding Hinton’s Capsule Networks“

Max Pechyonkin has published the first three parts in his excellent and highly popular series on capsule networks.

Part 1 begins with the example of randomly assembled face parts to illustrate how convolutional neural networks fail to take into account hierarchies of object parts. The author then proceeds with an explanation of inverse graphics: computer graphics generates images from abstraction description of objects. Capsule networks attempt to perform the reverse process and map visual information to a hierarchical representation of objects. A brief discussion of why it took decades to implement this new architecture rounds up the first part of the tutorial.

Part 2 describes the building block of capsule networks. Capsules predict the presence of an entity and its pose in vectorial form. While the length of the vector is interpreted to be the probability of the presence of the entity, the orientation corresponds to the pose. The author compares CapsNets with traditional feed-forward network and explains how three out of four computational steps in the former have analogues in the latter. The section on the Matrix Multiplication of Input Vectors is particularly helpful as it explains how the relationship between objects (e.g., a face) and their parts (e.g., a mouth, a nose, etc.) is encoded mathematically.

Part 3 describes the dynamic routing algorithm employed in CapsNets to decide how to send the output to relevant higher-level capsules. The short, but information-packed pseudocode from the original paper is explained step by step. Handwritten figures support the text throughout the tutorial.

2) “Uncovering the Intuition behind Capsule Networks and Inverse Graphics“

The first part of Tanay Kothari’s long-form tutorial explains the Sabour et al. paper on capsule networks from 2017. I would recommend reading this complementary post once you have gone through Max Pechyonkin’s series. It emphasizes different aspects and is written in a different style.

The tutorial includes a discussion of the difference between invariance and equivariance. To the extent that a computer vision system is translationally invariant, a translation of the input does not result in a change to the output. A cat is a cat, no matter whether it is positioned on the left side or the right side of an image. In a translationally equivariant system, a translation of the input leads to an equivalent change to the output. The author suggests that computer vision needs to move beyond translational invariance and achieve viewpoint invariance to deal with real life 3D images.

Max-pooling can be thought of as a crude form of routing. Section 4 uses an example of a 2×2 kernel to forcefully point out just how crude it is: “If MaxPool was a messenger between the two layers, what it tells the second layer is that ‘We saw a high-6 somewhere in the top left corner and a high-8 somewhere in the top right corner.”

One way in which this tutorial complements Max Pechyonkin’s series is through its introduction to pose matrices. A pose matrix encodes the translation, scale and rotation of an object in a 4×4 matrix (or, alternatively, as a 3×4 matrix, with the last row left out). Lower-level parts of an object are related to higher-level parts through these pose matrices. This is memorably illustrated using a tree in which the nodes correspond to the body parts of Mr. Bean.

When presented with the pose for the mouth, you can estimate the pose for the entire face. The same holds for other relationships between parts. Knowing the pose for the left ear too provides clues about the pose for the face. The tutorial explains how incorporating these relationships between parts makes the neural network more robust to distortions in images.

Over time, certain examples have become popular within the community. Many tutorials use the example of face parts to explain capsule networks. Aurélien Géron has popularized the use of houses and sailboats in tutorials. Tanay Kothari explains dynamic routing algorithm using the digits 4 and 7. Given that these two digits have overlapping features, distinguishing them is more challenging than it may seem.

Finally, this tutorial mentions the reconstruction loss that is part of the cost function used in Sabour et al. and the application of coordination addition to further enhance the performance of capsule networks.

3) “A Visual Representation of Capsule Connections in Dynamic Routing Between Capsules”

Once you’ve read the first two tutorials, I would strongly encourage you to check out this post by Mike Ross. It provides one of best visualizations of capsule networks that I have come across. (For those who prefer to start with the mechanics and then proceed to the intuition, this may in fact be the best starting point.)

Remarkably, the post provides both a high-level overview and many of the details in a single visualization. Equipped with the conceptual understanding conveyed by the first tutorials, it allows you learn about the computational steps involved in a CapsNet. It should also serve as a useful guide for those who would like to implement the architecture.

The particular capsule network architecture used for the MNIST classificationt task consists of two parts: PrimaryCaps and DigitCaps. There are 32 primary capsules and one capsule for each digit.

The very first operation in a CapsNet is the convolution used in traditional CNNs. 256 convolutional units produce 36 scalars each. A primary capsule contains 8 convolutional units. The input to a capsule is reshaped as 36 8D vectors. The two parts, PrimaryCaps and DigitCaps, are fully connected. In other words, there is a connection from each primary capsule to each digit capsule. A vector is first transformed with the newly introduced squashing function (to ensure a length between 0 and 1, while preserving orientation) and then multiplied with a matrix. The result of this process is a total of 32*36*10 = 11,520 vectors.

The right-hand part of the diagram illustrates the dynamic routing algorithm. Special attention should be paid to the vectors û_1151,0 and û_1151,1. In the illustrated case, only the coupling weight with respect to the capsule for digit 0 is large, so most of the activity is routed to the output for that particular capsule.