It may seem every year a new type of neural network is announced that piques our imaginations, but the last big and truly novel neural network to arrive on the scene was the Generative Adversarial Network (GAN), created by Ian Goodfellow in 2014, almost four years ago. Much of the research following the creation of a new net focuses first on discovering applications for which the neural network shows aptitude, subsequently on how to optimize or extend them, and finally on how to mix them with other types of neural networks to create ensembles. 2016, for instance, was unavoidably the year of the GAN, with a truly dizzying number of derivative networks for all manner of applications hitting Arxiv: CycleGANs, DCGANs, Conditional GANs, and many more, but late 2017 saw perhaps the first truly lifelike mimicry of faces with Nvidia’s Progressive growing of GANs approach (which has incredible accuracy and the side effect of permitting very large, 1024×1024 images to be faithfully generated — I highly recommend the short paper introducing it).
So though the field of deep learning is white hot right now in terms of academic and corporate interest and an overwhelming amount of stellar research and production-ready applications generated each year, it’s still a huge event when a completely new, non-derivative type of neural network is created and announced. In late October, Geoffrey Hinton, one of the fathers of modern AI (he was one of the first researchers to demonstrate the use of backpropagation algorithms to train neural networks, among other important contributions like dropout and deep-belief networks) announced he and a team of researchers had finally implemented a before only theoretical type of neural net called the capsule network with an implementation dubbed CapsNet.
The capsule network is historically different from other types of neural networks (which have generally been used in different, noncompeting problem domains) because it was created to address a perceived fundamental deficiency with another type of neural network, the Convolutional Neural Network (sometimes referred to as a ConvNet but more often regrettably abbreviated to CNN, just like another network you may have heard of…). CNNs are, largely, image classifiers. They train on huge labeled datasets of images like the CIFAR-10 and ImageNet datasets and produce a classification probability between 0 and 1 upon inference, the closest to 1 being the “best guess.”
CNNs have several known problems. They are susceptible to adversarial examples or distortions to images that would not trick a human (e.g., creating a negative of an image). They also counteract most people’s intuition about how much training data ought to be required to determine what something seen is. More importantly, they tend to search for component parts of an image without necessarily caring about the orientation of each part and without the ability to detect the spatial relation between parts. A CNN may know, for instance, that a face contains two eyes and a mouth, so it may classify both of the following as a face, even though one is merely a collection of facial features:
This brings us to the issue of translation variance. CNNs are translation invariant (due in part to a feature of called max pooling present in most CNNs that downsamples pixel clusters in an image between convolutional layers). This is both good and bad. It means a CNN can detect whether an image contains a frog no matter where the frog is in the image. This is in contrast to the possibility that a CNN trained only on images containing frogs in the images’ top-left quadrants would only be able to infer the presence of frogs in input images with frogs in that same quadrant. (That would be bad!) However, this invariance comes with no additional information about the position of the detected feature (i.e., its translation from some norm), nor anything approximating the spatial relationships between features, unless higher order layers in the neural network had already composed the features as part of a larger whole. This certainly happens in a CNN (a “dog” is a collection of shapes, edges, colors, and patterns that resolve to “dog”), but it’s not possible to do this for more complex scenes, especially if we want to the neural network to infer “there is a dog, a human, and a bone in midair in the scene: ergo, the dog and human are playing fetch” in every scene, from any angle, where this is true.
NB: By training a network well on images with the same label that are rotated, resized, and displaced, a network can also become rotation- and scale-invariant. These are useful traits for a network, but a rotation-aware/spacially-unaware network will succumb to even more cases where parts of a presumed “whole” present in the image will increase the final probability of a match, even if the part is completely upside down relative to other parts.
I hesitate to say that there is anything fundamentally wrong with a convolutional-max-pooling approach (though Hinton has done just that), as the network model itself is still relatively new (AlexNet inaugurated widespread research in 2012 after achieving a 15% error rate at the annual ImageNet competition), and research into modifications and extensions of the CNN is still, when put into perspective, in its infancy. For instance, one could conceptually imagine a CNN that trained on three-dimensional models or point-cloud data from 3D cameras and mapped this somehow onto 2D images during inference (though, again quoting Hinton: “Convolutional nets in 3-D are a nightmare”). So we could find that the CNN, given orders of magnitude more computational power and some breakthrough tweaks or additions, is the right tool for the job. However, put another way, there are known shortcomings to CNNs that it is possible tinkering on the margins will not address, and for which we don’t, at current, have many viable workarounds.
Enter the capsule network, which aims to overcome the limitations of CNNs and more broadly to reset the field of deep learning such that the processes of neural net learning more closely resemble the way the brains of humans actually work. For Hinton, this is the most viable and quickest strategy for arriving at Generalized Artificial Intelligence (GAI).
Whereas convolutional layers learn higher-level features cumulatively layer by layer (points become edges, edges become shapes, and finally a collection resolves by way of a nonlinear activation function into a high probability for some label like “fish” or “helicopter”) allowing a network to “build” an image (or confidence in an image) over time, a capsule network starts off with, fittingly, the capsule: “a group of neurons whose activity vector represents the instantiation parameters of a certain kind of entity,” like an entire object, or part or an object. This means its collection of neurons encode information like hue, position, orientation, velocity, texture, and more. Most importantly, this approach preserves pose, a concept from computer graphics that encodes rotation and translation and which allows the network to enforce spacial relationships between component parts (neurons, capsules, or both). What a capsule network will do is look at an image and try to reverse engineer it into a hierarchical data organization—what Hinton calls inverse rendering.
Essentially, when an image is passed into a capsule network, lower level matching capsules activate around discrete features. The example I have seen most often used is of a face. If the “nose” and “ear” capsule are activated (they do so when parts of the image appear to have a nose and/or ear in them, at any orientation and pose), they are multiplied by an affine transformation matrix (recall that capsules are a group of neurons represented by an activity vector) instead of scalar weights to encode information about their pose and are sent as inputs to the next, higher-order capsule (if routed correctly, a “face” capsule).
But how do these collaborating capsules know which higher-order capsule they belong to? Well, their pose-encoded vectors become, effectively, “predictions,” which search for higher-order component with an output activity vector akin to their own. When multiple capsules “agree” around a capsule, that capsule is “selected” as active. This process (which is more involved than described here) is called dynamic routing or routing by agreement and is essential to the CapsNet approach. If you haven’t noticed, unlike every other major neural net, there is no backpropagation in this model. Dynamic routing is iterative and the number of iterations is a hyperparameter to the model, so it is intuitive to analogize here that whereas backprop is the workhorse of all major other neural networks, dynamic routing is the workhorse of the capsule network. Routing by selection of similar output vectors more closely aligns with “Hebbian” principles of neuronal firing—neurons within a chain of firing tend to have similar strength output signatures—i.e., what is actually occuring in biological neurons in the brain.
CapsNet has shown error reduction on the MNIST dataset, but results well below best in class on the CIFAR-10 dataset (an “easier” dataset than ImageNet, on which it has not yet been tested). Its major benefits are that it takes fewer image examples to learn, largely by making equivariant feature detection possible, which may make it more robust in practice. However, iterative dynamic routing is expensive, and CapsNet currently takes longer to train than equivalently performant CNNs. So far, it’s not clear if CapsNet is better than the network model it aims to replace, and which Hinton so clearly views as a dead end. Certainly, it’s not uniformly better across the board.
That said, I began this blog post highlighting how far GANs have come in the few years they’ve been in existence. By not using backpropagation, capsule networks are not just a different type of neural network but almost in an entirely different class unto themselves—which makes them both a fascinating entrypoint for research but perhaps also an oddity not necessarily deserving of research commitment on the same scale as proven networks now in their prime. If the model is promising and researchers pursue it with the urgency they have other NNs, it is possible we could see it develop over the next half-decade into something that is truly competitive with the CNN and prove out models of deep learning that mimic the brain, pointing us towards a more naturalistic and less data-intensive future for the pursuit of general AI.
 For instance, several thousand examples of “spoon” and a transfer-learning schema may be necessary to identify spoon/not spoon with human confidence, as opposed to tens of examples for a child to start to grasp what a spoon looks like, or at least the notion that it differs from a fork because of its shape. However, the brain is a complex and ever mysterious evolved organ that has cognitive apparatuses built up over many millions of years of evolution—it is sensible to assume the total amount of information processed throughout the development of sentient/perceptive life on Earth in order to evolve such finely calibrated, quickly learning intelligences is likewise mammoth.