Stacked Capsule Autoencoders

Original article was published on Artificial Intelligence on Medium

Stacked Capsule Autoencoders

A look into the future of object detection in images and videos using Unsupervised Learning and a limited amount of training data.

(Source: Boeing’s New Space Capsule)


During the last few years, Geoffrey Hinton and a team of researchers started working on a revolutionary new type of neural network based on Capsules.

Some of the main motivations behind this study are that current neural networks like Convolutional Neural Networks (CNNs) are able to achieve the state of the art accuracy in computer vision tasks such as object detection only if provided a large amount of data.

One of the main reason why models like CNNs require such a large amount of data is their inability to capture orientational and spatial relationships between the different elements which compose an image. In fact, one of the main techniques used in order to improve CNNs performances is Data Augmentation. When applying Data Augmentation, we help our model learn more in-depth and in a more general way what characterises different objects by creating additional data from the original one by for example rotating, cropping, flipping, etc… the original images. In this way, our model will more likely be able to recognise the same object even if seen from a different perspective (Figure 1).

Figure 1: Viewing the same object from a different perspective can cause misconception [1].

CNNs are able to detect objects by first identifying edges and shapes in an image and then combining them together. This approach although does not take into account spacial hierarchies which construct the overall image, and therefore leading to the need of creating large datasets in order to perform well (increasing therefore also the computational cost necessary in order to train the model).


Geoffrey Hinton approach of using Capsules closely follow instead the principle of Inverse Graphic. In fact, according to Hinton, every time our brain processes a new object, its representation does not depend on the viewing angle. Therefore, in order to create models able to perform object recognition as good as our brain can do, we need to be able to capture the hierarchical relationship of the different parts which compose an object and relate them with respect to a frame of coordinates.

This can be achieved by basing our network on a structure called Capsule. A capsule is a data structure incorporating in a vector form all the main information of the feature we are detecting. Its main constituents are:

  1. A logistic unit which represents if a shape exists in an image.
  2. A matrix representing the pose of the shape.
  3. A vector embedding other information such as colour, deformations, etc…

Different approaches have been proposed by Hinton research team during the last few years in order to create Capsule Network such as:

In Capsule Networks, the different neurons compete with each other in order to find agreeing parts which compose objects in images. Three different approaches can be used in order to measure agreements between different capsules:

  • Using cosine distance as a measure of agreement.
  • Expectation-Maximization.
  • Mixture Models.

As shown in Figure 2, basing our system on understanding objects on the grounds of geometric relationships we can then enable our model to be able to reliably detect an object (even if captured from a different point of view or under different light conditions) providing just one instance of it during training (no Data Augmentation needed).

Figure 2: Object Detection from different points of view [2]

Stacked Capsule Networks

One of the additions to the 2019 approach to create capsule networks, is the ability to perform object detection in an unsupervised way therefore not having the need to label our data.

This model architecture can be divided into 3 main different stages such as:

  • Constellation Autoencoder (CCAE): at this stage, an autoencoder model is trained in an unsupervised way to maximise part capsules likelihood.
  • Part Capsule Autoencoder (PCAE): our input images are divided into constituent parts in order to infer objects poses.
  • Object Capsule Autoencoder (OCAE): the created parts get organised together with their corresponded poses to recreate objects.

Combining these three stages together we can then get our final Stacked Capsule Network. The whole process can be summarised in the workflow in Figure 3.