Paper Anatomy — FactorVAE (Part 1)

Original article was published on Deep Learning on Medium

Paper Anatomy — FactorVAE (Part 1)

Explaining this paper from ICML 2018

Photo by Lukas Stoermer on Unsplash


Learning Disentangled Representation means being able to identify the salient factors of variations in the data and store them indepently
For example, considering human faces, a salient factor of variations in the data could be the color of the skin
Being able to store this factor independently from the other factors means different things depending on the latent space:

  • in case of an Euclidean Space, it could mean orthogonality
  • in case of a Proabilistic Model, it means a Factorized Latent PDF

Disentangled Representation Definition — Bengio 2013

a representation where a change in one dimension corresponds to a change in one factor of variation, while being relatively invariant to changes in other factors.

Advantages of pursuing a Disentangled Representation

It is believed, and there are also some empirical evidences, the Disentangled Representation should be able to improve abstract reasoning.

A key aspect is that there is a tradeoff between disentanglement and reconstruction quality.

Why working with visual data

We focus on image data, where the effect of factors of variation is easy to visualise.

Paper Elements


In particular, we assume that the data has been generated from a fixed number of independent factors of variation.

The Dataset is the result of a generative process which is unknown but it is possible to make assumptions about it

In fact, assumptions about the underlying factors are key

Notably, semi-supervised approaches that require implicit or explicit knowledge about the true underlying factors of the data have excelled at disentangling.
However, ideally we would like to learn these in an unsupervised manner,

So the goal is to move closer to a more unsupervised learning of a disentangled representation

due to the following reasons:
1. Humans are able to learn factors of variation unsupervised (Perry et al., 2010).
2. Labels are costly as obtaining them requires a human in the loop.
3. Labels assigned by humans might be inconsistent or leave out the factors that are difficult for humans to identify.

1. aiming at true intelligence means aiming at learning as the humans

2. labelling is a major bottleneck

3. human labels are not precise and induce a bias in the training set

Review of previous works

β-VAE (Higgins et al., 2016) is a popular method for unsupervised disentangling based on the Variational Autoencoder (VAE) framework

One important work that inspired this one is Beta VAE

It uses a modified version of the VAE objective with a larger weight (β > 1) on the KL divergence between the variational posterior and the prior, and has proven to be an effective and stable method for disentangling.

Beta VAE defines a way to achieve Disentangled Representation Learning in the context of VAE, working on the Objective Function (more details in the Beta VAE Paper
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework)

One drawback of β-VAE is that reconstruction quality (compared to VAE) must be sacrificed in order to obtain better disentangling.

Here arises the tradeoff between Disentangled Representation and Reconstruction

Purpose of the Paper

The goal of our work is to obtain a better trade-off between disentanglement and reconstruction, allowing to achieve better disentanglement without degrading reconstruction quality.

Improving the Factors Disentanglement vs Reconstruction Quality Tradeoff

In this work, we analyse the source of this trade-off and propose FactorVAE, which augments the VAE objective with a penalty that encourages the marginal distribution of representations to be factorial without substantially affecting the quality of reconstructions.

Here the strategy is explained: find a better objective function.

This penalty is expressed as a KL divergence between this marginal distribution and the product of its marginals, and is optimised using a discriminator network following the divergence minimisation view of GANs


Our experimental results show that this approach achieves better disentanglement than β-VAE for the same reconstruction quality

Anticipation of results: better disentanglement without paying a fee in terms of reduced reconstruction quality.

We also point out the weaknesses in the disentangling metric of Higgins et al. (2016), and propose a new metric that addresses these shortcomings.

Theoretical reason for this result: Beta VAE Objective Function is suboptimal.

The authors of Factor VAE claim they fixed this issue with their new metric.

Alternative Generative Models: GAN

A popular alternative to β-VAE is InfoGAN

Let’s explore the works related to the GAN world, as an alternative to VAE.

Understanding Beta VAE

Let’s extend the math framework and be more clear than the paper

Math Framework

So let’s move forward with the paper

So the key idea is to learn to map the underlying factors of the generative model on the axis of the the Latent Euclidean Spaces which is equivalent, in a probabilistic framework, to using a factorized variational approximation.

Marginal Posterior

This is the marginal posterior, which consists of marginalizing the image away from the variational posterior by integrating over the dataset

Actually what happens in the training is the marginalization is performed on a batch rather than on the entire dataset, but the idea is the same

Beta VAE Objective Function

The first tems is related to the reconstruction as it takes into account the probability of observing the same image given in input to the encoder

The second term is a regularizer hence it is the one acting on the representation so it is the one we are interested in understanding better

As you can see it pushes variational posterior to be as similar as possible to the prior

In fact, the fact variational posterior is by design made of orthogonal gaussians does not mean it is like the prior (otherwise that would not make sense) : in order to be like the prior, it needs statistical independence among its gaussians, so its Covariance Matrix has to be as diagonal as possible.