Original article was published on Deep Learning on Medium

# Paper Anatomy — FactorVAE (Part 1)

Explaining this paper from ICML 2018

# Introduction

Learning Disentangled Representation means being able to identify the salient factors of variations in the data and store them indepently

For example, considering human faces, a salient factor of variations in the data could be the color of the skin

Being able to store this factor independently from the other factors means different things depending on the latent space:

- in case of an Euclidean Space, it could mean orthogonality
- in case of a Proabilistic Model, it means a Factorized Latent PDF

**Disentangled Representation Definition — Bengio 2013**

a representation where a change in one dimension corresponds to a change in one factor of variation, while being relatively invariant to changes in other factors.

**Advantages of pursuing a Disentangled Representation**

It is believed, and there are also some empirical evidences, the Disentangled Representation should be able to improve abstract reasoning.

A key aspect is that there is a tradeoff between disentanglement and reconstruction quality.

**Why working with visual data**

We focus on image data, where the effect of factors of variation is easy to visualise.

# Paper Elements

**Assumptions**

In particular, we assume that the data has been generated from a fixed number of independent factors of variation.

The Dataset is the result of a generative process which is unknown but it is possible to make assumptions about it

In fact, assumptions about the underlying factors are key

Notably, semi-supervised approaches that require implicit or explicit knowledge about the true underlying factors of the data have excelled at disentangling.

However, ideally we would like to learn these in an unsupervised manner,

So the goal is to move closer to a more unsupervised learning of a disentangled representation

due to the following reasons:

1. Humans are able to learn factors of variation unsupervised (Perry et al., 2010).

2. Labels are costly as obtaining them requires a human in the loop.

3. Labels assigned by humans might be inconsistent or leave out the factors that are difficult for humans to identify.

So

1. aiming at true intelligence means aiming at learning as the humans

2. labelling is a major bottleneck

3. human labels are not precise and induce a bias in the training set

# Review of previous works

β-VAE (Higgins et al., 2016) is a popular method for unsupervised disentangling based on the Variational Autoencoder (VAE) framework

One important work that inspired this one is Beta VAE

It uses a modified version of the VAE objective with a larger weight (β > 1) on the KL divergence between the variational posterior and the prior, and has proven to be an effective and stable method for disentangling.

Beta VAE defines a way to achieve Disentangled Representation Learning in the context of VAE, working on the Objective Function (more details in the Beta VAE Paper

beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework)

One drawback of β-VAE is that reconstruction quality (compared to VAE) must be sacrificed in order to obtain better disentangling.

Here arises the **tradeoff** between Disentangled Representation and Reconstruction

**Purpose of the Paper**

The goal of our work is to obtain a better trade-off between disentanglement and reconstruction, allowing to achieve better disentanglement without degrading reconstruction quality.

Improving the Factors Disentanglement vs Reconstruction Quality Tradeoff

In this work, we analyse the source of this trade-off and propose FactorVAE, which augments the VAE objective with a penalty that encourages the marginal distribution of representations to be factorial without substantially affecting the quality of reconstructions.

Here the strategy is explained: find a better objective function.

This penalty is expressed as a KL divergence between this marginal distribution and the product of its marginals, and is optimised using a discriminator network following the divergence minimisation view of GANs

TBD

Our experimental results show that this approach achieves better disentanglement than β-VAE for the same reconstruction quality

Anticipation of results: better disentanglement without paying a fee in terms of reduced reconstruction quality.

We also point out the weaknesses in the disentangling metric of Higgins et al. (2016), and propose a new metric that addresses these shortcomings.

Theoretical reason for this result: Beta VAE Objective Function is suboptimal.

The authors of Factor VAE claim they fixed this issue with their new metric.

**Alternative Generative Models: GAN**

A popular alternative to β-VAE is InfoGAN

Let’s explore the works related to the GAN world, as an alternative to VAE.

# Understanding Beta VAE

Let’s extend the math framework and be more clear than the paper

So let’s move forward with the paper