Generating new faces with Variational Autoencoders

Source: Deep Learning on Medium

Generating new faces with Variational Autoencoders

Photo by Markus Spiske on Unsplash


Deep generative models such as Generative Adversarial Networks (GANs) are gaining tremendous popularity, both in the industry as well as academic research. In fact, Yann LeCun, the father of the Convolutional Neural Network, described it as “the most interesting idea in the last 10 years in Machine Learning.” The idea of a computer program generating new human faces or new animals can be quite exciting. Deep generative models take a slightly different approach compared to supervised learning which we shall discuss very soon.

This quick and concise tutorial covers the basics of Deep Generative Modelling with Variational Autoencoders. I am assuming that you are fairly familiar with the concepts of Convolutional Neural Networks and representation learning. If not, I would recommend watching Andrej Karpathy’s CS231n lecture videos as they are, in my honest opinion, the best resource for learning CNNs on the internet. You can also find the lecture notes for the course here.

This example demonstrates the process of building and training a VAE using Keras to generate new faces. We shall be using the CelebFaces Attributes (CelebA) Dataset from Kaggle and Google Colab for training the VAE model.

Generative Models

If you’re beginning to explore the field of Generative Deep Learning, a Variational Autoencoder (VAE) is ideal to kick off your journey. The VAE architecture is intuitive and simple to understand. Contrary to a discriminative model such as a CNN classifier a generative model attempts to learn the underlying distribution of the data rather than classifying the data into one of many categories. A well trained CNN classifier would be highly accurate in differentiating an image of a car from that of a house. However, this does not accomplish our objective of generating images of cars and houses.

A discriminative model learns to capture useful information from the data and utilise that information to classify a new data point into one of two or more categories. From a probabilistic perspective, a discriminative model estimates the probability 𝑝(𝑦|𝑥), where 𝑦 is the category or class and 𝑥 is the data point. It estimates the probability of a datapoint 𝑥 belonging to the category 𝑦. For example, the probability of an image being that of a car or a house.

A generative model learns the underlying distribution of the data that explains how the data was generated. In essence, it mimics the underlying distribution and allows us to sample from it to generate new data. It can be defined as estimating the probability 𝑝(𝑥), where 𝑥 is the data point. It estimates the probability of observing the data point 𝑥 in the distribution.

Simple Autoencoder

Before delving into a Variational Autoencoder, it is crucial to analyse a simple Autoencoder.

A simple or Vanilla Autoencoder consists of two neural networks — an Encoder and a Decoder. An Encoder is responsible for converting an image into a compact lower dimensional vector (or latent vector). This latent vector is a compressed representation of the image. The Encoder, therefore maps an input from the higher dimensional input space to the lower dimensional latent space. This is similar to a CNN classifier. In a CNN classifier, this latent vector would be subsequently fed into a softmax layer to compute individual class probabilities. However in an Autoencoder, this latent vector is fed into the Decoder. The Decoder is a different neural network that tries to reconstruct the image, thereby mapping from the lower dimensional latent space to the higher dimensional output space. The Encoder and Decoder perform mappings that are exactly opposite to each other, as shown in the image img-1.

img-1 (Source :

Consider the following analogy to understand this better. Imagine you’re playing a game with your friend over the phone. The rules of the game are simple. You are presented with a number of different cylinders. Your task is to describe the cylinders to your friend who will then attempt to recreate them out of modelling clay. You are forbidden from sending pictures. How will you convey this information?

Since any cylinder can be constructed with two parameters — its height and diameter, the most efficient strategy would be to estimate these two measures and convey them to your friend. Your friend, upon receiving this information can then reconstruct the cylinder. In this example, it is quite evident that you are performing the function of an Encoder by condensing visual information into two quantities. Your friend on the contrary, is performing the function of a Decoder by utilising this condensed information to recreate the cylinder.


Downloading the dataset

The dataset can be downloaded directly into your Google Colab environment using the Kaggle API as shown below. You can refer to this post on Medium for more details.

Upload Kaggle.json downloaded from your registered kaggle account.

!pip install -U -q kaggle
!mkdir -p ~/.kaggle

Download the dataset from Kaggle.

!cp kaggle.json ~/.kaggle/
!kaggle datasets download -d jessicali9530/celeba-dataset

Defining the project structure.

CAUTION (for Colab users) : Do not attempt to explore the directory using the viewer on the left as the page becomes unresponsive owing to the large size of the dataset.



Since the dataset is quite large, we shall create an ImageDataGenerator object and employ its member function — flow_from_directory to define the flow of data directly from disk rather than loading the entire dataset into memory. The ImageDataGenerator can also be used to dynamically apply various transformations for image augmentation which is particularly useful in the case of small datasets.

I highly encourage you to refer to the documentation for understanding the various parameters of the dataflow function.

Model Architecture

Building the Encoder

The architecture of the Encoder, as shown below, consists of a stack of convolutional layers followed by a dense (fully connected) layer which outputs a vector of size 200.

NOTE : The combination of padding = ‘same’ and stride = 2 will produce an output tensor half the size of the input tensor in both height and width. The depth/channels aren’t affected as they are numerically equal to the number of filters.

Building the Decoder

Recall that it is the function of the Decoder to reconstruct the image from the latent vector. Therefore, it is necessary to define the decoder so as to increase the size of the activations gradually through the network. This can be achieved either through the UpSampling2D layer or the Conv2DTransponse layer.

Here, the Conv2DTranspose Layer is employed. This layer produces an output tensor double the size of the input tensor in both height and width.

NOTE : The Decoder, in this example, is defined to be a mirror image of the encoder, which is not mandatory.

Attaching the Decoder to the Encoder

Compilation and Training

The loss function used is a simple Root Mean Square Error (RMSE). The true output is the same batch of images that was fed to the model at its input layer. The Adam optimizer is optimizing the RMSE error for encoding the batch of images into their respective latent vectors and subsequently decoding them to reconstruct the images.

The ModelCheckpoint Keras callback saves the model weights for reuse. It overwrites the file with a fresh set of weights after every epoch.

NOTE : If you’re using Google Colab, either download the weights to disc or mount your Google Drive.

TIP : Here’s a really useful tip I found on Reddit (by arvind1096) — to prevent Google Colab from disconnecting due to a timeout issue, execute the following JS function in the Google Chrome console.

function ClickConnect(){console.log(“Working”);document.querySelector(“colab-toolbar-button#connect”).click()}setInterval(ClickConnect,60000)


The first step is to generate a new batch of images using the ImageDataGenerator defined in the ‘Data’ section at the top. The images are returned as an array and the number of images is equal to BATCH_SIZE.

The first row shows images directly from the dataset and the second row shows images that have been passed through the Autoencoder. Evidently, the model has learned to encode and decode (reconstruct) fairly well.

NOTE : A reason why the images lack sharpness is due to the RMSE loss as it averages out the differences between individual pixel values. Generative Adversarial Networks on the contrary, produce much sharper images.


Adding noise vectors sampled from a standard normal distribution to the image encodings

It can be observed that the images are starting to get distorted with a bit of noise added to its encodings. One possible reason could be that the model did not ensure that the space around the encoded values (latent space) was continuous.

Attempting to generate images from latent vectors sampled from a standard normal distribution

It is evident that the latent vector sampled from a standard normal distribution can not be used to generate new faces. This shows that the latent vectors generated by the model are not centred/symmetrical around the origin. This also strengthens our inference that the latent space is not continuous.

Since we do not have a definite distribution to sample latent vectors from, it is unclear as to how we can generate new faces. We observed that adding a bit of noise to the latent vector does not produce new faces. We can encode and decode images but that does not meet our objective.

Building on this thought, wouldn’t it be great if we could generate new faces from latent vectors sampled from a standard normal distribution? This is essentially what a Variational Autoencoder is capable of.

Variational Autoencoder

Variational Autencoders tackle most of the problems discussed above. They are trained to generate new faces from latent vectors sampled from a standard normal distribution. While a Simple Autoencoder learns to map each image to a fixed point in the latent space, the Encoder of a Variational Autoencoder (VAE) maps each image to a z-dimensional standard normal distribution.

How can a simple Autoencoder be modified to make the encoder map to a z-dimensional standard normal distribution?

Any z-dimensional normal distribution can be represented by a mean vector 𝜇 and a covariance matrix Σ.

Therefore, in the case of a VAE, the encoder should output a mean vector (𝜇) and a covariance matrix (Σ) to map to a normal distribution, right?

Yes, however, there are slight modifications to be made to the covariance matrix Σ.

Modification 1) It is assumed that there is no correlation between the elements of the latent vector. Hence, the non diagonal elements, representing covariance, are all zeroes. The covariance matrix is therefore, a diagonal matrix.

As the diagonal elements of this matrix represent variance, it is simply represented as 𝜎2, a z-dimensional variance vector.

Modification 2) Recall that the variance can only take non-negative values. To ensure that the output of the encoder is unbounded, the encoder is actually mapped to the mean vector and the logarithm of the variance vector. The logarithm ensures that the output can now take up any value in the range, (−∞,∞). This makes training easier as the outputs of the neural network are naturally unbounded.

Now, how is it ensured that the Encoder maps to a standard normal distribution (i.e., with a mean of 0 and a standard deviation of 1)?

Enter KL divergence.

KL divergence provides a measurement of the extent to which one probability distribution differs from another. The KL divergence between a distribution with mean 𝜇 and standard deviation 𝜎, and the standard normal distribution, takes up the form:

By slightly modifying the loss function to include the KL divergence loss in addition to the RMSE loss, the VAE is forced to ensure that the encodings are very similar to a multivariate standard normal distribution. Since a multivariate standard normal distribution has a zero mean, it is centered around the origin. Mapping each image to a standard normal distribution as opposed to a fixed point ensures that the latent space is continuous and the latent vectors are centered around the origin.

Take a look at the image img-2 for a better understanding.

img-2 (Source :

If the encoder maps to 𝜇 and 𝜎 instead of a z-dimensional latent vector, what is the input given to the Decoder during training?

The input to the Decoder, as shown in img-2, is a vector sampled from the normal distribution represented by the output of the Encoder — 𝜇 and 𝜎. This sampling can be done as follows:


where 𝜀 is a sampled from a multivariate standard normal distribution.

How do we generate new faces?

Since the KL divergence ensures that the encoder maps as close to a standard normal distribution as possible, we can sample from a z-dimensional standard normal distribution and feed it to the decoder to generate new images.

Does the decoder require any modification?

No, the decoder remains the same. It is identical to that of a Simple Autoencoder.

VAE Model Architecture

The encoder architecture has minor changes. It now has two outputs : mu (mean) and log_var (logarithm of variance) as shown below.

Building the Decoder

Since the Decoder remains the same, the Decoder architecture of the Simple Autoencoder is reused.

Attaching the Decoder to the Encoder

Compilation and Training

The loss function is a sum of RMSE and KL Divergence. A weight is assigned to the RMSE loss, known as the loss factor. The loss factor is multiplied with the RMSE loss. If we use a high loss factor, the drawbacks of a Simple Autoencoder start to appear. However, if we use a loss factor too low, the quality of the reconstructed images will be poor. Hence the loss factor is a hyperparameter that needs to be tuned.


The reconstruction process is the same as that of the Simple Autoencoder.

The reconstruction results are quite similar to that of a Simple Autoencoder.

Generating new faces from random vectors sampled from a standard normal distribution.

The VAE is evidently capable enough of producing new faces from vectors samped from a standard normal distribution. The fact that a neural network is capable of generating new faces from random noise shows how powerful it is in performing extremely complex mappings!

As it is impossible to visualize a 200 dimensional vector, some of the elements of the latent vector are individually visualized to see if they are close to a standard normal distribution.

It is observed that the first 50 elements of the Z dimensional vector are very similar to a standard normal distribution. The addition of the KL divergence term is therefore, justified.


Even though a Variational Autoencoder solves some of the major drawbacks of a Simple Autoencoder, the images produced are quite soft. Current state-of-the-art generative models are Generative Adversarial Networks(GANs). I am currently in the process of making a tutorial for GANs as well, so stay tuned!

The notebook for this tutorial can be accessed from my GitHub repository.

For diving deeper into generative modelling, I would recommend Ian Goodfellow’s NIPS 2016 Tutorial: Generative Adversarial Networks.

Feel free to connect with me on LinkedIn and GitHub.

Thanks for reading!