Deep Generative Models

A Generative Model is a powerful way of learning any kind of data distribution using unsupervised learning and it has achieved tremendous success in just few years. All types of generative models aim at learning the true data distribution of the training set so as to generate new data points with some variations. But it is not always possible to learn the exact distribution of our data either implicitly or explicitly and so we try to model a distribution which is as similar as possible to the true data distribution. For this, we can leverage the power of neural networks to learn a function which can approximate the model distribution to the true distribution.

Two of the most commonly used and efficient approaches are Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN). VAE aims at maximizing the lower bound of the data log-likelihood and GAN aims at achieving an equilibrium between Generator and Discriminator. In this blogpost, I will be explaining the working of VAE and GANs and the intuition behind them.

Variational Autoencoder

I am assuming that the reader is already familiar with the working of a vanilla autoencoder. We know that we can use an autoencoder to encode an input image to a much smaller dimensional representation which can store latent information about the input data distribution. But in a vanilla autoencoder, the encoded vector can only be mapped to the corresponding input using a decoder. It certainly can’t be used to generate similar images with some variability.

To achieve this, the model needs to learn the probability distribution of the training data. VAE is one of the most popular approach to learn the complicated data distribution such as images using neural networks in an unsupervised fashion. It is a probabilistic graphical model rooted in Bayesian inference i.e., the model aims to learn the underlying probability distribution of the training data so that it could easily sample new data from that learned distribution. The idea is to learn a low-dimensional latent representation of the training data called latent variables (variables which are not directly observed but are rather inferred through a mathematical model) which we assume to have generated our actual training data. These latent variables can store useful information about the type of output the model needs to generate. The probability distribution of latent variables z is denoted by P(z). A Gaussian distribution is selected as a prior to learn the distribution P(z) so as to easily sample new data points during inference time.

Now the primary objective is to model the data with some parameters which maximizes the likelihood of training data X. In short, we are assuming that a low-dimensional latent vector has generated our data x (x ∈ X) and we can map this latent vector to data x using a deterministic function f(z;θ) parameterized by theta which we need to evaluate (see fig. 1). Under this generative process, our aim is to maximize the probability of each data in X which is given as,

Pө(X) = ∫Pө(X, z)dz = ∫Pө(X|z)Pө(z)dz (1)

Here, f(z;θ)has been replaced by a distribution Pө(X|z).

The intuition behind this maximum likelihood estimation is that if the model can generate training samples from these latent variables then it can also generate similar samples with some variations. In other words, if we sample a large number of latent variables from P(z) and generate x from these variables then the generated x should match the data distribution Pdata(x). Now we have two questions which we need to answer. How to capture the distribution of latent variables and how to integrate Equation 1 over all the dimensions of z?

Obviously it is a tedious task to manually specify the relevant information we would like to encode in latent vector to generate the output image. Rather we rely on neural networks to compute z just with an assumption that this latent vector can be well approximated as a normal distribution so as to sample easily at inference time. If we have a normal distribution of z in n dimensional space then it is always possible to generate any kind of distribution using a sufficiently complicated function and the inverse of this function can be used to learn the latent variables itself.

In equation 1, integration is carried over all the dimensions of z and is therefore intractable. However, it can be calculated using methods of Monte-Carlo integration which is something not easy to implement. So we follow an another approach to approximately maximize Pө(X) in equation 1. The idea of VAE is to infer P(z) using P(z|X) which we don’t know. We infer P(z|X) using a method called variational inference which is basically an optimization problem in Bayesian statistics. We first model P(z|X) using simpler distribution Q(z|X) which is easy to find and we try to minimize the difference between P(z|X) and Q(z|X) using KL-divergence metric approach so that our hypothesis is close to the true distribution. This is followed by a lot of mathematical equations which I will not be explaining here but you can find it in the original paper. But I must say that those equations are not very difficult to understand once you get the intuition behind VAE.

The final objective function of VAE is :-

The above equation has a very nice interpretation. The term Q(z|X) is basically our encoder net, z is our encoded representation of data x(x ∈ X) and P(X|z) is our decoder net. So in the above equation our goal is to maximize the log-likelihood of our data distribution under some error given by DKL[Q(z|X) || P(z|X)]. It can easily seen that VAE is trying to minimize the lower bound of log(P(X)) since P(z|X) is not tractable but the KL-divergence term is >=0. This is same as maximizing E[logP(X|z)] and minimizing DKL[Q(z|X) || P(z|X)]. We know that maximizing E[logP(X|z)] is a maximum likelihood estimation and is modeled using a decoder net. As I said earlier that we want our latent representation to be close to Gaussian and hence we assume P(z) as N(0, 1). Following this assumption, Q(z|X) should also be close to this distribution. If we assume that it is a Gaussian with parameters μ(X) and Ʃ(X), the error due to the difference between these two distributions i.e., P(z) and Q(z|X) given by KL-divergence results in a closed form solution given below.

Considering we are optimizing the lower variational bound, our optimization function is :

log(P(X|z)) − DKL[Q(z|X)‖P(z)], where the solution of the second is shown above.

Hence, our loss function will contain two terms. First one is reconstruction loss of the input to output and the second loss is KL-divergence term. Now we can train the network using backpropagation algorithm. But there is a problem and that is the first term doesn’t only depend on the parameters of P but also on the parameters of Q but this dependency doesn’t appear in the above equation. So how to backpropagate through the layer where we are sampling z randomly from the distribution Q(z|X) or N[μ(X), Ʃ(X)] so that P can decode. Gradients can’t flow through random nodes. We use reparameterization trick (see fig) to make the network differentiable. We sample from N(μ(X), Σ(X)) by first sampling ε ∼ N(0, I), then computing z=μ(X) + Σ1/2(X)∗ε.

This has been very beautifully shown in the figure 2 ? . It should be noted that the feedforward step is identical for both of these networks (left & right) but gradients can only backpropagate through right network.

At inference time, we can simply sample z from N(0, 1) and feed it to decoder net to generate new data point. Since we are optimizing the lower variational bound, the quality of the generated image is somewhat poor as compared to state-of-the art techniques like Generative Adversarial Networks.

The best thing of VAE is that it learns both the generative model and an inference model. Although both VAE and GANs are very exciting approaches to learn the underlying data distribution using unsupervised learning but GANs yield better results as compared to VAE. In VAE, we optimize the lower variational bound whereas in GAN, there is no such assumption. In fact, GANs don’t deal with any explicit probability density estimation. The failure of VAE in generating sharp images implies that the model is not able to learn the true posterior distribution. VAE and GAN mainly differ in the way of training. Let’s now dive into Generative Adversarial Networks.

Generative Adversarial Networks

Yann LeCun says that adversarial training is the coolest thing since sliced bread. Seeing the popularity of Generative Adversarial Networks and the quality of the results they produce, I think most of us would agree with him. Adversarial training has completely changed the way we teach the neural networks to do a specific task. Generative Adversarial Networks don’t work with any explicit density estimation like Variational Autoencoders. Instead, it is based on game theory approach with an objective to find nash equilibrium between the two networks, Generator and Discriminator. The idea is to sample from a simple distribution like Gaussian and then learn to transform this noise to data distribution using universal function approximators such as neural networks.

This is achieved by adversarial training of these two networks. A generator model G learns to capture the data distribution and a discriminator model D estimates the probability that a sample came from the data distribution rather than model distribution. Basically the task of the Generator is to generate natural looking images and the task of the Discriminator is to decide whether the image is fake or real. This can be thought of as a mini-max two player game where the performance of both the networks improves over time. In this game, the generator tries to fool the discriminator by generating real images as far as possible and the generator tries to not get fooled by the discriminator by improving its discriminative capability. Below image shows the basic architecture of GAN.

We define a prior on input noise variables P(z) and then the generator maps this to data distribution using a complex differentiable function with parameters өg. In addition to this, we have another network called Discriminator which takes in input x and using another differentiable function with parameters өd outputs a single scalar value denoting the probability that x comes from the true data distribution Pdata(x). The objective function of the GAN is defined as

In the above equation, if the input to the Discriminator comes from true data distribution then D(x) should output 1 to maximize the above objective function w.r.t D whereas if the image has been generated from the Generator then D(G(z)) should output 1 to minimize the objective function w.r.t G. The latter basically implies that G should generate such realistic images which can fool D. We maximize the above function w.r.t parameters of Discriminator using Gradient Ascent and minimize the same w.r.t parameters of Generator using Gradient Descent. But there is a problem in optimizing generator objective. At the start of the game when the generator hasn’t learned anything, the gradient is usually very small and when it is doing very well, the gradients are very high (see Fig. 4). But we want the opposite behaviour. We therefore maximize E[log(D(G(z))] rather than minimizing E[log(1-D(G(z))]

The training process consists of simultaneous application of Stochastic Gradient Descent on Discriminator and Generator. While training, we alternate between k steps of optimizing D and one step of optimizing G on the mini-batch. The process of training stops when the Discriminator is unable to distinguish ρg and ρdata i.e. D(x, өd) = ½ or when ρg = ρdata.

One of the earliest model on GAN employing Convolutional Neural Network was DCGAN which stands for Deep Convolutional Generative Adversarial Networks. This network takes as input 100 random numbers drawn from a uniform distribution and outputs an image of desired shape. The network consists of many convolutional, deconvolutional and fully connected layers. The network uses many deconvolutional layers to map the input noise to the desired output image. Batch Normalization is used to stabilize the training of the network. ReLU activation is used in generator for all layers except the output layer which uses tanh layer and Leaky ReLU is used for all layers in the Discriminator. This network was trained using mini-batch stochastic gradient descent and Adam optimizer was used to accelerate training with tuned hyperparameters. The results of the paper were quite interesting. The authors showed that the generators have interesting vector arithmetic properties using which we can manipulate images in the way we want.

One of the most widely used variation of GANs is conditional GAN which is constructed by simply adding conditional vector along with the noise vector (see Fig. 7). Prior to cGAN, we were generating images randomly from random samples of noise z. What if we want to generate an image with some desired features. Is there any way to provide this extra information to the model anyhow about what type of image we want to generate? The answer is yes and Conditional GAN is the way to do that. By conditioning the model on additional information which is provided to both generator and discriminator, it is possible to direct the data generation process. Conditional GANs are used in a variety of tasks such as text to image generation, image to image translation, automated image tagging etc. A unified structure of both the networks has been shown in the diagram below.

One of the cool thing about GANs is that they can be trained even with small training data. Indeed the results of GANs are promising but the training procedure is not trivial especially setting up the hyperparameters of the network. Moreover, GANs are difficult to optimize as they don’t converge easily. Of course there are some tips and tricks to hack GANs but they may not always help. You can find some of these tips here. Also, we don’t have any criteria for the quantitative evaluation of the results except to check whether the generated images are perceptually realistic or not.


Deep Learning models are really achieving human level performance in supervised learning but the same is not true for unsupervised learning. Nevertheless, deep learning scientists are working hard to improve the performance of unsupervised models. In this blogpost, we saw how two of the most famous unsupervised learning frameworks of generative models actually work. We got to know the problems in Variational Autoencoders and why Adversarial networks are better at producing realistic images. But there are problems with GANs such as stabilizing their training which is still an active area of research. However GANs are really powerful and currently they are being used in a variety of tasks such as high quality image (see this video) and video generation, text to image translation, image enhancement, reconstruction of 3D models of objects from images, music generation, cancer drug discovery etc. Besides this, many deep learning researchers are also working to unify these two models and to get the best of both these models. Seeing the increasing rate of advancement of Deep Learning, I believe that GANs will open many closed doors of Artificial Intelligence such as Semi-supervised Learning and Reinforcement Learning. In the next few years, generative models is going to be very helpful for graphics designing, designing of attractive User-Interfaces etc. It may also be possible to generate natural language texts using Generative Adversarial Networks.

Deep Generative Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium