Auto-Encoders And The Battle Of Generations

Source: Deep Learning on Medium

The building architecture for an auto-encoder which reproduces an MNIST digit

Lets divide our understanding into two parts and an analogy where you are at the airport with a Suitcase and you have to clear the security checks before you enter your flight.

1.The Encoder : Think of an encoder as the these two security guards who only let things pass which are relevant and are useful in your Suitcase and drop down all the irrelevant stuff.Doing this reduces your Suitcase weight significantly but also making sure that you don’t loose everything and can travel with the remaining stuff.

Let’s take another case where given the MNIST dataset we feed a 784-sized (28×28 Flattened) vector input into a Neural Network and outputs a vectors much lesser than 784,lets say 2.This vector of size 2 is called Latent Variable.

So What are Latent Variables Anyway ?

In statistics, latent variables (from Latin: present participle of lateo (“lie hidden”), as opposed to observable variables) are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).

In simple terms Latent Variables are nothing but vector Projections in a vector-space which have much lesser dimensions than the original vector space.We can use these much lesser dimensional Projections to infer the content which are contained in the original vectors.This sounds like Principle Component Analysis and let me tell you !! It actually is.But the only difference being,we are using a Neural Network to learn the most important attributes from the data and condensing it into a smaller dimension.

So you can actually think of Latent variables as a filtered version of the original input which contains only the relevant information associated with the original input which will help us later in this blog with the decoder part.

Now lets see some code to implement our encoder with the very obvious starter MNIST data.


import keras
from keras.layers import Inpu
from keras.layers import MaxPool2D
from keras.layers import Conv2D,Dense,Conv2DTranspose
from keras.models import Model
from keras.layers import Input
#We create a input layer which takes in a shape of (784,)
#We create 4 layers of convolution with 3x3 filters each
x=Conv2D(32, (3, 3), padding='same',activation='relu',name='Conv_1',strides=(1,1))(image_input)
x=Conv2D(64, (3, 3), padding='same',activation='relu',name='Conv_2',strides=(2,2))(x)x=Conv2D(64, (3, 3), padding='same',activation='relu',name='Conv_3',strides=(2,2))(x)x=Conv2D(64, (3, 3), padding='same',activation='relu',name='Conv_4',strides=(1,1))(x)x = Flatten()(x)#We output a Latent Vector of size 2 from a dense layer
latent_output=Dense(2, name='encoder_output')(x)

Now that we have our encoder is in place we can move forward with our decoder.

2.Decoder : As the name suggests a Decoder just decodes whatever the encoder outputs.Now what do i mean by that?

Lets say Dr Decoder is looking for evidences at a crime scene.He must take small and minute evidences to solve a big case.In our case he needs to decode a latent variable of size 2 and make accurate guess of what Digit(0 to 9) it might represent or he needs to Reconstruct the original image from the latent variable.Now who is Dr Decoder anyway ?

You guessed it right ! Dr Decoder is yet another Neural Network which learns to reconstruct the original image from the latent variable.Lets look at the architecture


#We take in the input shape
output=Dense(7*7*64, name='encoder_output')(latent_input)x=Reshape(target_shape=(7,7,64))(output)#A transpose Convolution up samples your image to higher dimensions by padding zeros around and inbetween the original image such that the size of the image increases and we apply kernel operations on it. x=Conv2DTranspose(64, (3, 3), padding='same',activation='relu',name='Conv_1',strides=(1,1))(x)x=Conv2DTranspose(64, (3, 3), padding='same',activation='relu',name='Conv_2',strides=(2,2))(x)x=Conv2DTranspose(32, (3, 3), padding='same',activation='relu',name='Conv_3',strides=(2,2))(x)output=Conv2DTranspose(1, (3, 3), padding='same',activation='relu',name='Conv_4',strides=(1,1))(x)decoder=Model(latent_input,output)

Now we have both the encoder and the decoder setup and also the link between them,we can train our entire architecture such that the encoder takes in the image input and produces the latent variables and then the decoder takes in these latent variables and reconstructs the images from it.

We train it with “binary_crossentropy” loss and Adam optimizer with 0.001 Learning rate for 20 epochs.

Now lets visualize the output !!

Encoder input on the top and decoder output at the bottom

We notice that our auto-encoder model has learnt very minute detailed features to reconstruct the image.It might not be able to reconstruct the image perfectly as we have taken the dimension of the latent variable to be 2 which is quite less and the decoder tries its best to reconstruct the image from such little information.However on increases the latent variables size we see much better results.

But can we generate digits which are not in our data set or are much different that those in our dataset ?

Now what does Generation even mean would be your first questions?

“Suppose we have a dataset containing images of horses. We may wish to build a model that can generate a new image of a horse that has never existed but still looks real because the model has learned the general rules that govern the appearance of a horse”.

In case of MNIST dataset we need to generate pixel distribution which is very close to our MNIST dataset distribution but not entirely,such that we can sample from our learnt distribution and generate new digits.

Generated Distribution vs True data distribution

From where to sample the data points would be your questions?

Well, the latent space that the encoder has learnt over training is the distribution that we sample from to generate new digits.

Lets visualize out latent space that the encoder learnt with the Test data!

Now if I sample a point randomly from this space , do you think I will be able to generate anything which is different from the original dataset ?

There are 2 main problems which does not allow Auto-encoders to generate new images:

  1. Unequal cluster size:If we sample a point at random it is very likely that it might look more of 0 or 1 than other digits as there clusters cover majority of the latent space.Which makes it highly unlikely to generate images with variations.
  2. Sampling a point at random from the latent space might lead to a vector from which the decoder just outputs a random noise. As you can see there are huge gaps between the clusters and a point sampled from there results in noise.There is no continuous distributions of latent points so ,the decoder has no idea of how to decode these latent variables falling in the blank space.

These two main problems do not allow variations and random sampling from the latent space to create new images.Only those images which the decoder has seen before can be reconstructed because it knows how to map them from the latent space.

Variational Auto-encoders comes to the rescue !!

All the problems that we had with the simple auto-encoder can be solved with Variational Auto-encoder.Let’s see how !

A Variational auto-encoder has two probabilistic parts:

  1. Probabilistic Decoder
  2. Probabilistic Encoder


In a simple auto-encoder we had discrete values for the latent vectors.This made sure that there were no variations.But what if we could produce a distribution for each latent variable and the decoder can sample for such distributions.Doing so we can get variations in our outputs as every time we sample we get values which are stochastic.

Latent variable type for Auto-encoder vs Variational Auto-encoder

In mathematical terms we want to learn a distribution P(Z|X) over the latent variables Z,where X is our input.

What does it mean to learn a distribution?

It means nothing but learning the parameters of the distribution which are mean and variance.How do we learn these? Yet again ! Through a neural network.

A neural network will help us output the mean and variance of the Multivariate distribution we are trying to learn.Once we have the mean and variance we can sample our latent vectors and feed into the decoder.

According to condition probability we have

We assume that P(z) is a Normal distribution Z~N(0,1).

But can we really solve this ?

According to the law of Total Probability,this solution is intracable as Z can have many dimension and thus exponential integrals will we required to solve this equation.

Can we think of some trick here?

Yes ! Lets take another Gaussian Distribution Q(Z|X) which has tracable solution and we try to infer P(Z|X) with it.This is called Variational Inference and we will try to find Q(Z|X) with regards to an optimization problem.

So whenever two distributions are involved and we need to find the similarity between them,we go for KL-Divergence which measures the difference between two distributions.Here we need to find the KL-Divergence between P(Z|X) and Q(Z|X) and minimize it.

Also Parameters of the Neural Network produce the parameters of the distribution Q(Z|X) .Remember that !!

So now our objective function that we need to minimize becomes:

Solving this can take some more time here so you can find the complete derivation here link.

After all the steps the final loss function that we get is :

Where the first part of the loss is the reconstruction loss which is equivalent to the Expectation over log-likely hood of P(X|Z)

The second part is the KL-Divergence between Q(Z|X) and P(Z).

What are the significance and roles of these loss terms ?

  1. The reconstruction loss ensures that our model is able to decode the latent variables into its original form.Basically this is just like the simple Auto-encoder above where I get an input and these get mapped to distinct clusters of latent variables to produce distinct digits.It only helps to describe the inputs but does not help to generate new samples.
  2. The most important part in terms of generation is the second part ,KL -Divergence.This KL part acts as a regularizer which does not allow latent variables to form distinct clusters or any latent variables to have 0 zero variance .In fact it penalizes the model in doing so.It makes sure that the encoder does not map the data points into distinct parts of the space with 0 variance.This is important as we will face the same problem of non-continuous latent space where the Generation cannot take place as there is no interpolation of various latent points to produce something different.

Lets see what happens when one of the either terms is absent and also when both are present !

In first case we see the same discontinuous latent space.

In the second case our latent space does not learn to reconstruct the original data but every points seem to have the same characteristics and sampling any points from this space will only result in noise as there is no fair distinction.

The third case is the ideal case where we have smooth continuous yet distinct latent space with proper interpolation which will help us to generate new samples .

def vae_loss(x, x_decoded_mean):
xent_loss=keras.losses.binary_crossentropy(K.flatten(x)K.flatten(x_ decoded_mean))
xent_loss *= 28 * 28kl_loss = -0.5 * K.sum(1 + std_v - K.square(mean_v) - K.exp(std_v), axis = 1)return xent_loss + kl_loss

Here we multiply our xent_loss (KL-loss) with 28*28 which acts as a factor(Its an hyper-parameter) so that the KL-Loss does not totally dominate it .

But how do we train this complex model with Back propagation ?

During forward propagation we calculate mean and variance as stated above.

Now after some mathematical operations we expand our loss function into several other easy to calculate terms as derived in link.

But can we back propagate at all ? I mean can I write my output as a function of my input ?

The answer is no and that is because of the randomness that the sampling terms in between encoder and decoder brings.

How to solve now ? Here is a another trick

This trick is called reparameterization trick where I bring abot randomness during sampling from our learnt mean and variance by drawing an eplison from another distribution epsilon~N(0,1) which provides stochastic element and randomness in our sampling.

Now we can back propagate as we have moved our stochastic term to epsilon.

def sampling(args):mu, log_var = argsepsilon = K.random_normal(shape=K.shape(mu), mean=0., stddev=1.)return mu + K.exp(log_var / 2) * epsilon

Encoder and Decoder Architecture’s


image_input=Input(shape=input_shape)x=Conv2D(32, (3, 3), padding='same',activation='relu',name='Conv_1',strides=(1,1))(image_input)x=BatchNormalization()(x)x=Dropout(0.3)(x)x=Conv2D(64, (3, 3), padding='same',activation='relu',name='Conv_2',strides=(2,2))(x)x=BatchNormalization()(x)x=Dropout(0.3)(x)x=Conv2D(64, (3, 3), padding='same',activation='relu',name='Conv_3',strides=(2,2))(x)x=BatchNormalization()(x)x=Dropout(0.3)(x)x=Conv2D(64, (3, 3), 
x = Flatten()(x)#Defining the the mean and variance vector for distribution
mean_v=Dense(2, name='mean_output')(x)

std_v=Dense(2, name='variance_output')(x)
#Sampling layer in from of Lambda layer
encoder_output = Lambda(sampling, name='encoder_output')([mean_v,std_v])


latent_input=Input(shape=(2,))output=Dense(7*7*64, name='encoder_output')(latent_input)x=Reshape(target_shape=(7,7,64))(output)x=Conv2DTranspose(64, (3, 3), padding='same',activation='relu',name='Conv_1',strides=(1,1))(x)x=Conv2DTranspose(64, (3, 3), padding='same',activation='relu',name='Conv_2',strides=(2,2))(x)x=Conv2DTranspose(32, (3, 3), padding='same',activation='relu',name='Conv_3',strides=(2,2))(x)output=Conv2DTranspose(1, (3, 3), padding='same',activation='relu',name='Conv_4',strides=(1,1))(x)decoder=Model(latent_input,output)

Combining to form a single model


We train our model for just 10 epochs with Adam optimizer

Lets see our results by generating some images by feeding in a sample from N(0,1).

n = 15 # figure with 15x15 digitsdigit_size = 28# linearly spaced coordinates on the unit square were transformed
# through the inverse CDF (ppf) of the Gaussian to produce values
# of the latent variables z, since the prior of the latent space
# is Gaussian
z1 = norm.ppf(np.linspace(0.01, 0.99, n))
z2 = norm.ppf(np.linspace(0.01, 0.99, n))
z_grid = np.dstack(np.meshgrid(z1, z2))
x_pred_grid = decoder.predict(z_grid.reshape(n*n, 2)) \
.reshape(n, n, digit_size, digit_size)
plt.figure(figsize=(10, 10))plt.imshow(np.block(list(map(list, x_pred_grid))), cmap='gray')

We see a region of our latent space decoded into various samples of images that are not a part of our original data set.

However Generation of new images is not really clear to you by looking at these pictures i am guessing !

So lets create fake celebrity faces !!

I am not going to extend this blog to showcase that as it has already been a long one.However i will put my Github repository link in here.

Hope you guys liked it !!


1.NPTEL-Lectures by Mitesh Khapra

2.Ahlad Kumar lectures on Youtube

3.Jeremy Jordon for Images