Source: Deep Learning on Medium

#### A deep learning perspective

Autoencoders are symmetric networks used for unsupervised learning, where

output units are connected back to input units The output layer has the same size of the input layer because its purpose is to reconstruct its own inputs rather than predicting a dependent target value[1]. The difference between an autoencoder and a vainilla autoencoder is that variational autoencoder are useful for generative modeling, and the latent space is continuous, allowing random sampling and interpolation. For more information for differences you can visit [2].

### Parts of an Autoencoder

- Encoder
- Code
- Decoder
- Loss Function

** Encoder** [

*q*

*θ*(

*z*∣

*x*)]: its input is a datapoint

*x*, its output is a hidden representation

*z*, and it has weights and biases

*θ*.

** Decoder** [

*p*

*ϕ*(

*x*∣

*z*)]: its input is the representation

*z*, it outputs the parameters to the probability distribution of the data, and has weights and biases

*ϕ*.

** Loss Function**: is the negative log-likelihood with a regularizer. The first term is the reconstruction loss, or expected negative log-likelihood of the i

*i*-th datapoint. The expectation is taken with respect to the encoder’s distribution over the representations. The second term is a regularizer called

**divergence between the encoder’s distribution**

*Kullback-Leibler**q*

*θ*(

*z*∣

*x*) and

*p*(

*z*). This divergence measures how much information is lost (in units of nats) when using

*q*to represent

*p*. It is one measure of how close

*q*is to

*p*[3]

*. T*he KL divergence is defined as the relative entropy between probability density functions

*q*and

*p*.

#### A probabilistic perspective

This process of teasing out a mapping from input to hidden representation is called *representation learning*. With the Bayesian perspective, the encoder becomes a variational *inference network*, mapping observed inputs to (approximate) posterior distributions over latent space, and the decoder becomes a *generative network*, capable of mapping arbitrary latent coordinates back to distributions over the original data space [4].

Whereas a vanilla autoencoder is deterministic, a Variational Autoencoder is stochastic — a mashup of:

- a
qϕ(z|x)*probabilistic encoder* - a
pθ(x|z)*generative decoder*

In other words, a VAE represents a directed *probabilistic graphical model*, in which approximate inference is performed by the encoder and optimized alongside an easy-to-sample generative decoder. These complementary halves are also known as the *inference* (or *recognition*) *network* and the *generative network*.

We can write the joint probability of the model as *p*(*x*,*z*)=*p*(*x*∣*z*)*p*(*z*). The generative process is the following one

The latent variables are drawn from a prior *p*(*z*). The data *x* have a likelihood *p*(*x*∣*z*) that is conditioned on latent variables *z*. The model defines a joint probability distribution over data and latent variables: *p*(*x*,*z*). We can decompose this into the likelihood and prior: *p*(*x*,*z*)=*p*(*x*∣*z*)*p*(*z*). For black and white digits, the likelihood is Bernoulli distributed.

The goal is to infer good values of the latent variables given observed data, or to calculate the posterior *p*(*z*∣*x*). According to Bayes

The denominator is called the evidence and it can be calculated by marginalizing out the latent variables

To compute this integral we need to evaluate all configurations of latent variables. As a consequence, we need to approximate the posterior distribution.

** Variational inference** approximates the posterior with a family of distributions

*q*

*λ*(

*z*∣

*x*). The variational parameter

*λ*indexes the family of distributions. How can we know how well our variational posterior

*q*(

*z*∣

*x*) approximates the true posterior

*p*(

*z*∣

*x*)? We can use the Kullback-Leibler divergence:

We can define ELBO (Evidence Lower BOund) as:

Minimizing the Kullback-Leibler divergence is equivalent to maximizing the ELBO and it allows us to do approximate posterior inference, what is computationally possible [3]. For more information about *variational inference* you can visit [6].

### Resources

- Zocca, V; Spacagna, G; Slater, D; Roelants, P.
*Python Deep Learning*2017 - Intuitively Understanding Variational Autoencoders http://bit.ly/2TW3gNX
- Tutorial — What is a variational autoencoder? http://bit.ly/2v7PBrD
- Introducing Variational Autoencoders (in Prose and Code) http://bit.ly/2AuM89j
- Under the hood of the Variational Autoencoder (in Prose and Code) http://bit.ly/2ABQhIq
- Variational Inference: A review for statisticians https://arxiv.org/abs/1601.00670