Original article was published by Branislav Holländer on Deep Learning on Medium
Autoencoders: Overview of Research and Applications
Since the early days of machine learning, it has been attempted to learn good representations of data in an unsupervised manner. The hypothesis underlying this effort is that disentangled representations translate well to downstream supervised tasks. For example, if a human is told that a Tesla is a car and he has a good representation of what a car looks like, he can probably recognize a photo of a Tesla among photos of houses without ever seeing a Tesla.
Most early representation learning ideas revolve around linear models such as factor analysis, Principal Components Analysis (PCA) or sparse coding. Since these approaches are linear, they may not be able to find disentangled representations of complex data such as images or text. Especially in the context of images, simple transformations such as change of lighting may have very complex relationships to the pixel intensities. Therefore, there is a need for deep non-linear encoders and decoders, transforming data into its hidden (hopefully disentangled) representation and back.
Autoencoders are neural network models designed to learn complex non-linear relationships between data points. Usually, autoencoders consist of multiple neural network layers and are trained to reconstruct the input at the output (hence the name autoencoder). In this post, I will try to give an overview of the various types of autoencoders developed over the years and their applications.
What Autoencoders Do
In general, the assumption of using autoencoders is that the highly complex input data can be described much more succinctly if we correctly take into account the geometry of the data points. Consider, for instance, the so-called “swiss roll” manifold depicted in Figure 1. Although the data originally lies in 3-D space, it can be more briefly described by “unrolling” the roll and laying it out on the floor (2-D). Note that a linear transformation of the swiss roll is not able to unroll the manifold. However, autoencoders are able to learn the (possibly very complicated) non-linear transformation function.
A simple way to make the autoencoder learn a low-dimensional representation of the input is to constrain the number of nodes in the hidden layer. Since the autoencoder now has to reconstruct the input using a restricted number of nodes, it will try to learn the most important aspects of the input and ignore the slight variations (i.e. noise) in the data.
In order to implement an undercomplete autoencoder, at least one hidden fully-connected layer is required. Most autoencoder architectures nowadays actually employ multiple hidden layers in order to make the architecture deeper. Empirically, deeper architectures are able to learn better representations and achieve better generalization. It is also customary to have the number and size of layers in the encoder and decoder, making the architecture symmetric.
Undercomplete autoencoders do not necessarily need to use any explicit regularization term, since the network architecture already provides such regularization. However, we should nevertheless be careful about the actual capacity of the model in order to prevent it from simply memorizing the input data. One regularization option is to bind the parameters of the encoder and decoder together by simply using the transpose of the encoder weight matrix in the corresponding layer in the decoder.
Applications of undercomplete autoencoders include compression, recommendation systems as well as outlier detection. Outlier detection works by checking the reconstruction error of the autoencoder: if the autoencoder is able to reconstruct the test input well, it is likely drawn from the same distribution as the training data. If the reconstruction is bad, however, the data point is likely an outlier, since the autoencoder didn’t learn to reconstruct it properly.
Since convolutional neural networks (CNN) perform well at many computer vision tasks, it is natural to consider convolutional layers for an image autoencoder. Usually, pooling layers are used in convolutional autoencoders alongside convolutional layers to reduce the size of the hidden representation layer. The hidden layer is often preceded by a fully-connected layer in the encoder and it is reshaped to a proper size before the decoding step. Since the output of the convolutional autoencoder has to have the same size as the input, we have to resize the hidden layers. In principle, we can do this in two ways:
- Upsampling the hidden layer before every convolutional layer, e.g. with bilinear interpolation, or
- Using specialized transposed convolution layers to perform a trainable form of upsampling.
The second option is more principled and usually provides better results, however it also increases the number of parameters of the network and may not be suitable for all kinds of problems, especially if there is not enough training data available.
Convolutional autoencoders are frequently used in image compression and denoising. In case of denoising, the network is called denoising autoencoder and it is trained differently to the standard autoencoder: instead of trying to reconstruct the input in the output, the input is corrupted by an appropriate noise signal (e.g. Gaussian noise) and the autoencoder is trying to predict the denoised output.
Convolutional autoencoders may also be used in image search applications, since the hidden representation often carries semantic meaning. Therefore, similarity search on the hidden representations yields better results that similarity search on the raw image pixels. It is also significantly faster, since the hidden representation is usually much smaller.
As I already mentioned, undercomplete autoencoders use an implicit regularization by constricting the size of the hidden layers compared to the input and output. Sparse autoencoders now introduce an explicit regularization term for the hidden layer. Therefore, the restriction that the hidden layer must be smaller than the input is lifted and we may even think of overcomplete autoencoders with hidden layer sizes that are larger than the input, but optimal in some other sense.
For example, we might introduce a L1 penalty on the hidden layer to obtain a sparse distributed representation of the data distribution. This will force the autoencoder select only a few nodes in the hidden layer to represent the input data. Note that this penalty is qualitatively different from the usual L2 or L1 penalties introduced on the weights of neural networks during training. In this case we restrict the hidden layer values instead of the weights. In contrast to weight decay, this procedure is not quite as theoretically founded, with no clear underlying probabilistic description. However, it is an intuitive idea and it works very well in practice.
Another penalty we might use is the KL-divergence. In this case, we introduce a sparsity parameter ρ (typically something like 0.005 or another very small value) that will denote the average activation of a neuron over a collection of samples. In our case, ρ will be assumed to be the parameter of a Bernoulli distribution describing the average activation. We will also calculate ρ_hat, the true average activation of all examples during training. The KL-divergence between the two Bernoulli distributions is given by:
, where s₂ is the number of neurons in the hidden layer. This is a differentiable function and may be added to the loss function as a penalty.
An interesting approach to regularizing autoencoders is given by the assumption that for very similar inputs, the outputs will also be similar. We can enforce this assumption by requiring that the derivative of the hidden layer activations is small with respect to the input. This will make sure that small variations of the input will be mapped to small variations in the hidden layer. The name contractive autoencoder comes from the fact that we are trying to contract a small cluster of inputs to a small cluster of hidden representations.
Specifically, we include a term in the loss function which penalizes the Frobenius norm (matrix L2-norm) of the Jacobian of the hidden activations w.r.t. the inputs:
Hereby, h_j denote the hidden activations, x_i the inputs and ||*||_F is the Frobenius norm.
Variational Autoencoders (VAEs)
The crucial difference between variational autoencoders and other types of autoencoders is that VAEs view the hidden representation as a latent variable with its own prior distribution. This gives them a proper Bayesian interpretation. Variational autoencoders are generative models with properly defined prior and posterior data distributions.
More specifically, the variational autoencoder models the joint probability of the input data and the latent representation as p(x, z) = p(x|z) p(z). The generative process is defined by drawing a latent variable from p(z) and passing it through the decoder given by p(x|z). As with the other autoencoder types, the decoder is a learned parametric function.
In order to find the optimal hidden representation of the input (the encoder), we have to calculate p(z|x) = p(x|z) p(z) / p(x) according to Bayes’ Theorem. The issue with applying this formula directly is that the denominator requires us to marginalize over the latent variables. In other words, we have to compute the integral over all possible latent variable configurations. This is usually intractable. Instead, we turn to variational inference.
In variational inference, we use an approximation q(z|x) of the true posterior p(z|x). q(z|x) is explicitly designed to be tractable. In our case, q will be modeled by the encoder function of the autoencoder. To train the variational autoencoder, we want to maximize the following loss function:
We may recognize the first term as the maximal likelihood of the decoder with n samples drawn from the prior (encoder). The second term is new for variational autoencoders: it tries to approximate the variational posterior q to the true prior p using the KL-divergence as a measure. Furthermore, q is chosen such that it factorizes over the m training samples, which makes it possible to train using stochastic gradient descent. While this is intuitively understandable, you may also derive this loss function rigorously. If you are familiar with Bayesian inference, you may also recognize the loss function as maximizing the Evidence Lower BOund (ELBO).
We usually choose a simple distribution as the prior p(z). In many cases, it is simply the univariate Gaussian distribution with mean 0 and variance 1 for all hidden units, leading to a particularly simple form of the KL-divergence (please have look here for the exact formulas). q is also usually chosen as a Gaussian distribution, univariate or multivariate.
The only thing remaining to discuss now is how to train the variational autoencoder, since the loss function involves sampling from q. The sampling operation is not differentiable. Luckily, the distribution were are trying to sample from is continuous. This allows us to use a trick: instead of backpropagating through the sampling process, we let the encoder generate the parameters of the distribution (in the case of the Gaussian, simply the mean μ and the variance σ). Then we generate a sample from the unit Gaussian ε and rescale it with the generated parameter:
Since we do not need to calculate gradients w.r.t ε and all other derivatives are well-defined, we are done. This is called the reparametrization trick. Note that the reparameterization trick works for many continuous distributions, not just for Gaussians. Unfortunately, though, it doesn’t work for discrete distributions such as the Bernoulli distribution.
After training, we have two options: (i) forget about the encoder and only use the latent representations to generate new samples from the data distribution by sampling and running the samples through the trained decoder, or (ii) running an input sample through the encoder, the sampling stage as well as the decoder. If we choose the first option, we will get unconditioned samples from the latent space prior. With the second option, we will get posterior samples conditioned on the input.
This already motivates the main application of VAEs: generating new images or sounds similar to the training data. When generating images, one usually uses a convolutional encoder and decoder and a dense latent vector representation. Multiple different versions of variational autoencoders appeared over the years, including Beta-VAEs which aim to generate a particularly disentangled representations, VQ-VAEs to overcome the limitation of not being able to use discrete distributions as well as conditional VAEs to generate outputs conditioned on a certain label (such as faces with a moustache or glasses). See Figure 3 for an example output of a recent variational autoencoder incarnation.
Although variational autoencoders have fallen out of favor lately due to the rise of other generative models such as GANs, they still retain some advantages, such as the explicit form of the prior distribution.
Autoencoders form a very interesting group of neural network architectures with many applications in computer vision, natural language processing and other fields. Although nowadays there are certainly other classes of models used for representation learning nowadays, such as siamese networks and others, autoencoders remain a good option for a variety of problems and I still expect a lot of improvements in this field in the near future.