Source: Deep Learning on Medium

While I was doing some research regarding Variational autoencoders(VAEs) I came up across The term Variational inferences, I started to dig deeper into the meaning of that term and the more I understood it the more I love it. motivated by the beauty of Variational Bayesian methods, I thought of writing down this short introduction that might motivate other people to read more about it. Also, I think, it provides a nice summarization of the important aspects need to understand the math of VAEs. So, let’s navigate the amazing world of Bayesian variational methods.

The core of Bayesian statistics revolves around the following idea, we have some parameter θ that we are interested in. We express our prior belief or knowledge about the value of θ using a prior distribution, P(θ). The prior distribution captures everything we know about θ. Then we observe some data, Y, that lead us to update our belief about the value θ. We express the update in belief in the form of a posterior distribution P(θ|Y). Using Bayes’s formula we can express the posterior as follow,

The term P(Y|θ) is known as the likelihood function which can be rephrased informally as how likely to get the data, i.e. Y, under our prior belief about the value of the parameters. P(θ) is the prior and P(Y) is the probability of the data, commonly known as the evidence. We can express P(Y) as the marginal of the joint distribution between the data and the parameter P(Y, θ). Thus, Bayes’s formula can be rewritten as follow,

For many practical applications, A problem arises when we try to compute the evidence because the evidence is usually hard or impractical to compute. So, instead of computing the exact solution we use approximations. Usually, two main classes of methods are used, the first is Markov chain Monte Carlo methods, MCMC, and the second is variational Bayesian Methods.

Variational inference is based upon the following idea, instead of computing the difficult-to-compute posterior we are going to come up with a family of “nicer” or “simpler” distribution L, that we can work on, next, we are going “to choose”a member of that family that is a ‘good approximation’ to the posterior. Let’s put it more formally, let’s assume we have a distribution Q(θ) which is a member of L, we need a member that is close enough to the posterior. So, the question now is how can we say if a member of L is a good or bad approximation to the posterior and what we mean by close enough? to do so we need to come up with a metric function that computes how much similar, or dissimilar two distributions are. One commonly used metric is the kullback-leibler (KL) divergence between the posterior and the member, i.e. Q(θ). So, we can frame the ideas we have in the following formula:

The previous equation means that we are going to optimize the parameter θ in order to minimize the KL divergence between the posterior and our approximation. However, we still have the same problem which is how to compute the evidence, i.e. P(Y)? which is needed to compute KL divergence. It seems like we are moving in a circle, but are we? well, let’s look more closely to KL

which if you think about it, is like taking the expectation of the log differences between Q(θ) and P(θ|Y) with respect to Q(θ), that is

Now, using Bayes’s formula we get:

let’s simplify the second term,

Now, let’s put everything together

let’s rearrange the terms by shifting log P(Y) to the other side and then multiplying both sides by -1

let’s look at the last equation for a moment, we have actually something really interesting here, the right-hand side of the equation does not contain the evidence, P(Y), which was causing a problem before, so that is a good thing, at least we can compute that side of the equation. On the left-hand side of the equation, we have the log of the evidence minus the KL divergence, Now, what we know about that left-hand side, first, log P (Y) is just a constant okay, what else? KL is always positive so we have a constant minus a positive value and hence the value of that constant will always be bigger than the constant minus a positive value i.e. log P(Y) > log P(Y)- KL(Q(θ) || P(θ|Y)). Okay, nice note but what does it mean? let’s take a moment and think about it …

So, this means although we can not compute KL directly we can compute a constant- KL, and here comes the nice part, we can control the value of the constant- KL term by adjusting θ on the right-hand side of the equation. One might ask why this is nice? Well, the only way to make constant- KL bigger is by making KL smaller and the only way to make constant- KL smaller is by making KL bigger. That is really interesting, let it sink for a moment.

The term constant- KL is known as the evidence lower bound (ELBO). Now, let’s notice that the ELBO has a functional form, it takes a function, here, it is the distribution Q(θ) which is a p.d.f over θ and returns a value which is the ELBO.

Okay, let’s recap. Our initial aim was to minimize the KL between the posterior and our approximation, but we can not compute that because we can not compute the evidence, so we come up with a second term, ELBO, that we can compute. The ELBO is a constant, log P(Y)-KL. So, if we maximized ELBO we are indirectly minimizing KL, viola. So, we have found a nice way to indirectly make KL smaller without computing the difficult-to-compute integral. Now, ELBO is a Functional and how do we increase or decrease a Functional, well, we use Calculus of variation, and hence the name variational.

Before we end let’s take for a few seconds to talk about the family of distribution, L, we are going to come up with. Well, the most common family is the mean-field variational family, where the distributed is factorized over independent distributions for each parameter, that is

Usually, the qs are chosen to be a member of the exponential family of distribution.

I hope that this short introduction would motivate the readers to dig deeper into variational Bayesian methods, we have just scratched the surface and there a lot more to be learned and discovered about this amazing world. Finally, I would like to add some links to references I find to be really helpful,

Variational Inference: A Review for Statisticians

https://www.youtube.com/watch?v=DYRK0-_K2UU a really interesting talk by Dr. Tamara Brodreick