In a previous post, we discussed the relevance of Monte Carlo methods in the deep learning ecosystem as an alternative to more traditional Las Vegas techniques. Essentially, both techniques fall under the umbrella of randomized methods but Las Vegas techniques focused on providing an exact answer while Monte Carlo methods provide an approximate exact answer based on a probabilistic distribution. The efficiency of Monte Carlo techniques when operating in large, multi-dimensional datasets have made it a favorite of deep learning practitioners. From sampling data, to regularization or optimization techniques, Monte Carlo methods have become an important building block of modern deep learning solutions.

There are several Monte Carlo techniques that have been widely implemented in modern deep learning platforms. The best known member of the Monte Carlo family is a technique that brings Markov chains into the world of randomness and is known by the name of Markov Chain Monte Carlo methods (MCMC).

The main objective of MCMC models is to obtain information about distributions using Markov random walks algorithms. This is fancy way of saying that MCMC techniques are able to learn the fundamental attributes of a probabilistic distribution without sampling all its members. Reading this you might be confused. Isn’t the role Monte Carlo methods to draw examples from a distribution? If so, how are MCMCs any different?

The main difference of MCMC methods comes from the usage of Markov chains to generate the samples using a special sequential process. While standalone Monte Carlo methods are able to generate samples from a distribution, there are many scenarios in which there is no tractable methods to draw exact examples from a dataset. Markov chains complement traditional Monte Carlo methods by using a model in which each random sample is used as a stepping stone to generate the next random sample (hence the *chain*). A unique benefit of the chain is that, although each new sample depends on the one before it, new samples do *not* depend on any samples before the previous one (this is the “Markov” property).

**MCMC in Action**

Let’s use a classic example in machine learning literature to illustrate the value of MCMC models. Suppose that a professor is interested in learning the average of test scores in a student population. While the mean test score is unknown, the lecturer knows that the scores are normally distributed with a standard deviation of 15. So far, the lecturer has observed a test score of a single student: 100. One can use MCMC to draw samples from the *target distribution*, in this case the posterior, which represents the probability of each possible value of the population mean given.

In order to draw samples from the distribution of test scores, MCMC starts with an initial guess: just one value that could be drawn from the distribution. Let’s assume this initial guess is 110. MCMC is then used to produce a chain of new samples from this initial guess. Each new sample is produced by two simple steps: first, a *proposal* for the new sample is created by adding a small random perturbation to the most recent sample; second, this new proposal is either accepted as the new sample, or rejected (in which case the old sample retained). By continuously repeating this process, an MCMC model should produce a series of samples that are very close to the original probabilistic distribution.