Probability Learning VI: Hidden Markov Models

Source: Deep Learning on Medium

Hello again friends! This is post number six of our Probability Learning series, listed here in case you have missed any of the previous articles:

I deeply encourage you to read them, as they are fun and full of useful information about probabilistic Machine Learning. However, if you don´t want to read them, that is absolutely fine, this article can be understood without having devoured the rest with only a little knowledge of probability.

Also, do not fear, I will not include any complex math in this article: it’s intention is to lay the theoretical background of hidden Markov models, show how they can be used, and talk about some of its applications.

We have already met Reverend Bayes, and today we are going to meet another very influential individual in the world of game theory and probability. This is no other than Andréi Márkov, they guy who put the Markov in Hidden Markov models, Markov Chains…

Hidden Markov models are a branch of the probabilistic Machine Learning world, that are very useful for solving problems that involve working with sequences, like Natural Language Processing problems, or Time Series. In a moment, we will see just why this is, but first, lets get to know Markov a little bit.

So, who is this Andrei Markov?

Image of Andrei Markov

Andrei Markov (1856–1922) was a Russian mathematician who taught probability theory in the University of St Petersburg, and was also a very politically active individual. He worked with continuous fractions, the central limit theorem, and other mathematical endeavours, however, he will mostly be remembered because of his work on probability theory, specifically on the study of stochastic processes; the Markov Chains that we will discuss in just a moment.

Markov Chains: Probabilistic Sequences

Lets start with the most basic element of Markov´s proposal: the Markov Chain. In probability theory, a Markov Chain or Markov Model is an special type of discrete stochastic process in which the probability of an event occurring only depends on the immediately previous event.

The underlying assumption is that the future is independent of the past given the present. In other words, if we know the present state or value of a system or variable, we do not need any past information to try to predict the future states or values.

Example of a two state Markov Chain

Markov chains are generally defined by a set of states and the transition probabilities between each state. In the example above, a two state Markov Chain is displayed: We have states A and B and four transition probabilities: from A to A again, from A to B, from B to A and from B to B again. These transition probabilities are usually represented in the form of a Matrix, called the Transition Matrix, also called the Markov Matrix.

One possible transition matrix for this chain

The element ij is the probability of transiting from state j to state i. In some cases transposed notation is used, so that element ij represents the probability of going from state i to state j. Because of this I added the ‘to’ and ‘from’ just to clarify. Overall, the system would look something like this:

Diagram of a Markov Chain with the transition probabilities

How do we calculate these probabilities? The answer is one that you´ve probably heard already a million times: from data.

Imagine the states we have in our Markov Chain are Sunny and Rainy. To calculate the transition probabilities from one to another we just have to collect some data that is representative of the problem that we want to address, count the number of transitions from one state to another, and normalise the measurements. The following figure shows how this would be done for our example.

Calculations of transition probabilities from data

At the moment Markov Chains look just like any other state machine, in which we have states and transitions in between them. However, later in this article we will see just how special they are.

Okay, now that we know what a Markov Chain is, and how to calculate the transitions probabilities involved, lets carry on and learn about Hidden Markov Models.

Hidden Markov Models: Discovering the unknown

Hidden Markov Models are probabilistic models that attempt to find the value or the probability of certain hidden variables having a certain value, based on some other observed variables. These variables are commonly referred to as hidden states and observed states.

The state of a system might only be partially observable, or not observable at all, and we might have to infer its characteristics based on another fully observable system or variable.

Imagine, using the previous example, that we add the following information. Every day, there is a probability that we get a phone call from our best friend, John who lives in a different continent, and this probability depends on the weather conditions of such day. Using the latter information (if we get a phone call or not –the observed variables) we would like to infer the former (the weather in the continent where John lives — the hidden variables)

Hidden and observed variables for our problem

The probabilities shown here, that define how likely is John to call us on a given day depending on the weather of such day are called emission probabilities. They define the probability of seeing certain observed variable given a certain value for the hidden variables.

Knowing these probabilities, along with the transition probabilities we calculated before, and the prior probabilities of the hidden variables (how likely it is to be sunny or rainy), we could try to find out what the weather of a certain period of time was, knowing in which days John gave us a phone call.

Lets see how we would solve this problem with simple statistics: Imagine John did not phone us for two days in a row. What is the most likely weather scenario? For this, we first need to calculate the prior probabilities (that is, the probability of being sunny or rainy previous to any actual observation), which we obtain from the same observations as the transitions probabilities.

Calculations of prior probabilities

Now, we are ready to solve our problem: for two days in a row, we did not get a single sign that John is alive. What is the most likely weather scenario then? As we can see in the image below, we have 4 possible situations to consider: sunny followed by sunny, sunny followed by rainy, rainy followed by sunny and lastly rainy followed by rainy.