A Crash Course in Markov Decision Processes, the Bellman Equation, and Dynamic Programming

Original article was published by Andre Ye on Artificial Intelligence on Medium

Source: Pixabay

A Crash Course in Markov Decision Processes, the Bellman Equation, and Dynamic Programming

An intuitive introduction to reinforcement learning

Take a moment to locate the nearest major city around you.

If you were to travel there now, which mode of transportation would you use? You may take a car, a bus, or a train. Perhaps you’ll ride a bike, or even purchase an airplane ticket.

Regardless of how you get to the location, you incorporate probability into the decision-making process. Perhaps there’s a 70 percent chance of rain or a car crash, which can cause traffic jams. If your bike tire is old, it may pop – this is certainly a large probabilistic factor.

On the other hand, there are deterministic costs — for instance, the cost of gas or an airplane ticket — as well as deterministic rewards — like substantially faster travel times taking an airplane (depending on the location) or perhaps a free Uber/Lyft ride.

These types of problems – in which an agent must balance probabilistic and deterministic rewards and costs – appear commonly in decision-making. Markov Decision Processes are used to model these types of optimization problems and can be applied furthermore to more complex tasks in Reinforcement Learning.

Defining Markov Decision Processes

To illustrate a Markov Decision process, consider a dice game:

  • Each round, you can either continue or quit.
  • If you quit, you receive $5 and the game ends.
  • If you continue, you receive $3 and roll a 6-sided die. If the die comes up as 1 or 2, the game ends. Otherwise, the game continues onto the next round.

Clearly, there is a trade-off here. For one, we can trade a deterministic gain of $2 for the chance to roll a dice and continue onto the next round.

Let’s establish some vocabulary, which we will use to create an MDP to model this game:

  • A state is a status the agent (decision-maker) can hold. In the dice game, the agent can either be in the game or out of the game.
  • An action is a movement the agent can choose. These move the agent between states, with certain penalties or rewards.
  • Transition probabilities describe the probability of ending up in a state s’ (s prime) given an action a. These will be often denoted as a function P(s, a, s’) that outputs the probability of ending up in s’ given current state s and action a.
    For example, P(s=playing the game, a=choose to continue playing, s’=not playing the game) is ⅓, since there is a two-sixths (one-third) chance of losing the dice roll.
  • Rewards are given depending on the action. The reward for continuing the game is 3, whereas the reward for quitting is $5. The ‘overall’ reward is to be optimized.

We can formally describe a Markov Decision Process as m = (S, A, P, R, gamma), where:

  • S represents the set of all states.
  • A represents the set of possible actions.
  • P represents the transition probabilities.
  • R represents the rewards.
  • Gamma is known as the discount factor. More on this later.

The goal of the MDP m is to find a policy, often denoted pi, that yields the optimal long-term reward. Policies are simply a mapping of each state s to a distribution of actions a. For each state s, the agent should take action a with a certain probability. Alternatively, policies can also be deterministic (i.e. the agent will take action a in state s).

Our Markov Decision Process would look like this. An agent traverses the graph’s two states by making decisions and following probabilities.

Something important to mention is the Markov Property, which applies not only to Markov Decision Processes but anything Markov-related (like a Markov Chain). It states that the next state can be determined solely by the current state — no ‘memory’ is necessary. This applies to how the agent traverses the Markov Decision Process, but note that optimization methods use previous learning to fine-tune policies. This is not a violation of the Markov property, which only applies to the traversal of an MDP.

The Bellman Equation

The Bellman Equation is one central to Markov Decision Processes. It outlines a framework for determining the optimal expected reward at a state s by answering the question, “what is the maximum reward an agent can receive if they make the optimal action now and for all future decisions?”

Although versions of the Bellman Equation can become fairly complicated, fundamentally most of them can be boiled down to this form:

It is a relatively common-sense idea, put into formulaic terms. Notice the role gamma — which is between 0 or 1 (inclusive) — plays in determining the optimal reward. If gamma is set to 0, the V(s’) term is completely canceled out and the model only cares about the immediate reward. On the other hand, if gamma is set to 1, the model weights potential future rewards just as much as it weights immediate rewards. The optimal value of gamma is usually somewhere between 0 and 1, such that the value of farther-out rewards has diminishing effects.

Let’s use the Bellman equation to determine how much money we could receive in the dice game. We can choose between two choices, so our expanded equation will look like max(choice 1’s reward, choice 2’s reward). Choice 1 — quitting — yields a reward of 5. On the other hand, choice 2 yields a reward of 3, plus a two-thirds chance of continuing to the next stage, in which the decision can be made again (we are calculating by expected return). We add a discount factor gamma in front of terms indicating the calculating of s’ (the next state).

This equation is recursive, but inevitably it will converge to one value, given that the value of the next iteration decreases by ⅔, even with a maximum gamma of 1. At some point, it will jut not be profitable to continue staying. Let’s calculate four iterations of this, with a gamma of 1 to keep things simple and to calculate the total long-term optimal reward.

At each step, we can either quit and receive an extra $5 in expected value, or stay and receive an extra $3 in expected value. Each new round, the expected value is multiplied by two-thirds, since there is a two-thirds probability of continuing, even if the agent chooses to stay.

Here, the decimal values are computed, and we find that (with our current number of iterations) we can expect to reap $7.8 if we follow the best choices.

Obviously, we may find that we can actually reap more rewards if we continue computing expected values. Here, we are limiting ourselves

Here, we calculated the best profit manually. If you wanted to compute this with a program, however, you would need to use a specialized data structure. Moreover, in order to be efficient, we don’t want to calculate each expected value independently, but in relation with previous ones. The solution: Dynamic Programming.

Dynamic Programming

In fact, Richard Bellman of the Bellman Equation coined the term Dynamic Programming, and it’s used to compute problems that can be broken down into subproblems. For example, the expected value for choosing Stay > Stay > Stay > Quit can be found by calculating the value of Stay > Stay > Stay first. These pre-computations would be stored in a two-dimensional array, where the row represents either the state [In] or [Out], and the column represents the iteration.

We can write rules that relate each cell in the table to a previously precomputed cell (this diagram doesn’t include gamma).

Then, the solution is simply the largest value in the array after computing enough iterations. In this case, because each iteration only needs the previous calculated value, we do not need a grid to store the values at all, and can simply store a current maximum and see if it is surpassed.

Through dynamic programming, we are able to compute any number of iterations — in the tens of thousands, if necessary. This can be helpful, in this case, for finding the maximum possible reward in an MDP, but also as useful structures to evaluate and determine policies.

There are generally two policy improvement methods. Both follow the same rough structure, but differ in policy evaluation, which determines the reward one can expect to receive by following a policy.

  • Policy iteration is when the value of each policy is repeatedly evaluated until it converges at some point. This value is then used to update the policy to produce a better value. This method is more accurate, but also takes more time and computing resources to calculate.
  • Value iteration is when we only run through the policy once and use that value to update the policy. This method is very quick to calculate but can also be inaccurate and have wide variance — sampling only one possibility from the MDP. The assumption is that after enough evaluation-update cycles, the policy will converge and perform well generally.

There is no right or wrong method to iteration. In fact, these two methods represent extremes; in many scenarios it is infeasible to do either and some intermediate number of iterations.


  • A Markov Decision Process (MDP) is used to model decisions that can have both probabilistic and deterministic rewards and punishments.
  • MDPs have five core elements: S, which is the set of possible states for an agent to be in; A, which is a set of possible actions an agent can take at a particular state; R, which are the rewards for making an action A at state S; P, which are the probabilities for transitioning to a new state S’ after taking action A at original state S; and gamma, which controls how far-looking the Markov Decision Process agent will be.
  • All Markov Processes, including Markov Decision Processes, must follow the Markov Property, which states that the next state can be determined purely by the current state.
  • The Bellman Equation determines the maximum reward an agent can receive if they make the optimal decision at the current state and at all following states. It defines the value of the current state recursively as being the maximum possible value of the current state reward, plus the value of the next state.
  • Dynamic programming utilizes a grid structure to store previously computed values and builds upon them to compute new values. It can be used to efficiently calculate the value of a policy and to solve not only Markov Decision Processes, but many other recursive problems.
  • Policies can be updated through policy iteration and value iteration, which represent different approaches to evaluating a policy before it is updated.