# Reinforcement Learning : Solving MDPs using Dynamic Programming (Part 3)

Source: Deep Learning on Medium

# Reinforcement Learning: Solving Markov Decision Process using Dynamic Programming

Previous two stories were about understanding Markov-Decision Process and Defining the Bellman Equation for Optimal policy and value Function. In this one, we are going to talk about how these Markov Decision Processes are solved. But before that, we will define the notion of solving Markov Decision Process and then, look at different Dynamic Programming Algorithms that helps us solve them. So, before we start, let’s look at what we are going to talk about in this story:

• What is Dynamic Programming?
• Notion of solving Markov- Decision Process
• What is Policy Evaluation?
• Dynamic Programming Algorithm 1: Policy Iteration
• Modified Policy Iteration
• Dynamic Programming Algorithm 2: Value Iteration

So, make yourself a coffee and get fresh as the difference between ordinary and extraordinary is that little extra. 🤙

## What is Dynamic Programming?

Dynamic Programming is a lot like divide and conquer approach which is breaking down a problem into sub-problems but the only difference is instead of solving them independently (like in divide and conquer), results of a sub-problem are used in similar sub-problems. The idea of dynamic programming is that you don’t need to solve a problem you have already solved. A problem can be solved using dynamic programming if it satisfies two properties:

• Optimal Substructure: This means that a problem can be divided into sub-problems and if we find optimal solutions to those sub-problems, then we can use this optimal solution to find an optimal solution for the overall problem.
• Overlapping Sub-problems: Being said before, Dynamic Programming is used if a problem has similar sub-problems. The solution of one sub-problem is saved and then used to solve similar sub-problems.

But wait, do Markov Decision Process have these two properties?

For the first property, there we have a Bellman Equation. But, how?

Recall, Bellman equation is defined as follows:

Remember what the Bellman Equation said, that the value of a state is equal to the immediate reward our agent gets leaving the state plus the value of the next state. So, the equation is breaking down the process of finding the value function of a state by dividing it into sub-problems.

Secondly, for overlapping sub-problems, we have Value functions. How?

See, value functions have already stored how good a particular state is so we don’t need to recompute the value of that state again and again. For example, suppose there are two states(s[1] and s[2]) and we are in state s[1] and we have already computed the value of state s[2] so when calculating the value of state s[1] we will not re-compute the value of state s[2]. (The value of s[1] relies on state s[2])

## Notion of Solving Markov Decision Process

What we mean when we say we are going to solve a Markov Decision Process?

It means that we are going to find the Value Function that will satisfy the Bellman Equation. Once we have found the optimal value function, then we can use it to find the optimal policy.

This process is broken down into two parts:

1. Prediction: Suppose we have a Markov Decision Process defined as (S, A, P, R, gamma) and given some policy π. Our job in prediction is to find the value function. We find the value function by evaluating a policy using the Bellman Expectation Equation.
2. Control: This process involves optimizing the value function, we calculated during the prediction process. We find optimal value function and optimal policy for our Markov Decision Process.

With that being said, let’s talk about our first building block that is, Policy Evaluation.

## Policy Evaluation

Synchronous Backups: It means that we are going to take into consideration every state in our Markov Decision Process for calculating the value of a state.

Policy Evaluation means how much reward our agent is going to get by following a particular policy π. But how we evaluate a policy π?

We evaluate a given policy π using Bellman Expectation Equation. Recall, Bellman Expectation Equation tells how much reward our agent is going to get following a policy π and is defined as:

Now, to evaluate a policy we are going to use this equation in an iterative fashion. Meaning, we will calculate the value of the next state by backing up the value of the current state from the previous iteration. Lets understand this with an example:

Look at the following backup diagram:

We know that the value of the root node [s] is given by the reward we got by taking the action [a] plus the value of the state [s’]. Now, what we are doing is we are using the Bellman Expectation equation in an iterative fashion for dynamic programming. That is, we will use the value of the leaves (state [s’] here) which is from the previous iteration, when we were computing the value of state s, to calculate the value of the state [s]. This happens in every iteration. Now, for the second iteration, we are going to use the value of state we computed in the previous iteration that is v [k+1] to compute the value of the next state, that is, v [k+2].

Note that: Initially we start with our value function of each state initialize to zero.

Lets understand this with a help of a grid world:

In this Grid World, we get a reward of -1 for each transition we make (actions we take). And the actions that we can take are up, down, right and left. Our agent is following some random policy π with an equal weight-age of 0.25 given to each action.

Now, let’s track policy evaluation with this Grid world:

As said before we start off with initializing the value of each state to be zero:

Then on next iteration: We will use the Bellman Expectation Equation as described above as an iterative fashion and we end up with each state value function be -1 (recall, -1 was the reward we got for taking every action):

Let’s take state highlighted in red for example: First, the reward of going from this state to another is -1. Now, the value of the next state from the previous iteration which is 0 as initialized by us. So, applying the Bellman Expectation Equation gives us the value of -1 for this state. Since we are using synchronous backup so we are going to consider every state our agent can go, that is, four here.

Let’s look at what happens in the next iteration:

Again, let’s see how the value of the state is -2 (in red): So, again we are doing the same thing. First, we will get a reward of -1 for leaving the state and then taking our value of the next state from the previous iteration (which is -1), then the value of this state becomes -2. And since it’s synchronous backup so we are doing it for every state our agent can land on and normalizing it, which gives us the value of that state being -2.

Let’s talk about the states next to terminal states:

Let’s talk about the state in red: So, from this also we are able to take four actions, but the only difference is that the value of the terminal state is always 0. So for every other step from this state yields a value of -2 but in the terminal state, it yields a value of -1. Adding them up which gives -7 and then normalizing it by 4 (as this is a synchronous backup) gives the value of this state to be -1.75.

Now, if we continue the iteration process, we end up with the value of states as follows:

In the last iteration, we have Optimal Value Function for each of these states (we are going to talk about convergence later).

This concludes our Iterative Evaluation Process. Now, what if we behave greedily to these values of the states?

Acting greedy means that out of all the states our agent can go, it chooses the one with greater value than the rest.

It turns out that if we start acting greedy w.r.t the values of the states we will eventually end up with an optimal policy.

Now, keeping both of these points in mind, let’s look at our first Dynamic Programming algorithm that is Policy Iteration.

## Policy Iteration

This algorithm breaks the process as described earlier in two parts.

First, we will evaluate a Policy say π using Bellman Expectation Equation as described in the previous section and then we are going to act greedy to this evaluated value function to find us an Optimal Policy. We are going to iterate this process until we get our true value function.

Idea of Policy Iteration is in two steps:

• Policy Evaluation (as described earlier)
• Acting greedy to the evaluated Value Function which yields a policy better than the previous one

We iterate these two processes until it converges to Optimal Value Function and Optimal Policy.

When after evaluating a policy (calculating Value function using Bellman Expectation Equation in an iterative way), we act greedy to that value function and because we are acting greedy it makes our policy deterministic.

Lets understand this in more depth :( In this Figure Up Arrows indicate Policy evaluation and Down arrow indicates that we are acting greedy to the value function)

What this Figure conveys is that we start with random values of each state (with zeroes) and a random policy and then we evaluate this policy using the Bellman Expectation equation as described earlier and then act greedy to the value function which gives us a new policy. Then we again evaluate this new policy and then again act greedy to the new value function. We continue this process until we find optimal Value function and the optimal policy for our Markov Decision Process.

Now, let’s dive a bit deeper into how we improve our policy..

Acting greedy means that pick an action, say a that would yield us maximum value in a particular state s. We can define this as:

Recall, qπ (s, a) tells us how good it is to take action an in state s or how much value we are going to get by taking action an in state s. Taking arg-max over qπ (s, a) means that we pick actions to say an in state s that maximizes the q-value [qπ (s, a)] or yield us maximum q-value [qπ (s, a)].

Well, does this make our policy better or not?

So, we are asking that whether taking the greedy action [π’(s)] is better than just following our policy π all along. Yes, taking first step greedy [π’(s)] and then following policy π is better or equal to just following policy π all along.

This can be expressed as: