Reinforcement Learning : Solving MDPs using Dynamic Programming (Part 3)

Source: Deep Learning on Medium

#Solving Markov Decision Processes

Reinforcement Learning: Solving Markov Decision Process using Dynamic Programming

Previous two stories were about understanding Markov-Decision Process and Defining the Bellman Equation for Optimal policy and value Function. In this one, we are going to talk about how these Markov Decision Processes are solved. But before that, we will define the notion of solving Markov Decision Process and then, look at different Dynamic Programming Algorithms that helps us solve them. So, before we start, let’s look at what we are going to talk about in this story:

  • What is Dynamic Programming?
  • Notion of solving Markov- Decision Process
  • What is Policy Evaluation?
  • Dynamic Programming Algorithm 1: Policy Iteration
  • Modified Policy Iteration
  • Dynamic Programming Algorithm 2: Value Iteration

So, make yourself a coffee and get fresh as the difference between ordinary and extraordinary is that little extra. 🤙

What is Dynamic Programming?

Dynamic Programming is a lot like divide and conquer approach which is breaking down a problem into sub-problems but the only difference is instead of solving them independently (like in divide and conquer), results of a sub-problem are used in similar sub-problems. The idea of dynamic programming is that you don’t need to solve a problem you have already solved. A problem can be solved using dynamic programming if it satisfies two properties:

  • Optimal Substructure: This means that a problem can be divided into sub-problems and if we find optimal solutions to those sub-problems, then we can use this optimal solution to find an optimal solution for the overall problem.
  • Overlapping Sub-problems: Being said before, Dynamic Programming is used if a problem has similar sub-problems. The solution of one sub-problem is saved and then used to solve similar sub-problems.

But wait, do Markov Decision Process have these two properties?

For the first property, there we have a Bellman Equation. But, how?

Recall, Bellman equation is defined as follows:

Bellman Equation

Remember what the Bellman Equation said, that the value of a state is equal to the immediate reward our agent gets leaving the state plus the value of the next state. So, the equation is breaking down the process of finding the value function of a state by dividing it into sub-problems.

Secondly, for overlapping sub-problems, we have Value functions. How?

See, value functions have already stored how good a particular state is so we don’t need to recompute the value of that state again and again. For example, suppose there are two states(s[1] and s[2]) and we are in state s[1] and we have already computed the value of state s[2] so when calculating the value of state s[1] we will not re-compute the value of state s[2]. (The value of s[1] relies on state s[2])

Notion of Solving Markov Decision Process

What we mean when we say we are going to solve a Markov Decision Process?

It means that we are going to find the Value Function that will satisfy the Bellman Equation. Once we have found the optimal value function, then we can use it to find the optimal policy.

This process is broken down into two parts:

  1. Prediction: Suppose we have a Markov Decision Process defined as (S, A, P, R, gamma) and given some policy π. Our job in prediction is to find the value function. We find the value function by evaluating a policy using the Bellman Expectation Equation.
  2. Control: This process involves optimizing the value function, we calculated during the prediction process. We find optimal value function and optimal policy for our Markov Decision Process.

With that being said, let’s talk about our first building block that is, Policy Evaluation.

Policy Evaluation

Synchronous Backups: It means that we are going to take into consideration every state in our Markov Decision Process for calculating the value of a state.

Policy Evaluation means how much reward our agent is going to get by following a particular policy π. But how we evaluate a policy π?

We evaluate a given policy π using Bellman Expectation Equation. Recall, Bellman Expectation Equation tells how much reward our agent is going to get following a policy π and is defined as:

Bellman Expectation Equation

Now, to evaluate a policy we are going to use this equation in an iterative fashion. Meaning, we will calculate the value of the next state by backing up the value of the current state from the previous iteration. Lets understand this with an example:

Look at the following backup diagram:

Backup Diagram to Calculate Value function of the root Node

We know that the value of the root node [s] is given by the reward we got by taking the action [a] plus the value of the state [s’]. Now, what we are doing is we are using the Bellman Expectation equation in an iterative fashion for dynamic programming. That is, we will use the value of the leaves (state [s’] here) which is from the previous iteration, when we were computing the value of state s, to calculate the value of the state [s]. This happens in every iteration. Now, for the second iteration, we are going to use the value of state we computed in the previous iteration that is v [k+1] to compute the value of the next state, that is, v [k+2].

Note that: Initially we start with our value function of each state initialize to zero.

Lets understand this with a help of a grid world:

Small Grid World

In this Grid World, we get a reward of -1 for each transition we make (actions we take). And the actions that we can take are up, down, right and left. Our agent is following some random policy π with an equal weight-age of 0.25 given to each action.

Now, let’s track policy evaluation with this Grid world:

As said before we start off with initializing the value of each state to be zero:

Start with value be zero of each state

Then on next iteration: We will use the Bellman Expectation Equation as described above as an iterative fashion and we end up with each state value function be -1 (recall, -1 was the reward we got for taking every action):

First Iteration using Bellman Expectation Equation

Let’s take state highlighted in red for example: First, the reward of going from this state to another is -1. Now, the value of the next state from the previous iteration which is 0 as initialized by us. So, applying the Bellman Expectation Equation gives us the value of -1 for this state. Since we are using synchronous backup so we are going to consider every state our agent can go, that is, four here.

Let’s look at what happens in the next iteration:

Second iteration Values of state

Again, let’s see how the value of the state is -2 (in red): So, again we are doing the same thing. First, we will get a reward of -1 for leaving the state and then taking our value of the next state from the previous iteration (which is -1), then the value of this state becomes -2. And since it’s synchronous backup so we are doing it for every state our agent can land on and normalizing it, which gives us the value of that state being -2.

Let’s talk about the states next to terminal states:

Terminal state on Second iteration

Let’s talk about the state in red: So, from this also we are able to take four actions, but the only difference is that the value of the terminal state is always 0. So for every other step from this state yields a value of -2 but in the terminal state, it yields a value of -1. Adding them up which gives -7 and then normalizing it by 4 (as this is a synchronous backup) gives the value of this state to be -1.75.

Now, if we continue the iteration process, we end up with the value of states as follows:

Iteration Number Third
Iteration Number Ten
Iterating till it converges to Optimal Value function

In the last iteration, we have Optimal Value Function for each of these states (we are going to talk about convergence later).

This concludes our Iterative Evaluation Process. Now, what if we behave greedily to these values of the states?

Acting greedy means that out of all the states our agent can go, it chooses the one with greater value than the rest.

It turns out that if we start acting greedy w.r.t the values of the states we will eventually end up with an optimal policy.

Now, keeping both of these points in mind, let’s look at our first Dynamic Programming algorithm that is Policy Iteration.

Policy Iteration

This algorithm breaks the process as described earlier in two parts.

First, we will evaluate a Policy say π using Bellman Expectation Equation as described in the previous section and then we are going to act greedy to this evaluated value function to find us an Optimal Policy. We are going to iterate this process until we get our true value function.

Idea of Policy Iteration is in two steps:

  • Policy Evaluation (as described earlier)
Value Function Calculation
  • Acting greedy to the evaluated Value Function which yields a policy better than the previous one
Acting greedy to this function

We iterate these two processes until it converges to Optimal Value Function and Optimal Policy.

When after evaluating a policy (calculating Value function using Bellman Expectation Equation in an iterative way), we act greedy to that value function and because we are acting greedy it makes our policy deterministic.

Lets understand this in more depth :( In this Figure Up Arrows indicate Policy evaluation and Down arrow indicates that we are acting greedy to the value function)

Idea of Policy Iteration

What this Figure conveys is that we start with random values of each state (with zeroes) and a random policy and then we evaluate this policy using the Bellman Expectation equation as described earlier and then act greedy to the value function which gives us a new policy. Then we again evaluate this new policy and then again act greedy to the new value function. We continue this process until we find optimal Value function and the optimal policy for our Markov Decision Process.

Now, let’s dive a bit deeper into how we improve our policy..

Acting greedy means that pick an action, say a that would yield us maximum value in a particular state s. We can define this as:

Greedy Policy

Recall, qπ (s, a) tells us how good it is to take action an in state s or how much value we are going to get by taking action an in state s. Taking arg-max over qπ (s, a) means that we pick actions to say an in state s that maximizes the q-value [qπ (s, a)] or yield us maximum q-value [qπ (s, a)].

Well, does this make our policy better or not?

So, we are asking that whether taking the greedy action [π’(s)] is better than just following our policy π all along. Yes, taking first step greedy [π’(s)] and then following policy π is better or equal to just following policy π all along.

This can be expressed as:

Here, the left side first takes our greedy step [π’(s)] and then follows our policy π. Which is equal to taking the best action possible (which has maximum q-value). The right-hand side is just following a policy π all along, which is equal to saying the value function of a state. Now, you might be thinking why it is better than or equal to, than just following policy π all along?

See, when taking our greedy step (π’(s)) in the next step we are choosing the best action that can be taken from that state s. Now, on the right side, we are not taking any greedy step or deterministic step so there are chances that our agent might select the best action or might end up selecting the action which yields less value than the greedy one.

The above idea can be applied to every step:

Policy Improvement Theorem

So, this policy gets better on each step and the only time it stops is when π becomes our optimal policy. If improvement stops, then the greedy action value is equal to the current value of the state:

Improvement stops

So, if the value function Vπ (s) is equal to the greedy value function qπ (s, π’(s)) then Bellman Optimal Equation is Satisfied:

Bellman Optimal Equation

We have talked about above equation in much detail in part 2. So if this is satisfied, then our policy π is the optimal policy.

So, this is all about Policy iteration. First, we evaluate our policy using Bellman Expectation Equation and then act greedy to this evaluated value function which we have shown improves our policy over iteration.

Modified Policy Iteration

This is based on the idea that whether we need to wait until our policy evaluation needs to converge at Vπ. Should we stop this early?

If you remember in policy evaluation, we first evaluate policy and then act greedy respect for it. It turns out that after a few iterations we already have achieved our optimal policy. We don’t need to wait for our value function to converge. So, we introduce early stopping conditions.

One way to know when to stop is if our Value function during evaluation gets updated by little value. So, we can say if the value function gets updated by a very little amount, say less than an epsilon then we can stop this process.

Another way is we can manually stop the process of policy evaluation after k-iterations.

Now, let’s look at another Dynamic Programming algorithm: Value Iteration.

Principle of Optimality :

Before we look into it, let’s redefine our definition of Optimal Behavior. So, an Optimal policy can be subdivided into two components:

• An Optimal first step, say A

• And then following Optimal policy from the next state, say

a policy is said to be optimal if it behaves optimally after we have taken one optimal step, say action A.

Value Iteration

In Policy iteration, we saw that first, we evaluate a policy i.e find value function and then act greedy to it to construct a new policy. But, in Value Iteration we do not compute policy for every intermediate value function. Once, we have found the optimal value function, then we compute the optimal policy from our final optimal value function.

Value iteration is the special case for Policy Iteration. When the process of policy evaluation is stopped after one step.

For value Iteration, Bellman Optimal Equation is used. We start off with some random Value Function (say zero) and then we iterate this process using Bellman Optimal Equation. That is, we compute the new value function of a state by plugging in the values of the previous iteration in the Bellman Optimal Equation. We iterate this process until it converges to Optimal Value Function and then computes optimal policy from this optimal value function.

We use Bellman Optimal Equation in an iterative way for value iteration. This can be expressed as:

Value Iteration Update Rule

Let’s understand it with an example:

Backup Diagram for Value Iteration

Since we are using synchronous backup, so each state in our Markov Decision Process gets to be the root. Like before we first compute the value of root(s) using the Bellman Optimal equation [talked about in detail in the previous blog]. Then at iteration k+1, the value of state s is updated by plugging in the values from the previous iteration V [k (s’)]. So, we use the Bellman Optimal Equation in an iterative way. Doing this way, our value function converges to optimal value function.

This can be expressed as:

Bellman Optimal Equation

So, this is all about Value iteration.

Synchronous Dynamic Programming Algorithms

So, summing it all what we saw earlier, here is a summary of David Silver Lecture:

Summary of Algorithms