Reinforcement Learning: Temporal Difference Learning — Part 1

Original article was published on Artificial Intelligence on Medium

Reinforcement Learning: Temporal Difference Learning — Part 1

Since the last articles we have moved from theory more and more into practice. The last two articles of Monte Carlo methods were used to solve the prediction problem and the control problem in reinforcement learning.

Following up on the Monte Carlo Methods, in this article we will look at another method called Temporal Difference (TD) Learning.

TD learning is a central and novel idea of reinforcement learning. It can be seen as a combination of the other two core elements Monte Carlo Methods (MC) and Dynamic Programming (DP).

Like Monte Carlo Methods, TD can learn from raw experience without knowledge of the environment. Thereby TD updates estimates based in part on other learned estimates, without waiting for the final outcome, like MC methods.

Just like Monte Carlo Methods, TD Methods are discussed in two articles.

In the first part, I want to cover the TD prediction problem, TD error, the advantages of the TD prediction, and the Optimality of TD(0).

So then, let’s start with the prediction problem …

TD for the prediction problem

TD learning uses experience to solve the prediction problem, just like Monte Carlo Methods. Both learning methods update the value V of a nonterminal state s, that occurs in the experience. However, they differ in the estimation of the target value. (What is the value function?)

MC prediction update

Monte Carlo methods need to wait until the end of the episode to determine the increment to V(S_t) because only then is the return G_t known, whereas TD methods just have to wait until the next time step. At time t+1 they immediately form a target and make a useful update using the observed reward R_(t+1) and the estimate V(S_(t+1)).

TD prediction update

In the shown equations, you can nicely see the difference in choosing the target value for the updating step. MC uses G as the Target value and the target for TD in the case of TD(0) is R_(t+1) + V(s_(t+1)). Because it only uses the Value of the next state to calculate the target its called one-step TD or TD(0).

There are other methods like n-step TD or TD(lambda) build on top of this simple one-step TD updating but these introduced in later articles.

TD and MC updates are assigned as sample updates, because they involve looking ahead to a sample successor state, and using the value of the successors and the reward along the way to compute a backed-up value. Then updating the value of the original state accordingly. In comparison sample updates differ from the expected updates of DP methods in that they are based on a single sample successor rather than on a complete distribution of all possible successor.

Important to notice is that the quantity:

in the TD(0) update is an error measuring the difference between the estimated value of s and the better estimate R_(t+1) + gamma * V. Thus this error is called TD error and arises in various forms everywhere in the world of reinforcement learning.

TD error

It should be also kept in mind, that the TD error depends on the next state and the next reward, which means its actually only available at the next time step.

As in the Monte Carlo Method article, I implemented the basic algorithms for TD learning as well. Results and the implementation can be visited at the GitHub repo:

The image below shows the predicted state values of the Gridworld environment after training:

Advantages of the TD prediction

So now, if you read and worked through the code of the last article about MC, you know about both methods MC and TD for prediction. Now it would be interesting to know where the advantages for each method lay. TD learns their estimates are based on other estimates that means they bootstrap. But is this bootstrapping a good way to do? Learning to guess from another guess?

If we compare the updating step, TD seems to be more naturally in an online way, since we only have to wait for the next step and then update the Value directly for the previous step. As we know, MC methods have to wait until the Episode has finished, because only then the total return is known.

If you think about it, this seems to be a crucial fact for a lot of tasks. Imagine one episode in for your task / in your Environment is very long which means all your learning is super delayed, and your learning is prolonged. And now consider Environments with continuing tasks. They have no episode at all.

These are significant difficulties for MC methods. Its also to remember that some MC methods have to ignore or discount episodes on which experimental actions are taken. This can greatly slow down the learning. All this makes the TD methods way handier since they learn from each transition.

But what about the problem of TD methods learning a guess from a guess? Are TD methods robust and solid, do they converge?
Happily if you visit the notebook on GitHub for the TD(0) implementation or looked closely at the results posted above we can say yes, TD(0) methods do converge! For any fixed policy pi, TD(0) has been proved to converge to v_pi.

A logical next question would be which method, MC or TD converges/learns faster?

Well answering this question on a general basis is not possible at the moment, since at the current time this is an open question in the sense that no one has been able to prove mathematically that one method converges faster than the other. It’s even difficult to phrase this question in the most appropriate formal way.

Still in practice TD methods usually been found to converge faster than MC methods on stochastic tasks. Also consider the better applicability of TD methods for tasks with long episodes or continuous tasks!

Optimality of TD(0)

To understand the optimality of TD(0) methods, its difference and faster convergence lets consider the example of batch updating.

First, what is batch updating. Imagine we have interacted with an environment for 3 episodes. So for each updating step we consider all experiences in an episode as one batch and we iterate over each episode repeatedly for each learning step until the algorithm converges. Whereas after each learning step one episode is played and added to our memory. So based on our example, the learning step would be:
learn from batch ep1, ep2, ep3
play ep4
learn from ep1,ep2, ep3, ep4
play ep5
– …
It is called batch updating because updates of the value are only made after processing each complete batch of training data. That is, the value function is changed only once by the sum of all increments.

Considering this batch updating, MC and TD(0) converge both to a single but a different answer. Below you can see the results comparing MC and TD(0) for batch updating on a random walk prediction problem based on their RMS error to a value function v_pi.

We know that MC uses sample averages of the actual returns experienced after visiting each state s. These are optimal estimates, in the sense, that they minimize the RMS error.
But how is it possible that TD(0) performed better than this optimal method? The answer is that MC methods are only optimal in a limited way, and TD methods are optimal ways that are more relevant to predicting returns. To understand this more clear let’s consider this prediction example:

Imagine we have 8 episodes of experience given as state and reward of a Markov reward process and we want to update the value function.
Ep1: (A,0), (B,0)
Ep2: (B,1)
Ep3: (B,1)
Ep4: (B,1)
Ep5: (B,1)
Ep6: (B,1)
Ep7: (B,1)
Ep8: (B,0)
Calculating the Value for state B seems intuitively easy with V(B) = ¾. But what would be the Value for A? There are two reasonable answers.

The answer for the TD method would be ¾ as well. Since we have already decided that B has value ¾ and we move from state A directly to be with a reward of 0. Therefore A must have the value if ¾ as well. One way of viewing this answer is that it is based on first modeling the Markov process (shown below) and then computing the correct estimates given the model, which indeed in this case gives V(A) = ¾. You can also look at the updating step of TD methods.

The answer that MC would give is 0. We have seen A once and the total return that followed it until the episode ended was 0. Therefore the estimate of V(A) = 0. This is also the answer that gives the minimum RMS error on the training data. Despite the fact that MC reaches an error of 0 we consider the first response (TD) as better. Since, if the process is Markov, we expect that the first answer will produce a lower error on future data, even though the MC answer is better on the existing data.
Batch MC methods always find the estimates that minimize MSE on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process.

Given such a model, we can compute the estimate of the value function that would be exactly correct if the model were exactly correct. This is called the certainty-equivalence estimate because it is equivalent to assuming that the estimate of the underlying process was known with certainty rather than being approximated. In general, batch TD(0) converges to the certainty-equivalence estimate.

I hope this helps you understand why TD methods converge more quickly than MC methods. In batch form, TD(0) is faster than MC methods because it computes the true certainty-equivalence estimate.

With that I want to close the first part on Temporal Difference Learning. I hope you were able to learn something or gain new insights. If you want to be informed about the next part on Temporal Difference Learning or want to support my work in general, you are welcome to follow me
on Medium | GitHub | LinkedIn.