Fundamentals of Reinforcement Learning: Illustrating Online Learning through Temporal Differences

Source: Deep Learning on Medium

Fundamentals of Reinforcement Learning: Illustrating Online Learning through Temporal Differences

Welcome to GradientCrescent’s special series on reinforcement learning. This series will serve to introduce some of the fundamental concepts in reinforcement learning using digestible examples, primarily obtained from the” Reinforcement Learning” text by Sutton et. al, and the University of Alberta’s “Fundamentals of Reinforcement Learning” course. Note that code in this series will be kept to a minimum- readers interested in implementations are directed to the official course, or our Github. The secondary purpose of this series is to reinforce (pun intended) my own learning in the field.


Over the course of our articles covering the fundamentals ofreinforcement learning at GradientCrescent, we’ve studied both model-based and sample-based approaches to reinforcement learning. Briefly, the former class is characterized by requiring knowledge of the complete probability distributions of all possible state transitions, and can exemplified by Markovian Decision Processes. In contrast, sample-based learning methods allow for the determination of state values simply through repeated observations, eliminating the need for transition dynamics. In our last article, we discussed the applications of Monte Carlo approaches in determining the values of different states and actions simply through environmental sampling.

More generally, Monte Carlo approaches belong to the offline family of learning approaches, insofar as to allow updates to the values of states only when the terminal state is reached, or at the end of an episode. While this approach may seem sufficient for many controlled or simulated environments, it would be woefully inadequate for applications requiring rapid changes, such as in the training of autonomous vehicles. The use of offline learning for such applications could possibly result in an accident, as a delay in updating the state values would result in an unacceptable loss of life or property.

Different agents playing Breakout via OpenAI’s gym utilizing DQN, an online-learning approach based on TD.

As such, the majority of reinforcement learning algorithms in use today are classified as online learning. In other words, the values of states and actions is continuously updated throughout time through sets of estimates. This is also known as temporal difference learning, and is the foundation of more advanced algorithms that are used to train agents tackling game environments such as those observed in the OpenAI Atari gyms.

Temporal Difference Learning

Just as in Monte Carlo, Temporal Difference Learning (TD) is a sampling-based method, and as such does not require knowledge of the model in order to estimate its value functions. However, unlike Monte Carlo approaches, TD is an online method, relying on intra-episode updates with incremental timesteps. At the core of temporal difference learning is a incremental update function (“bootstrapping”) featuring a TD error, shown in red at the bottom :

Monte Carlo (top) versus Temporal Difference (bottom) update functions.

Notice the introduction of the two different timesteps (t and t+1) in the TD update function.The TD error contains the sum of the return at the next timestep and the current estimate for state St+1, with the value of the previous state St subtracted from this sum. Essentially, we update the estimate of a state with another estimate obtained at a later time-step, in a facsimile gradient descent observed previously for neural networks.

How does this work in practice? Consider the sequence of States (S), Actions (A), and (Rewards)

Due to this cyclic behavior, we can update the value of the previous state as soon as we reach the next state. Defined more formally,

Temporal Difference Learning versus Monte Carlo

To best illustrate the difference between online versus offline learning, consider consider the case of predicting the duration of trip home from the office, introduced in the Reinforcement Learning Course at the University of Alberta. At each location or state named below, the predicted remaining time is shown within the circle in minutes, with the actual true elapsed time shown in between each state.

Locations, predictions, and time elapsed on a trip home from the office

Let’s use a Monte Carlo approach first. We’ll wait for the episode to finish in order to acquire the actual total elapsed time of the trip, and then update each of our states by updating the value with the return. So in the case of the departure point (leave), we can accumulate our return (with no discount factor in this case), and update its value at the end of the episode as follows:

Similarly, we can update the next state’s estimate of 35 by using the real elapsed return of 38:

We then repeat this process for each of the intermediate destinations in turn, and achieve a final update as follows:

As such, our state values now better reflect the actual elapsed time of our trip. However, note how we had to wait until the end of our journey to perform our updates. What if our trip featured an objective of fetching a package at a particular time at an intermediate destination? A delayed update may result in significant delays.

Let’s repeat our estimation using temporal difference analysis. Using our new estimate and actual time elapsed, we can update our previous initial estimate using temporal difference analysis. Starting from the office (t), estimate that it will take us 30 minutes to reach home. However, after reaching the “exit” state in five minutes (t+1), we observe that we’re behind schedule, and so update the time remaining from timestep t to 40 minutes.

Continuing on, we take 15 minutes to reach “exit highway”, from which we estimate it’ll take another 15 minutes to reach home. As this is faster than we expected, we can then use this incremental return to update our previous estimate:

We can then repeat this for our entire trip.

Comparing the two approaches, it’s clear that temporal difference analysis allows access into intra-episode optimization, increasing the reactivity of our agent to better converge to finding optimal policies in a minimum amount of time. In a autonomous driving application, this would allow us to monitor and evaluate the performance of an agent at a much earlier point in time, and allow us to make adjustments more rapidly, preventing unnecessary accidents from exploration.

That wraps up this introduction to Temporal Difference Analysis. In our next tutorial, we’ll build upon this foundation and introduce TD for state-action values through SARSA and Q-learning, the latter a well-known online learning approach used to optimize agent policy in a variety of game environments.

We hope you enjoyed this article, and hope you check out the many other articles on GradientCrescent covering applied AI. To stay up to date with the latest updates on GradientCrescent, please consider following the publication.


Sutton et. al, Reinforcement Learning

White et. al, Fundamentals of Reinforcement Learning, University of Alberta

Silva et. al, Reinforcement Learning, UCL