Source: Deep Learning on Medium

# Illustrating Online Learning through Temporal Differences

## Fundamentals of Reinforcement Learning

# Introduction

Over the course of our articles covering the fundamentals of reinforcement learning at GradientCrescent, we’ve studied both model-based and sample-based approaches to reinforcement learning. Briefly, the former class is characterized by requiring knowledge of the complete probability distributions of all possible state transitions, and is exemplified by Markovian Decision Processes. In contrast, sample-based learning methods allow for the determination of state values simply through repeated observations, eliminating the need for transition dynamics. In our last article, we discussed the applications of Monte Carlo approaches in determining the values of different states and actions simply through environmental sampling.

More generally, Monte Carlo approaches belong to the offline family of learning approaches, insofar as to allow updates to the values of states only when the terminal state is reached, or at the end of an episode. While this approach may seem sufficient for many controlled or simulated environments, it would be woefully inadequate for applications requiring rapid changes, such as in the training of autonomous vehicles. The use of offline learning for such applications could possibly result in an accident, as a delay in updating the state values would result in an unacceptable loss of life or property.

As such, the majority of reinforcement learning algorithms in use today are classified as online learning. In other words, the values of states and actions is continuously updated throughout time through sets of estimates. This is also known as temporal difference learning, and is the foundation of more advanced algorithms that are used to train agents tackling game environments such as those observed in the OpenAI Atari gyms.

# Temporal Difference Learning

Just as in Monte Carlo, Temporal Difference Learning (TD) is a sampling-based method, and as such does not require knowledge of the model in order to estimate its value functions. However, unlike Monte Carlo approaches, TD is an online method, relying on intra-episode updates with incremental timesteps. At the core of temporal difference learning is a incremental update function (“bootstrapping”) of a state *St, *featuring a TD error (shown in red at the bottom):

Notice the introduction of the two different timesteps (*t *and *t+1*) in the TD update function. The TD error contains the sum of the return at the next timestep and the current estimate for state *St+1*, with the value of the previous state *St* subtracted from this sum. Essentially, we update the estimate of a state with another estimate obtained at a later time-step, in a facsimile gradient descent observed previously for neural networks.

How does this work in practice? Consider the sequence of States (S), Actions (A), and (Rewards)