Fundamentals of Reinforcement Learning: Navigating Cliffworld with SARSA and Q-learning.

Source: Deep Learning on Medium

Fundamentals of Reinforcement Learning: Navigating Cliffworld with SARSA and Q-learning.

Welcome to GradientCrescent’s special series on reinforcement learning. This series will serve to introduce some of the fundamental concepts in reinforcement learning using digestible examples, primarily obtained from the” Reinforcement Learning” text by Sutton et. al, and the University of Alberta’s “Fundamentals of Reinforcement Learning” course. Note that code in this series will be kept to a minimum- readers interested in implementations are directed to the official course, or our Github.


Over the course of our articles covering the fundamentals of reinforcement learning at GradientCrescent, we’ve studied both model-based and sample-based approaches to reinforcement learning. Briefly, the former class is characterized by requiring knowledge of the complete probability distributions of all possible state transitions, and can be exemplified by Markovian Decision Processes. In contrast, sample-based learning methods allow for the determination of state values simply through repeated observations, without the need for transition dynamics. Within this domain, we’ve covered both Monte Carlo and Temporal Difference learning. Briefly, the two can be separated by the frequency of state-value updates: while a Monte Carlo approach requires that an episode be finished for a round of updates to take place, Temporal Difference approaches update intra-episode incrementally, using old estimations of state-values together with discounted rewards to generate new updates.

The rapid reactivity of TD or “online” learning approaches makes them suitable for highly dynamic environments, as the values of states and actions is continuously updated throughout time through sets of estimates. Perhaps most notably, TD is the foundation of Q-learning, a more advanced algorithm used to train agents tackling game environments such as those observed in the OpenAI Atari gyms, and the focus of this article.

Our previous policy-based Pong model, trained over 5000 episodes with a binary action space.

Beyond TD: SARSA & Q-learning

Recall that in Temporal Difference learning, we observed that an agent behaves cyclically in an environment, through sequence of States (S), Actions (A), and (Rewards).

Due to this cyclic behavior, we can update the value of the previous state as soon as we reach the next state. However, we can expand the scope of our training to include state-action values, just as we did with Markov Decision Processes prior. This is generally known as SARSA, an on-policy TD control algorithm for estimating action values. For a given policy, we continuously estimate the action-values across incremental timesteps, while using this knowledge to modify the policy towards an action-value greedy alternative. This is similar to vanilla TD control — we can compare the state-action and state-value TD update equations:

We can define SARSA more formally as follows.

Notice how the update process requires an initial action to be evaluated within the network. This is an intra-episode data generation step. During your own implementations, it may be more computationally efficient to run the environment for the first two tasks of the loop over many steps, in order to build up a large buffer of data.

Q-learning takes this a step further by forcing a selection of the action with the highest action value during an update, in a similar way to what’s observed with Bellman Optimality Equations. In practice, Q-learning can be thought of as an off-policy approach to TD, as the algorithm learns the optimal action values independent of the current policy being followed. We can inspect SARSA and Q-learning next to the Bellman and Bellman Optimality Equations we’ve covered previously, below:

We can define the Q-learning estimation process more formally.

You may be wondering about how ensure complete exploration of our state-action space, given the need to constantly select actions for a state with the highest existing action-value. In theory, we could be avoiding the optimal action simply by failing to evaluate it in the first place. To encourage exploration, we can use a decaying e-greedy policy, essentially forcing the agent to select an apparent sub-optimal action in order to learn more about its value, at a certain percentage of the time. By introducing a decaying process, we can limit exploration once all of the states have been evaluated, after which we’ll permanently select the optimal actions for each state.

We’ve covered an implementation of Q-learning on Atari’s Miss Pacman, but to keep this article more theoretical, let’s take a look at how SARSA and Q-learning approach the problem of navigating a Windy Gridworld.

Our previous implementation of OpenAI’s Pac-Miss Pac-man environment using deep Q-learning.

Cliffworld: Comparing SARSA & Q-learning

We’ve covered the Gridworld environment before in our Dynamic Programming article. Our new Cliffworld looks slightly different, and is shown below.

Essentially, we modify this original Gridworld into a (4 x 12) grid space with a start (S) and goal (G) positions at two opposite ends of the environment. Agents are only allowed to move a single grid at a time in one of the four cardinal directions, where every movement yields a reward of -1. Moreover, part of the bottom row is now taken up with a cliff, where a step into the area would yield a reward of -100, and an immediate teleport back into the start position.

So how do SARSA and Q-learning compare in this environment? Recall that Q-learning is off policy, and hence assumes its policy is already optimal. With an epsilon-greedy policy, Q-learning will eventually explore all of the possible actions while always attempting to choose the optimal action by action value, in order to achieve maximal reward. This means that it will attempt to follow right alongside the cliff (shown in red). However, as it is following an epsilon-greedy policy, it would occasionally fall off through random steps into the cliff area.

In contrast, an agent following SARSA must evaluate the actions available at each step under the constraint of their likelihood in the current policy. This means that an agent following SARSA will use the policy first to observe the likelihood of taking an action, evaluate the reward of that action, and then commit to another following action to update the Q-value of its current state-action pair accordingly. Alternatively, we can do this by taking an expected value, essentially taking a weighted sum of the transition dynamics multiplied by the value of each of state-action value, in a form known as Expected SARSA.

An idealized evaluation of Expected SARSA.

The benefits of using Expected SARSA to explicit expected value of the state-action values is in improving the convergence time: regular SARSA must follow a current policy, and hence may still choose a sub-optimal action, relying on multiple episodes of policy evaluation to eventually choose the optimal action. This results in a large variance in the update targets being observed. On the other hand, expected SARSA provides an averaged, more optimal target for convergence, eliminating this problem. This also results in the SARSA agent following the more conservative, blue path, within the Cliffworld environment.

So how do SARSA and Q-learning compare in performance within the Cliffworld environment? Let’s plot the episodic reward for both approaches.

Naturally, as the SARSA agent takes the longer path, it suffers from slightly higher penalization. However, as the epsilon-greedy policy of the Q-learning agent forces it to take occasional steps into the cliff area, this punishment averages out to reduce its performance. To address this, we could introduce a decaying epsilon-greedy policy, where the probability of taking a random action decreases with a higher number of episodes, allowing for Q-learning to converge to the same performance as that of the SARSA agent.

That wraps up this introduction to Q-learning. In our next article, we’ll move on from the world of Atari to tackling one of the most well known FPS games in the world. Stay tuned!

We hope you enjoyed this article, and hope you check out the many other articles on GradientCrescent, covering applied and theoretical aspects of AI. To stay up to date with the latest updates on GradientCrescent, please consider following the publication.


Sutton et. al, Reinforcement Learning

White et. al, Fundamentals of Reinforcement Learning, University of Alberta

Silva et. al, Reinforcement Learning, UCL