(Shallow?) Reinforcement Learning

Original article can be found here (source): Artificial Intelligence on Medium

(Shallow?) Reinforcement Learning

Recent feats such as AlphaGo’s victory over the world’s best Go player have brought reinforcement learning (RL) to the spotlight. However, what is RL and how does it achieve such remarkable results?

In this first article, we will explore the Monte Carlo Control Method (not the deep kind) which, despite being elegantly simple, is the basis upon which some of the most advanced RL is built.

The Basics

RL problems consist of (at least) 2 entities: The agent and the environment, as shown in the figure below. The environment gives the agent a state (also called an observation). The agent then chooses an action based on the provided state and applies it to the environment. The environment then replies to the action by giving the agent a reward (a score for the action).

Photo by Kelly Sikkema on Unsplash

For example, consider a kid (the agent) playing a game (the environment) for the first time. The kid starts by seeing the game screen containing all its elements (the state) and decides on an action to take. To which the game scores him (the reward) and the process reprises until the game ends (we consider environments with a clear termination episodic). After enough repetitions, the kid will start to understand how his actions influence the environment and (assuming he is a competitive child) choose the actions that maximizes his score.

An RL agent, attempts to do exactly the same. However, unlike a human child, computers don’t yet possess innate intelligence. So how does the computer learn what the best actions should be? For the remainder of the text we will distill this problem and re-derive an algorithm that provides a solution.

Framing the Problem

Monte Carlo Learning is a subfield of RL that is well suited for solving finite (limited number of states) Markov Decision Processes (MDPs). An MDP is just a class of problems for which knowledge of the current state provides sufficient information to decide on an optimal action to take. Chess is an example of a problem that can be framed as an MDP. The board state at any point in time contains all the information required to determine the next action.

Let’s consider the problem depicted below. We have a household floor cleaning robot (the agent) that has a battery that can be either LOW or HIGH in charge. The robot can be either CLEANING or DOCKED. Thus, the robot’s environment consists of 4 possible states: {CLEANING, HIGH}, {CLEANING, LOW}, {DOCKED, HIGH}, {DOCKED, LOW}. The robot controller always chooses between 3 actions: CLEAN, DOCK or WAIT, and the environment provides the agent with a reward for its action. However, since our battery sensor is not precise and our robot may get lost, our transitions are stochastic (probabilistic in nature).

Markov Decision Process (MDP) for floor cleaning robot.

For instance, if the robot is currently in the state of CLEANING with HIGH battery and chooses to continue cleaning it has an 80% chance of receiving a reward of 1 and continuing on the same state. At this point, it becomes important to introduce some notation. We express a probability (from 0 to 1) of a new state s′ and reward r given a previous state s and action a as: p(s′,r | s,a). Therefore, we could write our opening sentence as: p({cleaning,high},1 | {cleaning,high},clean)=0.8.

By inspection of the expected rewards from each path, it is possible to guess that an optimal solution to this environment is likely to be in the form of the blue paths shown below.

However, the agent has no information about the environment’s probabilities prior to interacting with the environment. So how can we train an agent to operate on this environment? In other words: How do we find a policy (represented as π) that the agent can follow (that maps a state to an action) such that it maximizes total reward?

Let the Agent Explore!

What if we start with an agent that simply chooses actions at random? Below is an example of the results from such an agent running over an episode consisting of 50 steps (the first 10 of which are shown below):

S0: Docked, High
A0: Wait
R1: 0
S1: Docked, High
A1: Wait
R2: 0
S2: Docked, High
A2: Wait
R3: 0
S4: Docked, High
A4: Wait
R5: 0
S5: Docked, High
A5: Clean
R6: 1
S6: Cleaning, Low
A6: Clean
R7: -3
S7: Docked, Low
A7: Clean
R8: -3
S8: Docked, Low
A8: Clean
R9: -3
S9: Docked, Low

How can we make use of the information we have just collected? To start, we can represent this data using a table. Where the rows will correspond to the states and the columns to actions. The entries would be the weighted-average expected return from taking the given action while in the given state under the current operating policy. With some notation: Q(s,a​)←Q(s​,a​)+α(G​−Q(s​,a​)). Where α is our weighting factor for the average and is often called the learning rate since a large alpha means the network learns faster as recent observations have a greater impact. The table below is termed a Q-table.

Q-table for 1 episode with 50 time steps and α=0.001.

If we repeat this process for enough episodes we would expect our table to tend towards the actual returns. However, for sufficiently large problems this random exploration can take too long to converge. What if we use the information already in the table to focus our exploration? Instead of always picking random actions we pick the actions that would maximize our return (termed greedy actions) based on the current values in the table. Under this proposed method, when running episode 2 if the agent were in the cleaning-high state it would always opt to dock. We know from the suggested solution in the previous section that this is likely not an optimal solution. It suggests, that this new approach may be susceptible to local minimas. What if we start with random exploration but over time (as more data is collected and our Q-table tends towards better approximations of the expected returns) we become more greedy?

This final idea is the concept behind the 𝜀-greedy algorithm where we pick the action in the table that maximizes our expected return with 1 — 𝜀 + 𝜀/n and the values that don’t maximize our return with 𝜀/n probability. We want to explore more in the beginning (when our Q-table is unlikely to be a good representation of the rewards) and narrow our exploration as our approximations improves with a larger number of episodes. A method to achieve this is to have an 𝜀 that decays (becomes smaller) with increasing number of episodes but is limited by some bound so that our algorithm is never fully greedy such as 𝜀 ← min(𝜀 * 𝜀_decay^num_episodes, 𝜀_min_bound). This notion is often referred to as greedy in the limit with infinite exploration (GLIE). Using this algorithm and running for 1000 episodes we end up with the Q table below.

Q-table for 1000 episodes with 50 time steps, α=0.001, 𝜀=1, 𝜀_decay=0.999 and 𝜀_min_bound=0.001.

Observe how choosing greedy actions based on this table is our optimal (blue) policy as shown earlier in the MDP figure.


We have derived a simple reinforcement learning algorithm based on Monte Carlo Control. However, what if the problem we are attempting to solve is not episodic but instead a continuing task (one that has no clear end)? In the next article we will explore Temporal-Difference (TD) Methods that are useful for this class of problems.


Udacity’s course in deep reinforcement learning provides a solid introduction to reinforcement learning.

The scripts used to generate the Q table in the example can be found here.