Summary of algorithms in Stanford Machine Learning (CS229) Part III

Source: Deep Learning on Medium

Reinforcement learning in general

In supervised learning, we try to make output mimic label y given input x. In this setting, the labels gave an unambiguous “right answer” for each input x. In reinforcement learning, since it involves a sequence of decision making, it hard to provide this type of explicit supervision. Instead we provide a reward function which indicates to the learning algorithm when it is doing well or poorly. You can think of our learning algorithm as an agent and it interacts with the environment. It knows its state and learns the action at each state, it can get reward from the environment. The end goal of our learning algorithm is to maximize expected cumulative reward.

Markov decision process (MDP)

MDP is defined by:

  • A set of states S. S can be finite or infinite
  • A set of actions A. A can be finite or infinite
  • Psa, the state transition probabilities. Psa gives the distribution over what states we will transition to if we take action a in state s
  • r is called discount factor which is in the range of 0 and 1
  • reward function R, which takes state or state and action and outputs a real value

The dynamics of an MDP is that we start in some state s0, choose action a0, based on Ps0a0, we transit to state s1 and so on. Depending on our reward function, the total payoff is:


Our goal in reinforcement learning is to choose actions over time so as to maximize the expected value of the total payoff. Intuitively, in order to maximize total payoff, we want to accrue positive rewards as soon as possible. and postpone negative rewards as long as possible.

Some terminologies that will be used often are:

  • A episodic task means the interaction between agent and environment ends at some time step T.
  • A continuing task means the interaction between agent and environment continues forever without limit.
  • A deterministic policy is a mapping that maps state to action. It is denoted as π : S -> A. In other words, our action at state s is: a = π(s).
  • A stochastic policy is a mapping from state and action to a probability denoted as: π : S, A-> [0,1]. The probability represents that the agent takes action a while in state s.

For each state, the value function (state value function) is the expected return if the agent started in a state and follow a policy for all time steps:

The action-value function is the expected return if the agent started in a state and a action following a policy:

Given a fixed policy π, its value function satisfies Bellman equation:

The first term in Equation (4) is an immediate reward while the second term is the sum of future discounted rewards starting in s’ where s’ is distributed according to Ps pi(s). Bellman equation can also be derived from Equation (3) directly:

Bellman equation can be used to solve value function for finite state MDP. Because we can write down bellman equation at every state, then we will have a linear system with |S| (number of states) number of equations and unknowns. So we can solve the value function at each state.

We can also define the optimal value function as:

So optimal value function is the one that gives the best sum of discounted rewards. There is also a version of Bellman’s equation for optimal value function:

We also define optimal policy as:

For every state and policy, we have:

The first term is optimal value function, the second term is value function for optimal policy. By definition, they are equal. The inequality is based on Equation (8). An interesting property about optimal policy is that it is the optimal policy for all the states s. This means we can use the same optimal policy no matter what the initial state of our MDP is.

Policy iteration and value iteration

Policy iteration and value iteration are used for solving finite-state MDP. Finite-state MDP means the number of states is finite.

The outline of policy iteration is as follows:

The goal of step 2.1 is to compute the value function for every state given policy π. A naive way is to solve |S| linear equations like we mentioned above. The other way is to use iterative policy evaluation which estimate V with the one-step dynamics of MDP.

The python implementation is as follows:

As can be seen, in order to compute V(s) for any state s, all we need to do is look at every action that we can take at state s and every next state s’ we can get after this action and sum up the current reward plus expected future reward like we did in Bellman’s equation. Because we only look at one-step further, this is why iterative policy is called one-step dynamics of MDP. The algorithm ends when the change of V(s) is smaller than the threshold theta we set before.

In terms of step 2.2, we update policy by going through all the actions in a state and pick the action (or all the actions) that gives the highest action-value. The policy improvement algorithm is outlined as follows:

The python implementation is as follows:

With iterative policy evaluation and policy improvement, the code to implement policy iteration is as follows:

There is another version of policy iteration called truncated policy iteration. The only difference is the stop criterion. Instead of waiting till the end where policy does not change, the truncated policy iteration set a limit to the number of iterations and stop updating once the total number of iterations reaches the threshold. The rest keeps the same.

That’s all about policy iteration. Value iteration can be viewed as a simplified and greedy version of policy iteration. It is outlined as follows:

If you compare it with Algorithm 1: Iterative policy evaluation, you will find the difference is only for computing V(s). In iterative policy evaluation, it looks at every action and next state given s, but in value iteration it only takes the action that lead to maximum action value. The python implementation is as follows:

Notice there are two ways to update value for each state (and action), one is called synchronous update where we compute new V(s) for every state and then do update for all states at the same time; the other is called asynchronous update where we update V(s) one at a time. It turns out both update method can let V converge to V*. In practice value iteration is used more often since it is faster than policy iteration.

Both policy iteration and value iteration are algorithms for model-based learning. Model based learning means we find optimal policy by maximizing total expected reward with known state transition and reward. In the next section we are going to talk about model-free algorithm (Monte Carlo and TD learning) which take exploratory actions and use gained experience to estimate state-value or action-value function.

Monte Carlo Learning

The policy iteration and value iteration we discussed above assumes that we know the state transition probabilities and reward. However, in most cases, we have to learn state transition probabilities and rewards directly from data. This is where Monte Carlo method of reinforcement learning comes in.

Notice that Monte Carlo method only applies to episodic MDPs since the episode has to terminate before we can calculate any returns.

Same as policy iteration and value iteration as we discussed above, Monte Carlo method also has two steps. Step 1 is to learn action-value function Q(s,a) from episodes of experience under a policy π; step 2 is to update policy based on action-value function AKA policy improvement.

There are 2 common ways to learn action-value function Q(s,a). First is called First visit Monte Carlo which is outlined as follows:

To understand the first visit Monte Carlo algorithm better, let’s do a simple example: Let’s assume our states are S = {M, N}, actions are A = {+, -}, discount factor is 1. Episode 1 is (M, +, -2), (N, -, 0), (N, -, 2), terminate. Episode 2 is (N, -, 1), (M, +, -3), (N,-, 5), terminate.

We want to compute Q(M, +). By the first time we saw (M, +) in episode 1, the expected return = -2 + 0 + 2 = 0. By the first time we saw (M, +) in episode 2, the expected return = -3 + 5 = 2. We take the average of 0 and 2 and finally our evaluation for Q(M, +) = 1. What Algorithm 4 does is that it maintains a running mean. Instead of holding expected return for each episode, it updates Q(s, a) on the fly. The formula is also easy to come up with:

So in order to find mean at kth iteration, all we need is the mean of (k-1)^th iteration, the current value at k and the k itself. So algorithm 4 keeps track of k by using N(S, A), G_t is like x_k in Equation (9), Q(S,A) is like u_k and u_{k-1} in Equation (9). These are all about first visit Monte Carlo.

Every visit Monte Carlo is almost the same as first visit Monte Carlo except that if the same state and action has appeared multiple times, we record the expected return G_t every time and average all of them together and set it to be Q(state, action). This time, we cannot use running mean but the idea is the same.

To update policy, we can do what outlined in Algorithm 2, basically take π(s) to be the action or actions a that gives maximum Q(s,a). A better way is to consider the tradeoff between exploration and exploitation in reinforcement learning. Exploration means to explore new actions that are different from our greedy option; exploitation means take all the information we have and apply greedy algorithm (take maximum) to determine our policy.

A good method to combine exploration and exploitation is called epsilon-greedy algorithm. The idea is that when determining policy, there is a chance (1-epsilon) that we execute based on policy π, there is a chance epsilon that we act randomly to explore the state space. Mathematically this is:

Combining policy evaluation and policy improvement together, we can outline Monte Carlo learning algorithm as follows:

Temporal-Difference (TD) Learning

The idea of TD learning is that instead of waiting till the end of episode to update value (expected return, action-value etc), it updates prediction at every step. It can also work in both episodic and continuous tasks.

One popular TD learning algorithm is called Sarsa (or Sarsa(0)). The algorithm can be outlined as follows:

Notice the update of Q(s,a)uses one-step dynamic like we did in Equation (5). Since we want update of prediction at every step and we cannot accumulate the sum of expected return in the continuous task. Hyper-parameter alpha controls the weight of update. If alpha = 1, we update old Q using new Q directly. If alpha = 0, we do not update Q anymore.

To understand Sarsa, we can look at a simple example. Say we have a sequence S0, A0, R1, S1, A1 | R2, S2, A2 | R3, S3, A3 |, …

We start at state S0, A0 can be a random choice and we gain reward R1 and reaches states S1. We pick another random action A1. Now we update Q(S0, A0) and policy π using epsilon-greedy. Based on the new policy, we can determine A2 and actions further. We will also update Q(S_t, A_t) and policy π after we have seen S_{t+1} and A_{t+1}. The name “Sarsa” comes from the first part of the sequence S, A, R, S, A.

This is all about Sarsa. Another popular algorithm similar to Sarsa is called Sarsamax AKA Q-learning. It can be outlined as follows:

To compare Sarsa and Q-learning. Sarsa follows the policy that it is learning. So in practice, it is more suitable for online-learning. The bad thing is that Q values might be affected by exploration since we during exploration we might fall into a bad action. Q-learning uses policy π to choose action and another greedy policy (take max) to update Q value. It is bad for online-learning because there is a disconnection between the policy we learned and the policy we are following. But the good thing is that Q value will not be affected by exploration.

There is another algorithm called expected Sarsa which updates Q(s,a) as follows:

Basically we use a weighted average with weight being the probability each action can be taken at state St.

Reinforcement learning in continuous space

In many applications, the state and/or action are continuous. For instance, in sweeping robot, its state which is the position in the environment is continuous rather than discrete. We cannot use algorithms we discussed above to solve this kind of problem directly since we all not able to store infinite number of V(s) or Q(s,a).

There are two main strategies to deal with continuous space. One is discretization, the other is function approximation.

Discretization means we set some intervals and all the values inside each interval is deemed as one value. We can also do non-uniform discretization. For instance, we can increase the granularity in a grid. After discretization, we can convert continuous space into discrete so that we can apply algorithms we discussed above. One drawback of discretization is curse of dimensionality. As the number of dimensions increases, our discretized states will grow exponentially. Usually discretization works well for 1d and 2d problems but it rarely works for problems higher than 6d.

For function approximation, a simple strategy is to use linear approximation. The idea is to introduce weight w and then we can approximate state value function and action value function as follows:

We can also apply kernels to the feature vector x(s) in order to learn non-linear features like we talked about in SVM.

We can update v(s,w) and q(s,a,w) using gradient descent. This is similar to what we did in supervised learning. More specifically, using the idea of first visit Monte Carlo algorithm as we described in algorithm 4 above, we can outline Monte Carlo with function approximation as follows:

The update rule for state value function is similar expect using v(s) rather than q(s,a):

Following the same idea, we can turn Sarsa into continuous task using linear approximation as follows:

Notice that the update rule of w for Sarsa uses one-step dynamic rather than sum of expected returns. Q-learning for continuous tasks is similar to Sarsa except the update rule:

Useful Resources and reference:

[CS229 official website]