Notice there are two ways to update value for each state (and action), one is called synchronous update where we compute new V(s) for every state and then do update for all states at the same time; the other is called asynchronous update where we update V(s) one at a time. It turns out both update method can let V converge to V*. In practice value iteration is used more often since it is faster than policy iteration.

Both policy iteration and value iteration are algorithms for model-based learning. Model based learning means we find optimal policy by maximizing total expected reward with known state transition and reward. In the next section we are going to talk about model-free algorithm (Monte Carlo and TD learning) which take exploratory actions and use gained experience to estimate state-value or action-value function.

The policy iteration and value iteration we discussed above assumes that we know the state transition probabilities and reward. However, in most cases, we have to learn state transition probabilities and rewards directly from data. This is where Monte Carlo method of reinforcement learning comes in.

Notice that Monte Carlo method only applies to episodic MDPs since the episode has to terminate before we can calculate any returns.

Same as policy iteration and value iteration as we discussed above, Monte Carlo method also has two steps. Step 1 is to learn action-value function Q(s,a) from episodes of experience under a policy π; step 2 is to update policy based on action-value function AKA policy improvement.

There are 2 common ways to learn action-value function Q(s,a). First is called First visit Monte Carlo which is outlined as follows:

To understand the first visit Monte Carlo algorithm better, let’s do a simple example: Let’s assume our states are S = {M, N}, actions are A = {+, -}, discount factor is 1. Episode 1 is (M, +, -2), (N, -, 0), (N, -, 2), terminate. Episode 2 is (N, -, 1), (M, +, -3), (N,-, 5), terminate.

We want to compute Q(M, +). By the first time we saw (M, +) in episode 1, the expected return = -2 + 0 + 2 = 0. By the first time we saw (M, +) in episode 2, the expected return = -3 + 5 = 2. We take the average of 0 and 2 and finally our evaluation for Q(M, +) = 1. What Algorithm 4 does is that it maintains a running mean. Instead of holding expected return for each episode, it updates Q(s, a) on the fly. The formula is also easy to come up with:

So in order to find mean at kth iteration, all we need is the mean of (k-1)^th iteration, the current value at k and the k itself. So algorithm 4 keeps track of k by using N(S, A), G_t is like x_k in Equation (9), Q(S,A) is like u_k and u_{k-1} in Equation (9). These are all about first visit Monte Carlo.

Every visit Monte Carlo is almost the same as first visit Monte Carlo except that if the same state and action has appeared multiple times, we record the expected return G_t every time and average all of them together and set it to be Q(state, action). This time, we cannot use running mean but the idea is the same.

To update policy, we can do what outlined in Algorithm 2, basically take π(s) to be the action or actions a that gives maximum Q(s,a). A better way is to consider the tradeoff between exploration and exploitation in reinforcement learning. Exploration means to explore new actions that are different from our greedy option; exploitation means take all the information we have and apply greedy algorithm (take maximum) to determine our policy.

A good method to combine exploration and exploitation is called epsilon-greedy algorithm . The idea is that when determining policy, there is a chance (1-epsilon) that we execute based on policy π, there is a chance epsilon that we act randomly to explore the state space. Mathematically this is:

Combining policy evaluation and policy improvement together, we can outline Monte Carlo learning algorithm as follows:

Temporal-Difference (TD) Learning
The idea of TD learning is that instead of waiting till the end of episode to update value (expected return, action-value etc), it updates prediction at every step. It can also work in both episodic and continuous tasks.

One popular TD learning algorithm is called Sarsa (or Sarsa(0)). The algorithm can be outlined as follows:

Notice the update of Q(s,a)uses one-step dynamic like we did in Equation (5). Since we want update of prediction at every step and we cannot accumulate the sum of expected return in the continuous task. Hyper-parameter alpha controls the weight of update. If alpha = 1, we update old Q using new Q directly. If alpha = 0, we do not update Q anymore.

To understand Sarsa, we can look at a simple example. Say we have a sequence S0, A0, R1, S1, A1 | R2, S2, A2 | R3, S3, A3 |, …

We start at state S0, A0 can be a random choice and we gain reward R1 and reaches states S1. We pick another random action A1. Now we update Q(S0, A0) and policy π using epsilon-greedy. Based on the new policy, we can determine A2 and actions further. We will also update Q(S_t, A_t) and policy π after we have seen S_{t+1} and A_{t+1}. The name “Sarsa” comes from the first part of the sequence S, A, R, S, A.

This is all about Sarsa. Another popular algorithm similar to Sarsa is called Sarsamax AKA Q-learning. It can be outlined as follows:

To compare Sarsa and Q-learning. Sarsa follows the policy that it is learning. So in practice, it is more suitable for online-learning. The bad thing is that Q values might be affected by exploration since we during exploration we might fall into a bad action. Q-learning uses policy π to choose action and another greedy policy (take max) to update Q value. It is bad for online-learning because there is a disconnection between the policy we learned and the policy we are following. But the good thing is that Q value will not be affected by exploration.

There is another algorithm called expected Sarsa which updates Q(s,a) as follows:

Basically we use a weighted average with weight being the probability each action can be taken at state St.

Reinforcement learning in continuous space
In many applications, the state and/or action are continuous. For instance, in sweeping robot, its state which is the position in the environment is continuous rather than discrete. We cannot use algorithms we discussed above to solve this kind of problem directly since we all not able to store infinite number of V(s) or Q(s,a).

There are two main strategies to deal with continuous space. One is discretization, the other is function approximation.

Discretization means we set some intervals and all the values inside each interval is deemed as one value. We can also do non-uniform discretization. For instance, we can increase the granularity in a grid. After discretization, we can convert continuous space into discrete so that we can apply algorithms we discussed above. One drawback of discretization is curse of dimensionality. As the number of dimensions increases, our discretized states will grow exponentially. Usually discretization works well for 1d and 2d problems but it rarely works for problems higher than 6d.

For function approximation, a simple strategy is to use linear approximation. The idea is to introduce weight w and then we can approximate state value function and action value function as follows:

We can also apply kernels to the feature vector x(s) in order to learn non-linear features like we talked about in SVM.

We can update v(s,w) and q(s,a,w) using gradient descent. This is similar to what we did in supervised learning. More specifically, using the idea of first visit Monte Carlo algorithm as we described in algorithm 4 above, we can outline Monte Carlo with function approximation as follows:

The update rule for state value function is similar expect using v(s) rather than q(s,a):

Following the same idea, we can turn Sarsa into continuous task using linear approximation as follows:

Notice that the update rule of w for Sarsa uses one-step dynamic rather than sum of expected returns. Q-learning for continuous tasks is similar to Sarsa except the update rule:

Useful Resources and reference:

[CS229 official website]http://cs229.stanford.edu/syllabus-autumn2018.html