Model-Free Reinforcement Learning Algorithms (Part-1)

Source: Deep Learning on Medium

Recently I started a reinforcement learning course named Move37 by Siraj Raval. I have finished my third week in the course which covers Monte Carlo (MC) reinforcement learning methods. Below I am going to summarise MC Prediction methods used in reinforcement learning and in the next parts I will cover MC control and temporal-difference (TD) learning methods, each of them with code examples. So let’s dive in.

Prerequisite: In order to understand the concepts discussed in this article one should have at least some basic idea of the reinforcement learning (rl), Markov decision processes (MDPs), Bellman equations and Dynamic Programming (DP). For a quick overview of these topics, one can go through episodes 1–4 of this blog post series.

After learning rl, MDP and DP one would have understood that the central idea in rl is to find an optimal policy i.e. a policy to maximise the reward returned by the environment (env). So the purpose of every rl algorithm (algo/algos) is to find an optimal policy i.e. the best way to behave in an MDP (which is called control in rl jargon) and to determine the effectiveness of a policy i.e. to calculate the value functions of an MDP (which is called prediction or policy evaluation in rl jargon). Finding out the effectiveness of policies is important, because then only one can judge whether the policy is optimum or not.

While applying these algos there could be different constraints on the agents and MDP, for example, if we are training an rl agent to fly a helicopter, then we don’t exactly know from which direction could the wind come the next moment i.e. we don’t exactly know the dynamics (transition probability and rewards in MDPs) of the MDP (or env) or the dynamics is known but too big to use except by samples. The algos to deal with such constraints on MDPs are called model-free (free from the dynamics of the env) algos and are different from algos which can be applied on completely known MDPs also called model-based algos. DP methods are popular to solve (i.e. to find optimal policy) completely known MDPs or envs e.g. policy iteration and value iteration.

In real life problems most of the time we don’t know the dynamics of the env so model-free algos are going to be most useful and practical. Model-free algos work on sampled episodes from the env (env) i.e. set of randomly sampled states, actions and rewards tuple. Two types of model-free algos we will cover here and in the next blogpost are MC and TD learning respectively. In this blogpost first we will discuss MC prediction methods and then MC control methods.

Monte Carlo Prediction (or MC Policy Evaluation)

In prediction algorithms our main aim is to find the value functions for a given policy. DP achieves this using iterative application of bellman expectation equations. MC prediction achieves this by
a) sampling complete episodes (i.e. final state is the terminal state) from the env and then 
b) calculating average of the sample returns for each state over different episodes.

MC prediction is of two different varieties:
i) First visit MC: Here, we consider each state only for the first time it is encountered in the episode, to find its return.
ii) Every visit MC: Here, we consider each state every time it is encountered in the episode, to find its return.

For further clarification you can check out the lecture slides from David Silver’s RL course. But the underlying idea is if we consider V(s) to be estimate of the state-value function of state s, and Vπ(s) to be the actual state-value function then

where S(s) is the sum of returns for state s over all the episodes, and N(s) is the number of times state s was visited, i.e V(s) is a simple mean of the return and by the law of large numbers we know that

i.e. if we visit each state sufficient number of times we will get unbiased estimate of state-value function.

Another important concept to understand here is the incremental update of the mean. Basic idea behind it is that we can write the mean of k numbers as the sum of mean of k-1 numbers and difference between the kth number and the mean of k-1 numbers. In our prediction algorithm it will convert to

Below is the code for first visit MC prediction: