Original article was published by Maxim Volgin on Artificial Intelligence on Medium

# The many flavours of Reinforcement Learning

Somehow, I had that misconception that Reinforcement Learning (RL) is all about control. Probably because it is usually explained to lay people in terms of stick and carrot. Indeed, many RL methods are about improving policies, i.e. about learning to *act* more efficiently/profitably in a particular situation. However, there are also enough of those *predicting* the (estimated) benefits of being in a particular situation.

Another misconception is that because there are so many algorithms in RL, some (probably the newer ones) must be better than (most) others. The thing is, RL is in fact a very broad field applicable to very different kinds of problems. And therefore there are plenty of different algorithms developed for specific problems which they solve best. So it does not even make sense to compare some methods with each other because they serve entirely different purposes or are designed for very different operating conditions. Moreover, some of them are capable of working together to solve one part of the problem each.

Different applications of RL are often contradictory in terms of both goals and assumptions, so one has to know exactly which problem is being solved and what is being assumed about the environment when reading about a particular RL method. In fact, I was still pretty much confused about a number of things even after a few months of formal study. In RL, understanding the context is crucial for understanding a particular algorithm.

So let us proceed with the differences in goals and assumptions. Do we know the environment we operate in? Does is always behave the same, or is there a measure of randomness about it? Can we observe it directly or do we only get glimpses of it and must make up our mind about what it means for us? Can we easily distinguish between different states, or is it pretty much up to us how to define them? How can the effect of our algorithm’s actions be estimated? Is there a natural end of the activity, or is it ongoing and we need to establish some time boundaries where we take measures? How many states and actions can or should we define? Does this amount pose additional computational difficulties to be solved? Can or should we model the environment at all, or it is so volatile that it does not even make sense? And finally, do we need our algorithm to act or just to estimate how well we are doing?

There are good, solid and precise algorithms which lead straight into computational hell when applied to somewhat extended environment. There are somewhat rash and biased algorithms that can efficiently deal with a lot of uncertainty. And there are plenty of others too.

The are three major classes of RL algorithms: planning, model-based and model-free. In case to use a planning algorithm, we need an MDP (Markov Decision Process) model of the environment ready, i.e. transition function `T(S,A)`

and reward function `R(S,A)`

must be known upfront. In model-based algorithms, model of the environment is being learned and used to get extra experience by simulation. Model-free algorithms are obviously getting by without any attempt to model the environment. An alternative solution is experience replay, where past experiences are stored and may be replayed either at random, or prioritised in different ways having profound effect on the training process.

RL algorithms can be value-based, where state or state-action values are being evaluated and possibly used to define a policy, or those learning policy directly by utilising an objective function.

Another major division lies between on-policy and off-policy algorithms. On-policy means that policy providing experience and policy being optimised are one and the same. Off-policy obviously means the opposite. Algorithms involving experience replay are by definition off-policy. Policy gradient algorithms are by definition on-policy.

And there are also online and offline algorithms, i.e. the ones learning with each time step and those requiring to complete a trajectory before reflecting upon it and updating itself.

Function approximation, such as regular Machine Learning (ML) or Deep Learning (DL) methods, can help with evaluation of feedback and generalising sampled feedback. DL primarily helps dealing with huge and/or continuous state and/or action spaces, however occasionally it also adds something unique to the overal design, such as in duelling architecture.

Major differences between algorithms are usually obvious from the first glance on their respective update (or ‘backup’ in RL lingo) function. If it is a policy gradient algorithm, the only thing of interest there is the objective function `J(`

θ`)`

. Otherwise, it typically consists of the value function being updated, which can tell you whether it is a planning method (state value function `V(S)`

) or not (state-action value function `Q(S,A)`

). Other typical components of the update (or ‘backup’) function are learning rate, discount rate, error and its sub-component target, which is the most interesting of them all.

Target is by definition an estimation of value, hence the difference between the target and the actual value is the error. Target can be defined in a number of ways, and that is where value-based algorithms differ the most. One common way to define the target is bootstrapping, i.e. making an estimation based on yet another estimation, which in practice means using value function instead of the actual reward. Bootstrapping inevitably introduces bias, but reduces variance. Actor-Critic algorithms, which are a combination of policy gradient and value-based algorithms, are using bootstrapping by definition.

The so-called deadly triad, the combination of function approximation, off-policy learning and bootstrapping tends to lead to divergence and therefore should be avoided.

It’s time for some target practice! For TD (Temporal Difference) algorithm, the target is defined as `r + `

𝛾`V(s')`

and for SARSA as `r + `

𝛾`Q(s',a')`

, both are on-policy and the main difference is in the value function being used (state vs state-action). For Q-learning, it is `r + `

𝛾`maxQ(s',a')`

where the optimal state-action value is being used instead of the one suggested by the policy, making it an off-policy algorithm. Monte Carlo target `R +`

𝛾`R' +`

𝛾²`R'' + ... `

reveals that is it an offline algorithm not using bootstrapping.