Source: Deep Learning on Medium

# Fundamentals of Reinforcement Learning: Markov Decision Processes, Policies, & Value Functions

*Welcome to the second article in GradientCrescent’s special series on reinforcement learning. This series will serve as to introduce some of the fundamental concepts in reinforcement learning, primarily using Sutton’s Reinforcement Learning Textbook and the University of Alberta’s Reinforcement Learning course as source material. This series will focus on learning of concepts over their demonstrations, serving to reinforce (pun intended) my own learning in the field.*

**Introduction**

In our** ****last article**, we introduced and discussed the K-armed bandit problem, one of the fundamental pillars in reinforcement learning. While simple in nature and highly intuitive, it exhibits weaknesses that make it unsuitable for the majority of real-world situations:

- A bandit-based system does not possess any information about the environment, or
*state,*that it is in, being only concerned on the rewards obtained from actions. - The long-impacts of any decision are not considered, as only the short-term gain is maximized.
- The possible changes in reward distribution in response to changes to a situation and or actions are not considered.

**Markovian Decision Processes tackle these shortcomings by allowing the system to analyse the environment it is in, and the effects of individual actions on rewards, to decide on the best course of action to follow**. Let’s illustrate this with a simple example. Consider the rabbit below, moving on a single axis. While the rabbit is closer to the carrot, choosing to eat it would place it at risk of being eaten by the lynx, instead of choosing the safer broccoli slightly further away. What should the rabbit do to maximize its reward?

**Markov Decision Processes**

We can break down the example above into a few key components.

- The Agent
- The Environment
- The Action
- The Reward

The agent represents the actor in a RL scenario. It’s job is to interact with the environment (represented for a particular time step as a state) via various actions to maximize the reward. This can be done in an episodic or continuous manner. We can visualize this at a theoretical-level:

For a episodic activity, at a time step t, the agent is given information about the environment (state S), and uses that information to select an action. At t+1, we transition to state S (t+1) as a result of that action, and the agent receives a reward R (t+1). This cycle then continues on in a loop until a terminal state, R (T) is reached. The aforementioned set of events can be described as a episode episode.

It’s easy to use the analogy of a simple arcade game here, such as Pong or Super Mario Brothers, where the inherent use of lives define the game as episodic, with the rewards tracked by a score counter as one progresses through the game. The rewards in a episodic event can be described as a cumulative sum of individual rewards:

In contrast, a **continuous event **is described by constant perpetual interactions without terminal states, resulting in a cumulative reward that would eventually reach infinity. To avoid this, we can introduce a **discount rate** (commonly used in the financial industry for judging the present value of future revenues), which is a numerical factor that assigns a lower weight to future rewards. As rewards received infinite timesteps away approach 0, we can achieve a finite result:

Naturally, shifting the value of gamma also changes the relative nearsightedness of our system in terms of maximizing cumulative reward.

You may be wondering about how our examples relate to Markovian decision processes. As mentioned, both episodic and continuing tasks can be described through a series of states. A state follows a *Markov Property ***if it contains information about the entire history of previous agent-environment interactions, without the need to invoke previous states**. Essentially, this means that probability of achieving State St and reward Rt for example, depends solely on their immediate preceding counterparts, S (t-1) and A (t-1). **Any preceding information concerning state history can be disregarded.**

Before carrying on, we take the relationship described above and formally define the Markov Decision Process mathematically:

Where **t** represents a environmental timestep, **p** & **Pr** represent probability, **s** & **s’** represent the old and new states, **a** the actions taken, and **r** the state-specific reward.

The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. It proposes that **whatever the details of the sensory, memory, and control apparatus, and whatever objective one is trying to achieve, any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment**: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states), and one signal to define the agent’s goal (the rewards). This framework may not be sufficient to represent all decision-learning problems usefully, but it has proved to be widely useful and applicable.

**Policies and Value functions**

Given an agent in an initial state presented with multiple possible courses of action, how do we define the probability of the agent preferring certain actions over others. In other words, **how do we define the behaviour of an agent?**

**This can be done through policies, which describe the distribution of actions available given a state, accompanied by the probability of choosing each of these actions given a certain state**. As such, different policies will lead to different actions, and hence different cumulative overall rewards. Taking the original example of a rabbit, you could argue that going left or right could be formulated as following two different policies, with different end rewards.

In order to identify the optimum policy, we need to be able to quantify the amount of reward an action or state would return to us. **To achieve this, we can use value functions, which summarize over all possible future scenarios by averaging over specific returns when following a particular policy**. More specifically,** state-value functions describe the expected sum of future rewards given a state when following a policy **π.

We can take a step further and define an action-value function, which **describes the expected returns for a state s, then taking an action a, and then following a policy **π.:

The relationship two value functions can be intuitively understood**. The state value function for a state s is defined as the sum of the action value functions of all outgoing actions (a), multiplied by the probabilities assigned to selecting each of the actions (which are defined in the policy).**

**How do we estimate these value functions? **This is based on a combination of repeated statistical observations, experience, and intuition. Classical Monte Carlo-based approaches, involving averaging over many random samples of actual returns for each particular action or state, can be done to ascertain the value functions via convergence. Naturally, if the number of states is excessive, it may become less practical to keep averages for each state (particularly if we are tracking separate averages for each action within a state), demanding us to come up with parameterized functions as substitutes.

Chess is a great analogy here. For every possible situation on a board (which can be thought of as states), the value of a situation can be derived by analyzing the positions of the opponent’s pieces and those of one’s pieces. Given a situation, we can also estimate the quality of all of our future moves in helping us achieve a high probability of winning (which serve as the expected reward), given that we follow a particular strategy (which serves as a policy). Each move has an inherent value derived from years of observational analysis

**The fundamental relationship that defines value functions within reinforcement learning is the relationship between a value of a state and that of its possible successor states that can be quantified via Bellman equations. **The derivation of these equations is beyond the scope of this tutorial (readers are encouraged to consult Sutton’s *Reinforcement Learning *textbook for a formal discussion), **but essentially allow the value of a state to be composed of the immediate reward of achieving the said state, along with the discounted value of achieving successor states.**

Naturally, tracking all of the different relationships between states and state-actions can be complicated to keep track of, facilitating the creation of *backup *diagrams. **These diagrams allow us to visualize the relationship between our states, actions and rewards, facilitating the maximization of overall reward within the constraints of the MDP.** We’ve presented one below from David Silva’s excellent lecture, representing the different actions a student could take on a particular evening.

Our diagram represents states as circles, with transition between our states illustrated with arrows. We’ve specified a policy value of 0.5 here, meaning that there is a 50–50% chance of choosing whichever action from a certain state (apart from the starting state at the bottom, which features probabilities). The numbers in red are the calculated state-value functions, encapsulating the relative value of being in a particular state by** taking** **into account all possible rewards arising from achieving the state, along with the discounted value of future rewards.**

**Optimal Policies**

**Given a system with several policies, how do we compare our policies against one another? **Generally, the value of a policy is better than another when its cumulative expected reward is better or the same as that of its counterpart across all states. This means that while the policies may share a same value at a particular state, one must be consistently more higher valued than the other in order to establish policy superiority. Optimal policies exhibit the highest possible value in any state. In an MDP, there will always be at least one optimal policy.

While the value functions of a system with three policies may still be feasible to calculate, more complex problems can be tackled through the Bellman Optimality Equations. These rely on the postulate that an optimal policy must also exhibit optimal state-value and action-value functions. More formally, **the value of a state under an optimal policy must equal the expected return for the best action from that state**.

Recall that for a given MDP there will always exist an optimal deterministic policy- one that will always select an optimal action in every state. Such a policy will always assign probability one to the action yielding the highest value, and probability 0 for all other actions. Hence, we can replace the sum over **A** with max(**a**). As such, we can simplify our equation to taking the maximum value over all possible actions **a**.

That wraps up this introduction to the Markov Decision Processes. In our next tutorial, we’ll implement what we’ve learned and tackle the automation of classic Pong with Reinforcement Learning.

We hope you enjoyed this article. To stay up to date with the latest updates to GradientCrescent, please consider following the publication.

**References**

Sutton et. al, Reinforcement Learning

White et. al, Fundamentals of Reinforcement Learning, University of Alberta

Silva et. al, Reinforcement Learning, UCL