DRL 02: Formalization of a Reinforcement Learning Problem

Original article was published on Deep Learning on Medium

DRL 02: Formalization of a Reinforcement Learning Problem

Agent-Environment interaction in a Markov Decision Process

(Image by author)

Today we start with the second post in the series “Deep Reinforcement Learning Explained” . As we announced in the first post, one of the main aspects of this series is its orientation to practice, however we need some theoretical knowledge before starting coding. In this posts we will explore specific assumptions and abstractions in its strict mathematical form. Don’t panic; your patience will be rewarded!

Here we will introduce the reader to the mathematical representation and notation of the concepts that have been presented in the previous post, which will be used repeatedly in this series. Actually the reader will learn to represent these kinds of problems using a mathematical framework known as Markov Decision Processes (MDP) that allows to model virtually any complex Environments.

Often, the dynamics of the environments are hidden and inaccessible to the Agent, however, as we will see in future posts, DRL agents do not need to know the precise MDP of a problem to learn robust behaviors. But knowing about MDP is important for the reader because agents are commonly designed with the assumption that an MDP, even if inaccessible, is running under the hood.

Markov Decision Processes

In more formal terms, almost all the Reinforcement Learning problems can be framed as Markov Decision Processes (MDP). For the moment we could consider that a MDP consists basically of five elements <S,A,T,R,γ> where the symbols mean:

  • S — a set of states
  • A — a set of actions
  • T — transition function
  • R — reward function
  • γ — discounting factor

Let’s describe each of them.


A state is a unique and self-contained configuration of the problem. The set of all possible states, the state space. There are special states as starting states or terminal states.

In our Frozen-Lake example, used in the previous post, the state space of the Environment is composed by 16 states:

print(“State space: “, env.observation_space)State space: Discrete(16)

In the Frozen-Lake Environment, for instance, there is only one starting state (which is state 0) and five terminal states (states 5,7,11,12 and 15):

Figure 1: State space of the Frozen-Lake Environment. (Image by author)

All states in MDP has “Markov” property, referring to the fact that the future only depends on the current state, not the history: the probability of the next state, given the current state, will be the same as if you give it the entire history of interactions. In other words that is, the future and the past are conditionally independent given the present, as the current state encapsulates all the statistics we need to decide the future.


At each state, the Environment makes available a set of actions, an action space, from which the Agent will choose an action. The Agent influences the Environment through these actions and the Environment may change states as a response to the action taken by the Agent. The Environment makes the set of all available actions known in advance.

In Frozen-Lake Environment, there are four available actions in all states: UP, DOWN, RIGHT, or LEFT:

print(“Action space: “, env.action_space)Action space: Discrete(4)

Now that we have presented the states and actions, we can revisit the “Markov” property. The probability of the next state St+1, given the current state St and current action At in a given time t, will be the same as if you give it the entire history of interactions. In other words that is, the probability of moving from one state to another state on two separate occasions, given the same action, is the same regardless of all previous states or actions encountered before that point.

In Frozen-Lake example, we know that from state 2 the Agent can only transition to state 1, 3, 6, or 2 and this is true regardless of whether the agent’s previous state was 1, 3, 6, or 2. That is, you don’t need the history of states visited by the Agent for anything.

Transition Function

Which state the Agent will arrive in (and the Environment changes its state) is decided by the transition function and is denoted by T. Depending of the Environment, Agents can select actions either deterministically or stochastically.


Imagine the example of Frozen-Lake that is not a slippery surface. We can create this Environment with the argument is_slippery=False to create the Environment in a deterministic mode:

env = gym.make(“FrozenLake-v0”, is_slippery=False)

In this case, the probability at time t of the next state St+1 given the current state St and action At was always 1. In other words, a deterministic Environment where there was always a single possible next state for a action. In this case we can consider the transition function as a simple lookup table of two dimension matrix (2D). In our Frozen-Lake example we could obtain it with env.env.P that outputs the function as a dictionary:

0: {0: [(1.0, 0, 0.0, False)],
1: [(1.0, 4, 0.0, False)],
2: [(1.0, 1, 0.0, False)],
3: [(1.0, 0, 0.0, False)]},
1: {0: [(1.0, 0, 0.0, False)],
1: [(1.0, 5, 0.0, True)],
2: [(1.0, 2, 0.0, False)],
3: [(1.0, 1, 0.0, False)]},
14: {0: [(1.0, 13, 0.0, False)],
1: [(1.0, 14, 0.0, False)],
2: [(1.0, 15, 1.0, True)],
3: [(1.0, 10, 0.0, False)]},
15: {0: [(1.0, 15, 0, True)],
1: [(1.0, 15, 0, True)],
2: [(1.0, 15, 0, True)],
3: [(1.0, 15, 0, True)]}

In this output, env.P returns all the states (lots removed for clarity, go to the notebook for the complete output) where each state contains a dictionary which maps all possible actions (0,1,2,3) from that state to the next state if we take that action. And further each action contains a list, where each element of the list is a tuple showing the probability of transitioning into the state, next state, reward and if the Game terminates there or not (done= True if the next state is a HOLE or the GOAL).

For example, in this “unfrozen” Environment (not slippery), if we execute the sequence/plan (called by me “good plan”) indicated in the following figure, the Agent will arrive safely at the end:

Figure 2: “good plan” for the Frozen-Lake example. (Image by author)

We can check with the following code that it is a plan that allows the Agent to achieve the objectives:

actions = {‘Left’: 0, ‘Down’: 1, ‘Right’: 2, ‘Up’: 3 }good_plan = (2 * [‘Right’]) + (3 * [‘Down’]) + [‘Right’]env = gym.make(“FrozenLake-v0”, is_slippery=False)
for a in good_plan:
new_state, reward, done, info = env.step(actions[a])
if done:

Here the Environment reacts deterministically to Agent’s actions, but if we remember from the previous post, the original Environment reacts stochastically to Agent’s actions to simulate slipping.


We introduced that which state the Agent will arrive in is decided by the transition function. But in a stochastic Environment, at time t the transition function T maps a transition tuple (St, At, St+1) to the corresponding probability of transition p from a source state St to target state St+1 when taking action At.

Now, to capture all the details about the Environment and possible reactions to the Agent’s actions, the transition function couldn’t be represented as a 2D matrix as in the case of the deterministic Environment. In this case we need a 3D matrix with the dimensions source state, action and target space, where each element indicates the probability of transition from a source state St to a target state St+1 given action At.

In order to verify that we are really talking about a 3D matrix, we could obtain the transition function as before, but now for the slippery Environment:

env = gym.make(“FrozenLake-v0”)
0: {0: [(0.3333333333333333, 0, 0.0, False),
(0.3333333333333333, 0, 0.0, False),
(0.3333333333333333, 4, 0.0, False)],
1: [(0.3333333333333333, 0, 0.0, False),
(0.3333333333333333, 4, 0.0, False),
(0.3333333333333333, 1, 0.0, False)],
2: [(0.3333333333333333, 4, 0.0, False),
(0.3333333333333333, 1, 0.0, False),
(0.3333333333333333, 0, 0.0, False)],
3: [(0.3333333333333333, 1, 0.0, False),
(0.3333333333333333, 0, 0.0, False),
(0.3333333333333333, 0, 0.0, False)]},
14: {0: [(0.3333333333333333, 10, 0.0, False),
(0.3333333333333333, 13, 0.0, False),
(0.3333333333333333, 14, 0.0, False)],
1: [(0.3333333333333333, 13, 0.0, False),
(0.3333333333333333, 14, 0.0, False),
(0.3333333333333333, 15, 1.0, True)],
2: [(0.3333333333333333, 14, 0.0, False),
(0.3333333333333333, 15, 1.0, True),
(0.3333333333333333, 10, 0.0, False)],
3: [(0.3333333333333333, 15, 1.0, True),
(0.3333333333333333, 10, 0.0, False),
(0.3333333333333333, 13, 0.0, False)]},

15: {0: [(1.0, 15, 0, True)],
1: [(1.0, 15, 0, True)],
2: [(1.0, 15, 0, True)],
3: [(1.0, 15, 0, True)]}

This transition function implements that there is a 33.3% chance we will transition to the intended cell (state) and a 66.6% chance we will transition to orthogonal directions. There is also a chance we will bounce back to the state we are coming from if next to the wall.

It could be helpful a visual representation of the Environment as a graph with nodes representing the states and edges, labeled with probabilities (and rewards), representing a possible transition from state to state. For simplicity and clarity, I have added to the image below only the transition function for all actions of state 14. This subset of states allows the illustration of all possible transition without too much clutter.

Figure 3: Part of transition diagram for state 14. (Image by author)

Now if we execute the same sequence of the “good plan” actions several times, we can see that it does not behave in a deterministic way, giving very different results. We will come back to this example in future posts.

Reward Function

Once an action is taken, the Environment delivers a reward as a mesure of goodness to transitions using a reward function (or reward probability). It’s just a scalar value the Agent obtains in each time step (or every fixed number of time steps) from the Environment (can be positive or negative, large or small). The purpose of reward is to tell our Agent how well it has behaved.

The reward gives the Agent a feedback about its success, an that reward obtained should reinforce Agent’s behaviour in a positive o negative way. Nevertheless it only reflects the success of the Agent’s recent activity and not all the successes achieved by the Agent so far. What an Agent is trying to achieve is the largest accumulated reward over its sequence of actions, and we need another feedback, the return. But before introducing the return, let me introduce another component of the MDP, the discounting factor.

Discount factor

The task the Agent is trying to solve may or may not have a natural ending. Tasks that have a natural ending, such as a game, are called episodic tasks. The sequence of time steps from the beginning to the end of an episodic task is called an episode. The Frozen-Lake Environment presents episodic tasks, because there are terminal states; there is a clear goal and failure states. Conversely, tasks that do not have a natural end are called continuing tasks, such as learning forward motion.

Because of the possibility of infinite sequences of time steps, we need a way to discount the value of rewards over time; that is, we need a way for telling the Agent that getting +1’s is better sooner than later. So, we commonly use a positive real value less than one to exponentially discount the value of future rewards. The further into the future we receive the reward, the less valuable in the present.

This number is called the discount factor, or gamma denoted by γ. The discount factor adjusts the importance of rewards over time. The later we receive rewards, the less attractive they are to present calculations.


We introduced in post 1 that the Agents are often designed to maximize the return, denoted by G. Now that we have introduced the gamma, we can see how the Return is calculated from the gamma and rewards. For every episode, we define return at the time t, as this quantity:

For every time step t, we calculate return Gt as a sum of subsequent rewards, but more distant rewards are multiplied by the discount factor raised to the power of the number of steps we are away. If gamma equals 1, then return just equals a sum of all subsequent rewards. If gamma equals 0, return will be just the immediate reward without any subsequent state. These extreme values are useful only in corner cases, and most of the time, gamma is set to something in between, such as 0.9 or 0.99.

The objective of a decision-making Agent

Let’s see how the Agent makes the decision to meet its objectives: to find a sequence of actions that will maximize the return G during an episode. In other words, the Agent must have a plan, a sequence of actions from the START state to the GOAL state.

We introduced previously a “good plan” for the Frozen-Lake example which seems intuitively the best. But when we run it in a stochastic environment, event the best of plans fall, due actions taken will not always work the way we intend. Remember that in the Frozen-Lake Environment, unintended actions affects have even higher probability: 66.6% vs. 33.3%.

And in particular it happens that due to the environment’s stochasticity, our Agent lands on a cell not covered by our plan. Then? What to do? Well, what we need is a plan for every possible state, a “universal plan” , a Policy , that covers all possible states.


The policy, is the strategy (e.g. some set of rules) that the Agent employs to determine the next action based on the current state. Typically denoted by 𝜋(𝑎|𝑠), a policy is a function that determines the next action a to take given a state s.

The policy 𝜋(𝑎|𝑠) is defined as probability and not as a concrete action. In other words that is, an stochastic policy that has a probability distribution over actions that an Agent can take at a given state.

During the learning process, the policy may change as the Agent gains more experience. For example, the Agent may start from a random policy, where the probability of all actions is uniform; meanwhile, the Agent will hopefully learn to optimize its policy toward reaching the optimal policy.

Even for fairly simple environments, we can have a variety of policies. Then we need a method to automatically find optimal policies. For the moment I will introduce two new concepts, state-value function and action-value function, that help to learn and find optimal policies to the Agent.

State-value function

The state-value function, also referred to as the value function, or even the V-function, measures the goodness of each state — in other words, how good or bad it is to be in a particular state according on the return 𝐺 when following a policy 𝜋. It is usually denoted by V𝜋(𝑠).

However return G quantity is not very useful in practice, as it was defined for every specific episode, so it can vary widely, even for the same state. But, if we go to the extreme and calculate the mathematical expectation of return for a state (by averaging a large number of episodes), we will get a much more useful value for V𝜋(𝑠). In future posts we will use it and go into further details.

Action-value function

Moreover, we can extend the definition of state-value function to state-action pairs, defining a value for each state-action pair, which is called the action-value function, also known as Q-function. It is usually denoted by Q𝜋(𝑠, 𝑎) and refers to the expected return 𝐺 when the Agent is at state 𝑠 and takes action 𝑎 following the policy 𝜋.

Estimating the state-value function and action-value function is an essential component of Reinforcement Learning methods. In future posts in this series we will go into more detail and cover different ways of calculating and estimating these functions.

We have reached the end of this post!. A post that doesn’t pretend to be a complete theoretical framework, only to introduce the minimum formalism to start and we will be introducing new formulations as we need them throughout the series. See you in the next post!