Original article was published on Deep Learning on Medium

# DRL 02: Formalization of a Reinforcement Learning Problem

**Agent-Environment interaction in a Markov Decision Process**

Today we start with the second post in the series “Deep Reinforcement Learning Explained” . As we announced in the first post, one of the main aspects of this series is its orientation to practice, however we need some theoretical knowledge before starting coding. In this posts we will explore specific assumptions and abstractions in its strict mathematical form. Don’t panic; your patience will be rewarded!

Here we will introduce the reader to the mathematical representation and notation of the concepts that have been presented in the previous post, which will be used repeatedly in this series. Actually the reader will learn to represent these kinds of problems using a mathematical framework known as **Markov Decision Processes** (MDP) that allows to model virtually any complex Environments.

Often, the dynamics of the environments are hidden and inaccessible to the Agent, however, as we will see in future posts, DRL agents do not need to know the precise MDP of a problem to learn robust behaviors. But knowing about MDP is important for the reader because agents are commonly designed with the assumption that an MDP, even if inaccessible, is running under the hood.

# Markov Decision Processes

In more formal terms, almost all the Reinforcement Learning problems can be framed as **Markov Decision Processes **(MDP). For the moment we could consider that a MDP consists basically of five elements ** <S,A,T,R,γ> **where the symbols mean:

— a set of states*S*— a set of actions*A*— transition function*T*— reward function*R*— discounting factor*γ*

Let’s describe each of them.

# States

A** state **is a unique and self-contained configuration of the problem. The set of all possible states, the **state space**. There are special states as starting states or terminal states**.**

In our Frozen-Lake example, used in the previous post, the state space of the Environment is composed by 16 states:

print(“State space: “, env.observation_space)State space: Discrete(16)

In the Frozen-Lake Environment, for instance, there is only one starting state (which is state 0) and five terminal states (states 5,7,11,12 and 15):

All states in MDP has “Markov” property, referring to the fact that the future only depends on the current state, not the history: the probability of the next state, given the current state, will be the same as if you give it the entire history of interactions. In other words that is, the future and the past are **conditionally independent** given the present, as the current state encapsulates all the statistics we need to decide the future.

# Actions

At each state, the Environment makes available a set of actions, an **action space**, from which the Agent will choose an **action**. The Agent influences the Environment through these actions and the Environment may change states as a response to the action taken by the Agent. The Environment makes the set of all available actions known in advance.

In Frozen-Lake Environment, there are four available actions in all states: UP, DOWN, RIGHT, or LEFT:

print(“Action space: “, env.action_space)Action space: Discrete(4)

Now that we have presented the states and actions, we can revisit the “Markov” property. The probability of the next state ** St+1**, given the current state

**and current action**

*St*

*At**in a given time*

**, will be the same as if you give it the entire history of interactions. In other words that is, the probability of moving from one state to another state on two separate occasions, given the same action, is the same regardless of all previous states or actions encountered before that point.**

*t*In Frozen-Lake example, we know that from state 2 the Agent can only transition to state 1, 3, 6, or 2 and this is true regardless of whether the agent’s previous state was 1, 3, 6, or 2. That is, you don’t need the history of states visited by the Agent for anything.

# Transition Function

Which state the Agent will arrive in (and the Environment changes its state) is decided by the **transition function** and is denoted by ** T. **Depending of the Environment, Agents can select actions either deterministically or stochastically.

## Deterministic

Imagine the example of Frozen-Lake that is not a slippery surface. We can create this Environment with the argument `is_slippery=False`

to create the Environment in a deterministic mode:

`env = gym.make(“FrozenLake-v0”, is_slippery=False)`

In this case, the probability at time ** t** of the next state

**given the current state**

*St+1***and action**

*St***was always 1. In other words, a deterministic Environment where there was always a single possible next state for a action. In this case we can consider the transition function as a simple lookup table of two dimension matrix (2D). In our Frozen-Lake example we could obtain it with**

*At*`env.env.P`

that outputs the function as a dictionary:`{`

0: {0: [(1.0, 0, 0.0, False)],

1: [(1.0, 4, 0.0, False)],

2: [(1.0, 1, 0.0, False)],

3: [(1.0, 0, 0.0, False)]},

1: {0: [(1.0, 0, 0.0, False)],

1: [(1.0, 5, 0.0, True)],

2: [(1.0, 2, 0.0, False)],

3: [(1.0, 1, 0.0, False)]},

.

.

.

14: {0: [(1.0, 13, 0.0, False)],

1: [(1.0, 14, 0.0, False)],

2: [(1.0, 15, 1.0, True)],

3: [(1.0, 10, 0.0, False)]},

15: {0: [(1.0, 15, 0, True)],

1: [(1.0, 15, 0, True)],

2: [(1.0, 15, 0, True)],

3: [(1.0, 15, 0, True)]}

}

In this output, `env.P`

returns all the states (lots removed for clarity, go to the notebook for the complete output) where each state contains a dictionary which maps all possible actions (0,1,2,3) from that state to the next state if we take that action. And further each action contains a list, where each element of the list is a tuple showing the probability of transitioning into the state, next state, reward and if the Game terminates there or not (done= `True`

if the next state is a HOLE or the GOAL).

For example, in this “unfrozen” Environment (not slippery), if we execute the sequence/plan (called by me “good plan”) indicated in the following figure, the Agent will arrive safely at the end:

We can check with the following code that it is a plan that allows the Agent to achieve the objectives:

actions = {‘Left’: 0, ‘Down’: 1, ‘Right’: 2, ‘Up’: 3 }good_plan = (2 * [‘Right’]) + (3 * [‘Down’]) + [‘Right’]env = gym.make(“FrozenLake-v0”, is_slippery=False)

env.reset()

env.render()for a in good_plan:

new_state, reward, done, info = env.step(actions[a])

env.render()

if done:

break