Introduction to Reinforcement Learning

Original article was published on Artificial Intelligence on Medium

Introduction to Reinforcement Learning

When we are born, the first thing we do is interact with an environment completely unknown to us. Our actions lead to consequences that we internalise by way of experience.

This process of learning has been clarified into a series of algorithms collectively referred to as reinforcement learning, algorithms which are now leading the artificial intelligence revolution forward — more specifically, the field of machine learning.

2010 — Google buys DeepMind for ~$450 million; DeepMind is the [referente entre las compañías dedicadas al I+D] in the field of artificial intelligence

2014 — DeepMind develops a software based on reinforcement learning capable of learning to play video games (Atari) similarly to humans

The results of DeepMind’s Deep Q-learning algorithm applied to the Atari game “Brick Block Blast”

2015 — The AlphaGo software, developed by DeepMind, beats the then world champion of the strategy game Go.

2017 — Within just 24 hours, Deepmind’s AlphaZero program reaches a superhuman level of play in several games — chess and its global East equivalent, Shogi, and Go — defeating their respective world champions.

2019 — Another program developed by DeepMind, AlphaStar, succeeds in besting a professional player of one of the most complex strategy video games: StarCraft II.

This technique seeks to optimise the results of a problem through trial and error.

Every reinforcement learning task is composed of an agent and an environment.

  • The agent can be defined as an entity with the capacity to deduce and memorise. Its mission is to train itself in the environment until it reaches an optimal performance.
  • The environment represents the problem to be solved; the context the agent will interact with — in other words, its source of information. The environment is always structured as a series of alternatives.

With each alternative, the agent takes an action and receives in exchange a reward. Its familiarity with the environment grows more complete as it accumulates rewards. The learning process is complete once the agent is able to identify which sequence of actions, among all of the possible actions, which yields it the greatest cumulative reward.

The image below models the environment as a burning floor inside a building composed of 9 cells (alternatives). The agent is represented by a person whose objective is to reach the door (the green cell), traversing the least number of cells without getting burned.

In each cell, the agent can take a step (action) towards any other adjacent cell: to the left, right, forward, or backwards.

The following graphic shows the floor analysed by which possible actions the agent can take from each cell. As one might expect, the walls block movement in their direction.

Differences and Similarities with Supervised Learning

  • Reinforced learning also consists of classification, however with an objective of identifying the most fit (suitable) action from all possible actions, while bearing in mind the state of the environment in that moment.
  • Reinforcement learning is based on experimentation — not imitation — in order to carry out its prediction. Therefore, it is not necessary to tag all of the training data. Instead, we substitute that tag with the results that we gather as the agent interacts with the environment, in the form of positive or negative experiences.
  • Reinforcement learning takes an opposite approach to the problem: first, it attempts all of the various solutions that could exist in the problem and compares them to construct a model from the best solution found. Thus, it avoids the need to try distinct predictive models to identify which best adapts itself to the task.

Markov’s Decision Process

The MDP (Markov Decision Process) provides a mathematic framework for modeling the interaction between agent and environment.

Every MDP is composed of a collection of S states, which are equivalent to the alternatives from the previously mentioned environment (cells).

Each state has a finite collection of A actions, and each action supposes a movement between two states.

This transition is stochastic, which is to say: the agent’s criteria when choosing a given action is unpredictable. However, it is certainly possible to approximate, by way of the occurrence probability attributed to each action.

In other words, in each state, the agent has a 100% probability of taking a given action distributed arbitrarily among all the possible actions of that state.

Thus, if we structure our environment in the form of a Markov chain and we choose a state (any state) — for example, number 8 — the sum of its movement probabilities will be equal to the unit P87 + P89 = 1.

The policy 𝝅 will thus be a function whose input variable is the agent’s current state and whose output is the next action to be taken.

Put differently, it’s the criteria or strategy the agent uses to select an action in each state.

In an MDP, the transition between states is ordered by discrete time: t = 0, 1, 2, .. That is to say: in the initial moment (t), the agent exits from a given state in the environment. At a later time (t+1), the agent executes an action, moving itself to the next state:

The following diagram represents all of the possible sequences of states that the agent can pass through, as described in the previous example, supposing that it starts in the state (cell) “8” and stops when it reaches state “1”. The asterisks in this image represent that the branches preceding this state have already appeared at another point in the diagram, meaning that the only thing which differs between the two representations is the time when they split.

As we discussed above, the agent’s objective is to determine which of his future paths (which sequence of states) is the most suitable. To do so, in each new state, the agent receives a reward which it will use to evaluate the selected action from the previous state:

We can think about this process as an arbitrary function, which maps state-action pairs to rewards.

This reward is represented in the form of a number, and its magnitude is proportional to the extent to which the agent approaches (positive number) or moves away (negative number) from the ideal solution.

The following list establishes an order of priorities for all of the possible events that the agent must consider to correctly resolve the problem. Each event may occur in one or more states (cells). Each state is assigned the reward associated with its respective event:

  1. Reach the exit (state 1) → +100 as a reward (a positive reward)
  2. Burning oneself (states 7 and 8) → -10 as a reward (a negative reward)
  3. Select the shortest route (for all other states) → -1 as a reward (a negative reward) for each cell passed through. If the agent weren’t penalised for the steps between cells, it would not have a way of knowing that its movement is not optimising the distance, resulting in an excessive number of steps and/or undoing steps already taken.

Keep in mind that the values assigned to the rewards are subjective; what’s important is not the exact value, but rather that the two remain related. The following image summarises the distribution of rewards described for each environmental state in the list above.

Once the MDP modeling is complete, the environment would be prepared for the agent to begin learning, in other words, to take in the distribution of rewards.

For example, let’s suppose that just after starting (state 8), the agent decides to take a step to the left. Upon reaching state 7, it will realise that it chose poorly (given the presence of fire) and will continue its path to the next state.

In this way, the agent can calculate the accumulated reward of all possible itineraries by summing up the immediate reward of each of their state-action pairs.

For example, for a given route that begins in time t and continues until reaching the final state of the sequence in time T, the cumulative reward would be the sum of all the rewards in its states.

The agent’s objective is not to maximise the adjacent reward (of the following action), but rather the cumulative reward in each of the possible combinations of actions (Gt).

In the figure below, the agent compares two of the seemingly optimal state-action sequences; although the left path is shorter, the presence of fire dramatically reduces the cumulative reward, leaving only the optimal solution in the problem.

However, if we stop a moment to think, when the environment is finite, it is likely that the agent will arrive at the final state before having completely familiarised itself with all of the state-action pairs. To do that, it would need several episodes (sessions) of learning before having taking in all of the distribution of rewards. This type of environment is referred to as episodic tasks.

Reinforcement learning problems can often have hundreds, thousands, or even an infinite number of states (continuous tasks). In these cases, finding the optimal solution can be an excessively large task, if not endless.

Because of this, in order to accelerate the algorithm’s convergence (in other words, to facilitate the agent’s finding the route which maximises the cumulative reward), it is good practice to reduce the impact that a reward has on an agent as the agent becomes further from the initial state.

From the agent’s point of view, we could interpret this as a type of fatigue that grows as the agent continues taking actions. Or, said differently, a motivation to find the solution as quickly as possible.

Mathematically, this concept is carried out by multiplying a discount factor gamma 0 < 𝛾 < 1 which decreases as the number of states passed through increases (its exponent). How quickly the algorithm converges is proportional to the value assigned to 𝛾:

As such, the cumulative reward would be: