Deep Reinforcement Learning — A Simple Approach to a Fascinating Area of AI Research



Agents mimicking human intelligence by continuously learning from its environment

October 12, 2018 by Stacy Stanford, Roberto Iriondo

Figure 1: Robots take actions that change the state of the world, by using a reward system as to plan actions to maximize cumulative reward. | Credits: Assistant Professor, Anca Dragan, UC Berkeley

Deep reinforcement learning (DRL) a type of machine learning, as well as a branch of AI research, it allows machine learning agents (programs, algorithms, models, etc.) to learn from interacting within an environment with an objective as to achieve a goal, which is based on a reward system (reward function/objective function), in order to intelligently determine the ideal behavior within a specific context to maximize cumulative reward and achieve the goal.

For instance, when playing video games we work towards achieving a goal, whether is leveling up our character, leveling up on the game, winning the game, etc. Likewise, in DRL an agent works within its environment by making mistakes and learning through human supervision, hence applying a reward function.

These goal-oriented algorithms can start from a blank slate, and through the right set of conditions achieve superhuman state (i.e. AlphaGo). Imagine that you are trying to motivate a child, by either providing a reward for completing a goal (i.e. candy) or by punishments (i.e. time-out), in the case of a DRL agent these are penalized when they act upon wrong decision-making, and rewarded when they make the right choices — this process is reinforcement learning.

World-renowned machine learning expert, Professor Manuela Veloso at Carnegie Mellon University, is a big fan of deep reinforcement learning — from the most basic Q-learning to any other variation. Her favorite algorithm refers to (Deep) Q-Learning, using convolutional neural networks to recognize an agent’s state — i.e. Figure 2 the environment that the soccer player is on.

Figure 2: Deep Q-Learning in action | Perfecting the art of free kicks in FIFA 18 with Tensorflow. | Credits: Chintan Trivedi

The soccer player above follows a process called Markov decision process (MDP), a stochastic (completely random) process, where a future state depends solely on the agent’s current state. In a simplistic way, imagine that the soccer player above can only do three things: run, stand/wait or kick the ball. Each of these three actions/decisions is a state.

Figure 3: Markov chain with a soccer player in Fifa | Credits: Roberto Iriondo

The next state in which the soccer player will act upon depends solely on his current state. It does not matter if the player has been throwing kicks for the past five minutes, if at that moment he is standing still, that is all we need to know. The importance is simple, since the previous states are not what define the player’s next action, we can know the next state of the player by calculating the probability of the transition of it. In other words, we can know the chance that the player has to go from the stopped state to the running state, and to the kicking state.

In this case, deep reinforcement learning uses Markov’s decision process (MDP), which changes properties in real time. Every MDP consists of five values:

1. Finite set of States (denoted by the letter s)

This set of states can be the ones we talked about above: walking , runningand jumping . However, each project will have its own set of states, such as whether a door is open or closed, or the geographic position of something.

2. Finite set of actions (a)

Actions are precisely what it seems: they are actions that, in our case, the soccer player can carry out in each state. The action in this case may be in the act of pushing a button for the soccer player to move from one state to another, such as pushing the button forward or backward to make the soccer player leave the state of standing by for running, or kicking.

3. Probability model (P)

It is the likelihood of an action taking the soccer player from his current state to a future state.

4. Reward (R)

The Reward is a number value that the soccer player will receive after performing that action in that state. When this number is positive, it means that the action that the soccer player did in that state was beneficial; the soccer player’s main objective is to perform the set of actions that maximize the reward until its final goal.

5. Discount factor (Y)

The discount factor is a scalar number, smaller than 1. It influences the total future reward an agent will receive, e.g. a discount factor of 0.9 indicates that the more advanced your agent is in the future, the higher your reward will be.

In other words, the soccer player will prefer to follow paths in the phase that will guarantee a greater reward in the future, instead of an immediate reward.

Chintan Trivedi’s publication mentions that while his ML model has not mastered all different types of chosen states, it has learnt some situations very well — thanks to DRL. Switching from supervised learning to DRL eased the data collection process of training data and allowed the agent to learn to handle the chosen states extremely well, further findings reagarding his research can be found on his publication.

Where to learn more in regards to DRL?

The content we brought here just scratched the surface of the iceberg called Deep Reinforcement Learning. It would be impossible to illustrate all the concepts on this field in an article, but we hope that we have at least illustrated a small amount of the process for you. Here are some tips if you want to go deeper into the subject:

Free course of the School of AI: this course was created by Siraj Raval to teach for free about Deep Reinforcement Learning. The content of the course is very straight forward and will offer you a good overview on the subject.

DRL by Sebastian Thrun: This course was created by Sebastian Thrun — who founded Udacity — for more than 5 years, which still keeps up to date. To give you an idea, the thunderous success of this course was the reason why Thrun founded Udacity, whose course had more than 50,000 students enrolled in less than two weeks.

Deep Reinforcement Learning and Control, Katerina Fragkiadaki, Tom Mitchell: This page contains a lot of great information, along readings regarding DRL. Professors Fragkiadaki and Mitchell are both experts in machine learning, furthermore Professor Mitchell has recently been named CMU’s Interim Dean of the School of Computer Science which is considered #1 in the world for AI and machine learning research.

DRL course for Python on Udemy: I did not actually take this course, but a reference professional for me indicated it and, for examining the qualifications, I think it is worth checking. The course focuses on teaching the theoretical and practical concepts of RL using Python.

Deep Reinforcement Learning by Ruslan Salakhutdinov: Professor Salakhutdinov is an expert in deep learning, machine learning and large-scale optimization. His main goal is to understand computational and statistical principals required for discovering structure in large amounts of data, he is currently on leave from the Machine Learning Department | @CarnegieMellon to lead Apple’s AI Research Lab.

Final notes

It helps to think of deep reinforcement learning as an algorithm in action, in this case imagine that the algorithm is learning to play Fifa by maneuvering the player’s states, trying to get the soccer player to score the maximum quantity of goals (points). In order to do that, you can have different instances of the soccer player executing in parallel, working towards the same goal and running through all the possible game states. These soccer players are reward seeking agents guided by maximizing the rewards achieved during those game instances by creating heat-maps (maps indicating how the maximum reward of a set of actions is achieved).

Nevertheless, while DRL is a great choice for simple games, it seems to fail when it encounters unfamiliar situations, therefore adjusting the objective function with a supervised learning approach helps on addressing these issues, as Dr. Dragan mentioned during her lecture at CMU, she used DRL on an agent with an objective function as to score the most points during the game, and found out during the training period that the agent found a loop in order to score the most points, and not with the goal of winning the game. Which surprised the researcher, since what was specified was to score the most points, not really participating and winning the game.

Figure 4: Agent figures out a loop during the race in order to score the most points, not participating/winning the game. | Credits: Assistant Professor Anca Dragan, UC Berkeley

References & Acknowledgements:

Anca Dragan, Trust & Transparency on Ethics & AI | https://www.youtube.com/watch?v=21Ev8FGFIbo

An Introduction to Markov Decision Processes, Bob Givan | An Introduction to Markov Decision Processes — Rice CS

Tree Based Hierarchical Reinforcement Learning | http://reports-archive.adm.cs.cmu.edu/anon/2002/CMU-CS-02-169.pdf

Source: Deep Learning on Medium