Reinforcement Learning Using a Single Demonstration

Source: Deep Learning on Medium

Go to the profile of Or Rivlin

Learning from Demonstration

Reinforcement learning bears a lot of promise for the future; recent achievements have shown its ability in solving problems at super human level, like playing board games, predicting the structure of proteins based on their genetic sequence and playing real-time strategy games on a professional level. These achievements demonstrate the capability to discover new strategies that are superior to those we humans can devise, which is an exciting prospect.

Another possible use of RL is automating human decision making, in cases where human performance is good enough. In this setting we would like our agent to imitate the strategies employed by a human expert, which provides our agent with demonstrations of the “right” way to do the task. This can be very handy in several situations, for example:

a) When we cannot formulate a reward function, but we can query an expert to guide the learning process or have logged data of the expert’s actions and their consequences. A practical example of this might be attempting to learn a “help desk” style policy, where our agent needs to interact with costumers and help solve their problems. Since we cannot easily model the reward functions of customers, we cannot employ standard RL techniques, but we might be able to leverage human expert logs to learn a policy.

b) When we have a “standard” type of learning problem, but it is just too difficult to learn from scratch and our algorithms fails completely. Having demonstrations of an expert could help give our learning algorithm a major boost in the right direction

An important point to note is that the expert does not necessarily have to be human. Perhaps we have an optimization or search algorithm that can do the job, but it is too slow for our real-time application, and we need to approximate it using a learned neural network policy. In this setting, the search algorithm can provide expert demonstrations for the learning algorithm throughout the process.

In some sense, learning from human demonstrations is more challenging, since having a human sit down throughout the learning process and provide advice when necessary is generally not practical, unlike the case of our search algorithm. Human expert time is most likely expensive, and so we will generally have only a small quantity of such data.

Imitation Learning

So how can we use demonstrations given by an expert to learn a policy? An obvious and naïve approach would be to use the demonstrations as labeled data and use supervised learning to try and predict the actions taken by the expert at different states. Supervised learning is much better understood than RL, and if we could predict the actions well, it stands to reason that our agent will perform relatively well, or at least somewhat close to the expert.

Sadly however, this approach fails very miserably in many cases when used on its own. The state-space in most realistic problems is very large, and the number of states for which we might have a demonstration is a tiny fraction of that. Since policy learning inherently deals with multi-step processes, and our naïve supervised learning approach learns a response for single states, a small divergence at the beginning of an episode could have a compounding effect that takes our agent to states the likes of which it has never observed and has no clue how to act in.


Another problem with this Behavior Cloning is that our learning process does not optimize for our desired metric, which is the accumulated return from episodes, but rather minimizes some notion of distance between model predictions and expert moves. Obviously, if our model predicts the moves perfectly it will also produce the same returns, but how does prediction error translate to difference in the returns? Small errors might have a large impact on the agent’s performance on our task.

To have any hope of using this approach successfully we must have a very large amount of data that covers a very broad range of states, something we might not have. Even then, behavior cloning is often used as initialization for another reinforcement learning algorithm, to speed up the process of learning the basics. For example, the original AlphaGo trained its policy network using a dataset of 30M human expert moves gathered from online games, and the recent AlphaStar agent was initialized using professional player games, before both were further trained using RL algorithms.

Learning from a Single Demonstration

Several papers in recent years investigated the option of using human demonstrations to help agents learn in difficult problems. A notoriously hard problem that is frequently used as a benchmark is Montezuma’s Revenge, which is an Atari game with extremely sparse and delayed rewards that most standard RL algorithms fail to make even small progress in.

A very interesting paper on the subject was published by researchers from Google’s DeepMind called “Playing Hard Exploration Games by Watching YouTube”. In this paper the researchers gathered many YouTube videos of human players playing the game, trained a neural network to predict the time difference between different frames of the same episode. This produced a meaningful embedding of game states from different sources with visual variations like slightly different coloring and resolution, which made it possible to “plant” fictitious rewards along the embedding trajectories from which the agent can learn that it is on the right track.

While this approach makes a very nice use of an abundant resource — YouTube videos, it might not be applicable in many problems for which very little data exists. Researchers from OpenAI tackled this problem in a recent paper called “Learning Montezuma’s Revenge from a Single Demonstration”. The solution is very simple: given a state-action trajectory provided by an expert, restart the agent at the end of the trajectory and let it learn on its own using some RL algorithm, and then progressively restart it earlier in the trajectory until eventually you restart it from the beginning and let it learn from there.

The idea behind it is that by restarting it at the end of the trajectory we place it adjacent to a reward, so close in fact that a standard RL algorithm will have no trouble finding it. When the agent has learned to find the reward satisfactorily, we restart it at an earlier part of the state-action trajectory given by the expert, and let it learn from there.

To get some intuition as to why that is such a good idea, let’s look at a simple toy problem that the authors provide in the paper; the Blind Cliff Walk problem. In this problem the agent must navigate across a one-dimensional cliff to get back to safety, using either of two actions. The first action moves it forward along the cliff and the second one causes it to fall and die. We assume a tabular setting where the agent cannot generalize between states, and must learn a table that specifies an action for each state.


The agent receives a reward only upon reaching the goal state, and so must initially explore its environment using random actions. However, the expected number of actions required to achieve the reward is exponential in the length of the cliff, making it unpractical beyond very short lengths. The authors show that by having a single successful demonstration and using the proposed method, the time to solve this task becomes quadratic in the length of the cliff, which is a huge improvement.


The authors note that this is very much like Dynamic Programming, in which we typically solve problems from the end backwards, and bootstrap our solutions of the later stages to help quickly solve the earlier stages. In Dynamic Programming we actually observe a very similar reduction of computational complexity for problems such as shortest path in graphs.


The researchers used this method on the infamous Montezuma’s Revenge and obtained state of the art results at the time, beating the score of the DeepMind paper by a substantial margin and using far fewer data to do so.

An obvious drawback to this approach is that it requires having the ability to restart our agent at different states along a prescribed trajectory, which means we must either have free control of the environment or that it be deterministic such that taking the same sequence of actions will always lead to the same state. However, if the method is applicable, it has two major advantages:

1) It requires very little data, and as was demonstrated in the paper even a single trajectory could be enough to solve difficult problems.

2) It optimizes directly for the return, as the trajectories are used only to initialize the agent and from there on it learns using standard RL. This makes it possible in principle for the agent to actually perform better than the expert demonstrator.

I always like it when simple ideas work very well, and this is a great example of such a case. Check out the paper.

Disclaimer: the views expressed in this article are those of the author and do not reflect those of IBM.