Source: Deep Learning on Medium
DeepMind and OpenAI are two artificial intelligence(AI) companies at the center of advancements in reinforcement learning(RL). From AlphaGo to Dota2 Five, both DeepMind and OpenAI have been pushing the boundaries of RL applications to surpass human in complex cognitive tasks. Last week, the two research powerhouses decided to team up in a new paper that proposes a new method to train RL agents in ways that enables them to achieve superhuman performance.
Titled “Reward learning from human preferences and demonstrations in Atari”, the new research paper introduces a training model that combines human feedback and reward optimization to maximize the knowledge of RL agents. The core thesis of the paper attempts to address one of the major limitations of modern RL applications. Most successful RL applications operate in environments such as multi-player games with well-established reward models that can be hardcoded into the RL agent. However, many tasks that we faced in real life present sparse or poorly defined rewards. Consider the task of finding an object in a house environment. In that setting, its hard to predetermine how to reward the agent after performing a specific task such as searching under the bed without finding the target object? Should we send the agent to the next adjacent room or the opposite side of the house? In those situations, humans rely on intuition to solve complex tasks but we understand very little about intuition to replicate it in RL agents. As a result, RL relies in the next best thing: human feedback.
Inverse reinforcement learning (IRL) or imitation learning is an RL variation that focuses on learning a reward function from human feedback. While IRL overcomes some of the limitations of traditional RL in environments with sparse rewards, it has some basic scalability limitations as it requires domain experts to train the agents. Additionally, if an RL agents is simply imitating a human judge how can it possibly ever exceed human performance?
Reward Learning from Human Preferences
To address some of the limitations of RL and IRL models, DeepMind and OpenAI proposed a method that combines human feedback and RL optimization to achieve super human performance in RL tasks. Instead of assuming a specific reward model, the proposed technique learns a reward function by leveraging two main feedback channels:
1) Demonstrations: several trajectories of human behavior on the task.
2) Preferences: the human compares pairwise short trajectory segments of the agent’s behavior and prefers those that are closer to the intended goal.
In a typical environment, the demonstrations are available from the beginning of the experiment, while preferences are built dynamically during training. Step 1 allows the RL agent to approximate the behavior of the human trainer while step 2 optimized a reward functions inferred from the preferences and demonstrations. In that sense, step 2 offers a window to surpass human performance in RL tasks.
The reward learning with human preferences model has two main components:
a) A Deep Q-Learning network that learns an action-value function from a given set of observations. The action-value pair is learned both from demonstrations as well as from the agent’s preferences. During the pretraining phase, the agent only learns action-value from expert demonstrations. During training, the agent’s experience is added to the reward function.
b) A convolutional neural network(CNN) that takes an observation as input and outputs the estimate of the corresponding reward. The CNN is trained using the models’ preferences and optimizes the reward function.
Using this simple architecture, the reward learning with human preferences model can not only match human performance learned with the Deep Q-Learning network but also use the CNN to optimize it. In other words, using this model allows an RL agents to not only learn from demonstrations (like traditional RL) but also from experience.
Learning to Play Atari Games from Scratch
Atari games are a classic benchmarking environment for RL models as they are incredible diverse but also include well-known reward functions. From that perspective, most RL agents in Atari games specify the reward function in the model itself. However, what would happen if the RL agent doesn’t have access to the reward function? How would it be able to handle the diversity of the Atari environments? This is the challenge that DeepMind and OpenAI decided to tackle by testing the reward learning with human preferences model on the Arcade Learning Environment, an open source framework for designing AI agents that can play Atari 2600 games. The experiments used four fundamental setups:
1) Imitation Learning (first baseline): Learning purely from the demonstrations without reinforcement learning. In this setup, no preference feedback is provided to the agent.
2) No Demos (second baseline): Learning from preferences without expert demonstrations.
3) Demos + Preferences: Learning from both preferences and expert demonstrations.
4) Demos + Preferences + Autolabels: Learning from preferences and expert demonstrations, with additional preferences automatically gathered by preferring demo clips to clips from the initial trajectories.
The different configurations were evaluated against 9 Atari games like Beamrider, Breakout, Enduro, Pong, Q*bert, Seaquest, Hero, Montezuma’s Revenge, and Private Eye. The last three games were chosen specifically because of their exploration difficulty. The results of the experiments were remarkable: After 50 million steps and a full schedule of 6,800 labels, the reward learning with human preferences model outperformed the alternatives in all games excepting Private Eye which is a game environment notoriously favorable to imitation. Not surprisingly, the experiments showed that most configurations achieved their best performance when full feedback and demonstrations were available. Even more remarkable was the fact that the reward function learned by the agents, in most cases, aligns with the true reward function of the game and, in some cases, outperform it.
The work of OpenAI and DeepMind showed that combining demonstrations and preferences is an efficient way to guide RL agents in environments with sparse or lack of explicit rewards. The experiments showed that even minor amounts of preference feedback helped RL models to outperform traditional imitation learning techniques. This is another example about how combining humans and machines leads to better AI.