Original article can be found here (source): Artificial Intelligence on Medium
Deep reinforcement leaning(DRL) has been at the center of some of the most important artificial intelligence(AI) breakthroughs of the last decade. Given its dependency on interactions with an environment, DRL is regularly applied to many real world scenarios such as self-driving vehicles that operate in really complex environments. Those requirements are pushing DRL research in a direction of creating agents that can generalize knowledge of their environment without an extensive need of trial and error. DeepMind and Google recently open sourced Dreamer, a reinforcement learning agent that learns a world model from images and uses it to learn long-sighted behaviors.
The world of DRL can be divided into two main groups: model-free and model-based : model-free and model based. In its most basic form, model-free reinforcement learning models focus on mastering specific tasks by mapping rewards to a given action. This is typically known as model-free reinforcement learning and has been the foundation behind systems such as DeepMind’s DQN which mastered Atari games. Model-Free reinforcement learning typically requires a large number of simulated training sessions in order to map actions to sensory inputs which often results limited for long-term planning strategies.
Model-based reinforcement learning is the best-known alternative to model-free architectures and has been the foundation behind major breakthroughs in reinforcement learning such as Open AI Dota2 agents as well as DeepMind’s Quake III, AlphaGo or AlphaStar. Contrasting with model-free approaches, model-based reinforcement learning attempts to have agents learn how the world behaves in general and select actions based on long-term outcomes. This type of knowledge generation is known as “world models” and are a fundamental element of model-based DRL. Not surprisingly, model-based reinforcement learning agents have proven more efficient in longer-term planning as those required in multi-player strategy games.
One of the main challenges for the mainstream adoption of model-based DRL has been their ability of generalize long-term tasks. In real world scenarios, DRL agents will regularly interact with complex environment and faced situations that they’ve never seen before. This characteristic ability requires building representations of the world from past experience that enable generalization to novel situations. Although there has been some notable progress in this area, the challenge of long-term planning in model-based DRL remains incredibly expensive from a computational standpoint.
Google’s Dreamer is a DRL agent that can learn long-horizon behaviors in a given environment. One of the main innovations of Dreamer is that the agent is able to learns a world model from images and uses it to learn long-sighted behaviors. From there, Dreamer leverages its world model to efficiently learn behaviors via backpropagation through model predictions. I know this all sounds a bit surreal so let’s try to deep into the details.
From an architecture standpoint, Dreamer is no different than other model-based DRL methods. Functionally, the Dreamer architecture is based in three fundamental steps. In the first step, the model tries to infer a world model by learning from a dataset of past experience, the agent learns to encode observations and actions into compact latent states. The second step focuses on learning value and actor networks. In this step, Dreamer predicts state values and actions that maximize future value predictions by propagating gradients back through imagined trajectories. Finally, the third step of Dreamer enables environment interactions. In this step, the agent encodes the history of the episode to compute the current model state and predict the next action to execute in the environment.
Let’s deep dive into some of those steps.
Step 1: Learning the World Model
To build a accurate world models, Dreamer leverages another innovative project from Google and DeepMind. Google’s Deep Planning Network(PlaNet) is a purely model-based reinforcement learning algorithm that solves control tasks from images by efficient planning in a learned latent space. In other words, PlaNet learns about an environment using images and uses that knowledge for log-term planning in image control tasks. To efficiently plan long-term tasks using images, PlaNet introduces the notion of a latent dynamics model which is a compact representation of “latent states” in an image which describe representations such as velocity or positions of objects. Instead of prediction the next image from a given image like other image-based planning models, PlaNet predicts the next latent state and that information is used to predict future images.
Dreamer leverages PlanNet to predict outcomes based on a sequence of compact model states that are computed from the input images, instead of directly predicting from one image to the next. It automatically learns to produce model states that represent concepts helpful for predicting future outcomes, such as object types, positions of objects, and the interaction of the objects with their surroundings. Given a sequence of images, actions, and rewards from the agent’s dataset of past experience, Dreamer learns the world model.
One of the benefits of using PlaNet is computational efficiency. Dreamer is able to predict thousands of images using a single GPU which facilitates generalization.
Step 2: Behavioral Learning
One of the challenges of model-based DRL is to predict long-term outcomes without incurring in heavy computational costs. Dreamer overcomes this challenge by by learning a value network and an actor network via backpropagation through predictions of its world model. The agent efficiently learns the actor network to predict successful actions by propagating gradients of rewards backwards through predicted state sequences. This allows Dreamer how small changes to its actions affect what rewards are predicted in the future, allowing it to refine the actor network in the direction that increases the rewards the most.
Step 3: Environment Interaction
Dreamer was evaluated using a benchmark of different tasks with continuous actions. The benchmark included diverse challenges such as difficult to predict collisions, sparse rewards, chaotic dynamics, small but relevant objects, high degrees of freedom, and 3D perspectives.
The results of the benchmark were compared against other state of the art model-based DRL models including PlaNet. Dreamer outperformed all the alternatives while also achieving relevant performance using fewer environment interactions.
Dreamer is an very intriguing project that provides some perspective about how model-based DRL agents can master long-term tasks. The new agent provides tangible improvements over competitors and showed relevant performance mastering control tasks from image inputs. Google and DeepMind open sourced the initial implementation of Dreamer in GitHub.