Source: Deep Learning on Medium
Have you ever heard the old phrase “curiosity killed the cat”? We need more of that in artificial intelligence.
Curiosity is a basic element of human cognition so pervasive that we barely notice it. Philosopher and psychologist William James (1899) called curiosity “the impulse towards better cognition,” meaning that it is the desire to understand what you know that you do not. Curiosity is particularly noticeable in babies as they are attracted towards objects in order to form concepts of the world. Recreating curiosity in artificial intelligence(AI) is a key area of focus of several disciplines like reinforcement learning(RL). Just a few days ago, I wrote about advancements in the field of RL curiosity. AI research powerhouse OpenAI has also been active in this area and recently published a paper outlining a new method called Random Network Distillation for building curiosity in RL agents.
Noisy-TVs and the Challenges of AI Curiosity
The challenges of recreating curiosity in RL systems are fundamentally driven by our ignorance of its cognitive foundation. Despite being a pervasive cognitive skill, we understand very little about curiosity from the neuroscience and psychology standpoints. The classic idea for inducing curiosity in RL agents is to build rewards in the policy that are related to the exploration of new stages. Just like a baby is attracted to a new object, an RL agent will receive a reward by exploring the environment. While this approach seems conceptually sound, it is vulnerable to one of the most famous RL dilemmas known as the “Noisy-TV Problem”.
The noisy-tv problem describes a scenario in which an RL agent gets stuck pursuing bad rewards in an environment. Imagine an RL agent is that is placed into a 3D maze. There is a precious goal somewhere in the maze which would give a large reward. Now, the agent is also given a remote control to a TV and can switch the channels. Every switch shows a random image (say, from a fixed set of images). The curiosity formulations which optimize surprise would rejoice because the result of the channel switching action is unpredictable. In other words, the randomness of the environment will make the agent to stay in front of the TV forever instead of trying to solve the target task.
To address challenges such as the noisy-tv program, OpenAI introduced a method that discriminates “bad rewards” from “good rewards” in order to incentivize the agent to explore a given environment.
Random Network Distillation
When observing RL agents in complex environments prompt to randomness, OpenAI observed four major causes of prediction errors:
1. Amount of Training Data: Prediction error is high where few similar examples were seen by the predictor (epistemic uncertainty).
2. Stochasticity: Prediction error is high because the target function is stochastic (aleatoric uncertainty). Stochastic transitions are a source of such error for forward dynamics prediction.
3. Model Misspecification: Prediction error is high because necessary information is missing, or the model class is too limited to fit the complexity of the target function.
4. Learning Dynamics: Prediction error is high because the optimization process fails to find a predictor in the model class that best approximates the target function.
Looking at this list in more detail, factor1 can be seen as a way to quantify the novelty of the experience which translates into a useful source of error. Factors 2 and 3 are typically associated with the noisy-tv problem as an RL agent can be rewarded for errors in the prediction.
Random Network Distillation(RND) tries to avoid the previous challenges by using a model that predicts the output of a fixed and randomly initialized neural network on the next state, given the next state itself. The RND architecture uses two neural networks: a fixed and randomly initialized target network which sets the prediction problem, and a predictor network trained on data collected by the agent. The target network takes an observation to an embedding representation and the predictor neural network is trained by gradient descent to minimize the expected mean-square-error with respect to its parameters. This process distills a randomly initialized neural network into a trained one in which the prediction error is expected to be higher for novel states dissimilar to the ones the predictor has been trained on.
The main intuition behind RND is that the RL agent’s predictions of the output of a randomly initialized neural network will be less accurate in novel states than in states the agent visited frequently. The advantage of using a synthetic prediction problem is that we can have it be deterministic and inside the class of functions the predictor can represent by choosing the predictor to be of the same architecture as the target network. These choices make RND avoid factors 2 and 3 making it resilient to the noisy-TV problem.
RND in Action
The OpenAI team evaluated RND across a number of Atari games such as Gravitar, Montezuma’s Revenge, Pitfall!, Private Eye, Solaris, and Venture. In most cases, RND was able to achieve state of the art performance and, in some cases, outperform established methods such as Proximal Policy Optimization(PPO).
The following visualization illustrates RND in an episode of Montezuma’s Revenge. Notice how the rewards are correlated with the agents ability to explore the environment further and find the torch in a small number of steps.
Incentivizing efficient exploration and fomenting curiosity is one of the essential challenges for expanding RL into mainstream scenarios. Methods like RND are certainly a step in the right direction. An initial implementation of RND is available in GitHub.