A Learning Engine for Embodied AI

Source: Deep Learning on Medium

This essay introduces a deep learning framework for creating autonomous agents within the game simulator

Go to the profile of Aaron Krumins
An Agent Learning a Match to Sample Task using Neurostudio

Modern game engines with their high-fidelity representations of physics and visual environments have emerged as a viable simulator for training many kinds of embodied AI. The term embodied AI is used here to differentiate between types of machine learning algorithms. An embodied AI contains a visual representation within an environment, be it real or virtual, and is optimizing something subject to that persona. A spam classifier for instance does not represent embodied AI as it lacks any kind of seen identity within its environment. A spam classifier being used to animate an emoticon however could be said to represent embodied AI since the optimizer now has a physical representation within space time.

The tools for creating such embodied AI can be thought of as a new type of software, or what I call here the “Learning Engine”. While some examples of these already exist, most deep learning tools for creating embodied agents are still in their infancy and require significant expertise to deploy. In this essay I introduce a deep learning framework called Neurostudio which leverages Unreal Game Engine’s visual scripting language Blueprints, to create autonomous learning agents in a few simple steps that a novice can learn.


Why use a video game engine for training embodied AI in the first place? One of the enduring hurdles in deploying next-gen AI techniques such as reinforcement learning is the lack of suitable environments for which training can take place. A variety of research has noted the difficulty in using baseline reality for training embodied agents. Often the physical constrains of the AI, such as electronic actuators in the case of a robot, prohibit the kind of high volume training schedules required to make a technique like deep reinforcement learning yield good results.

One alternative is to use a simulator for training the agent. From here the agent can either continue to exist within the simulator (as in the case of a video game NPC) or their learning can be transferred out to “brick and mortar” reality as in the case of a robot. In either event, there are several advantages to using a simulator to train an embodied agent. For one, when things go catastrophically wrong, the results are typically less onerous than when they go catastrophically wrong in physical reality. A simulation of a robot running amok is a very different beast than an 800 pound cacophony of flying arms and legs. What’s more, many times one may wish the agent to remain permanently within the simulator, as in the case of video game characters.

The system I describe here combines a deep neural network with a reinforcement learning algorithm to enable self-learning embodied agents. Using this system an agent can easily be made to perform any combination of actions involving objects in its environment simply by changing the reward function in the AI Character Controller. This is the chief advantage of reinforcement learning — instead of having to hand-craft behaviors, one specifies a behavior to be rewarded and lets the agent learns by itself the necessary steps to achieve that reward. In essence, this is how one might teach a dog to perform a trick using food rewards.

The same approach can also be used to train an NPC, virtual assistant or even robots. A variety of intentional behaviors can be acquired through this system, including path finding, NPC attack and defense, and many types of human behavior. State-of-the-art implementations include those used to defeat best in class human players at Chess, Go and multiplayer strategy video games.


Reinforcement learning has emerged as a promising avenue for training embodied agents because such agents almost always exist in relation to some concept of time and action. Reinforcement learning differs from other branches of AI in that it specifically addresses the issue of learning where an agent can take some action in relation to time and it’s environment. Time in this case could be a series of game moves or training epochs. In one way or another there is a temporal space in which things are occurring. The AI is not simply iterating over a frozen set of labeled data as in the case of supervised learning techniques.

As soon as an agent exists in relation to time and a changing environment, complexity becomes an overriding issue. In a reinforcement learning techniques such as tabular Q learning, the algorithm must keep track of all combinations of environmental changes, actions and rewards in a table. Depending on the complexity of the environment and actions available to the agent, this can make such a table astronomically big. Even a few environmental elements interacting can quickly lead to combinatorial explosion.

Humans also use a variation of reinforcement learning to acquire many skills and you may be tempted to ask how many environmental variables we ourselves can track before complexity becomes an issue? Visually speaking, the answer is surprisingly few — between four and eight depending on if the objects are moving or stationary. This makes sense from what we know about reinforcement learning and combinatorial explosion — beyond eight objects and the spatial resolution of attention would be bogged down. Four dynamic objects can combine in 24 different ways, 8 objects in 40,320 ways! Beyond that and we get into numbers that are all but meaningless from a human perspective — 12 objects can combine in 479 million unique ways. At any given time while playing a video game for instance, we are typically abstracting from thousands of pixels to just 4 or 5 different objects that we are keeping track of.

How then do we abstract from all those pixel combinations to get just a handful of meaningful features? In humans, evolution did all the hard work for us, and in biology this talent is called latent inhibition. In this process familiar objects lose their salience and we tune them out over time. The veil of forgetfulness is indeed a merciful one. While a computer could conceivably track more features than a human depending on its processing power, combinatorial explosion can tax even modern day supercomputers. Fortunately for computer scientists, “neural networks” can help solve such issues of complexity.

Deep neural networks are most often associated with the field of AI called “supervised learning” which requires someone to provide a database of labeled training data for the algorithm to learn patterns from. The amazing thing about deep neural networks is that they can take noisy, non-linear, non-uniform data like a picture of a cat, and abstract it down to a few features that are essential for categorizing it. This is how your spam classifier works in your email inbox and how Netflix creates recommendations for you based on the movies you liked or disliked. Such classifiers are increasingly common in software and have received a shot in the arm thanks to deep neural networks. Neural networks are more powerful than earlier statistical methods like logistic regression and Bayesian classifiers because they excel at finding patterns that are buried beneath layers of complexity. Classifiers are powerful tools, but unlike with reinforcement learning, they typically require some sort of labeled data to train on.

In 2015, researchers at the company DeepMind hit upon the idea of using deep neural networks to abstract from the screen pixels of an Atari video game screen so that a reinforcement learning algorithm could be applied to playing games. As mentioned, deep neural networks have the ability to take very noisy and large datasets and detect patterns within them. The screen of an Atari video game can be thought of as such a large and noisy dataset. By using the screen of an Atari console as the training set for the neural network, they reduced the complexity of all the pixel combinations to a small handful of features tied to the different moves a player could make, usually just 4 or 5 (Volodymyr Mnih, 2015). Were they simply lucky that none of the games required more than a handful of actions to win at? Not at all, rather, because these games were designed for humans they had a limited action and environment space. An Atari video game programmed for space aliens capable of learning from 10,000 different action combinations would be another matter entirely. However, in situations where most of the game features are just decoration and don’t have any significance for the actions one should take, a deep neural network can reduce the complexity to something manageable by a reinforcement learning algorithm.

What is true of Atari is also true of board games like Go. This ancient Chinese pastime was thought to be beyond mastery by computers due to the inherent complexity of the game. As the Go experts were fond of reminding us, there are more possible board combinations in a game of Go than the number of quarks that have ever existed in the universe since the beginning of time. But as in Atari video games, many of the board positions in a Go game are not pertinent to play at any given turn of the game. They are like that pixel in the far corner of the screen that isn’t important until it indicates that an enemy is headed your way. “Deep Reinforcement learning”, that is, the combination of deep neural networks and reinforcement learning, proved just as effective at mastering Go as it did at the Attari video games. In 2017, AlphaZero, a Go playing deep reinforcement learning algorithm developed by DeepMind defeated several of the world’s leading human Go players as well the best Go artificial intelligence.

One of the key fallacies people fall into when thinking of a game like Go is the assumption that complex games require a complex type learning. In fractal geometry, bewildering patterns of seemingly infinite variation can be derived from simple formulas. Evolution, which has produced myriad life forms of overwhelming complexity is guided by an equally simple learning rule — the mistake. The same learning equation that allows for the mastery of Tic-Tac-Toe can produce mastery of a game like Go. In both games, reinforcement learning can discover the key associations that are pertinent to winning. Which isn’t to say there are not more complex ways to address learning. DeepBlue, the IBM supercomputer that defeated Gary Kasparov at chess in 1997, was a gargantuan program with thousands of hand coded scenarios built into it by chess experts and programmers. But such complex programs, in the end, are far less robust and powerful than a simple algorithm like Q learning. For one, they weave in the experiential bias of the humans who coded them. When the Atari deep reinforcement learning algorithm was developed at DeepMind, it discovered a way to rack up points in the game of Pong using a trick that was previously unknown to human players. If it was programed from solely human experience, it would likely never produce such “alien” moves. The strength of reinforcement learning is that, playing by itself, it can try out millions of moves that nobody in the history of the game has ever thought to try.

Many expert commentators looked at AlphaZero, the chess playing reinforcement learning algorithm, and saw a more advanced version of DeepBlue and thus failed to realize that they were looking at a completely different kind of AI, one with far different implications. Because reinforcement learning mimics one of the way humans learn, the same algorithm that can be used to master Go can be used to master cooking an omelet or folding the laundry. When you first start learning to fold the laundry, you make mistakes, the sleeves don’t line up, your creases lack precision. Through repetition, or in the words of computer science, iteration, you slowly learn the correct moves necessary to get you to the goal state, the perfectly folded shirt. In such a manner, many human activities can be “gamified” and turned into reinforcement learning problems.

Combing neural networks with reinforcement learning allows one to take an algorithm such as Q learning and scale up the environment and action space in which the agent can learn. In extremis — the environment in which the agent learns can be the actual pixel input of the screen. This mirrors the way in which mammals such as humans and dogs learn from the contents of their visual field.

The Neurostudio learning engine is based upon the deep reinforcement learning algorithm called Deep Q Learning, or simply DQN. A basic understanding of its cousin, tabular Q learning will be helpful for understanding Deep Q Learning. In the link below you can find an introduction to tabular Q learning in which the agent solves a simple Match to Sample Puzzle task: https://www.unrealengine.com/marketplace/artificial-intelligence-q-learning

The method by which Q learning works is sometimes called backward induction. Imagine a hiker who has lost the trail and is trying to find their way back to camp. First they randomly pick a direction to move in. Scrambling over a rock, they observe whether this action seemed to get them closer to their goal. From that new vantage point they decide how to value thier previous action. Was climbing over the rock a good decision or a bad one? In effect, we are learning a guess from a guess. First one makes a guess about the future(given my new state, how easy will it be to reach camp), and then one makes a guess about the values of their last action based upon that first guess(I got allot closer to basecamp by climbing over this rock, so my last action was a good one) We commonly know this as trial and error learning, or associative learning. In tabular Q learning mathematical form, this takes the expression

Q(state, action) = Reward(state, action) + Gamma * Max[Q(next state, all actions)]

Where R(state, action) is any reward gleaned by the current action and gamma is a fixed discount rate between 0 and 1 which controls how much the agent values present rewards vs. future ones. Max[Q(next state, all actions)] also called the bellman update rule, makes the present value of the state action pair dependent on the next best future action that can be taken from this position. This is the “look ahead” part of the equation. In such a way, a reward that one believes will happen in the future can be chained backwards to the steps that got one there.

To combine Q learning with a neural network, the Neural Network first predicts values corresponding to each action available to the agent given its current state. Initially these are simply random noise created by the neural network. However, after the agent receives a reward, it can apply the Q learning equation and use the resulting error between the predicted value of a state action pair and the new value of the state action pair to form an error term. This error term can then be used to train the neural network. Think of a neural network just as a tool for minimizing prediction error and the Q learning equation as what supplies those prediction errors.

After many training episodes the neural network gradually improves its estimate of every state action pair the agent can encounter within the simulator. To make a strategic move the agent has only to run the network forward given its present state and take the action that is predicted to have the highest value.

While I have gone into some detail explaining the workings of DQN, to use Neurostudio ones don’t necessarily need to understand the details of how learning takes place. Rather one only needs to be clear about what elements in the environment the agents is learning from, the actions the agent can take, and the rewards it can receive. Just as one doesn’t need to understand brain science to teach a dog how to do tricks, one does need to understand what capabilities the dog has, what it finds rewarding, and when to deliver those rewards to maximize learning.

That said an understanding of deep learning principles such as overfitting can be helpful when selecting the parameters of the neural network. Training neural networks is as much an art as a science, since many parameters of the neural network can affect learning.

While Deep Reinforcement Learning allows agents to learn in complex real world environments, they often do so more slowly than tabular Q learning agents. It can also be more difficult to “tune” the agent’s behavior given the larger number of parameters that affect the neural network.

On the other hand, using neural networks allows agents generalize their strategies — that is they will apply a learned strategy to a wide variety of environmental states that resemble the one in which they were initially trained. This allows them to act intelligently even in completely new environments. In contrast, with tabular Q learning the learned strategy would only be invoked in the exact environment state for which the agent has a past association. This ability to generalize learning to new tasks and across new environments makes Deep Reinforcement Learning one of the most powerful machine learning frameworks presently available and a natural starting point for a Learning Engine. I expect the use of such Learning Engines to expand the horizons of both real and virtual agents and usher in a new era of embodied AI.