Original article can be found here (source): Artificial Intelligence on Medium
Deep-Q Networks (2 of 2): Superhuman Pong
Using Deep Learning combined with Q-learning to make the computer play Pong at superhuman levels
If you’re coming from my previous article, welcome! The basics you learned there are going to prove crucial to get to the juicy rewards today: Deep Q Learning. If you haven’t read my article on basic Q-learning, I highly suggest you do so, because it will give you all the foundation you need.
Ready? Awesome. Let’s get straight to it.
What’s wrong with Q-learning?
Q-learning is really good at learning what it has to do. With the optimal Q-table, every possible state-action pair is accounted for; the agent knows exactly what to do in any given scenario. However, like in a more complex game like Pong, it might never actually experience every state. We know what a Q-learning agent does in an unseen scenario — it takes a random action.
But what if that scenario was similar to another situation that the agent had experienced? It wouldn’t matter, since the Q-table is absolute: it accounts for only the scenarios it has seen. This makes Q-learning unable to handle more complex environments with complex goals.
How can neural networks help with that? Aren’t they absolute too?
Deep neural nets are, in essence, function approximators. They take an input, bring it to a boil, mix well, throw it in the oven for 20 minutes and spit out a set of freshly baked outputs (how it works isn’t super important for now).
We can use a neural network to approximate our Q function to make it more robust and powerful. The key to all this is that neural networks can make inferences on data it’s never seen, so it can take a reasonably good action in unfamiliar situations! It can also handle for more complex input, and for this project the only input it gets is the images of each frame of the game.
Simple is beautiful, but the devil’s in the details
The concept behind the DQN is simple and elegant: take the concept of Q-values and just slap a neural net on it!
That seems pretty easy! Why is this still new?
In actuality, this problem is much more difficult than it looks, and I’ll get into why a little later. Let’s start with the architecture of the DQN.
For this project I decided to use PyTorch (instead of TensorFlow+Keras which I used previously). I have 2 convolution layers and 2 fully-connected (or dense) layers. For more information about convolutions and CNNs, check out another article of mine here!
The output layer has
n_actions outputs to give each action its own q-value. Our RL “environment” uses OpenAI’s gym, along with wrappers provided in OpenAI’s baselines repo and AtariPreprocessing. Essentially these wrappers make it easier for our agent to interact with the environment.
The Atari environments output a 210×160 RGB image — this is a huge image for our agent. Most of the information isn’t even useful, so I down-sampled the image. In my code, I used
AtariPreprocessing instead of reinventing the wheel, but also provided the baselines implementation of some of those wrappers to better explain what they do.
AtariPreprocessing scales all the images down to 84×84 and makes them grayscale.
If I were to give you that single image of Pong, what information would you have? You would have:
- The score
- The position of the enemy
- Your position
- The position of the ball
But if I were to ask which direction the ball was moving, you wouldn’t be able to tell me. So instead of passing each frame individually to the agent, AtariPreprocessing instead stacks frames together to give the agent more information. For example:
- With 2 frames, I could derive the velocity of the ball
- With 3 frames, I could derive its acceleration (if any)
- With 4 frames, I could…well…we just add the 4th frame for security:)
Not only does this give the agent more information, it essentially quadruples our memory efficiency. That’s huge for a model that’s going to have to train and backpropagate at every step. Other optimizations include:
- NoopReset — take a random amount of no-op (no action) actions when the environment is reset
- FireOnResetEnv — this is critical. I lost about 3 hours of training time because I forgot to add this one. Some Atari games (like Pong) require the player to hit the “fire” button every time the game is reset. The agent can sometimes learn this on their own, but this greatly hinders their learning.
Let’s take a look at the agent:
Pretty simple, right? All of this is fairly similar (other than the OOP) to basic Q-learning, except for that
Experience thing. Let’s leave the code behind (not for long though) and analyze.
It turns out DQNs are more than just slapping a neural network on the Q function. In a regular neural network, such as an image classifier (the classic dogs v cats), the data is not correlated. The model might get fed a picture of a dog, a picture of a cat and another unrelated picture of a different cat.
Neural Networks don’t play well with Q-Learning
In contrast, the data that the DQN gets is highly correlated. If the agent’s playing Mario and is moving to the left, it’s probably going to keep moving to the right. The DQN might then think “Everything’s great and I’m moving to the right, so I should keep doing that!” The moving right data is going to be over-represented in the DQN’s learning (this is a common theme), and then it’s going to fall right in the pit.
Another difficulty in DQN is the idea of non-stationary distribution. In normal classification tasks (like dogs v cats) the training data is fixed and its distribution is immutable. If I start out with 25,000 images of cats and 25,008 images of dogs, those numbers aren’t going to change as the network trains. In contrast, the DQN’s data distribution changes as it trains. Take Pac-Man as an example. If the policy dictates that the best action is to move down and left, the agent’s experiences will be dominated by those in the bottom-left of the map. This again leads to the over-representation or over-emphasis on a certain subset of data.
The idea of the Experience Replay fixes this. Every time the agent “experiences” something, we store it in the
In the beginning, when the agent hasn’t experienced anything yet, we wait for it to take a bunch of random actions to populate its memory. When it comes to train our neural net, we randomly sample from this
replay_memory instead of what immediately happened. This de-correlates the data in time, so the agent gets a healthy distribution of experiences.
Experience Replay is what allows Deep-Q-Learning to be possible.
The mitigation of the effects of non-stationary distribution is done by what is called a DDQN, or Double Deep-Q-Network. The idea is fairly simple: have 2 models. 1 model (
net) will be used select actions, and the other (
target_net) will be used to get q-values for that action. The first model is going to get trained every step, but for stability
target_net isn’t going to get trained — it’s just going to be used in calculating future Q-values. Every so often we take the weights from the
net and update
target_net with the same weights.
At a loss for words
How does the DQN calculate loss?
Remember that Q function from classic Q-learning? The way we calculate loss is essentially using something similar.
The classic one looked like this:
This is where the DDQN shows its usefulness — if the Q-Network calculated loss on itself, it would be very unstable and would have a high risk of overfitting. In using the target network as an approximation of the “ground truth,” the agent will be much more robust.
If, for example, the Q-Network gave an action, say
up with a probability of 60%, and the target network said
down, the Q-network needs to not only update all the weights for
down, but also tweak all the other weights to lower the probability for all the other actions. If the target network said
up as well, then we would want to adjust all the weights in the Q-Network to raise that probability. This loss is backpropagated through the network whenever the model is trained. Here’s what it looks like in code:
Achieving gamer status
Even though Deep-Q-Learning might seem foreign, it’s similar to Q-learning in 1 key way — it’s model-free.
Model-free? But we have neural networks? What about those?
Model-free in this context means I can take this code and have it train and play another game without changing a single line.
This is what truly makes AlphaZero so special. It’s model-free approach allowed to conquer Go, Chess, shogi and more. Model-free RL is like giving an emperor a top-class army instead of winning them a war: the army can adapt to new situations and excel.
Adaptation and learning have long been staples of the human psyche. Deep-Q-Networks are showing creativity, ingenuity and elegance that even humans cannot replicate. While still relegated to the field of games, in the future Deep RL is poised to take AI into situations where there is no one telling them what to do — they just have to do it.
- Deep-Q-Learning seems simple!
- Using neural networks as function approximators for the Q function of classic Q-learning, agents can make inferences in situations unseen
- DQNs take game states as input and give Q-values as output
- Deep-Q-Learning is hard.
- Because of high data correlation and dynamic data distribution, neural nets and Q-learning do not agree >:(
- Deep-Q-Learning gets harder, then easier.
- By implementing Experience Replay and DDQN architecture, neural networks and Q-learning are finally friends again!
- Experience Replay stores a certain amount of the agent’s experiences and is randomly sampled to provide de-correlated and well-distributed batches.
- DDQN uses 2 neural networks — one for taking actions and the other for calculating future Q-values.
- DQNs represent a massive step forward in AI, and because of their model-free nature they can be applied in many different scenarios.
Before you go
Thank you so much for reading this article! If you want to reach out to me, you can contact me at email@example.com. Stay safe!