Deep Reinforcement Learning for Ping Pong

Original article can be found here (source): Deep Learning on Medium

· In this post you will implement an AI program that can learn from any environment by giving rewards or punishment according to its actions respectively. Basically, we are encouraging the AI to do desired thing by giving it rewards if it acts in a desired way else punishment, how are we doing it? Follow along..!

· For the demonstration purpose we have used the game Ping Pong, provided by OpenAI’s library, as the environment for our AI. The AI gets control of one of the sliders only (green slider in our case). All the programs are done in python(Of course 😊😉).

· We will use a simple Neural Network with an input layer, a hidden layer and an output layer. It takes in input as image of current state and outputs the probability of moving the slider Up or Down and decision is made according to highest probability. This is called a policy network and has proven powerful in our case.

· By the end of this post, you’ll be able to do the following:

1. Write a Neural Network from scratch.

2. Implement a Policy Gradient with Reinforcement Learning.

3. Build an AI for Pong that can beat the computer that’s coded algorithmically to follow the ball with a speed limit for maximum speed of slider.

4. Use OpenAI gym.


The code and the idea are all tightly based on Andrej Karpathy’s blog post.

Prerequisites and Background Reading

To follow along, you’ll need to know the following:

· Basic Python

· Neural Network design and backpropagation

· Calculus and Linear Algebra

Great! Let’s get started.

First of all, what are Neural Networks?

To put it down simply, Neural Networks are universal function approximators that can approximate any function, meaning if we know few inputs and their corresponding outputs, we can map the inputs to the outputs and find approximately what the function was (i.e. what is the best function that describes given data). So, if the neural network is given any new input X, it can predict new output Y using this approximated function. For our case we have to find a function that takes in input an image i.e. array of float values and spits out output as correct probability of taking the slider Up (Conventionally, we have taken output as probability of slider going up, if you want the output to be probability of going down, you can do it by making changes in corresponding reward function).


A single neuron (variable) does one simple task takes in the input from previous neurons, evaluates their weighted the sum and applies a sigmoid function to the sum. A layer is combination of such neurons, that take input from previous layer, and passes the output to the next layer. Together, knowing the correct weights (for finding the weighted mean) these neurons can approximate any function with mind blowing accuracy.

The correct weights to the neural networks can be found out by:

1) Gradient Descent

2) Genetic Algorithm

We will use gradient descent in our case.

Structure of Neural Network :

The Neural Network we trained contains 3 layers input layer, one hidden layer and an output layer.

Training Time!!!:

Initially random weights are assigned to the network (i.e. it’s a random function). The input is given the image array of size 210 x 160 x 3 of the current state. The input image is first pre-processed to a grayscale image the colours of the ball paddles and the balls are white and the background becomes dark and then its cropped to only the middle portion of size 80 x 80. Since a single frame cannot show motion of the ball and paddles, we pass in difference of 2 consecutive frames to the model. This array is flattened (i.e. rows of the array are stacked on top of each other to form a vector) hence we get input vector of 784 dimension from input array of size 28×28 and forms the input layer of the network. The hidden layer consists of 200 neurons and the final the output layer consists of a single neuron, that gives the probability of moving the slider Up.

Initially, weights are going to be random and hence the model will output random probability of moving UP, let’s say it gives 70% probability of going UP.

Hence, we move UP. The problem is here we don’t whether moving UP is the right thing to do at that moment. For all the coming next frames, we assume that probabilities given by the model were right until it eventually losses or wins (A player wins a game when he scores 20 points earlier than other player). If finally, it wins the game, we can say that all the moves that it has been taking since start were right and we encourage such moves in future by giving it a positive reward and updating weights positively according to the gradient descent. If its losses a game, we can say that most of the moves that it did for this case were wrong and we discourage such moves in future by giving it a negative reward and updating weights negatively by the gradient descent. We iteratively apply the above process for thousands of gameplays exploring different possible scenarios until eventually model learns the correct weights to be used in order to maximise the

rewards every time thereby improving the model after every gameplay and coming closer to the ideal weights for the given game.

The algorithm we just used is known as policy gradient. It is one of the important algorithms of reinforcement learning.

And your pong playing AI bot is ready to take over the world.!

NOTE: Here, we are considering only one output(i.e. probability of taking slider UP and 1-p(UP) as probability of going down this is less efficient as at any instant the Neural Network will either go up or down but it wot stay at a place even if it’s the right thing to do

Lets gather some insight…

First problem we will face in this is to decide the hyperparameters such as learning rate, decay rate, no. of neurons for the Hidden layer etc. We didn’t have enough computing power in order to try and run (Trial and Error) the model for different values of hyperparameters. Our current model uses values similar to Andrej’s Blog.

So Finally…

After almost 10,000 games our Neural Network/ AI Agent/ Game Bot shows a pretty good results beating the hardcoded AI(red) almost every game.

If we want to see what the model is actually learning, we take a look at final weights of the bot in first layer unflatten that and visualize.

This is what actually our Network Learnt

The black and white trails are due to the motion of the ball as we are taking difference between 2 consecutive frames. As we can see model learns few of the common states of the game and reacts accordingly to maximize the reward in every game.

But Wait. There is more to it than what we have seen above. Like Documented Code for the above Model is available here:





4) Awesome Blog Post by Andrej Karpathy:

Next challenge:

Try making a Pong game in which both players are controlled by your bot, this way your bots will compete against each other and try to better than each other. Such bots if applied for other games can defeat any human at any time.

Gym doesn’t provide 2 player controls so you can take the code for game from here:

Reference Code for the challenge:

Check out my GitHub repositories for much such ML codes:

Happy Hacking!!!