Fundamentals of Reinforcement Learning: Automating Pong in using a Policy Model— an Implementation…

Source: Deep Learning on Medium

Fundamentals of Reinforcement Learning: Automating Pong in using a Policy Model— an Implementation in Keras


Reinforcement learning is currently one of the hottest topics within AI, with numerous publicized achievements in game-based systems, whether it be traditional board games such as Go or Chess, or complex strategy games such as StarCraft, agents trained using reinforcement learning have rapidly matched or exceeded human-level performance. The field stands distinct from tradtional supervised or unsupervised learning, relying on experimental observation to modify its behaviour in order to operate optimally within an environment. This constant self-optimization can be applied to various industries accompanied by large economical implicationss, and has brought deep learning a step closer to achieving true artificial general intelligence.

Not even QWOP is safe from deep learning.

In our last article, we covered the theoretical aspects behind Markov Decision Processes, an approach that allows a system to make decisions by estimating the value of available states and actions. Recall, that a state follows a Markov Property if it contains information about the entire history of previous agent-environment interactions, without the need to invoke previous states. In other words, each state (or action) encapsulates its current and future value within itself.

To better illustrate these concepts, let’s put theory into practice and build an RL-powered instance of Atari’s classic video game Pong. Pong can be viewed as a classic reinforcement learning problem, as we have an agent within a fully-observable environment, executing actions that yield differing rewards, using the magnitude of collected reward to self-optimize. Furthermore, Pong serves as a good example of a MDP, as every state (defined here as the position of the agents and the ball within the gamespace) is independent of each other. Note that we won’t be explaining the theory behind our approach in this article — the reader is kindly directed to our previous article for a detailed treatment of the topic.

While it would be possible to code a simple Pong simulation from scratch, we can bypass such busywork by using OpenAI’s helpful gym envrionment, which allows for the simulation of Atari and other publisher game libraries directly. The library provides an interface to the methods and variables within the games, in addition to standardized benchmark AIs for testing purposes.


How do we go about creating our own policy network capable of playing Pong? Where do we obtain our data? Within the context of reinforcement learning, our variables are self-generated through observations, rather than given to us as a set of labels. To better visualize this, let’s take a look at our implementation pipeline:

  1. We input each input frame into a freshly compiled, untrained network, and use the predicted binary probability to predict an action to take in the game. We continue doing this for the duration of the episode, accumulating X and Y, and reward data into separate arrays.
  2. After the episode is finished, we can then train our network using the X & Y and reward data in order to generate an intelligence for our model motivated by reward.
  3. With enough training, we aim to create a model capable of maximizing reward.

Let’s continue on by defining our variables:

  • Our x-variable represents the difference between two frames, which give the state of the game to the network as an input. This is generated by the game.
  • Our y-variable represents a predicted action of our network based on our input. Essentially, this helps ensure that our network behaves in a repeatable way to a given state. This is generated by predictions by our network.
  • Reward is the third variable to consider for our model: it assists in judging the suitability of different actions, punishing or rewarding our agents for undesired and desired behavior. This is generated by the game.

We will be implementing our policy model in Python using the Keras and OpenAI’s gym libraries, executed within the Google Colaboratory instance. Naturally, all of our code can be found within the GradientCrescent Github, Our implementation is based on the work of Trazzi, den Bakken, and community resources. Particular thanks have to be given to Conygham et. al, who’s guides on real-time visualization of the OpenAI gym environment have been crucial.

We begin by installing the necessary packages to access the OpenAI gym as well as visualization packages.

!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install — upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1

Let’s begin by setting up our OpenAI gym environment for Pong — a self-contained instance of the game that facilitates interfacing with the permissible actions within the game. There are three actions an agent (player) can take within the Pong environment , each assigned an integer– remaining stationary (0), vertical translation up (2), and vertical translation down (3). For simplicity, we’ll be considering the latter two actions only in our model.

We’ll also declare some parameters here — the gamma parameter is involved in discounting the value of rewards across time — essentially affecting “how many moves before winning or losing” we reward to hope such actions occur in the future. We’ll be discussing this in more detail shortly.

import numpy as np
import gym
# gym initialization
env = gym.make(“Pong-v0”)
observation = env.reset()
prev_input = None
# Declaring the two actions that can happen in Pong for an agent, move up or move down
# Decalring 0 means staying still. Note that this is pre-defined specific to package.
# Hyperparameters. Gamma here allows you to measure the effect of future events
gamma = 0.99
# initialization of variables used in the main loop
x_train, y_train, rewards = [],[],[]
reward_sum = 0
episode_nb = 0

Let’s inspect the a frame of Pong, the basic unit of data within the gamespace, using the pyplot library. In this mode, the game’s agents run on OpenAI’s own Pong AIs — we’ll be replacing the rightmost agent with our own policy model.

#Let’s take a look at the game in action.
import matplotlib.pyplot as plt
env = gym.make(“Pong-v0”) # environment info
observation = env.reset()
# The ball is released after 20 frames
for i in range(22):

if i > 20:
observation, _, _, _ = env.step(1)

Running this should yield you a frame of gameplay:

Frame capture of the state of the OpenAI environment at the moment the ball is released.

If desired, we could continue plotting more frames to observe the flow of the game by altering the range of the loop, but we’ll skip that for now. However, note that this frame contains two many unnecessary details irrelevant to the actual gameplay itself: for example, the tracking of the rewards is done internally by the OpenAI gym environment. We’ll preprocess our frame by applying a crop and a grayscale, before flattening the output for a 1-dimensional neural network.

def prepro(I):
“”” prepro 210x160x3 frame into 6400 (80x80) 1D float vector “””
I = I[35:195] # crop
I = I[::2,::2,0] # downsample by factor of 2
I[I == 144] = 0 # erase background (background type 1)
I[I == 109] = 0 # erase background (background type 2)
I[I != 0] = 1 # everything else (paddles, ball) just set to 1
return I.astype(np.float).ravel()
#Show preprocessedobs_preprocessed = prepro(observation).reshape(80,80)
plt.imshow(obs_preprocessed, cmap='gray')
Preprocessed version of the previous frame

Note that we won’t be using individual frames as our inputs, but rather the difference between two frames, obtainable via a simple subtraction. This is to the translation and velocity of our actors within our environment.

Earlier, we mentioned how a measure of reward is needed to judge the performance of our agent. But how do we define reward? Simply noting down the reward our agent earns at a specific frame of the game would yield an array similar to the following:

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0 …]

This is not very useful, as we’ve no idea about the significance of the actions preceding the reward, and the number of actions preceding a reward may vary. To address this, we can introduce a discount function, which utilizes a decay rate (gamma, defined earlier) to distribute the normalized earned reward across a number of preceding frames:

def discount_rewards(r, gamma):
“”” take 1D float array of rewards and compute discounted reward “””
r = np.array(r)
discounted_r = np.zeros_like(r)
running_add = 0

for t in reversed(range(0, r.size)):
if r[t] != 0: running_add = 0 # if the game ended (in Pong), reset
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
discounted_r -= np.mean(discounted_r) #normalizing the result
discounted_r /= np.std(discounted_r) #idem using standar deviation
return discounted_r

This results in the reward (or punishment in this case), being spread across a number of frames to judge the appropriateness of our actions.

[-1.22 -1.23 -1.24 -1.25 -1.26 -1.27 -1.28 -1.29 -1.3 -1.31 -1.32 -1.33

-1.34 -1.36 -1.37 -1.38 -1.39 …]

We will use these rewards as weights for inputs when training our network, in order to encourage behavior that would maximize our reward, and vice versa.

With our variables identified, let’s define our model, consisting of a two layer densely-connected model. Convolutional models are also possible (and feature far fewer parameters), but as our images are pixel-based, they can be represented by 1D arrays and are hence compatible with a fully-connected neural network.

# import necessary modules from keras
from keras.layers import Dense
from keras.layers import Reshape
from keras.layers import Conv2D
from keras.layers import Flatten
from keras.models import Sequential
import keras
from keras.models import InputLayer
from keras.optimizers import Adam
The 80 * 80 input dimension comes from the pre-processing of the raw pixels made by Karpathy (the only important pixels are the balls and the paddle)
Input here represents the difference in pixels betewen one frame and another, giving you direction of agents and ball. Encoded in Karpathy’s own preprocessing functions
model = Sequential()# hidden layer takes a pre-processed frame as input, and has 200 units. Simple layer architectur of 200 x1, 1x1
model.add(Dense(units=200,input_dim=80*80, activation=’relu’, kernel_initializer=’glorot_uniform’))
# output layer — we use a Sigmoid here, in order to get a 0, or 1 value to represent ACTION UP
model.add(Dense(units=1, activation=’sigmoid’, kernel_initializer=’RandomNormal’))
# compile the model using traditional Machine Learning losses and optimizers
model.compile(loss=’binary_crossentropy’, optimizer=”adam”, metrics=[‘accuracy’])

With the model compiled, let’s begin training our network. We’ll start with data collection firs, by let’s preparing our inputs via the aforementioned preprocessing function. Next, we feed our input into our current network and predict an action, representing translation up and down. We then append our x & y variables onto our datasets, in preparation for training. Our actions are also fed into the Pong environment, from which a reward value is obtained and stored.

history = []
observation = env.reset()
prev_input = None
# main training loop
while (True):
cur_input = prepro(observation)
#print(len(cur_input)) — Sanity Check reasons only

x = cur_input — prev_input if prev_input is not None else np.zeros(80 * 80)
prev_input = cur_input

# forward the policy network and sample action according to the probability distribution

proba = model.predict(np.expand_dims(x, axis=1).T)

action = UP_ACTION if np.random.uniform() < proba else DOWN_ACTION
y = 1 if action == 2 else 0 # 0 and 1 are our labels
# log the input and label to train later
# do one step in our environment — This is returned by our environment in OpenAI gym.
observation, reward, done, info = env.step(action)

reward_sum += reward

Once an episode is finished, we can report the total reward acquired, and then fit our model on the training data to generate an association between the states observed and the actions taken. As mentioned, the weights of our inputs are weighted by their relative contribution to our rewards:

if done:

print(‘At the end of episode’, episode_nb, ‘the total reward was :’, reward_sum)
if episode_nb>=3000 and reward_sum >=-12:

# increment episode number
episode_nb += 1

# training, y=np.vstack(y_train), verbose=1, sample_weight=discount_rewards(rewards, gamma))

# Reinitialization
x_train, y_train, rewards = [],[],[]
observation = env.reset()
reward_sum = 0
prev_input = None

We can view the evolution of our reward distribution across game episodes easily through the history variable


Training for 5000 episodes would yield you a reward distribution similar to the one below:

To evaluate our results within the confinement of the Colaboratory environment, we can record an entire episode and display it within a virtual display using a wrapped based on the IPython library:

import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplayfrom pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
def show_video():
mp4list = glob.glob('video/*.mp4')
if len(mp4list) > 0:
mp4 = mp4list[0]
video =, 'r+b').read()
encoded = base64.b64encode(video)
ipythondisplay.display(HTML(data='''<video alt="test" autoplay
loop controls style="height: 400px;">
<source src="data:video/mp4;base64,{0}" type="video/mp4" />
print("Could not find video")
def wrap_env(env):
env = Monitor(env, './video', force=True)
return env
#Evaluate model on openAi GYM#To do this consult
env = wrap_env(gym.make('Pong-v0'))
observation = env.reset()
new_observation = observation
prev_input = None
done = False
while True:
if True:

#set input to network to be difference image

cur_input = prepro(observation)
x = cur_input - prev_input if prev_input is not None else np.zeros(80 * 80)
prev_input = cur_input

# Sample an action (policy)
proba = model.predict(np.expand_dims(x, axis=1).T)
action = UP_ACTION if np.random.uniform() < proba else DOWN_ACTION

# Return action to environment and extract
#next observation, reward, and status
observation = new_observation
new_observation, reward, done, info = env.step(action)
if done:
#observation = env.reset()



We trained our model for nearly 5000 episodes, evaluating performance regularly.

Let’s start by taking a look at a model trained for 200 episodes

The final score? 21–0. That’s pretty terrible, but understandable given the short training time. Let’s compare this to a model trained for 5000 episodes:

Our agent has improved in responsiveness and precision, although it’s still a long way off from being capable of defeating the benchmark system. The final score was 21–3.

All in all, our results are acceptable given the small training times. A systematic study on the Pong environment by Phon-Amnuaisuk suggested that roughly 10000 episodes of data is needed to generate an agent capable of achieving a win.

Summary of average scores observed from the networks with 100, 200, and 400 hidden nodes (single hidden layer), with the learning rate set at 0.001

That wraps up this article. In our next tutorial, we’ll explore the OpenAI gym library some more, and tackle a few new environments. After that, we’ll move to explore a more complex and violent. classic video game environment, so stay tuned.

We hope you enjoyed this article. To stay up to date with the latest updates to GradientCrescent, please consider following the publication.


Trazzi et. al, Floyhub

den Bakken et. al, “Python Deep Learning Cookbook”, Packt

Conygham et. al, Star-AI