Automating Pac-man with Deep Q-learning: An Implementation in Tensorflow.

Source: Deep Learning on Medium

Automating Pac-man with Deep Q-learning: An Implementation in Tensorflow.

Fundamentals of Reinforcement Learning


Over the course of our articles covering the fundamentals of reinforcement learning at GradientCrescent, we’ve studied both model-based and sample-based approaches to reinforcement learning. Briefly, the former class is characterized by requiring knowledge of the complete probability distributions of all possible state transitions, and can be exemplified by Markovian Decision Processes. In contrast, sample-based learning methods allow for the determination of state values simply through repeated observations, without the need for transition dynamics. Within this domain, we’ve covered both Monte Carlo and Temporal Difference learning. Briefly, the two can be separated by the frequency of state-value updates: while a Monte Carlo approach requires that an episode be finished for a round of updates to take place, Temporal Difference approaches update intra-episode incrementally, using old estimations of state-values together with discounted rewards to generate new updates.

The rapid reactivity of TD or “online” learning approaches makes them suitable for highly dynamic environments, as the values of states and actions is continuously updated throughout time through sets of estimates. Perhaps most notably, TD is the foundation of Q-learning, a more advanced algorithm used to train agents tackling game environments such as those observed in the OpenAI Atari gyms, and the focus of this article.

Our previous policy-based Pong model, trained over 5000 episodes with a binary action space.

Going Beyond TD: SARSA & Q-learning

Recall that in Temporal Difference learning, we observed that an agent behaves cyclically in an environment, through sequence of States (S), Actions (A), and (Rewards).

Due to this cyclic behavior, we can update the value of the previous state as soon as we reach the next state. However, we can expand the scope of our training to include state-action values, just as we did with Markov Decision Processes prior. This is generally known as SARSA. Let’s compare the state-action and state-value TD update equations:

Q-learning takes this a step further by forcing a selection of the action with the highest action value during an update, in a similar way to what’s observed with Bellman Optimality Equations. We can inspect SARSA and Q-learning next to the Bellman and Bellman Optimality Equations, below:

You may be wondering about how ensure complete exploration of our state-action space, given the need to constantly select actions for a state with the highest existing action-value. In theory, we could be avoiding the optimal action simply by failing to evaluate it in the first place. To encourage exploration, we can use a decaying e-greedy policy, essentially forcing the agent to select an apparent sub-optimal action in order to learn more about its value, at a certain percentage of the time. By introducing a decaying process, we can limit exploration once all of the states have been evaluated, after which we’ll permanently select the optimal actions for each state.

As we’ve tackled Pong before with a MDP-based model, let’s take what we’ve learned about Q-learning and apply it to a game of Atari’s Ms. Pac-man.


Our Google Colaboratory implementation is written in Python utilizing Tensorflow Core, and can be found on the GradientCrescent Github. It’s based on that by Ravichandiran et. al, but upgraded to be compatible with Tensorflow 2.0, and significantly expanded to facilitate improved visualization and explanations. As the implementation for this approach is quite convoluted, let’s summarize the order of actions required:

  1. We define our Deep Q-learning neural network. This is a CNN that takes in-game screen images and outputs the probabilities of each of the actions, or Q-values, in the Ms-Pacman gamespace. To acquire a tensor of probabilitieses, we do not include any activation function in our final layer.
  2. As Q-learning require us to have knowledge of both the current and next states, we need to start with data generation. We feed preprocessed input images of the game space, representing initial states s, into the network, and acquire the initial probability distribution of actions, or Q-values. Before training, these values will appear random and sub-optimal.
  3. With our tensor of probabilities, we then select the action with the current highest probability using the argmax() function, and use it to build an epsilon greedy policy.
  4. Using our policy, we’ll then select the action a, and evaluate our decision in the gym environment to receive information on the new state s’, the reward r, and whether the episode has been finished.
  5. We store this combination of information in a buffer in the list form <s,a,r,s’,d>, and repeat steps 2–4 for a preset number of times to build up a large enough buffer dataset.
  6. Once step 5 has finished,we move to generate our target y-values, R’ and A’, that are required for the loss calculation. While the former is simply discounted from R, we obtain the A’ by feeding S’ into our network.
  7. With all of our components in place, we can then calculate the loss to train our network.
  8. Once training has finished, we’ll evaluate the performance of our agent graphically and through a demonstration.

Let’s get started. With Tensorflow 2 on the horizon for Colaboratory environments, we’ve converted our code to be TF2 compliant, using the new compat package. Note that this code is not TF2 native.

Let’s by importing all of the necessary packages, including the OpenAI gym environments and Tensorflow core.

import numpy as npimport gymimport tensorflow as tffrom tensorflow.contrib.layers import flatten, conv2d, fully_connectedfrom collections import deque, Counterimport randomfrom datetime import datetime

Next, we define a preprocessing function to crop the images from our gym environment and convert them into one-dimensional tensors. We’ve seen this before in our Pong automation implementation.

def preprocess_observation(obs): # Crop and resize the image img = obs[1:176:2, ::2] # Convert the image to greyscale img = img.mean(axis=2) # Improve image contrast img[img==color] = 0 # Next we normalize the image from -1 to +1 img = (img — 128) / 128–1 return img.reshape(88,80,1)

Next, let’s initialize the gym environment, and inspect a few screens of gameplay, and also understand the 9 actions available within the gamespace. Naturally, this information is not available to our agent.

env = gym.make(“MsPacman-v0”)n_outputs = env.action_space.nprint(n_outputs)print(env.env.get_action_meanings())observation = env.reset()import tensorflow as tfimport matplotlib.pyplot as pltfor i in range(22): if i > 20: plt.imshow(observation), _, _, _ = env.step(1)

You should observe the following:

We can take this chance to compare our original and preprocessed input images:

Next, let’s define our model, a deep Q-network. This is essentially a three layer convolutional network that takes preprocessed input images, flattens and feeds them to a fully-connected layer, and outputs the probabilities of taking each action in the game space. As previously mentioned, there’s no activation layer here, as the presence of one would result in a binary output distribution.

def q_network(X, name_scope):# Initialize layers initializer = tf.compat.v1.keras.initializers.VarianceScaling(scale=2.0) with tf.compat.v1.variable_scope(name_scope) as scope: # initialize the convolutional layers layer_1 = conv2d(X, num_outputs=32, kernel_size=(8,8), stride=4, padding=’SAME’, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘layer_1’,layer_1) layer_2 = conv2d(layer_1, num_outputs=64, kernel_size=(4,4), stride=2, padding=’SAME’, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘layer_2’,layer_2) layer_3 = conv2d(layer_2, num_outputs=64, kernel_size=(3,3), stride=1, padding=’SAME’, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘layer_3’,layer_3) flat = flatten(layer_3) fc = fully_connected(flat, num_outputs=128, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘fc’,fc) #Add final output layer output = fully_connected(fc, num_outputs=n_outputs, activation_fn=None, weights_initializer=initializer) tf.compat.v1.summary.histogram(‘output’,output) vars = {[len(]: v for v in tf.compat.v1.get_collection(key=tf.compat.v1.GraphKeys.TRAINABLE_VARIABLES,} #Return both variables and outputs togetherreturn vars, output

Let’s also take this chance to define our hyperparameters for our model and training process

num_episodes = 800batch_size = 48input_shape = (None, 88, 80, 1)#Recall shape is img.reshape(88,80,1)learning_rate = 0.001X_shape = (None, 88, 80, 1)discount_factor = 0.97global_step = 0copy_steps = 100steps_train = 4start_steps = 2000

Recall, that Q-learning requires us to select actions with the highest action values. To ensure that we still visit every single possible state-action combination, we’ll have our agent follow an epsilon-greedy policy, with an exploration rate of 5%. We’ll set a this exploration rate to decay with time, as we eventually assume all combinations have already been explored — any exploration after that point would simply result in the forced selection of sub-optimal actions.

epsilon = 0.5eps_min = 0.05eps_max = 1.0eps_decay_steps = 500000#def epsilon_greedy(action, step): p = np.random.random(1).squeeze() #1D entries returned using squeeze epsilon = max(eps_min, eps_max — (eps_max-eps_min) * step/eps_decay_steps) #Decaying policy with more steps if np.random.rand() < epsilon: return np.random.randint(n_outputs) else: return action

Recall from the equations above, that the update function for Q-learning requires the following:

  • The current state s
  • The current action a
  • The reward following the current action r
  • The next state s’
  • The next action a’

To supply these parameters in meaningful quantities, we need to evaluate our current policy following a set of parameters and store all of the variables in a buffer, from which we’ll draw data in minibatches during training.. This is unlike in our previous implementation in Pong, where we used an incremental approach. Let’s go ahead and create our buffer and a simple sampling function:

buffer_len = 20000#Buffer is made from a deque — double ended queueexp_buffer = deque(maxlen=buffer_len)def sample_memories(batch_size): perm_batch = np.random.permutation(len(exp_buffer))[:batch_size] mem = np.array(exp_buffer)[perm_batch] return mem[:,0], mem[:,1], mem[:,2], mem[:,3], mem[:,4]

Next, let’s copy the weight parameters of our original network into a target network. This dual-network approach allows us to generate data during the training process using an existing policy while still optimizing our parameters for the next policy iteration.

# we build our Q network, which takes the input X and generates Q values for all the actions in the statemainQ, mainQ_outputs = q_network(X, ‘mainQ’)# similarly we build our target Q network, for policy evaluationtargetQ, targetQ_outputs = q_network(X, ‘targetQ’)copy_op = [tf.compat.v1.assign(main_name, targetQ[var_name]) for var_name, main_name in mainQ.items()]copy_target_to_main =*copy_op)

Finally, we’ll also define our loss. This is simply the squared difference of our target action (with the highest action value) and our predicted action. We’ll use an ADAM optimizer to minimize our loss during training.

# define a placeholder for our output i.e actiony = tf.compat.v1.placeholder(tf.float32, shape=(None,1))# now we calculate the loss which is the difference between actual value and predicted valueloss = tf.reduce_mean(input_tensor=tf.square(y — Q_action))# we use adam optimizer for minimizing the lossoptimizer = tf.compat.v1.train.AdamOptimizer(learning_rate)training_op = optimizer.minimize(loss)init = tf.compat.v1.global_variables_initializer()loss_summary = tf.compat.v1.summary.scalar(‘LOSS’, loss)merge_summary = tf.compat.v1.summary.merge_all()file_writer = tf.compat.v1.summary.FileWriter(logdir, tf.compat.v1.get_default_graph())

With all of our code defined, let’s run our network and go over the training process. We’ve defined most of this in the initial summary, but let’s recall for posterity.

  • For each epoch, we feed an input image into our network to generate a probability distribution of the available actions, before using an epsilon-greedy policy to select the next action
  • We then input this into the gym environment, and obtain information on the next state and accompanying rewards, and store this into our buffer.
  • After our buffer is large enough, we sample the next states into our network in order to obtain the next action. We also calculate the next reward by discounting the current one
  • We generate our target y-values through the Q-learning update function, and train our network.
  • By minimizing the training loss, we update the network weight parameters to output improved state-action values for the next policy.
with tf.compat.v1.Session() as sess: # for each episode history = [] for i in range(num_episodes): done = False obs = env.reset() epoch = 0 episodic_reward = 0 actions_counter = Counter() episodic_loss = [] # while the state is not the terminal state while not done: # get the preprocessed game screen obs = preprocess_observation(obs) # feed the game screen and get the Q values for each action,  actions = mainQ_outputs.eval(feed_dict={X:[obs], in_training_mode:False}) # get the action action = np.argmax(actions, axis=-1) actions_counter[str(action)] += 1 # select the action using epsilon greedy policy action = epsilon_greedy(action, global_step) # now perform the action and move to the next state, next_obs, receive reward next_obs, reward, done, _ = env.step(action) # Store this transition as an experience in the replay buffer! Quite important exp_buffer.append([obs, action, preprocess_observation(next_obs), reward, done]) # After certain steps we move on to generating y-values for Q network with samples from the experience replay buffer if global_step % steps_train == 0 and global_step > start_steps: o_obs, o_act, o_next_obs, o_rew, o_done = sample_memories(batch_size) # states o_obs = [x for x in o_obs] # next states o_next_obs = [x for x in o_next_obs] # next actions next_act = mainQ_outputs.eval(feed_dict={X:o_next_obs, in_training_mode:False}) #discounted reward for action: these are our Y-values y_batch = o_rew + discount_factor * np.max(next_act, axis=-1) * (1-o_done) # merge all summaries and write to the file mrg_summary = merge_summary.eval(feed_dict={X:o_obs, y:np.expand_dims(y_batch, axis=-1), X_action:o_act, in_training_mode:False}) file_writer.add_summary(mrg_summary, global_step) # To calculate the loss, we run the previously defined functions mentioned while feeding inputs train_loss, _ =[loss, training_op], feed_dict={X:o_obs, y:np.expand_dims(y_batch, axis=-1), X_action:o_act, in_training_mode:True}) episodic_loss.append(train_loss) # after some interval we copy our main Q network weights to target Q network if (global_step+1) % copy_steps == 0 and global_step > start_steps: obs = next_obs epoch += 1 global_step += 1 episodic_reward += reward history.append(episodic_reward)print(‘Epochs per episode:’, epoch, ‘Episode Reward:’, episodic_reward,”Episode number:”, len(history))

Once training is complete, we can plot the reward distribution against incremental episodes. The first 550 episodes (roughly 2 hours) looks something like this:

After an additional 800 episodes, this converges into the following:

To evaluate our results within the confinement of the Colaboratory environment, we can record an entire episode and display it within a virtual display using a wrapped based on the IPython library:

“””Utility functions to enable video recording of gym environment and displaying it. To enable video, just do “env = wrap_env(env)””“”def show_video(): mp4list = glob.glob(‘video/*.mp4’) if len(mp4list) > 0: mp4 = mp4list[0] video =, ‘r+b’).read() encoded = base64.b64encode(video) ipythondisplay.display(HTML(data=’’’<video alt=”test” autoplay loop controls style=”height: 400px;”> <source src=”data:video/mp4;base64,{0}” type=”video/mp4" /> </video>’’’.format(encoded.decode(‘ascii’)))) else: print(“Could not find video”) 
def wrap_env(env):
env = Monitor(env, ‘./video’, force=True) return env

We then run a new session of our environment using our model, and record it.

#Evaluate model on openAi GYMobservation = env.reset()new_observation = observationprev_input = Nonedone = Falsewith tf.compat.v1.Session() as sess: while True: if True: #set input to network to be difference image obs = preprocess_observation(observation) # feed the game screen and get the Q values for each action actions = mainQ_outputs.eval(feed_dict={X:[obs], in_training_mode:False}) # get the action action = np.argmax(actions, axis=-1) actions_counter[str(action)] += 1 # select the action using epsilon greedy policy action = epsilon_greedy(action, global_step) env.render() observation = new_observation # now perform the action and move to the next state, next_obs, receive reward new_observation, reward, done, _ = env.step(action) if done: #observation = env.reset() break env.close() show_video()

You should observe something a few rounds of the game! Here’s a couple of episodes we recorded.

Not bad for a model trained in a few hours, scoring well above 400. In particular, it seems our agent performs quite well when directly chased by a ghost, but is still poor at anticipating incoming ones, probably as it hasn’t had enough experience observing their movements yet.

That wraps up this introduction to Q-learning. In our next article, we’ll move on from the world of Atari to tackling one of the most well known FPS games in the world. Stay tuned!

We hope you enjoyed this article, and hope you check out the many other articles on GradientCrescent, covering applied and theoretical aspects of AI. To stay up to date with the latest updates on GradientCrescent, please consider following the publication.


Sutton et. al, Reinforcement Learning

White et. al, Fundamentals of Reinforcement Learning, University of Alberta

Silva et. al, Reinforcement Learning, UCL

Ravichandiran et. al, Hands-On Reinforcement Learning with Python