Automating Doom with Deep Q-Learning: An Implementation in Tensorflow.

Original article was published on Deep Learning on Medium

Automating Doom with Deep Q-Learning: An Implementation in Tensorflow.

Fundamentals of Reinforcement Learning

Introduction

Online learning methods are a dynamic family of reinforcement learning algorithms behind many of the achievements in general AI over the past decade. Belonging to the sample-based learning class of reinforcement learning approaches, online learning methods allow for the determination of state values simply through repeated observations, eliminating the need for transition dynamics. Unlike their offline counterparts, online learning approaches allow for the incremental updates of the values of states and actions during an environmental episode, allowing for constant, incremental performance improvements to be observed.

Beyond Temporal Difference learning (TD), we’ve discussed the theory and practical implementations of Q-learning, an evolution of TD designed to allow for incremental estimations and improvement of state-action values. Q-learning has been made famous as becoming the backbone of reinforcement learning approaches to simulated game environments, such as those observed in OpenAI’s gyms. As we’ve already covered theoretical aspects of Q-learning in past articles, they will not be repeated here.

The rapid intra-episodic responsiveness that characterize online learning approaches such as TD make them suitable for highly dynamic environments,where the values of states and actions is continuously updated throughout time through sets of estimates. Perhaps most notably, TD is the foundation of Q-learning, a more advanced algorithm used to train agents tackling game environments such as those observed in the OpenAI Atari gyms, as covered in some of our previous implementations.

In this article, we’ll explore how Q-learning can be applied to training an agent to play the classic FPS game Doom, through the use of the open-source OpenAI gym wrapper library Vizdoomgym. We’ll guide you through setting up your first agents, and lay the foundations of future work.

An evolution of a timeless classic. We’ll be focusing on the 1993 version here.

Beyond TD: SARSA & Q-learning

Recall that in Temporal Difference learning, we observed that an agent behaves cyclically in an environment, through sequence of States (S), Actions (A), and (Rewards).

During TD, we can update the value of the previous state as soon as we reach the next state. We can further expand the scope of our model to include state-action values, in an approach termed SARSA, an on-policy TD control algorithm for estimating action values. During SARSA, we continuously estimate the action-values across incremental timesteps under a given policy, while using this knowledge to modify the policy towards an action-value greedy alternative.

Let’s compare the state-action and state-value TD update equations:

Q-learning differs from SARSA by forcing a selection of the action with the current highest action value during an update, in a similar approach to what’s observed with Bellman Optimality Equations. We can inspect SARSA and Q-learning next to the Bellman and Bellman Optimality Equations, below:

You may be wondering about how ensure complete exploration of our state-action space, given the need to constantly select actions for a state with the highest existing action-value. In theory, we could be avoiding the optimal action simply by failing to evaluate it in the first place. To encourage exploration, we can use a decaying e-greedy policy, essentially forcing the agent to select an apparent sub-optimal action at a decaying probability in order to learn more about its value. By using a decay value, we can limit exploration once all of the states have been evaluated, after which we’ll permanently select the optimal actions for each state.

With our theory covered, let’s get started on our implementation.

Implementation

Our Google Colaboratory implementation is written in Python utilizing Tensorflow Core, and can be found on the GradientCrescent Github. Readers following our publication will find the code similar to our previous implementations for Atari environments. As the implementation for this approach is quite convoluted, let’s summarize the order of actions required:

1. We define our Deep Q-learning neural network. This is a CNN that takes in-game screen images and outputs the probabilities of each of the actions, or Q-values, in the Ms-Pacman gamespace. To acquire a tensor of probabilitieses, we do not include any activation function in our final layer.

2. As Q-learning require us to have knowledge of both the current and next states, we need to start with data generation. We feed preprocessed input images of the game space, representing initial states s, into the network, and acquire the initial probability distribution of actions, or Q-values. Before training, these values will appear random and sub-optimal.

3. With our tensor of probabilities, we then select the action with the current highest probability using the argmax() function, and use it to build an epsilon greedy policy.

4. Using our policy, we’ll then select the action a, and evaluate our decision in the gym environment to receive information on the new state s’, the reward r, and whether the episode has been finished.

5. We store this combination of information in a buffer in the list form <s,a,r,s’,d>, and repeat steps 2–4 for a preset number of times to build up a large enough buffer dataset.

6. Once step 5 has finished,we move to generate our target y-values, R’ and A’, that are required for the loss calculation. While the former is simply discounted from R, we obtain the A’ by feeding S’ into our network.

7. With all of our components in place, we can then calculate the loss to train our network.

8. Once training has finished, we’ll evaluate the performance of our agent under a new game episode, and record the performance.

Let’s get started. With Tensorflow 2 in use for Colaboratory environments, we’ve converted our code to be TF2 compliant, using the new compat package. Note that this code is not TF2 native.

Let’s by importing all of the necessary packages, including the OpenAI and Vizdoomgym environments and Tensorflow Core:

import gym
import vizdoomgym
!pip install tensorflow==1.15
import numpy as np
import tensorflow as tf
from tensorflow.contrib.layers import flatten, conv2d, fully_connected
from collections import deque, Counter
import random
from datetime import datetime

Next, we define a preprocessing function to normalize and resize the observations from our gym environment and convert them into one-dimensional tensors.

from skimage.color import rgb2gray
from skimage import transform
#prepro (240, 320, 3) uint8 frame into 30x40 1D float vector
color = np.array([240, 320, 74]).mean()
def preprocess_observation(obs):


img =obs/255.0
img[img==color] = 0
img_gray = rgb2gray(img)
preprocessed_frame = transform.resize(img_gray, [60,80])

return preprocessed_frame

Next, let’s initialize the gym environment. We’ll be using Vizdoomgym’s health gathering scenario, where the objective is to collect as many health tokens as possible in order to stay alive while navigating a square room with a damaging acidic floor.

env = gym.make(‘VizdoomHealthGathering-v0’)
n_outputs = env.action_space.n
print(n_outputs)
observation = env.reset()import tensorflow as tf
import matplotlib.pyplot as plt
for i in range(22):

if i > 20:
print(observation.shape)
plt.imshow(observation)
plt.show()
observation, _, _, _ = env.step(1)

We can inspect a screen of gameplay, and also view the 3 actions available within the gamespace, namely to turn left, right, or to move forward. Naturally, this information is not available to our agent.

Raw observation input

We can take this chance to compare our original and preprocessed input images:

Preprocessed image input

Next, we introduce input frame stacking and frame composition into our preprocessing pipeline. These methods serve to provide a temporal and motion reference for our inputs, respectively.

We apply frame composition by by taking two of our input frames, and returning an element-wise maximum summation maxframe of the two. These composed frames are then stored in a deque or “stack”, which automatically removes older entries as new entries are introduced.

stack_size = 4 # We stack 4 composite frames in total
stacked_frames = deque([np.zeros((60,80), dtype=np.int) for i in range(stack_size)], maxlen=4)
def stack_frames(stacked_frames, state, is_new_episode):
# Preprocess frame
frame = preprocess_observation(state)

if is_new_episode:
# Clear our stacked_frames
stacked_frames = deque([np.zeros((60,80), dtype=np.int) for i in range(stack_size)], maxlen=4)

# Because we’re in a new episode, copy the same frame 4x, apply elementwise maxima
maxframe = np.maximum(frame,frame)
stacked_frames.append(maxframe)
stacked_frames.append(maxframe)
stacked_frames.append(maxframe)
stacked_frames.append(maxframe)



# Stack the frames
stacked_state = np.stack(stacked_frames, axis=2)

else:
#Since deque append adds t right, we can fetch rightmost element
maxframe=np.maximum(stacked_frames[-1],frame)
# Append frame to deque, automatically removes the oldest frame
stacked_frames.append(maxframe)
# Build the stacked state (first dimension specifies different frames)
stacked_state = np.stack(stacked_frames, axis=2)

return stacked_state, stacked_frames

Next, let’s define our model, a deep Q-network. This is essentially a three layer convolutional network that takes preprocessed input obserations, with the generated flattened output fed to a fully-connected layer, generating probabilities of taking each action in the game space as an output. Note there are no activation layers here, as the presence of one would result in a binary output distribution.

tf.compat.v1.reset_default_graph()
def q_network(X, name_scope):

# Initialize layers
initializer = tf.compat.v1.keras.initializers.VarianceScaling(scale=2.0)
with tf.compat.v1.variable_scope(name_scope) as scope: # initialize the convolutional layers
layer_1 = conv2d(X, num_outputs=32, kernel_size=(8,8), stride=4, padding=’SAME’, weights_initializer=initializer)
tf.compat.v1.summary.histogram(‘layer_1’,layer_1)

layer_2 = conv2d(layer_1, num_outputs=64, kernel_size=(4,4), stride=2, padding=’SAME’, weights_initializer=initializer)
tf.compat.v1.summary.histogram(‘layer_2’,layer_2)

layer_3 = conv2d(layer_2, num_outputs=64, kernel_size=(3,3), stride=1, padding=’SAME’, weights_initializer=initializer)
tf.compat.v1.summary.histogram(‘layer_3’,layer_3)

# Flatten the result of layer_3 before feeding to the fully connected layer
flat = flatten(layer_3)
# Insert fully connected layer
fc = fully_connected(flat, num_outputs=128, weights_initializer=initializer)
tf.compat.v1.summary.histogram(‘fc’,fc)
#Add final output layer
output = fully_connected(fc, num_outputs=n_outputs, activation_fn=None, weights_initializer=initializer)
tf.compat.v1.summary.histogram(‘output’,output)
# Vars will store the parameters of the network such as weights
vars = {v.name[len(scope.name):]: v for v in tf.compat.v1.get_collection(key=tf.compat.v1.GraphKeys.TRAINABLE_VARIABLES, scope=scope.name)}
#Return both variables and outputs together
return vars, output

Let’s also take this chance to define our hyperparameters for our model and training process. Note that the X_shape is now (None, 60, 80, 4), on account of our stacked frames.

num_episodes = 1000
batch_size = 48
input_shape = (None, 60, 80, 1)learning_rate = 0.002
#Modified for composite stacked frames
X_shape = (None, 60, 80, 4)
discount_factor = 0.99
global_step = 0
copy_steps = 100
steps_train = 4
start_steps = 2000

Recall, that Q-learning requires us to select actions with the highest action-values. To ensure that we still visit every single possible state-action combination, we’ll have our agent follow a decaying epsilon-greedy policy, with an exploration rate of 5%.

epsilon = 0.5
eps_min = 0.05
eps_max = 1.0
eps_decay_steps = 500000
def epsilon_greedy(action, step):
p = np.random.random(1).squeeze() #1D entries returned using squeeze
epsilon = max(eps_min, eps_max — (eps_max-eps_min) * step/eps_decay_steps) #Decaying policy with more steps
if p< epsilon:
return np.random.randint(n_outputs)
else:
return action

Recall from the equations above, that the update function for Q-learning requires the following:

  • The current state s
  • The current action a
  • The reward following the current action r
  • The next state s’
  • The next action a’

To supply these parameters in meaningful quantities, we need to evaluate our current policy following a set of parameters and store all of the variables in a buffer, from which we’ll draw data in minibatches during training. Let’s go ahead and create our buffer and a simple sampling function:

buffer_len = 20000
exp_buffer = deque(maxlen=buffer_len)
def sample_memories(batch_size):
perm_batch = np.random.permutation(len(exp_buffer))[:batch_size]
mem = np.array(exp_buffer)[perm_batch]
return mem[:,0], mem[:,1], mem[:,2], mem[:,3], mem[:,4]

Next, let’s copy the weight parameters of our original network into a target network. This dual-network approach allows us to generate data during the training process using an existing policy while still optimizing our parameters for the next policy iteration, reducing loss oscillations.

# we build our Q network, which takes the input X and generates Q values for all the actions in the state
mainQ, mainQ_outputs = q_network(X, ‘mainQ’)
# similarly we build our target Q network, for policy evaluation
targetQ, targetQ_outputs = q_network(X, ‘targetQ’)
copy_op = [tf.compat.v1.assign(main_name, targetQ[var_name]) for var_name, main_name in mainQ.items()]
copy_target_to_main = tf.group(*copy_op)

Finally, we’ll also define our loss. This is simply the squared difference of our target action (with the highest action value) and our predicted action. We’ll use an ADAM optimizer to minimize our loss during training.

# define a placeholder for our output i.e action
y = tf.compat.v1.placeholder(tf.float32, shape=(None,1))
# now we calculate the loss which is the difference between actual value and predicted value
loss = tf.reduce_mean(input_tensor=tf.square(y — Q_action))
# we use adam optimizer for minimizing the loss
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
init = tf.compat.v1.global_variables_initializer()
loss_summary = tf.compat.v1.summary.scalar(‘LOSS’, loss)
merge_summary = tf.compat.v1.summary.merge_all()
file_writer = tf.compat.v1.summary.FileWriter(logdir, tf.compat.v1.get_default_graph())

With all of our code defined, let’s run our network and go over the training process. We’ve defined most of this in the initial summary, but let’s recall for posterity.

  • For each epoch, we feed an input image stack into our network to generate a probability distribution of the available actions, before using an epsilon-greedy policy to select the next action
  • We then input this into the network, and obtain information on the next state and accompanying rewards, and store this into our buffer. We update our stack and repeat this process over a number of pre-defined steps.
  • After our buffer is large enough, we feed the next states into our network in order to obtain the next action. We also calculate the next reward by discounting the current one
  • We generate our target y-values through the Q-learning update function, and train our network.
  • By minimizing the training loss, we update the network weight parameters to output improved state-action values for the next policy.
with tf.compat.v1.Session() as sess:
init.run()
# for each episode
history = []
for i in range(num_episodes):
done = False
obs = env.reset()
epoch = 0
episodic_reward = 0
actions_counter = Counter()
episodic_loss = []
# First step, preprocess + initialize stack
obs,stacked_frames= stack_frames(stacked_frames,obs,True)
# while the state is not the terminal state
while not done:
#Data generation using the untrained network
# feed the game screen and get the Q values for each action
actions = mainQ_outputs.eval(feed_dict={X:[obs], in_training_mode:False})
# get the action
action = np.argmax(actions, axis=-1)
actions_counter[str(action)] += 1
# select the action using epsilon greedy policy

action = epsilon_greedy(action, global_step)
# now perform the action and move to the next state, next_obs, receive reward
next_obs, reward, done, _ = env.step(action)
#Updated stacked frames with new episode
next_obs, stacked_frames = stack_frames(stacked_frames, next_obs, False)
# Store this transition as an experience in the replay buffer!
exp_buffer.append([obs, action, next_obs, reward, done])
# After certain steps, we train our Q network with samples from the experience replay buffer
if global_step % steps_train == 0 and global_step > start_steps:

o_obs, o_act, o_next_obs, o_rew, o_done = sample_memories(batch_size)
# states
o_obs = [x for x in o_obs]
# next states
o_next_obs = [x for x in o_next_obs]
# next actions
next_act = mainQ_outputs.eval(feed_dict={X:o_next_obs, in_training_mode:False})
# discounted reward: these are our Y-values
y_batch = o_rew + discount_factor * np.max(next_act, axis=-1) * (1-o_done)
# merge all summaries and write to the file
mrg_summary = merge_summary.eval(feed_dict={X:o_obs, y:np.expand_dims(y_batch, axis=-1), X_action:o_act, in_training_mode:False})
file_writer.add_summary(mrg_summary, global_step)
# To calculate the loss, we run the previously defined functions mentioned while feeding inputs
train_loss, _ = sess.run([loss, training_op], feed_dict={X:o_obs, y:np.expand_dims(y_batch, axis=-1), X_action:o_act, in_training_mode:True})
episodic_loss.append(train_loss)
# after some interval we copy our main Q network weights to target Q network
if (global_step+1) % copy_steps == 0 and global_step > start_steps:
copy_target_to_main.run()
obs = next_obs
epoch += 1
global_step += 1
episodic_reward += reward
next_obs=np.zeros(obs.shape)
exp_buffer.append([obs, action, next_obs, reward, done])
obs= env.reset()
obs,stacked_frames= stack_frames(stacked_frames,obs,True)
history.append(episodic_reward)
print(‘Epochs per episode:’, epoch, ‘Episode Reward:’, episodic_reward,”Episode number:”, len(history))

Once training is complete, we can plot the reward distribution against incremental episodes. The first 1000 episodes are shown below:

This result is acceptable, with the beginnings of improvement being visible, given how the original Vizdoom paper suggests that hundreds of thousands of episodes would be needed for a more significant improvement to be observed.

img_array=[]
with tf.compat.v1.Session() as sess:
init.run()
observation, stacked_frames = stack_frames(stacked_frames, observation, True)

while True:
# feed the game screen and get the Q values for each action
actions = mainQ_outputs.eval(feed_dict={X:[observation], in_training_mode:False})
# get the action
action = np.argmax(actions, axis=-1)
actions_counter[str(action)] += 1
# select the action using epsilon greedy policy
action = epsilon_greedy(action, global_step)
environment.render()
new_observation, stacked_frames = stack_frames(stacked_frames, new_observation, False)

observation = new_observation
# now perform the action and move to the next state, next_obs, receive reward
new_observation, reward, done, _ = environment.step(action)

img_array.append(new_observation)
if done:
#observation = env.reset()
break

environment.close()

Finally, we can take our list of frames and feed it to the scikit-video library to generate a video sequence output for inspection:

from random import choice
import cv2
from google.colab.patches import cv2_imshow
import numpy as np
import skvideo.io
out_video = np.empty([len(img_array), 240, 320, 3], dtype = np.uint8)
out_video = out_video.astype(np.uint8)
for i in range(len(img_array)):
frame = img_array[i]
out_video[i] = frame
# Writes the the output image sequences in a video file
skvideo.io.vwrite(“/content/doom.mp4”, out_video)

Let’s view our agent in action!

Note how the agent pauses when it spots a health pack, before proceeding to move towards it. Just for fun, we also trained an agent for the basic scenario, where the objective is to hit the monster as soon as possible. While we achieved a best time of ca. 1.3 seconds, we’ll show one of the earlier episodes below.

That wraps up this implementation on Q-learning. In our next article, we’ll move on to tackling more complex Doom scenarios with more advanced Q-learning approahes, and switch our code over to Pytorch, one of the most popular languages in AI research.

We hope you enjoyed this article, and hope you check out the many other articles on GradientCrescent, covering applied and theoretical aspects of AI. To stay up to date with the latest updates on GradientCrescent, please consider following the publication and following our Github repository.

References

Sutton et. al, Reinforcement Learning

White et. al, Fundamentals of Reinforcement Learning, University of Alberta

Silva et. al, Reinforcement Learning, UCL

Makarov et. al, Deep Reinforcement Learning with VizDoom First-Person Shooter, HSE

Kempka et. al, ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning, PUT

Ravichandiran et. al, Hands-On Reinforcement Learning with Python