Walking through original DQN paper

Source: Deep Learning on Medium

Walking through original DQN paper

In this article, I will share my insights and observations, which I got while going through Deep Q-Learning Network (DQN) paper and
and implementing it to play Atari Breakout videogame.

It’s worth mentioning from the start that we will need a DL framework and the first option would be to use TensorFlow. I’ll just grab it for now, but I’ll keep in mind that later I will probably like to try PyTorch as well, so, it’s better to hide TF things behind some abstractions.

Also, I will be using gym.

So, DL is mentioned right at the start of the article which says that we will use deep neural network to approximate a Q-function, a function, which tells us how good it is to take different actions in different states.

In the paper they present this formula of optimal Q*(s, a) function:

which is indeed:

the maximum sum of rewards r, discounted by y at each timestep,
achievable by a behavior policy π = P(a|s), after making an
observation (s) and taking an action (a)

An expectation notion is used because policy π, in general, can select actions in states in a non-deterministic way, using some distribution P(a|s).

The paper deals with playing Atari video games, so the state would be the picture of the game situation (enemy in the top left, player a bit lower, obstacles here and there). For all those games number of states will be very very large (but the number of actions will be fixed and small — go left, right, up, down, etc).
That’s why we can not calculate Q-function for every possible state and action, no PC memory would hold that.

And thus, we introduce another function Q(s, a, θ), which will approximate ideal Q(s, a)-function and have fewer parameters θ, then the number of states of the game.
This function Q(s, a, θ) has a fixed number of parameters θ, which will be changed (learned) and then we will be able to pass the current game picture into that function and it will tell us what our best action is.

As a Q-function approximator, a very good function approximator is chosen — a neural network. θ parameters are network weights. And our goal is to train this network in such a way that it would output values very close to an optimal Q*(s, a) values. Using those values we would be able to construct an optimal policy, for that we would just grab max Q(a, s, θ) for each state (s).

The overall approach will be the same old Generalized Policy Iteration.

We start from some policy and Q-function
Until we think we are done:
We get a sample reward
We update our Q-function estimation, using new observation
We create new epsilon-greedy policy from updated Q-function

Details on how exactly we perform each step will follow.

Tackling the instability of RL with non-linear approximators

Okay, after presenting Q*(s, a) formula authors write one very important piece on the causes for RL with non-linear (that’s our case!) function approximators being unstable and their approach to tackle those causes.

They state three main reasons for RL with non-linear function approximators to be unstable:

1 The correlations present in the sequence of observations. That is, the next data point (the observation) is correlated with previous data points, obviously, there are rules to the game flow and it can not arbitrarily jump from state to state! Regarding this, the authors of the paper state that:

Learning directly from consecutive samples is inefficient, owing to the strong correlation between the samples; randomizing the samples breaks these correlations and reduces the variance of the updates

2. Small updates to Q may significantly change the policy and therefore change the data distribution. That means that the target of our learning algorithms is moving, which causes some trouble for the learning algorithm

if the maximizing action is to move left then the training samples will be dominated by samples from left-hand side; if the maximizing action then switches to the right then the training distribution will also switch. It’s easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum or even diverge

3. Correlations between the action-values Q and the target values (r +γ * maxₐQ (s, a)). Of course, there is a correlation, because we use target values to update our “ongoing” estimate for Q.

in standard online Q-learning , where an update increases Q(sᵢ, aᵢ) often also increases Q(sᵢ₊₁, a) for all a hence also increases the target γ, possibly leading to oscillations or divergence of the policy

To address these issues authors propose two key ideas.

Experience replay

The idea is to store last N transitions (state, action, reward, next_state) in a buffer and on each step, when we are about to update our Q-function estimation, in our case a neural network, we extract randomly M samples from that buffer and update it using that batch. This should fight with items (1) and (2) above

removing correlations in the observation sequence and smoothing over changes in the data distribution

class ReplayMemory:

def __init__(self):
self.buffer = []
self.replay_memory_init_size = 500
self.replay_memory_size = 500000

def append(self, transition):
if len(self.buffer) == self.replay_memory_size:

def reset_state(self, env, state_processor):
# Some stuff here...

def init_replay_memory(self,

print("Populating replay memory...")
# generate some sample transitions here

def sample(self, sample_size):
return random.sample(self.buffer, sample_size)

Separate network for the current estimator and target estimator

Authors propose to maintain 2 separate neural networks, one is updated every step, another is used to estimate target (r + γ * maxₐQ(s, a)) and kept untouched. Every N steps we copy weights from the current estimator to the target estimator. This adds a delay between the time an update is made to Q and the time the update affects target γ, making divergence and oscillations much more unlikely (trying to cope with (3) above).

Learning process

Each step we make a transition from the current state (s) using action (a) to the new state (s’) and get a reward (r). Then we

  • Put this new transaction (s, a, r, s’) into our replay buffer.
  • Sample N transactions from the buffer.
  • Calculate the target (using our “fixed weights” network) for each transaction in the batch
  • Use target Q-values it to get loss value and run optimizer on our “updatable weights” network
  • Each K steps we copy weights from “updatable weights” network (θ) to “fixed weights” network (θ⁻)

ε-greedy policy

How do determine which action (a) we take? We use so-called ε-greedy policy to generate data:

def make_epsilon_greedy_policy(estimator, actions_count):

def policy_fn(observation, epsilon):
A = np.ones(actions_count, dtype=float) * epsilon / actions_count
q_values = estimator.predict(np.expand_dims(observation, 0))[0]
best_action = np.argmax(q_values)
A[best_action] += (1.0 - epsilon)
return np.random.choice(len(A), p=A)

return policy_fn

We take Q values for the current state (s) for all possible actions. Then we pick the best action with prob (1-ε+ε/actions_count) and all others with equal probability (ε/actions_count). As this policy is random we are able sometimes to take action, which is not optimal in the current situation, but could lead to greater Return later as learning progresses.

Network architecture

Network architecture is not that complicated and can be described easily with python code. The input is 4 RGB frames of shape 84, 84 each. The output is Q-values for all possible actions in the current state.

self.input = tf.placeholder(shape=[None, 84, 84, 4], dtype=tf.uint8, name="X")
X = tf.to_float(self.input) / 255.0
# Three convolutional layers
conv1 = tf.layers.conv2d(X, 32, 8, 4, activation=tf.nn.relu)
conv2 = tf.layers.conv2d(conv1, 64, 4, 2, activation=tf.nn.relu)
conv3 = tf.layers.conv2d(conv2, 64, 3, 1, activation=tf.nn.relu)

# Fully connected layers
flattened = tf.layers.flatten(conv3)
fc1 = tf.layers.dense(flattened, 512)
self.q_values_for_all_actions = tf.layers.dense(fc1, actions_count)

Q-function estimator update

For our Q-function neural network to learn we must provide the optimizer with the loss function.

That’s the formula from the paper:

Where (r) is the reward we got by going from (s), using (a), to (s’), and γ is some coefficient which is intended to regulate how important are rewards from the distant future for our policy. θ are the weights of “updatable weights” network and θ⁻ are the weights from the “fixed weights” network, which is used to calculate target Q.

Expectation in this formula just means that take those samples from buffer and weight them by probability of drawing them, which is in our case equal for all the samples and this formula boils down to the mean of squared differences between current Q-value for state (s) and action (a) and desired (target) value, which is (r + γ * maxₐ’Q(s’, a’, θ⁻)).

In python and TF it’s just:

# Calculate the loss
self.losses = tf.squared_difference(self.target_q_for_selected_actions, self.q_values_for_selected_actions)
self.loss = tf.reduce_mean(self.losses)

As optimizer RMSProp is used as follows (all the hyperparameters are taken ‘as is’ from the paper)

self.optimizer = tf.train.RMSPropOptimizer(0.00025, 0.99, 0.0, 1e-6)
self.train_op = self.optimizer.minimize(self.loss, global_step=tf.train.get_global_step())

Details of implementation

Also, there are lots of details, which contribute to the algorithm’s performance.

Frame Downsampling

Atari 2600 games run 60 MHz and each frame is 210×160 with 128-bit pallette. To reduce the computational and memory needs of the DQN algorithm, each frame is downsampled to 84×84 and converted to grayscale.

The reasonable way of implementing the downsampling is to use a sequence of TF operations, which then can be run on GPU.

class TensorFlowFrameDownsampler(object):

def __init__(self, sess):
self.sess = sess

with tf.variable_scope("state_processor"):
self.input_state = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8)
self.output = tf.image.rgb_to_grayscale(self.input_state)
self.output = tf.image.crop_to_bounding_box(self.output, 34, 0, 160, 160)
self.output = tf.image.resize_images(
self.output, [84, 84], method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)
self.output = tf.squeeze(self.output)

def process(self, state):
return self.sess.run(self.output, {self.input_state: state})

Another piece of implementing that downsampling is a gym.Wrapper, which actually uses TensorFlowFrameDownsampler.

class DownsampleFrameWrapper(gym.Wrapper):

def __init__(self, env, downsample_frame_fn):
gym.Wrapper.__init__(self, env)
self.downsample_frame_fn = downsample_frame_fn

def reset(self):
ob = self.env.reset()
return self.downsample_frame_fn(ob)

def step(self, action):
ob, reward, done, info = self.env.step(action)
return self.downsample_frame_fn(ob), reward, done, info

Frame Stacking

Also each state for the algorithm is not just one frame but 4 previous frames stacked, so the state is a 84x84x4 tensor. I’ve also implemented a separate gym.Wrapper for that.

class FrameStackWrapper(gym.Wrapper):
def __init__(self, env, k):
gym.Wrapper.__init__(self, env)
self.k = k
self.frames = deque([], maxlen=k)

def reset(self):
ob = self.env.reset()
for _ in range(self.k):
return self._get_ob()

def step(self, action):
ob, reward, done, info = self.env.step(action)
return self._get_ob(), reward, done, info

def _get_ob(self):
assert len(self.frames) == self.k
return np.stack(self.frames, axis=2)

Reward clipping

While we evaluated our agents on unmodified games, we made one change to the reward structure of the games during training only. As the scale of scores varies greatly from game to game, we clipped all positive rewards at 1 and all negative rewards at -1, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games.

As I targeted just Breakout where the rewards were already 1.0 for every destroyed enemy I didn’t need the clipping.

Epsilon decay

Training of the DQN agent was done with decaying epsilon like this

class EpsilonDecaySchedule:

def __init__(self, start, end, steps):
self.steps = steps
self.epsilons = np.linspace(start, end, self.steps)

def next_epsilon(self, steps_taken_so_far):
return self.epsilons[min(steps_taken_so_far, self.steps-1)]
self.epsilon_decay_schedule = EpsilonDecaySchedule(1.0, 0.1, 1e6)

Frame skipping

From the paper:

More precisely, the agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames. Because running the emulator forward for one step requires much less computation than having the agent select an action, this technique allows the agent to play roughly k times more games without significantly increasing the runtime. We use k=4 for all games

So the authors of the paper used frame skipping to play k times more games (episodes) in roughly the same amount of time. In my case, I decided not to implement it.

Results stated in the paper

It is specifically emphasized that DQN-agent was able to learn how to play all the games with no prior knowledge, using only pixels, and for all the games the same NN-architecture was used.

Our DQN method outperforms the best existing reinforcement learning methods on 43 of the games without incorporating any of the additional prior knowledge about Atari 2600 games used by other approaches (for example, refs 12, 15). Furthermore, our DQN agent performed at a level that was comparable to that of a professional human games tester across the set of 49 games, achieving more than 75%of the human score on more than half of the games (29 games)

We next examined the representations learned by DQN that underpinned the successful performance of the agent in the context of the game Space Invaders, by using a technique developed for the visualization of high-dimensional data called ‘t-SNE’. As expected, the t-SNE algorithm tends to map the DQN representation of perceptually similar states to nearby points. Interestingly,we also found instances in which the t-SNE algorithm generated similar embeddings for DQN representations of states that are close in terms of expected reward but perceptually dissimilar, consistent with the notion that the network is able to learn representations that support adaptive behaviour from high-dimensional sensory inputs.

Amount of training stated:

We trained for a total of 50 million frames (that is, around 38 days of game experience in total) and used a replay memory of 1 million most recent frames.

Results in my case

In my case, I ran approx 5000 episodes of Breakout with the replay memory size of 500K.

To get an understanding of what was going on, I outputted return and length for every episode with the help of the TensorBoard.

Episode length
Episode return

The graphs show the episode length and reward are increasing as training progresses up until 1M steps. Good news!

After 1M steps it seems like there is no real progress in training, as episode length oscillates around 1000 steps and return around 20. Either this is the limit of my current implementation or I made an error somewhere and learning process got stuck in some local minimum, at the moment its hard to tell.

Anyway, here are some good skills in playing Breakout:

Now I need to dig deeper to solve it completely!