Deep Q Learning for the CartPole

Source: Deep Learning on Medium

Deep Q Learning for the CartPole

The purpose of this post is to introduce the concept of Deel Q Learning and use it to solve the CartPole environment from the OpenAI Gym.

The post will consist of the following components:

  1. Open AI Gym Environment Intro
  2. Random Baseline Strategy
  3. Deep Q Learning
  4. Deep Q Learning with Replay Memory
  5. Double Deep Q Learning
  6. Soft Updates


The CartPole environment consists of a pole which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. The state space is represented by four values: cart position, cart velocity, pole angle, and the velocity of the tip of the pole. The action space consists of two actions: moving left or moving right. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

The cell below plots a bunch of example frames from the environment:

# Demonstration
env = gym.envs.make("CartPole-v1")
def get_screen():
''' Extract one step of the simulation.'''
screen = env.render(mode='rgb_array').transpose((2, 0, 1))
screen = np.ascontiguousarray(screen, dtype=np.float32) / 255.
return torch.from_numpy(screen)
# Speify the number of simulation steps
num_steps = 2
# Show several steps
for i in range(num_steps):
plt.imshow(get_screen().cpu().permute(1, 2, 0).numpy(),
plt.title('CartPole-v0 Environment')

Dependent on the number of episodes, the output will look something like this:

Untrained Agent

As we can see, the agent is untrained yet, so it cannot make more than a couple of steps. We will soon explore some of the strategies that will drastically improve performance. But first, let’s define the plotting function that will help us analyze results.:

def plot_res(values, title=''): 
''' Plot the reward curve and histogram of results over time.'''
# Update the window after each episode

# Define the figure
f, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
ax[0].plot(values, label='score per run')
ax[0].axhline(195, c='red',ls='--', label='goal')
x = range(len(values))
# Calculate the trend
z = np.polyfit(x, values, 1)
p = np.poly1d(z)
ax[0].plot(x,p(x),"--", label='trend')

# Plot the histogram of results
ax[1].axvline(195, c='red', label='goal')
ax[1].set_xlabel('Scores per Last 50 Episodes')

The resulting plot consists of two subplots. The first one plots the total reward the agent accumulates over time, while the other plot shows a histogram of the agent’s total rewards for the last 50 episodes. We’ll see some of the graphs when we’ll analyze our strategies.

Baseline Random Model

Before implementing any deep learning approaches, I wrote a simple strategy where the action is sampled randomly from the action space. This approach will serve as a baseline for other strategies and will make it easier to understand how to work with the agent using the Open AI Gym environment.

def random_search(env, episodes, 
title='Random Strategy'):
""" Random search strategy implementation."""
final = []
for episode in range(episodes):
state = env.reset()
done = False
total = 0
while not done:
# Sample random actions
action = env.action_space.sample()
# Take action and extract results
next_state, reward, done, _ = env.step(action)
# Update reward
total += reward
if done:
# Add to the final reward
return final

One environment step returns several values, such as the next_state, reward, and whether the simulation is done. The plot below represents the total accumulated reward over 150 episodes (simulation runs):

Random Strategy

The plot above presents the random strategy. As expected, it’s impossible to solve the environment using this approach. The agent is not learning from their experience. Despite being lucky sometimes (getting a reward of almost 75), their average performance is as low as 10 steps.

Deep Q Learning

The main idea behind Q-learning is that we have a function 𝑄:𝑆𝑡𝑎𝑡𝑒×𝐴𝑐𝑡𝑖𝑜𝑛→ℝ, which can tell the agent what actions will result in what rewards. If we know the value of 𝑄, it is possible to construct a policy that maximizes rewards:

𝜋(𝑠)=argmax𝑎 𝑄(𝑠,𝑎)

However, in the real world, we don’t have access to full information, that’s why we need to come up with ways of approximating 𝑄. One traditional method is creating a lookup table where the values of 𝑄 are updated after each of the agent’s actions. However, this approach is slow and does not scale to large action and state spaces. Since neural networks are universal function approximators, I will train a network that can approximate 𝑄.

The DQL class implementation consists of a simple neural network implemented in PyTorch that has two main methods — predict and update. The network takes the agent’s state as an input and returns the 𝑄 values for each of the actions. The maximum 𝑄 value is selected by the agent to perform the next action:

class DQL():
''' Deep Q Neural Network class. '''
def __init__(self, state_dim, action_dim, hidden_dim=64, lr=0.05):
self.criterion = torch.nn.MSELoss()
self.model = torch.nn.Sequential(
torch.nn.Linear(state_dim, hidden_dim),
torch.nn.Linear(hidden_dim, hidden_dim*2),
torch.nn.Linear(hidden_dim*2, action_dim)
self.optimizer = torch.optim.Adam(self.model.parameters(), lr)
def update(self, state, y):
"""Update the weights of the network given a training sample. """
y_pred = self.model(torch.Tensor(state))
loss = self.criterion(y_pred, Variable(torch.Tensor(y)))
def predict(self, state):
""" Compute Q values for all actions using the DQL. """
with torch.no_grad():
return self.model(torch.Tensor(state))

The q_learning function is the main loop for all the algorithms that follow.
It has many parameters, namely:

env represents the Open Ai Gym environment that we want to solve (CartPole.)
episodesstand for the number of games we want to play.
gammais a discounting factor that is multiplied by future rewards to dampen these rewards’ effect on the agent. It is designed to make future rewards worth less than immediate rewards.
epsilonrepresents the proportion of random actions relative to actions that are informed by existing “knowledge” that the agent accumulates during the episode. This strategy is called “Greedy Search Policy.” Before playing the game, the agent doesn’t have any experience, so it is common to set epsilon to higher values and then gradually decrease its value.
eps_decay indicates the speed at which the epsilon decreases as the agent learns. 0.99 comes from the original DQN paper.

I will explain other parameters later on when we will get to the corresponding agents.

def q_learning(env, model, episodes, gamma=0.9, 
epsilon=0.3, eps_decay=0.99,
replay=False, replay_size=20,
title = 'DQL', double=False,
n_update=10, soft=False):
"""Deep Q Learning algorithm using the DQN. """
final = []
memory = []
for episode in range(episodes):
if double and not soft:
# Update target network every n_update steps
if episode % n_update == 0:
if double and soft:

# Reset state
state = env.reset()
done = False
total = 0

while not done:
# Implement greedy search policy
if random.random() < epsilon:
action = env.action_space.sample()
q_values = model.predict(state)
action = torch.argmax(q_values).item()

# Take action and add reward to total
next_state, reward, done, _ = env.step(action)

# Update total and memory
total += reward
memory.append((state, action, next_state, reward, done))
q_values = model.predict(state).tolist()

if done:
if not replay:
q_values[action] = reward
# Update network weights
model.update(state, q_values)
if replay:
# Update network weights using replay memory
model.replay(memory, replay_size, gamma)
# Update network weights using the last step only
q_values_next = model.predict(next_state)
q_values[action] = reward + gamma * torch.max(q_values_next).item()
model.update(state, q_values)
state = next_state

# Update epsilon
epsilon = max(epsilon * eps_decay, 0.01)
plot_res(final, title)
return final

The most straightforward agent updates its Q-values based on its most recent observation. It doesn’t have any memory, but it learns by first exploring the environment and then gradually decreasing its epsilon value to make informed decisions. Let’s evaluate the performance of such an agent:

Deep Q Learning

The graph above shows that the performance of the agent has significantly improved. It got to 175 steps, which, as we’ve seen before, is impossible for a random agent. The trend line is also positive, and we can see that the performance increases over time. At the same time, the agent didn’t manage to get above the goal line after 150 epochs, and its average performance is still around 15 steps, so there is enough room for improvement.

Replay Memory

The approximation of 𝑄 using one sample at a time is not very effective. The graph above is a nice illustration of that. The network managed to achieve a much better performance compared to a random agent. However, it couldn’t get to the threshold line of 195 steps. I implemented experience replay to improve network stability and make sure previous experiences are not discarded but used in training.

Experience replay stores the agent’s experiences in memory. Batches of experiences are randomly sampled from memory and are used to train the neural network. Such learning consists of two phases — gaining experience and updating the model. The size of the replay controls the number of experiences that are used for the network update. Memory is an array that stores the agent’s state, reward, and action, as well as whether the action finished the game and the next state.

# Expand DQL class with a replay function.
class DQN_replay(DQN):
def replay(self, memory, size, gamma=0.9):
""" Add experience replay to the DQN network class. """
# Make sure the memory is big enough
if len(memory) >= size:
states = []
targets = []
# Sample a batch of experiences from the agent's memory
batch = random.sample(memory, size)

# Extract information from the data
for state, action, next_state, reward, done in batch:
# Predict q_values
q_values = self.predict(state).tolist()
if done:
q_values[action] = reward
q_values_next = self.predict(next_state)
q_values[action] = reward + gamma * torch.max(q_values_next).item()
targets.append(q_values)self.update(states, targets)
DQL with Replay

As expected, the neural network with the replay seems to be much more robust and smart compared to its counterpart that only remembers the last action. After approximately 60 episodes, the agent managed to achieve the winning threshold and remain at this level. It also managed to achieve the highest reward possible — 500.

Double Deep Q Learning

Traditional Deep Q Learning tends to overestimate the reward, which leads to unstable training and lower quality policy. Let’s consider the equation for the Q value:

The Bellman Equation. Source: Link

The last part of the equation takes the estimate of the maximum value. This procedure results in systematic overestimation, which introduces a maximization bias. Since Q-learning involves learning estimates from estimates, such overestimation is especially worrying.

To avoid such a situation, I will define a new target network. The Q values will be taken from this new network, which is meant to reflect the state of the main DQN. However, it doesn’t have identical weights because it’s only updated after a certain number of episodes. This idea has been first introduced in Hasselt et al., 2015.
The addition of the target network might slow down the training since the target network is not continuously updated. However, it should have a more robust performance over time.

n_update from the q_learning loop specifies the interval after which the target network should be updated.

class DQN_double(DQN):
def __init__(self, state_dim, action_dim, hidden_dim, lr):
super().__init__(state_dim, action_dim, hidden_dim, lr) = copy.deepcopy(self.model)

def target_predict(self, s):
''' Use target network to make predicitons.'''
with torch.no_grad():

def target_update(self):
''' Update target network with the model weights.'''

def replay(self, memory, size, gamma=1.0):
''' Add experience replay to the DQL network class.'''
if len(memory) >= size:
# Sample experiences from the agent's memory
data = random.sample(memory, size)
states = []
targets = []
# Extract datapoints from the data
for state, action, next_state, reward, done in data:
q_values = self.predict(state).tolist()
if done:
q_values[action] = reward
# The only difference between the simple replay is in this line
# It ensures that next q values are predicted with the target network.
q_values_next = self.target_predict(next_state)
q_values[action] = reward + gamma * torch.max(q_values_next).item()
targets.append(q_values)self.update(states, targets)
Double DQL with Replay

Double DQL with replay has outperformed the previous version and has consistently performed above 300 steps. The performance also seems to be a bit more stable, thanks to the separation of action selection and evaluation. Finally, let’s explore the last modification to the DQL agent.

Soft Target Update

The method used to update the target network implemented above was introduced in the original DQN paper. In this section, we will explore another well-established method of updating the target network weights. Instead of updating weights after a certain number of steps, we will incrementally update the target network after every run using the following formula:

target_weights = target_weights * (1-TAU) + model_weights * TAU

where 0 < TAU < 1

This method of updating the target network is known as “soft target network updates” and was introduced in Lillicrap et al., 2016. This method implementation is shown below:

class DQN_double_soft(DQN_double):
def target_update(self, TAU=0.1):
''' Update the targer gradually. '''
# Extract parameters
model_params = self.model.named_parameters()
target_params =

updated_params = dict(target_params)
for model_name, model_param in model_params:
if model_name in target_params:
# Update parameter
updated_params[model_name].data.copy_((TAU)* + (1-TAU)*target_params[model_param].data)
DDQL with Soft Update

The network with soft target updates performed quite well. However, it doesn’t seem to be better than hard weight updates.

This gif illustrates the performance of a trained agent:

Trained Agent


The implementation of the experience replay and the target network have significantly improved the performance of a Deep Q Learning agent in the Open AI CartPole environment. Some other modifications to the agent, such as Dueling Network Architectures (Wang et al., 2015), can be added to this implementation to improve the agent’s performance. The algorithm is also generalizable to other environments. Feel free to test how well it solves other tasks!

Link to the notebook:


(1) Reinforcement Q-Learning from Scratch in Python with OpenAI Gym. (2019). Retrieved 9 December 2019, from

(2) Paszke, A., (2019). Reinforcement Learning (DQN) tutorial. Retrieved from:

(3) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

(4) Van Hasselt, H., Guez, A., & Silver, D. (2016, March). Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence.

(5) Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581.

(6) Double DQN Implementation to Solve OpenAI Gym’s CartPole v-0. (2019). Medium. Retrieved 20 December 2019, from