Original article can be found here (source): Deep Learning on Medium
The CartPole is a binary classification problem
The dimension of the CartPole observation space is 4, since there are 4 features that form the input: cart coordinate, velocity, pole’s angle from vertical and its derivative (pole “falling” velocity ). The CartPole is a binary classification problem because at each time step the agent chooses between moving
right. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every time step that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
The environment CartPole-v0 is considered solved if the getting average reward > 195 over 100 consecutive trials; CartPole-v1 is considered solved if the getting average reward > 475 over 100 consecutive trials.
Training experiments for CartPole-v0 and CartPole-v1
Here are the results of my experiments with DQN and Double DQN, obtained during training CartPole-v0 and CartPole-v1. For all cases, LEARNING_RATE = 0.001. The greedy parameter
εis changed from
ε_m = 0.01
For both CartPole-v0 and CartPole-v1, we put
- DQN, CartPole-v0, reward 195 is achieved in episode 962.
- DQN, CartPole-v1, reward 475 is achieved in episode 1345.
For CartPole-v0 we put
Mε= 200;for CartPole-v1, we put
Mε= 150. Recall that
Mε is the number of episode where
ε achieves the minimal value
3. Double DQN, CartPole-v0, reward 195 is achieved in episode 612.
4. Double DQN, CartPole-v1, reward 475 is achieved in episode 1030.
Choice of hyperparameter
Mε is set too large, then the choice of
ε will be performed for a long time in conditions of high probability (
> ε_m) of exploration. In other words, for a long time
ε will be carried out without information accumulated in the neural network. This means that choosing between moving
right , we can be mistaken in half the cases for a very long time.
Mε is set too small, then the choice of
ε will be performed for a long time under conditions of high probability (
= ε_m) of exploitation. This can be very bad in the early stages of neural network training because the choice of action using
argmax will be made from the neural network, which is still very crude. Then in many cases the chosen action will be mistaken.
In developing the DQN and Double DQN algorithms, three steps were taken in the fight against correlations and overestimations: (1) target and local networks, (2) experience replay mechanism, (3) decoupling the selection from the evaluation. These mechanisms have been developed with substantial use of two interrelated neural networks.
Appendix. A bit about PyTorch tensors
The PyTorch function
no_grad() excludes some elements from the gradient calculation. It is used when there is confidence that the back-propagation process is not performed. This function reduces memory consumption, see
get_action(). A similar effect occurs when using the
detach() function . The
with statement clarifies code corresponding to
clears old gradients from the last step (otherwise the gradients will be just accumulated from all
This function returns a new tensor, the same as the original tensor, but of a different shape. Trying to remove the function
get_action(), we get the different shapes of the action tensor in two branches of
learn() function we get
batch.action that consists of tensors of various shapes. This is failure. The function
view(1,1) changes the shape from
1,1 mean the number of elements in each dimension. For example, view(1,1,1,1,1) means
Concatenates the given tuple of tensors into the single tensor. For example, in
batch.state is the tuple of 64 tensors of shape [1,4]. Function
torch.cat transforms this tuple into the single tensor
states of the shape [64,4] as follows:
Why do we use
reshape(-1) to find the
Q_targets_nexttensor, see Table 2? In
learn() function we compare two tensors:
Q_expected. If we don’t use
reshape function, then by Table 3 these tensors have different shape, then comparison is failure.
For other Deep Reinforcement Learning projects, see my github directory. For interrelations between the Bellman equation and neural networks, see my previous paper. The same article provides more tips on PyTorch.
 V.Minh et. al., Playing Atari with Deep Reinforcement Learning (2013), arXiv:1312.5602
 H.van Hasselt et. al., Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461
 A.Karpathy, Deep Reinforcement Learning: Pong from Pixels (2016), karpathy.github.io
 Rubik’s Code, Introduction to Double Q-Learning, (2020), rubikscode.net
 S.Karagiannakos, Taking Deep Q Networks a step further, (2018), TheAISummer
 V.Minh et. al., Human-level control through deep reinforcement learning, (2015), Nature
 R.Stekolshchik, How does the Bellman equation work in Deep RL?, (2020), TowardsDataScience
 C.Yoon, Double Deep Q-Networks 2019, TowardsDataScience
 S.Thrun and A.Schwartz, Issues in Using Function Approximation for Reinforcement Learning, (1993),
Carnegie Mellon University, The Robotics Institute
 F.Mutsch, CartPole with Q-Learning — First experiences with OpenAI Gym (2017), muetsch.io
 T.Seno, Welcome to Deep Reinforcement Learning Part 1 : DQN, (2017), TowardsDataScience