A pair of interrelated neural networks in DQN

Original article can be found here (source): Deep Learning on Medium

The CartPole is a binary classification problem

The dimension of the CartPole observation space is 4, since there are 4 features that form the input: cart coordinate, velocity, pole’s angle from vertical and its derivative (pole “falling” velocity ). The CartPole is a binary classification problem because at each time step the agent chooses between moving left or right. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every time step that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

The environment CartPole-v0 is considered solved if the getting average reward > 195 over 100 consecutive trials; CartPole-v1 is considered solved if the getting average reward > 475 over 100 consecutive trials.

Training experiments for CartPole-v0 and CartPole-v1

Here are the results of my experiments with DQN and Double DQN, obtained during training CartPole-v0 and CartPole-v1. For all cases, LEARNING_RATE = 0.001. The greedy parameterεis changed from 1 to ε_m = 0.01

CartPole with DQN

For both CartPole-v0 and CartPole-v1, we put Mε= 50.

  1. DQN, CartPole-v0, reward 195 is achieved in episode 962.
  2. DQN, CartPole-v1, reward 475 is achieved in episode 1345.

CartPole with Double DQN

For CartPole-v0 we put Mε= 200;for CartPole-v1, we put Mε= 150. Recall that is the number of episode where ε achieves the minimal valueε_m.

3. Double DQN, CartPole-v0, reward 195 is achieved in episode 612.

4. Double DQN, CartPole-v1, reward 475 is achieved in episode 1030.

Choice of hyperparameter

If is set too large, then the choice of ε will be performed for a long time in conditions of high probability (> ε_m) of exploration. In other words, for a long time ε will be carried out without information accumulated in the neural network. This means that choosing between moving left or right , we can be mistaken in half the cases for a very long time.

If is set too small, then the choice of ε will be performed for a long time under conditions of high probability (= ε_m) of exploitation. This can be very bad in the early stages of neural network training because the choice of action using argmax will be made from the neural network, which is still very crude. Then in many cases the chosen action will be mistaken.

Conclusion

In developing the DQN and Double DQN algorithms, three steps were taken in the fight against correlations and overestimations: (1) target and local networks, (2) experience replay mechanism, (3) decoupling the selection from the evaluation. These mechanisms have been developed with substantial use of two interrelated neural networks.

Appendix. A bit about PyTorch tensors

with torch.no_grad()

The PyTorch function no_grad() excludes some elements from the gradient calculation. It is used when there is confidence that the back-propagation process is not performed. This function reduces memory consumption, see get_action(). A similar effect occurs when using the detach() function . The with statement clarifies code corresponding to try...finally blocks.

optim.zero_grad()

clears old gradients from the last step (otherwise the gradients will be just accumulated from all loss.backward() calls)

view(1,1)

This function returns a new tensor, the same as the original tensor, but of a different shape. Trying to remove the function view(1,1) in get_action(), we get the different shapes of the action tensor in two branches of get_action().Then in learn() function we get batch.action that consists of tensors of various shapes. This is failure. The function view(1,1) changes the shape from tensor([a]) to tensor([[a]]). Parameters 1,1 mean the number of elements in each dimension. For example, view(1,1,1,1,1) means
tensor([[[[[a]]]]]).

torch.cat

Concatenates the given tuple of tensors into the single tensor. For example, in learn() function, batch.state is the tuple of 64 tensors of shape [1,4]. Function torch.cat transforms this tuple into the single tensor states of the shape [64,4] as follows:

states = torch.cat(batch.state)

reshape(-1)

Why do we use reshape(-1) to find the Q_targets_nexttensor, see Table 2? In learn() function we compare two tensors: Q_targets.unsqueeze(1) and Q_expected. If we don’t use reshape function, then by Table 3 these tensors have different shape, then comparison is failure.

Table 3 Shapes of tensors compared in learn() function

For other Deep Reinforcement Learning projects, see my github directory. For interrelations between the Bellman equation and neural networks, see my previous paper. The same article provides more tips on PyTorch.

References

[1] V.Minh et. al., Playing Atari with Deep Reinforcement Learning (2013), arXiv:1312.5602

[2] H.van Hasselt et. al., Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461

[3] A.Karpathy, Deep Reinforcement Learning: Pong from Pixels (2016), karpathy.github.io

[4] Rubik’s Code, Introduction to Double Q-Learning, (2020), rubikscode.net

[5] S.Karagiannakos, Taking Deep Q Networks a step further, (2018), TheAISummer

[6] V.Minh et. al., Human-level control through deep reinforcement learning, (2015), Nature

[7] R.Stekolshchik, How does the Bellman equation work in Deep RL?, (2020), TowardsDataScience

[8] C.Yoon, Double Deep Q-Networks 2019, TowardsDataScience

[9] S.Thrun and A.Schwartz, Issues in Using Function Approximation for Reinforcement Learning, (1993),
Carnegie Mellon University, The Robotics Institute

[10] F.Mutsch, CartPole with Q-Learning — First experiences with OpenAI Gym (2017), muetsch.io

[11] T.Seno, Welcome to Deep Reinforcement Learning Part 1 : DQN, (2017), TowardsDataScience

[12] https://towardsdatascience.com/dqn-part-1-vanilla-deep-q-networks-6eb4a00febfb