3. Double DQN, CartPole-v0, reward 195 is achieved in episode 612.

4. Double DQN, CartPole-v1, reward 475 is achieved in episode 1030 .

Choice of hyperparameter Mε

If Mε

is set too large, then the choice of ε

will be performed for a long time in conditions of high probability (> ε_m

) of exploration . In other words, for a long time ε

will be carried out without information accumulated in the neural network. This means that choosing between moving left

or right

, we can be mistaken in half the cases for a very long time.

If Mε

is set too small, then the choice of ε

will be performed for a long time under conditions of high probability (= ε_m

) of exploitation . This can be very bad in the early stages of neural network training because the choice of action using argmax

will be made from the neural network, which is still very crude. Then in many cases the chosen action will be mistaken.

Conclusion
In developing the DQN and Double DQN algorithms, three steps were taken in the fight against correlations and overestimations: (1) target and local networks, (2) experience replay mechanism, (3) decoupling the selection from the evaluation. These mechanisms have been developed with substantial use of two interrelated neural networks.

Appendix. A bit about PyTorch tensors
with torch.no_grad()

The PyTorch function no_grad()

excludes some elements from the gradient calculation. It is used when there is confidence that the back-propagation process is not performed. This function reduces memory consumption, see get_action().

A similar effect occurs when using the detach()

function . The with

statement clarifies code corresponding to try...finally

blocks.

optim.zero_grad()

clears old gradients from the last step (otherwise the gradients will be just accumulated from all loss.backward()

calls)

view(1,1)

This function returns a new tensor, the same as the original tensor, but of a different shape. Trying to remove the function view(1,1)

in get_action()

, we get the different shapes of the action tensor in two branches of get_action().

Then in learn()

function we get batch.action

that consists of tensors of various shapes. This is failure. The function view(1,1)

changes the shape from tensor([a])

to tensor([[a]]).

Parameters 1,1

mean the number of elements in each dimension. For example, view(1,1,1,1,1) meanstensor([[[[[a]]]]]).

torch.cat

Concatenates the given tuple of tensors into the single tensor. For example, in learn()

function, batch.state

is the tuple of 64 tensors of shape [1,4]. Function torch.cat

transforms this tuple into the single tensor states

of the shape [64,4] as follows:

states = torch.cat(batch.state)
reshape(-1)

Why do we use reshape(-1)

to find the Q_targets_next

tensor, see Table 2? In learn()

function we compare two tensors: Q_targets.unsqueeze(1)

and Q_expected.

If we don’t use reshape

function, then by Table 3 these tensors have different shape, then comparison is failure.

Table 3 Shapes of tensors compared in learn() function
For other Deep Reinforcement Learning projects, see my github directory . For interrelations between the Bellman equation and neural networks, see my previous paper . The same article provides more tips on PyTorch .

References
[1] V.Minh et. al., Playing Atari with Deep Reinforcement Learning (2013), arXiv:1312.5602

[2] H.van Hasselt et. al., Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461

[3] A.Karpathy, Deep Reinforcement Learning: Pong from Pixels (2016), karpathy.github.io

[4] Rubik’s Code, Introduction to Double Q-Learning, (2020), rubikscode.net

[5] S.Karagiannakos, Taking Deep Q Networks a step further, (2018), TheAISummer

[6] V.Minh et. al., Human-level control through deep reinforcement learning, (2015), Nature

[7] R.Stekolshchik, How does the Bellman equation work in Deep RL?, (2020), TowardsDataScience

[8] C.Yoon, Double Deep Q-Networks 2019, TowardsDataScience

[9] S.Thrun and A.Schwartz, Issues in Using Function Approximation for Reinforcement Learning, (1993), Carnegie Mellon University, The Robotics Institute

[10] F.Mutsch, CartPole with Q-Learning — First experiences with OpenAI Gym (2017), muetsch.io

[11] T.Seno, Welcome to Deep Reinforcement Learning Part 1 : DQN, (2017), TowardsDataScience

[12] https://towardsdatascience.com/dqn-part-1-vanilla-deep-q-networks-6eb4a00febfb