Training Dual Reinforcement Learning Agents to learn to play Tennis

Source: Deep Learning on Medium

Training Dual Reinforcement Learning Agents to learn to play Tennis

In the previous post on Reinforcement Learning, I went over Value based methods for learning an environment’s actions. In this follow up post, I will explore another Reinforcement Learning method, called the Actor Critic models, that combine the Value based models with a Policy based model to better learn the environment.

We will train two separate agents to learn how to play a game of Table Tennis, interacting with each other to pass a ball back and forth between each other without dropping it. The topics covered in this post will be:

  1. Background of Actor Critic Models
  2. The DDPG Reinforcement Learning agent
  3. Implementation
  4. Training and Results
Table Tennis playing with Trained Agents

Actor Critic Models

In the previous post, we discussed Value Based models. In a nutshell, with Value Based models, we train a Deep Q Network that tries to learn the mapping of States to Actions, in order to maximize value (in a Q Table). However, Value Based models may not always be optimal, which is why Policy Based models are preferred in some cases.

In a Policy Based model, instead of learning state-action values, a probability distribution is learnt for all possible actions. Some actions, even though they may have a lower reward value, might have a higher likelihood. A value based model would not look at that, but a policy based model would as the more probable action is optimal. One of the well known Policy Based models is the REINFORCE algorithm.

Actor Critic models are a combination of both Value based, and Policy based models. An Actor Model (based on Policy models) learns the probabilities distributions of different actions given a state, and a Critic Model (based on Value), evaluates the actions selected by the Actor, by learning the mapping of states to those actions.

Actor Critic Model

DDPG Learning Algorithm

The DDPG Algorithm takes different things from two existing Reinforcement Learning algorithms — DQN (Deep Q Networks) and DPG (Deterministic Policy Gradients).

The Actor Critic Models were first introduced by the DPG Algorithm, but unlike DPG, Deep Deterministic Policy Gradients algorithm uses Neural Networks to learn the policy (Actor) and value (Critic) functions. The advantage of DDPG is that it allows for training on larger dimensional action spaces, non-linear relationships between them and the state spaces which can be better learnt using Neural Networks.

The DDPG Algorithm

The DDPG Algorithm also takes several things from the DQN algorithm. As shown in the pseudocode above, DDPG uses a Replay Buffer to sample experience tuples in batches, and continuously retrain the models. The Replay Buffer is advantageous because in sequential data, such as from these environments, the model won’t be able to learn independent relationships as easily.

The second thing from the DQN algorithm is Target Networks. The DDPG Algorithm keeps a separate set of Target Neural Networks, each for both Actors and Critics. These target networks slowly track the training networks (Local), and are time delayed. The Target Networks are also used for action value estimation, which makes the results more stable, and make convergence more consistent.


How the Agent is implemented

For this project, instead of training a single Agent to learn the Environment, a Multi-Agent setup was used instead. Two separate DDPG Agents were used, each with their own Actors and Critics, to learn the environment of Table Tennis with each other. However, to make convergence faster, the Replay Buffer was shared between the two agents, so they both can take advantage of all previous experiences.

In the part of the post, I will go over the implementation, and walk through the code for the DDPG Algorithm, and how training is done. Code for implementation of the Actor and Critic models themselves can be found on GitHub here.

There is a lot of code we are looking at here, however, it is also very self-explanatory, and based on the pseudo-code above. The Agent has an Actor and a Critic.

The purpose of the Agent is to implement two methods, Act, and Learn. Act will take an action based on a given state, and Learn will learn and update the models using prior information. Apart from the models, we instantiate a Replay Buffer, a Noise generator, and separate Target Networks for both the Actor and Critics.

As the step method shows, as a new experience tuple is generated, it is stored in the replay buffer. During the learn method, the Actor and Critic models are updated with experiences from the replay. During the learning process, the Bellman equation is used to update the Critic model. Action generated from the Critic model is then used to update and learn the Actor model. Finally, a soft-update step is carried out on the Target networks, taking a small sample from the Local networks to update the Targets.

Multi Agent Training

We are training two separate agents, so to make sure they learn together, multiple agents are instantiated, and a shared replay buffer is passed to both. In Agent training, for each episode, the agent acts on states returned by the environment. It is an iterative process, and scores are calculated at each time step. At the end of an episode, the environment is reset to it’s starting position.

To meet the win condition, we are tracking the scores from the agent over a 100 consecutive episodes. When the maximum score from either agent exceeds 0.5, it is considered to be solved.

Training and Results

To train the Agent, the following Hyperparameter configurations were set:

Hyperparameter Configurations

The Learning rates, and Hidden Layer sizes for both networks had to be tuned for training to converge. Adding the Batch Normalization layers to the two models vastly improved results.

With this configuration, the target average reward was reached in 2,328 episodes.

Training results

As the results show above, the agent started to pick up in score after 1,500 episodes, and took a high jump in another 500 episodes. I let the agent run after reaching it’s target, and it shows that the score gained some periodicity.

Playing with the Agent

With the final training completed, the agents were left to play the game against each other. As seen here, they are able to throw the ball back and forth without letting it fall. It is absolutely not perfect, but learning the environment from scratch, it is playing really impressively.


All code for this project is available on GitHub at

Licensing, Authors, Acknowledgements

Credit to Udacity for providing the data and environment. You can find the Licensing for the data and other descriptive information from Udacity. This code is free to use.