Proximal Policy Optimisation in PyTorch with Recurrent models

Source: Deep Learning on Medium

Proximal Policy Optimisation with PyTorch using Recurrent models

Proximal Policy Optimisation (PPO) is a policy gradient technique that is relatively straight forward to implement and can develop policies to maximise reward for a wide class of problems [1].

Like other policy gradient methods PPO can optimise recurrent neural network policies. This can be very useful because in many environments the observations do not represent the complete state of the system but the recurrent model can figure out what the states are from a series of observations i.e. Partially Observable Markov Decision Processes (POMDP) can be solved as though they were Markov Decision Processes (MDPs). This approach was used by OpenAI Five to beat the world champions in Dota [2].

At a much smaller scale the utility of this approach is demonstrated in the partially observable environment [3] below:


Training Iteration: 6830, Reward: 328

To further qualify the approach It tested some more contrived examples. Many of the Gym environments provide observations with velocity terms so they are fully observable and able to be solved using the observations as states. To test the LSTMs ability to learn the state I masked the velocity terms and the results can be seen below:

CartPole-v2 (Masked velocity terms in observations)

Training Iteration 420, Reward = 500.

Pendulum-v0 (Masked velocity terms in observations)

Training Iteration 660, Reward = -475.

LunarLander-v2 (Masked velocity terms in observations)

Training Iteration 3490, Reward 280

LunarLanderContinuous-v2 (Masked velocity terms in observations)

Training Iteration 4480, Reward 269

BipedalWalker-v2 (Velocity terms not masked)

Training Iteration 2160, Reward 310

Further justification

Hopefully, the above examples have demonstrated recurrent models can help in environments where partial observability is a problem. However, I would go further and think there are good reasons to use it as the default general method for solving environments:

  1. The model can compute the state from the observations for us

Many environments consist of or include a dynamic mechanical system. To be MDP the environment must include both the position and velocity terms which is exactly what the Gym environments like CartPole, Pendulum and LunarLander do. Without a recurrent model and using the observations directly as a state only low scores can be achieved because the policy can’t know whether to brake or accelerate for a given position. For this simple case, we know the answer but as the complexity of the observations grows it will become increasingly difficult to design an appropriate state and easier to let to model compute it for us. An example of this is the BipedalWalkerHardcore where [2] explains that the walker needs to remember may have seen in the previous frames to avoid falling into them.

2. Sometimes the observations are controlled by the agent itself

As the agent moves around in a partially observable world it may need to learn where we look to be able to get the observations needed for optimal actions.

3. The model can protect us from our ignorance of the problem

Even if we think we have designed a state based on the observations to reduce the problem complexity to an MDP we can make mistakes e.g. Maybe we lack relevant domain knowledge, there is a bug or the complexity of the state is beyond our understanding. For me, this is both intimidating and exciting because the recurrent model can automatically learn to address this for us.

Noteworthy Implementation Details

Most of the relevant algorithm details can be obtained from [1] so I’ll focus on some of the engineering details here:

1. Gym Vector implementation

The standard gym environment runs a single environment at a time. Other RL implementations I’ve seen work around this by running separate processes in parallel. This works but consumes a lot of resources especially when inferencing on multiple separate models. Gym now features a nice solution to this problem by wrapping the environment so that multiple instances of the same environment can run together either synchronously in the same process or asynchronously using multiple processes. Increasing the batch size for the policy and critic from 1 (single environment) to 32 parallel environments is much much faster than 32 models running in parallel doing inference on a single batch.

2. Google Colab

The model training and testing were conducted on Google’s Colab. The main reason for this is that environment generously offers a free GPU. It’s not without limitations however and in my experience, it will not allow continuous GPU usage more than about 12hrs. There are also other problems such as disconnections that occur. Meaning that for training jobs taking multiple days like the BipedalWalkerHardcore-v2 the ability to save and resume from checkpoints saved onto Google Drive had to be added.

3. State Initialisation

When capturing a trajectory for training a model it is easy to initialise the LSTM hidden state and cell state to zero. Then iterate through a complete episode allowing the LSTM to update its own state for each step. At training time, we split the episode into chunks usually around 8 steps long. During training, we can also initialise the hidden state and cell state to zero however under these conditions the pi / pi_old ratio can be quite different even under the same policy parameters. This problem gets better the bigger with longer training sequence lengths like 16 or 32 because the init state has less influence on the final state, but the inverse is also true and it gets worse for shorter sequence lengths. To mitigate this problem we can save the LSTM states at rollout time and used them for initialisation at training time. For the first training iteration in each PPO iteration, these saved hidden and cell states are accurate but as the parameters change these become less accurate. However, it’s still a much more accurate initialisation than zero and as such produces more accurate gradients for training allowing using very short sequence lengths.

4. Reward shaping

The received rewards can have a big impact on how PPO solves a problem. In the BipedalWalker and BipedalWalkerHardcore environments reward is given for forward motion, a penalty is applied for torque and a large negative reward (-100) for falling over. I used the same approach as in [4] to limit the reward to a minimum of -1 for the walker environments. The difference in behavior is quite dramatic:

Unshaped reward:

  1. Walker learns to adopt a very stable stance and remains there until the end of the episode.
  2. Walker slowly learns to shuffle forward.
  3. Walker’s gait slowly becomes less conservative and faster moving.

Shaped reward:

  1. Walker quickly learns to fall forward.
  2. The Walkers fall becomes longer and longer as it learns to “catch” itself with its legs.
  3. Walker no longer falls and has found a high-speed gait.

While PPO can solve both these problems the unshaped reward takes a lot longer to solve the problem and the gait is less satisfying because it is so conservative.

Implementation links & Setup

The implementation consists of two Notebooks. One for training a policy in a chosen environment: recurrent_ppo.ipynb

and one for generating a test video: test_recurrent_ppo.ipynb

The notebooks expect the following directories in Google Drive:




The checkpoints directory is used for saving the model and optimiser parameters so a model can be resumed or tested. The logs directory is used for saving the Tensorboard logs and the videos directory is used for saving the video output from the test notebook.

The saved checkpoints used to generate the results are included here: checkpoints, these could be added to google cloud to resume from, to test with or even transfer learning.

15 Minutes of Trained BipedalWalkerHardcore-v2

This video consists of about 70 episodes. Notice that it still fails fairly often. The mean reward was about 160 when I lost patience training it.


The PPO algorithm together with a recurrent model is a really powerful combination. Capable of producing good results on a wide range of problems, solving both MDPs and POMDPs in continuous and discrete action spaces.


[1] Proximal Policy Optimization Algorithms

[2] Dota 2 with Large Scale Deep Reinforcement Learning

[3] Recurrent Deterministic Policy Gradient Method for Bipedal Locomotion on Rough Terrain Challenge

[4] PPO with LSTM and Parallel processing