Writing Fast Deep Q Learning Pipelines on Commodity Hardware

Original article was published by llucid on Deep Learning on Medium

Writing Fast Deep Q Learning Pipelines on Commodity Hardware

The state of the art in Deep Reinforcement Learning has been ramping up in scale over time, and its becoming ever more difficult to reproduce the state of art on commodity hardware.

Previous works have shown that with enough optimization, patience, and time, we can get pretty close. To this end, I began to study how to write efficient training pipelines with the goal of zero-down time: the GPU must never stall waiting for data (both on the training and inference ends), and pipeline must be able to take on the full throughput the system is capable of.

How NOT to Write Pipelines

Most DeepRL baselines are written in a synchronous form.

There are benefits to this:
1) Simplicity: the baseline is just that; a baseline. They typically try to keep the code as simple as possible to get the salient points across. Writing fast code often means making code less readable and more error-prone.
2) Benchmarks: Keeping things sequential makes it easy to have good apples-to-apples comparisons on how many [samples | episodes | training-steps | <insert-metric-here> ] algorithms take compared to one another. When these things are simply left to run as fast as they can, disparities can simply be due to how fast one stage can process data compared to the other.

But the problem is performance: inference, learning and environment rollouts are all blocking each other because there are data dependencies between them, but prior art has shown these dependencies can be weakened enough to allow us to run them all as separate processes just trying to execute as fast as they can.

Isolating the Learner Process: APEX

One exemplar way of isolating the training process is Horgan et al’s APEX DQN.

  1. The replay memory is instantiated asynchronously. It provides an API to add to and sample from it from other processes.
  2. The Inference & Environment steps still run sequentially in their own process (the “Actor Process”). They queue data into the replay process
  3. The Training process is instantiated asynchronously too. It runs in an infinite loop simply getting minibatches from the Replay as fast as it can and training on them.
  4. The Training (or “Learner”) process pipes new network parameters back to the actor process periodically

This allows us to treat the training the same way we do for deep learning, and the same pipelining tricks that are standard there apply here too:

To ensure the GPU is never stalled waiting for data, we use parallelize data loading in the Trainer process: minibatches from the Replay are asynchronously copied to the GPU while the training step runs so they are always ready for use. We use the same trick for getting parameters off the GPU without slowing training down.

[Future Note]: We can also use tricks like virtualization to expand the capacity of the replay network, and pipeline it aggressively so we don’t lose throughput, but that’s a story for another day.

While this is a huge faff, it leads to a significant speedup in wall-clock time:

Orange: Synchronous Pipeline, Pink: APEX with 3 actor workers

The speedup comes from both the increase in training speed, and how quickly we gather data. With aggressive pre-fetching and tuning batch-sizes, we can max-out GPU utilization as is for training, but there are still improvements to be made for inference.