Reinforcement Learning frameworks

Original article was published by Jordi TORRES.AI on Artificial Intelligence on Medium

Reinforcement Learning frameworks

Proximal Policy Optimization using RLlib-Ray

This is the post number 20 in the Deep Reinforcement Learning Explained series devoted to Reinforcement Learning frameworks.

So far, in previous posts, we have been looking at a basic representation of the corpus of RL algorithms (although we have skipped several) that have been relatively easy to program. But from now on, we need to consider both the scale and complexity of the RL algorithms. In this scenario, programming a Reinforcement Learning implementation from scratch can become tedious work with a high risk of programming errors.

To address this, the RL community began to build frameworks and libraries to simplify the development of RL algorithms, both by creating new pieces and especially by involving the combination of various algorithmic components. In this post, we will make a general presentation of those frameworks and solving the previous problem of CartPole using the PPO algorithm with RLlib, an open-source library in Python, based on Ray framework.


But before continuing, as a motivational example, let’s remember that in the previous post, we presented REINFORCE, a Monte Carlo variant of a policy gradient algorithm in Reinforcement Learning. The method collects samples of an episode using its current policy and directly updates the policy parameter. Since one full trajectory must be completed to construct a sample space, it is updated as an off-policy algorithm.

However, there are some limitations associated with REINFORCE algorithm. Although we cannot go into more detail, we can highlight three main issues:

  1. The update process is very inefficient. We run the policy once, update once, and then throw away the trajectory.
  2. The gradient estimate is very noisy. By chance, the collected trajectory may not be representative of the policy.
  3. There is no clear credit assignment. A trajectory may contain many good/bad actions and whether these actions are reinforced depends only on the final total output.

As we have already advanced in the previous post, a proposal that solves these limitations is the PPO algorithm, introduced in the paper “Proximal Policy Optimization Algorithms” by John Schulman et al. (2017) at OpenAI. But understanding the PPO algorithm requires a more complex mathematical treatment, and its programming becomes more convoluted than that of REINFORCE. And this is going to happen with all the algorithms that we will present from now on in this series.

But actually, although we cannot avoid having to understand a specific algorithm well to see its suitability as a solution to a specific problem, its programming can be greatly simplified with the new Reinforcement Learning frameworks and libraries that the research community is creating and sharing.

Reinforcement Learning frameworks

Before presenting these RL frameworks, let’s see their contextualization a bit.

Learning from interactions instead of examples

In the last several years pattern-recognition side has been the focus of much of the work and much of the discussion in the community of Deep Learning. We are using powerful supercomputers that process large labeled data sets (with expert-provided outputs for the training set), and apply gradient-based methods that find patterns in those data sets that can be used to predict or to try to find structures inside the data.

This contrasts with the fact that an important part of our knowledge of the world is acquired through interaction, without an external teacher telling us what the outcomes of every single action we take will be. Humans are able to discover solutions to new problems from interaction and experience, acquiring knowledge about the world by actively exploring it.

For this reason, current approaches will study the problem of learning from interaction with simulated environments through the lens of Deep Reinforcement Learning (DRL), a computational approach to goal-directed learning from the interaction that does not rely on expert supervision. I.e., a Reinforcement Learning Agent must interact with an Environment to generate its own training data.

This motivates interacting with multiple instances of an Environment in parallel to generate faster more experience to learn from. This has led to the widespread use of increasingly large-scale distributed and parallel systems in RL training. This introduces numerous engineering and algorithmic challenges that can be fixed by these frameworks we are talking about.

Open source to the rescue

In recent years, frameworks such as TensorFlow or PyTorch (we have spoken extensively about both in this blog) have arisen to help turn pattern recognition into a commodity, allowing made deep learning something that practitioners could more easily try and use.

A similar pattern is beginning to play out in the Reinforcement Learning arena. We are beginning to see the emergence of many open source libraries and tools to address this both by helping in creating new pieces (not writing from scratch), and above all, involving the combination of various prebuild algorithmic components. As a result, these Reinforcement Learning frameworks help engineers by creating higher-level abstractions of the core components of an RL algorithm. In summary, this makes code easier to develop, more comfortable to read, and improves efficiency.

In this post, I provide some notes about the most popular RL frameworks available. I think the readers will benefit by using code from an already-established framework or library. At the time of writing this post, I could mention the most important ones (and I’m sure I’ll leave some):

Deciding which one of the RL frameworks listed here, depends on your preferences and what you want to do with it exactly. The reader can follow the links for more information.

RLlib: Scalable Reinforcement Learning using Ray

I have personally opted for RLlib based in Ray for several reasons that I will explain below.

Growth of computing requirements

Deep Reinforcement Learning algorithms involve a large number of simulations adding another multiplicative factor to the computational complexity of Deep Learning in itself. Mostly this is required by the algorithms we have not yet seen in this series, such as the distributed actor-critic methods or multi-agents methods, among others.

But even finding the best model often requires hyperparameter tuning and searching among various hyperparameter settings; it can be costly. All this entails the need for high computing power provided by supercomputers based on distributed systems of heterogeneous servers (with multi-core CPUs and hardware accelerators as GPUs or TPUs).

Two years ago, when I debuted as an author on Medium, I already explained what this type of infrastructure is like in the article “Supercomputing”. In Barcelona, we now have a supercomputer, called Marenostrum 4, which has a computing power of 13 Petaflops.

Barcelona Supercomputing Center will host a new supercomputer next year, Marenostrum 5, which will multiply the computational power by a factor of x17.

The current supercomputer MareNostrum 4 is divided into two differentiated hardware blocks: a block of general-purpose and a block-based on an IBM system designed especially for Deep Learning and Artificial Intelligence applications.

In terms of hardware, this part of the Marenostrum consists of a 54 node cluster based on IBM Power 9 and NVIDIA V100 with Linux operating system and interconnected by an Infiniband network at 100 Gigabits per second. Each node is equipped with 2 IBM POWER9 processors with 20 physical cores each and 512GB of memory. Each of these POWER9 processors is connected to two NVIDIA V100 (Volta) GPUs with 16GB of memory, a total of 4 GPUs per node.

How can this hardware fabric be managed efficiently?

System Software Stack

Accelerating Reinforcement Learning with distributed and parallel systems introduce several challenges in managing the parallelization and distribution of the programs’ execution. To address this growing complexity, new layers of software have begun to be proposed that we stack on existing ones in an attempt to maintain logically separate the different components of the layered software stack of the system

Because of this key abstraction, we can focus on different software components that today supercomputers incorporate in order to perform complex tasks. I like to mention that Daniel Hillis, who co-founded Thinking Machines Corporation, a company that developed the parallel Connection Machine, says that the hierarchical structure of abstraction is our most important tool in understanding complex systems because it lets us focus on a single aspect of a problem at a time.

And this is the case of RLlib, the framework for which I opted, that follows this divide and conquer philosophy with a layered design of the software stack.

Software stack of RLlib (source:

This hierarchical structure of abstraction that allows this functional abstraction is fundamental because it will let us manipulate information without worrying about its underlying representation. Daniel Hillis says that once we figure out how to accomplish a given function, we can put the mechanism inside a ”black box” of a ”building block” and stop thinking about it. The function embodied by the building block can be used over and over, without reference to the details of what’s inside.


In short, parallel and distributed computing is a staple of Reinforce Learning algorithms. We need to leverage multiple cores and accelerators (on multiple machines) to speed up RL applications, and Python’s multiprocessing module is not the solution. Some of the RL frameworks, like Ray can handle this challenge excellently.

On the official project page, Ray is defined as a fast and simple framework for building and running distributed applications:

  1. Providing simple primitives for building and running distributed applications.
  2. Enabling end-users to parallelize single machine code, with little to zero code changes.
  3. Including a large ecosystem of applications, libraries, and tools on top of the core Ray to enable complex applications.

Ray Core provides simple primitives for application building. On top of Ray Core, beside RLlib, there are other libraries for solving problems in machine learning: Tune (Scalable Hyperparameter Tuning), RaySGD (Distributed Training Wrappers), and Ray Serve (Scalable and Programmable Serving).


RLlib is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications. RLlib natively supports TensorFlow, TensorFlow Eager, and PyTorch, but most of its internals are framework agnostic.

At present, this library already has extensive documentation ( API documentation), offering a large number of built-in algorithms in addition to allowing the creation of custom algorithms.

The key concepts in RLlib are Policies, Samples, and Trainers. In a nutshell, Policies are Python classes that define how an agent acts in an environment. All data interchange in RLlib is in the form of Sample batches that encode one or more fragments of a trajectory. Trainers are the boilerplate classes that put the above components together, managing algorithm configuration, optimizer, training metrics, the workflow of the execution parallel components, etc.

Later in this series, when we have advanced more in distributed and multi-agent algorithms, we will already present in more detail these key components of RLlib.

TensorFlow or PyTorch

In a previous post, TensorFlow vs. PyTorch: The battle continues, I showed that the battle between deep learning heavyweights TensorFlow and PyTorch is fully underway. And in this regard, the option taken by RLlib, allowing users to seamlessly switch between TensorFlow and PyTorch for their reinforcement learning work, also seems very appropriate.

To allow users to easily switch between TensorFlow and PyTorch as a backend in RLlib, RLlib includes the “framework” trainer config. For example, to switch to the PyTorch version of an algorithm, we can specify {"framework":"torch"}. Internally, this tells RLlib to try to use the torch version of a policy for an algorithm (check out the examples of PPOTFPolicy vs. PPOTorchPolicy).

Coding PPO with RLlib

Now, we will show a toy example to get you started and show you how to solve OpenAI Gym’s Cartpole Environment with PPO algorithm using RLlib.

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link.

Due we are executing our examples in Colab we need to restart the runtime after installing ray package and uninstall pyarrow.

The various algorithms you can access are available through ray.rllib.agents. Here, you can find a long list of different implementations in both PyTorch and Tensorflow to begin playing with.

If you want to use PPO you can run the following code:

import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

The ray.init() command starts all of the relevant Ray processes. This must be done before we instantiate any RL agents, for instance PPOTrainer object in our example:

config = DEFAULT_CONFIG.copy()
config["num_gpus"] = 1 # in order to use the GPU
agent = PPOTrainer(config, 'CartPole-v0')

We can pass in a config object many hyperparameters that specify how the network and training procedure should be configured. Changing hyperparameters is as easy as giving a dictionary of configurations to this config argument. A quick way to see what’s available is to call trainer.config to print out the options that are available for your chosen algorithm:

‘num_workers’: 2,
‘num_envs_per_worker’: 1,
‘rollout_fragment_length’: 200,
‘num_gpus’: 0,
‘train_batch_size’: 4000,

Once we have specified our configuration, calling the train() method on our trainerobject will update and send the output to a new dictionary called results.

result = agent.train()

All the algorithms follow the same basic construction alternating from lower case abbreviation to uppercase abbreviation followed by Trainer . For instance, if you want to try a DQN instead, you can call:

from ray.rllib.agents.dqn import DQNTrainer, DEFAULT_CONFIG
agent = DQNTrainer(config=DEFAULT_CONFIG, env='CartPole-v0')

The simplest way to programmatically compute actions from a trained agent is to use trainer.compute_action():


This method preprocesses and filters the observation before passing it to the agent policy. Here is a simple example of how to watch the Agent that uses compute_action():

def watch_agent(env):
state = env.reset()
rewards = []
img = plt.imshow(env.render(mode=’rgb_array’))
for t in range(2000):

state, reward, done, _ = env.step(action)
if done:
print(“Reward:”, sum([r for r in rewards]))

Using watch_agent function, we can compare the behavior of the Agent before and after being trained running multiple updates calling the train() method for a given number:

for i in range(10):
result = agent.train()
print(f'Mean reward: {result["episode_reward_mean"]:4.1f}')

The last line of code shows how we can monitor the training loop printing information included in the return of the method train().

Before training
After training


Obviously, this is a toy implementation of a simple algorithm to show this framework very briefly. The actual value of the RLlib framework lies in its use in large infrastructures executing inherently parallel and, at the same time, complex algorithms were writing the code from scratch is totally unfeasible.

As I said, I opted for RLlib after taking a look at all the other frameworks mentioned above. The reasons are diverse; some are already presented in this post. Add that for me; it is relevant that it has already been included in major cloud providers such as AWS and AzureML. Or that there is a pushing company like ANYSCALE that has already raised 20 million and organizes the Ray Summit conference, which will be held online this week (September 30 through October 1) with great speakers (as our friend Oriol Vinyals ;-). Maybe add more context details, but for me, just as important as the above reasons is the fact that there are involved great researchers from the University of California at Berkeley, including the visionary Ion Stoica, whom I met about Spark, and they clearly got it right!

See you in the next!