Source: Deep Learning on Medium
In-depth look into PRL — the new reinforcement learning framework in Python
After an introduction (in my previous blog post) to the concept behind People’s Reinforcement Learning library I want to describe in details each part of the library and show how these parts work together. I am going to go through each submodule of the PRL.
Code fragments in gray under the names of following chapters are the submodule’s name of the discussed library fragment.
Environment class is the base class for the PRL environments. It preserves OpenAI Gym’s API (documentation here) with some minor exceptions.
Environment wraps around OpenAI Gym environment inside and have placeholders for state, action and reward transformer objects. It also tracks the history of currently played episode. Example classes inheriting from
Transformers can use the history of an episode to transform current observation from gym environment to state representation required by agent. Transformer objects can be implemented as stateless functions or stateful objects depending on what is more convenient for the user. States, actions and rewards stored in the episode history are in the raw form (before any transformations).
Only environments with discrete action spaces and observations being
numpy.ndarray are supported in current version of PRL.
State transformers are used by the environment to change observations to state representation for function approximators used by the agent. To build your own transformer all you need is to create a class inheriting from
prl.core.StateTransformer class and implement
reset methods. The
transform method takes two arguments: state in a form of
numpy.ndarray and episode history. It returns a state represented also by
reset method resets transformer state (if any) between episodes. You can make your transformation fitable to the data by implementing
fit method in the same manner as in the scikit-learn library.
It is important to use only NumPy functions while implementing transformers for the sake of performance. More advanced Python programmers can use Numba library to speed up their transformers even more. Very good and deep NumPy tutorial can be found here.
You can check the performance of your transformations by importing and using
prl.utils.time_logger object at the end of your program.
If you want to make your own reward shaping transformer you need to inherit from
prl.core.RewardTransformer and perform analogous steps as in the state transformer case.
While implementing the action transformer you need to inherit your class from
fit method. After that you have to assign a
gym.Space object to the
action_space attribute, because it cannot be automatically inferred only from the class implementation.
Classes for storage are created for easy management of the training history.
History class is used to keep the episodes’ history. You can get actions, states, rewards and done flag from it. It also gives user methods to prepare array with returns, count total rewards or sample a batch for neural network training. You can concatenate two history objects by using inplace add operator
Because appending to an
numpy.ndarray (used to store data) is a very expensive operation, the
History object allocates in advance some bigger arrays, and doubles its size when arrays are full. You can set the initial length of a
History object during initialization (e.g.
Environment does it based on
Class similar to
History created to be used as a replay buffer. It does not have to keep complete episodes, so its API has less methods than
History. Its length is constant and it is set during object initialization. You can’t concatenate two
Memory objects or calculate total rewards. This object is used by the DQN agent as a replay buffer for the experience replay.
Function approximators (FA) are created to deliver unified API for any kind of function approximators used by RL algorithms. FA have two methods:
predict and are implemented in PyTorch for now.
PyTorch implementation of function approximator. It needs three arguments to initialize:
PytorchNet object, loss and optimizer. Loss and optimizers can be imported directly from PyTorch.
PytorchNet class is similar to
torch.nn.Module but with additional method
Some neural networks and losses implementations used for RL problems are kept in this module. These are the standard
torch.nn.Modules and you can learn more about them in this great tutorial.
You can pass some callbacks to the agent’s
train method to control and supervise the training. Some of the implemented callbacks are:
TensorboardLogger (more about this logger can be found here) to log training statistics to tensorboardX,
Loggers and profiling
User and agents have access to five loggers. Most of them are used automatically by agents, environments, transformers or function approximators. These loggers are:
time_logger– this logger is used to monitor execution time of many functions and methods. You can print this object to generate report of execution times. If you want to profile your function you can decorate your function with
prl.utils.timeitdecorator. From now on, the execution time of this function will be logged.
memory_logger– logger is used to monitor RAM usage (currently unused).
agent_logger– this logger is used to monitor agent training statistics.
nn_logger– in this logger all the statistics from neural network training are stored. It is important to pass some distinct
idargument to each network during initialization when training agent with many networks. This id will be used as a key in the logger.
misc_logger– logger for the user statistics. They are captured by the
TensorboardLoggerand plotted in the browser. You can log only numbers (ints or floats) with a string key using
And finally the agents! Thanks to the above classes the agent implementations in PRL are simple and compact. While implementing the agent all you need to do is implement
train_iteration is a base step in agent training (e.g. one step in environment for DQN or some number of complete episodes in REINFORCE agent). You can also implement
post_train_cleanup methods if needed. They are called before and after main training loop.
act method is called by the agent while making one step in the environment. Agent have also methods inherited from base
Agent class like:
test which can be used within
train method should be used only to initialize training from outside the agent.
Example agent code looks like this:
As you can see it is very simple and self explanatory. We have implemented some most popular RL agents. This is the list of them:
- Random Agent
- Cross-Entropy Agent
- REINFORCE Agent
- Actor-Critic Agent
- A2C Agent with many advantage functions
- DQN Agent
There are many examples of use of every element of the library in the
examples/ folder in the repository and we encourage you to look at them to get better understanding of the PRL framework. Let’s look at one more complicated example. We hope it will be self explanatory after this blog post.
If you encounter any problems with library, documentation, this tutorial or you want to contribute to the project please write an email to us at piotr.tempczyk [at] opium.sh. For now we have suspended the development of the library and moved our resources to new projects, but feel free to use this library or develop your framework using parts of our framework or join us and contribute to the library by yourself. If you use our code or ideas in your tools we only want you to mention in your README, that you were inspired by PRL framework or that you borrowed some of the ideas (or code) from us 🙂
There were many people involved in this project. This is the list of the most important of them:
Project Lead: Piotr Tempczyk
If you enjoyed this post, please hit the clap button below and follow our publication for more interesting articles about ML & AI.