NEW: extended documentation available at https://rlpyt.readthedocs.io (as of 27 Jan 2020)
View the Change Log
Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient. Intended to be a high-throughput code-base for small- to medium-scale research (large-scale meaning like OpenAI Dota with 100's GPUs). Key capabilities/features include:
observation, prev_action, prev_reward.
Policy Gradient A2C, PPO.
Replay Buffers (supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style). Coming soon: Implicit Quantile Networks?
Q-Function Policy Gradient DDPG, TD3, SAC. Coming soon: Distributional DDPG?
Follow the installation instructions below, and then get started in the examples folder. Example scripts are ordered by increasing complexity.
Rlpyt introduces new object classes
namedarraytuple for easier organization of collections of numpy arrays / torch tensors. (see
namedarraytuple is essentially a
namedtuple which exposes indexed or sliced read/writes into the structure. For example, consider writing into a (possibly nested) dictionary of arrays:
for k, v in src.items(): if isinstance(dest[k], dict): ..recurse.. dest[k][slice_or_indexes] = v
This code is replaced by the following:
dest[slice_or_indexes] = src
Importantly, this syntax looks the same whether
src are indiviual numpy arrays or arbitrarily-structured collections of arrays (the structures of
src must match, or
src can be a single value, or
None is an empty placeholder). Rlpyt uses this data structure extensively--different elements of training data are organized with the same leading dimensions, making it easy to interact with desired time- or batch-dimensions.
This is also intended to support environments with multi-modal observations or actions. For example, rather than flattening joint-angle and camera-image observations into one observation vector, the environment can store them as-is into a
namedarraytuple for the observation, and in the forward method of the model,
observation.image can be fed into the desired layers. Intermediate infrastructure code doesn’t change.
Overall the code is stable, but might still develop, changes may occur. Open to suggestions/contributions for other established algorithms to add or other developments to support more use cases--please see our simple contribution guidelines.
This package does not include its own visualization, as the logged data is compatible with previous editions (see below). For more features, use https://github.com/vitchyr/viskit.
Clone this repository to the local machine.
Install the anaconda environment appropriate for the machine.
conda env create -f linux_[cpu|cuda9|cuda10].yml source activate rlpyt
#A export PYTHONPATH=path_to_rlpyt:$PYTHONPATH #B pip install -e .
Hint: for easy access, add the following to your
~/.bashrc (might substitute
alias rlpyt="source activate rlpyt; cd path_to_rlpyt"
For more discussion, please see the white paper on Arxiv. If you use this repository in your work or otherwise wish to cite it, please make reference to the white paper.
The class types perform the following roles:
algorithm; manages the training loop and logging of diagnostics.
environmentinteraction to collect training data, can initialize parallel workers.
environments(and maybe operates
agent) and records samples, attached to
sampler; trained by the
algorithm. Interface to
agentsand defines related formulas for use in loss function, attached to the
agent(e.g. defines a loss function and performs gradient descent).
This code is a revision and extension of accel_rl, which explored scaling RL in the Atari domain using Theano. Scaling results were recorded here: A. Stooke & P. Abbeel, "Accelerated Methods for Deep Reinforcement Learning". For an insightful study of batch-size scaling across deep learning including RL, see S. McCandlish, et. al "An Empirical Model of Large-Batch Training".
Accel_rl was inspired by rllab (the
logger here is nearly a direct copy). Rlpyt follows the rllab interfaces: agents output
action, agent_info, environments output
observation, reward, done, env_info. In general in rlpyt, agent inputs/outputs are torch tensors, and environment inputs/ouputs are numpy arrays, with conversions handled automatically.
env_inforather than a
dict. This makes for easier data recording but does require the same fields to be output at every environment step. An environment wrapper is provided. Wrappers are also provided for Gym spaces to convert to rlpyt spaces (notably
Thanks for support / mentoring from Pieter Abbeel, the Fannie & John Hertz Foundation, NVIDIA, Max Jaderberg, OpenAI, and the BAIR community. And thanks in advance to any contributors!
Happy reinforcement learning!