Stable Baselines3 Versions Save

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

v0.9.0

3 years ago

Breaking Changes:

  • Removed device keyword argument of policies; use policy.to(device) instead. (@qxcv)
  • Rename BaseClass.get_torch_variables -> BaseClass._get_torch_save_params and BaseClass.excluded_save_params -> BaseClass._excluded_save_params
  • Renamed saved items tensors to pytorch_variables for clarity
  • make_atari_env, make_vec_env and set_random_seed must be imported with (and not directly from stable_baselines3.common):
from stable_baselines3.common.cmd_util import make_atari_env, make_vec_env
from stable_baselines3.common.utils import set_random_seed

New Features:

  • Added unwrap_vec_wrapper() to common.vec_env to extract VecEnvWrapper if needed
  • Added StopTrainingOnMaxEpisodes to callback collection (@xicocaio)
  • Added device keyword argument to BaseAlgorithm.load() (@liorcohen5)
  • Callbacks have access to rollout collection locals as in SB2. (@PartiallyTyped)
  • Added get_parameters and set_parameters for accessing/setting parameters of the agent
  • Added actor/critic loss logging for TD3. (@mloo3)

Bug Fixes:

  • Fixed a bug where the environment was reset twice when using evaluate_policy
  • Fix logging of clip_fraction in PPO (@diditforlulz273)
  • Fixed a bug where cuda support was wrongly checked when passing the GPU index, e.g., device="cuda:0" (@liorcohen5)
  • Fixed a bug when the random seed was not properly set on cuda when passing the GPU index

Others:

  • Improve typing coverage of the VecEnv
  • Fix type annotation of make_vec_env (@ManifoldFR)
  • Removed AlreadySteppingError and NotSteppingError that were not used
  • Fixed typos in SAC and TD3
  • Reorganized functions for clarity in BaseClass (save/load functions close to each other, private functions at top)
  • Clarified docstrings on what is saved and loaded to/from files
  • Simplified save_to_zip_file function by removing duplicate code
  • Store library version along with the saved models
  • DQN loss is now logged

Documentation:

  • Added StopTrainingOnMaxEpisodes details and example (@xicocaio)
  • Updated custom policy section (added custom feature extractor example)
  • Re-enable sphinx_autodoc_typehints
  • Updated doc style for type hints and remove duplicated type hints

v0.8.0

3 years ago

Breaking Changes:

  • AtariWrapper and other Atari wrappers were updated to match SB2 ones
  • save_replay_buffer now receives as argument the file path instead of the folder path (@tirafesi)
  • Refactored Critic class for TD3 and SAC, it is now called ContinuousCritic and has an additional parameter n_critics
  • SAC and TD3 now accept an arbitrary number of critics (e.g. policy_kwargs=dict(n_critics=3)) instead of only 2 previously

New Features:

  • Added DQN Algorithm (@Artemis-Skade)
  • Buffer dtype is now set according to action and observation spaces for ReplayBuffer
  • Added warning when allocation of a buffer may exceed the available memory of the system when psutil is available
  • Saving models now automatically creates the necessary folders and raises appropriate warnings (@PartiallyTyped)
  • Refactored opening paths for saving and loading to use strings, pathlib or io.BufferedIOBase (@PartiallyTyped)
  • Added DDPG algorithm as a special case of TD3.
  • Introduced BaseModel abstract parent for BasePolicy, which critics inherit from.

Bug Fixes:

  • Fixed a bug in the close() method of SubprocVecEnv, causing wrappers further down in the wrapper stack to not be closed. (@NeoExtended)
  • Fix target for updating q values in SAC: the entropy term was not conditioned by terminals states
  • Use cloudpickle.load instead of pickle.load in CloudpickleWrapper. (@shwang)
  • Fixed a bug with orthogonal initialization when bias=False in custom policy (@rk37)
  • Fixed approximate entropy calculation in PPO and A2C. (@andyshih12)
  • Fixed DQN target network sharing feature extractor with the main network.
  • Fixed storing correct dones in on-policy algorithm rollout collection. (@andyshih12)
  • Fixed number of filters in final convolutional layer in NatureCNN to match original implementation.

Others:

  • Refactored off-policy algorithm to share the same .learn() method
  • Split the collect_rollout() method for off-policy algorithms
  • Added _on_step() for off-policy base class
  • Optimized replay buffer size by removing the need of next_observations numpy array
  • Optimized polyak updates (1.5-1.95 speedup) through inplace operations (@PartiallyTyped)
  • Switch to black codestyle and added make format, make check-codestyle and commit-checks
  • Ignored errors from newer pytype version
  • Added a check when using gSDE
  • Removed codacy dependency from Dockerfile
  • Added common.sb2_compat.RMSpropTFLike optimizer, which corresponds closer to the implementation of RMSprop from Tensorflow.

Documentation:

  • Updated notebook links
  • Fixed a typo in the section of Enjoy a Trained Agent, in RL Baselines3 Zoo README. (@blurLake)
  • Added Unity reacher to the projects page (@koulakis)
  • Added PyBullet colab notebook
  • Fixed typo in PPO example code (@joeljosephjin)
  • Fixed typo in custom policy doc (@RaphaelWag)

v0.7.0

3 years ago

Breaking Changes:

  • render() method of VecEnvs now only accept one argument: mode

  • Created new file common/torch_layers.py, similar to SB refactoring

    • Contains all PyTorch network layer definitions and feature extractors: MlpExtractor, create_mlp, NatureCNN
  • Renamed BaseRLModel to BaseAlgorithm (along with offpolicy and onpolicy variants)

  • Moved on-policy and off-policy base algorithms to common/on_policy_algorithm.py and common/off_policy_algorithm.py, respectively.

  • Moved PPOPolicy to ActorCriticPolicy in common/policies.py

  • Moved PPO (algorithm class) into OnPolicyAlgorithm (common/on_policy_algorithm.py), to be shared with A2C

  • Moved following functions from BaseAlgorithm:

    • _load_from_file to load_from_zip_file (save_util.py)
    • _save_to_file_zip to save_to_zip_file (save_util.py)
    • safe_mean to safe_mean (utils.py)
    • check_env to check_for_correct_spaces (utils.py. Renamed to avoid confusion with environment checker tools)
  • Moved static function _is_vectorized_observation from common/policies.py to common/utils.py under name is_vectorized_observation.

  • Removed {save,load}_running_average functions of VecNormalize in favor of load/save.

  • Removed use_gae parameter from RolloutBuffer.compute_returns_and_advantage.

Bug Fixes:

  • Fixed render() method for VecEnvs
  • Fixed seed() method for SubprocVecEnv
  • Fixed loading on GPU for testing when using gSDE and deterministic=False
  • Fixed register_policy to allow re-registering same policy for same sub-class (i.e. assign same value to same key).
  • Fixed a bug where the gradient was passed when using gSDE with PPO/A2C, this does not affect SAC

Others:

  • Re-enable unsafe fork start method in the tests (was causing a deadlock with tensorflow)
  • Added a test for seeding SubprocVecEnv and rendering
  • Fixed reference in NatureCNN (pointed to older version with different network architecture)
  • Fixed comments saying "CxWxH" instead of "CxHxW" (same style as in torch docs / commonly used)
  • Added bit further comments on register/getting policies ("MlpPolicy", "CnnPolicy").
  • Renamed progress (value from 1 in start of training to 0 in end) to progress_remaining.
  • Added policies.py files for A2C/PPO, which define MlpPolicy/CnnPolicy (renamed ActorCriticPolicies).
  • Added some missing tests for VecNormalize, VecCheckNan and PPO.

Documentation:

  • Added a paragraph on "MlpPolicy"/"CnnPolicy" and policy naming scheme under "Developer Guide"
  • Fixed second-level listing in changelog

v0.6.0

4 years ago

Breaking Changes:

  • Remove State-Dependent Exploration (SDE) support for TD3
  • Methods were renamed in the logger:
    • logkv -> record, writekvs -> write, writeseq -> write_sequence,
    • logkvs -> record_dict, dumpkvs -> dump,
    • getkvs -> get_log_dict, logkv_mean -> record_mean,

New Features:

  • Added env checker (Sync with Stable Baselines)
  • Added VecCheckNan and VecVideoRecorder (Sync with Stable Baselines)
  • Added determinism tests
  • Added cmd_util and atari_wrappers
  • Added support for MultiDiscrete and MultiBinary observation spaces (@rolandgvc)
  • Added MultiCategorical and Bernoulli distributions for PPO/A2C (@rolandgvc)
  • Added support for logging to tensorboard (@rolandgvc)
  • Added VectorizedActionNoise for continuous vectorized environments (@PartiallyTyped)
  • Log evaluation in the EvalCallback using the logger

Bug Fixes:

  • Fixed a bug that prevented model trained on cpu to be loaded on gpu
  • Fixed version number that had a new line included
  • Fixed weird seg fault in docker image due to FakeImageEnv by reducing screen size
  • Fixed sde_sample_freq that was not taken into account for SAC
  • Pass logger module to BaseCallback otherwise they cannot write in the one used by the algorithms

Others:

  • Renamed to Stable-Baseline3
  • Added Dockerfile
  • Sync VecEnvs with Stable-Baselines
  • Update requirement: gym>=0.17
  • Added .readthedoc.yml file
  • Added flake8 and make lint command
  • Added Github workflow
  • Added warning when passing both train_freq and n_episodes_rollout to Off-Policy Algorithms

Documentation:

  • Added most documentation (adapted from Stable-Baselines)
  • Added link to CONTRIBUTING.md in the README (@kinalmehta)
  • Added gSDE project and update docstrings accordingly
  • Fix TD3 example code block