SLM Lab Versions Save

Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

v3.2.0

5 years ago

Eval rework

#275 #278 #279 #280

This release adds an eval mode that is the same as OpenAI baseline. Spawn 2 environments, 1 for training and 1 more eval. In the same process (blocking), run training as usual, then at ckpt, run an episode on eval env and update stats.

The logic for the stats are the same as before, except the original body.df is now split into two: body.train_df and body.eval_df. Eval df uses the main env stats except for t, reward to reflect progress on eval env. Correspondingly, session analysis also produces both versions of data.

Data from body.eval_df is used to generate session_df, session_graph, session_fitness_df, whereas the data from body.train_df is used to generate a new set of trainsession_df, trainsession_graph, trainsession_fitness_df for debugging.

The previous process-based eval functionality is kept, but is now considered as parallel_eval. This can be useful for more robust checkpointing and eval.

Refactoring

#279

  • purge useless computations
  • properly and efficiently gather and organize all update variable computations.

This also speeds up run time by x2. For Atari Beamrider with DQN on V100 GPU, manual benchmark measurement gives 110 FPS for training every 4 frames, while eval achieves 160 FPS. This translates to 10M frames in roughly 24 hours.

v3.1.1

5 years ago

Docker image kengz/slm_lab:v3.0.0 released

Add Retro Eval

  • #270 add retro eval mode to run fail online eval sessions. Use command yarn retro_eval data/reinforce_cartpole_2018_01_22_211751
  • #272 #273 fix eval saving 0 index to eval_session_df causing trial analysis to break; add reset_index for safety

fix Boltzmann spec

  • #271 change Boltzmann spec to use Categorical instead of the wrong Argmax

misc

  • #273 update colorlover package to proper pip after they fixed division error
  • #274 remove unused torchvision package to lighten build

v3.1.0

5 years ago

v3.1.0: L1 fitness norm, code and spec refactor, online eval

Docker image kengz/slm_lab:v3.1.0 released

L1 fitness norm (breaking change)

  • change fitness vector norm from L2 to L1 for intuitiveness and non-extreme values

code and spec refactor

  • #254 PPO cleanup: remove hack and restore minimization scheme
  • #255 remove use_gae and use_nstep param to infer from lam, num_step_returns
  • #260 fix decay start_step offset, add unit tests for rate decay methods
  • #262 make epi start from 0 instead of 1 for code logic consistency
  • #264 switch max_total_t, max_epi to max_tick and max_tick_unit for directness. retire graph_x for the unit above
  • #266 add Atari fitness std, fix CUDA coredump issue
  • #269 update gym, remove box2d hack

Online Eval mode

#252 #257 #261 #267 Evaluation sessions during training on a subprocess. This does not interfere with the training process, but spawns multiple subprocesses to do independent evaluation, which then adds to an eval file, and at the end a final eval will finish and plot all the graphs and save all the data for eval.

  • enabled by meta spec 'training_eval'
  • configure NUM_EVAL_EPI in analysis.py
  • update enjoy and eval mode syntax. see README.
  • change ckpt behavior to use e.g. tag ckpt-epi10-totalt1000
  • add new eval mode to lab. runs on a checkpoint file. see below

Eval Session

  • add a proper eval Session which loads from the ckpt like above, and does not interfere with existing files. This can be ran on terminal, and it's also used by the internal eval logic, e.g. command python run_lab.py data/dqn_cartpole_2018_12_20_214412/dqn_cartpole_t0_spec.json dqn_cartpole eval@dqn_cartpole_t0_s2_ckpt-epi10-totalt1000
  • when eval session is done, it will average all of its ran episodes and append to a row in an eval_session_df.csv
  • after that it will delete the ckpt files it had just used (to prevent large storage)
  • then, it will run a trial analysis to update eval_trial_graph.png, and an accompanying trial_df as average of all session_dfs

How eval mode works

  • checkpoint will save the models using the scheme which records its epi and total_t. This allows one to eval using the ckpt model
  • after creating ckpt files, if spec.meta.training_eval in trainmode, a subprocess will launch using the ckpt prepath to run an eval Session, using the same way abovepython run_lab.py data/dqn_cartpole_2018_12_20_214412/dqn_cartpole_t0_spec.json dqn_cartpole eval@dqn_cartpole_t0_s2_ckpt-epi10-totalt1000`
  • eval session runs as above. ckpt will now run at the starting timestep, ckpt timestep, and at the end
  • the main Session will wait for the final eval session and it's final eval trial to finish before closing, to ensure that other processes like zipping wait for them.

Example eval trial graph:

dqn_cartpole_t0_ckpt-eval_trial_graph

v3.0.0

5 years ago

V3: PyTorch 1.0, faster Neural Network, Variable Scheduler

Docker image kengz/slm_lab:v3.0.0 released

PRs included #240 #241 #239 #238 #244 #248

PyTorch 1.0 and parallel CUDA

  • switch to PyTorch 1.0 with various improvements and parallel CUDA fix

new Neural Network API (breaking changes)

To accommodate more advanced features and improvements, all the networks have been improved with better spec and code design, faster operations, and added features

  • single-tail networks will now not use list but a single tail for fast output compute (for loop is slow)
  • use PyTorch optim.lr_scheduler for learning rate decay. retire old methods.
  • more efficient spec format for network, clip_grad, lr_scheduler_spec
  • fix and add proper generalization for ConvNet and RecurrentNet
  • add full basic network unit tests

DQN

  • rewrite DQN loss for 2x speedup and code simplicity. extend to SARSA
  • retire MultitaskDQN for HydraDQN

Memory

  • add OnpolicyConcatReplay
  • standardize preprocess_state logic in onpolicy memories

Variable Scheduler (breaking spec changes)

  • implement variable decay class VarScheduler similar to pytorch's LR scheduler. use clock with flexible scheduling units epi or total_t
  • unify VarScheduler to use standard clock.max_tick_unit specified from env
  • retire action_policy_update, update agent spec to explore_var_spec
  • replace entropy_coef with entropy_coef_spec
  • replace clip_eps with clip_eps_spec (PPO)
  • update all specs

Math util

  • move decay methods to math_util.py
  • move math_util.py from algorithm/ to lib/

env max tick (breaking spec changes)

  • spec/variable renamings:
    • max_episode to max_epi
    • max_timestep to max_t
    • save_epi_frequency to save_frequency
    • traininig_min_timestep to training_start_step
  • allow env to stop based on max_epi as well as max_total_t. propagate clock unit usage
  • introduce max_tick, max_tick_unit properties to env and clock from above
  • allow save_frequency to use the same units accordingly
  • update Pong and Beamrider to use max_total_t as end-condition
  • update ray from 0.3.1 to 0.5.3 to address broken GPU with pytorch 1.0.0
  • to fix CUDA not discovered in Ray worker, have to manually set CUDA devices at ray remote function due to poor design.

Improved logging and Enjoy mode

#243 #245

  • Best models checkpointing measured using the the reward_ma
  • Early termination if the environment is solved
  • method for logging learning rate to session data frame needed to be updated after move to PyTorch lr_scheduler
  • Also removed training_net from the mean learning rate reported in the session dataframe since the learning rate doesn't change
  • update naming scheme to work with enjoy mode
  • unify and simplify prepath methods
  • info_space now uses a ckpt for loading ckpt model. Example usage: yarn start pong.json dqn_pong enjoy@data/dqn_cartpole_2018_12_02_124127/dqn_cartpole_t0_s0_ckptbest
  • update agent load and policy to properly set variables to end_val in enjoy mode
  • random-seed env as well

Working Atari

#242 Atari benchmark had been failing, but the root cause had finally been discovered and fix: wrong image preprocessing. This can be due to several factors, and we are doing ablation studies to check against the old code: - Image normalization cause the input values to be lowered by ~255, and the resultant loss is too small for optimizer.

  • blackframes in stacking at the beginning timesteps
  • wrong image permutation

PR #242 introduces:

  • global environment preprocessor in the form of env wrapper borrowed from OpenAI baselines, in env/wrapper.py
  • a TransformImage to do the proper image transform: grayscale, downsize, and shape from w,h,c to PyTorch format c,h,w
  • a FrameStack which uses LazyFrames for efficiency to replace the agent-specific Atari stack frame preprocessing. This simplifies the Atari memories
  • update convnet to use honest shape (c,h,w) without extra transform, and remove its expensive image axis permutation since input now is in the right shape
  • update Vizdoom to produce (c,h,w) shape consistent with convnet input expectation

Tuned parameters will be obtained and released next version.

Attached is a quick training curve on Pong, DQN, where the solution avg is +18: fast_dqn_pong_t0_s0_session_graph pong

v2.2.0

5 years ago

Add VizDoom environment

#222 #224

  • add new OnPolicyImageReplay and ImageReplay memories
  • add VizDoom environment, thanks to @joelouismarino

Add NN Weight Initialization functionality

#223 #225

  • allow specification of NN weight init function in spec, thanks to @mwcvitkovic

Update Plotly to v3

#221

  • move to v3 to allow Python based (instead of bash) image saving for stability

Fixes

  • #207 fix PPO loss function broken during refactoring
  • #217 fix multi-device CUDA parallelization in grad assignment

v2.1.2

5 years ago

Benchmark

  • #177 #183 zip experiment data file for easy upload
  • #178 #186 #188 #194 add benchmark spec files
  • #193 add benchmark standard data to compute fitness
  • #196 add benchmark mode

Reward scaling

  • #175 add environment-specific reward scaling

HydraDQN

  • #175 HydraDQN works on cartpole and 2dball using reward scaling. spec committed

Add code of conduct

  • #199 add a code of conduct file for community

Misc

  • #172 add MA reward to dataframe
  • #174 refactor session parallelization
  • #196 add sys args to run lab
  • #198 add train@ mode

v2.1.1

5 years ago

Enable Distributed CUDA

#170 Fix the long standing pytorch + distributed using spawn multiprocessing due to Lab classes not pickleable. Just let the class wrapped in a mp_runner passed as mp.Process(target=mp_runner, args) so the classes don't get cloned from memory when spawning process, since it is now passed from outside.

DQN replace method fix

#169 DQN target network replacement was in the wrong direction. Fix that.

AtariPrioritizedReplay

#170 #171 Add a quick AtariPrioritizedReplay via some multi-inheritance black magic with PrioritizedReplay, AtariReplay

v2.1.0

5 years ago

This release optimizes the RAM consumption and memory sampling speed after stress-testing with Atari. RAM growth is curbed, and replay memory RAM usage is now near theoretical optimality.

Thanks to @mwcvitkovic for providing major help with this release.

Remove DataSpace history

#163

  • debug and fix memory growth (cause: data space saving history)
  • remove history saving altogether, and mdp data. remove aeb add_single. This changes the API.
  • create body.df to track data efficiently as a replacement. This is the API replacement for above.

Optimize Replay Memory RAM

#163 first optimization, halves replay RAM

  • make memory state numpy storage float16 to accommodate big memory size. half a million max_size virtual memory goes from 200GB to 50GB
  • memory index sampling for training with large size is very slow. add a method fast_uniform_sampling to speed up

#165 second optimization, halves replay RAM again to the theoretical minimum

  • do not save next_states for replay memories due to redundancy
  • replace with sentinel self.latest_next_states during sampling
  • 1 mil max_size for Atari replay now consumes 50Gb instead of 100Gb (was 200Gb before float16 downcasting in #163 )

Add OnPolicyAtariReplay

#164

  • add OnPolicyAtariReplay memory so that policy based algorithms can be applied to the Atari suite.

Misc

  • #157 allow usage as a python module via pip install -e . or python setup.py install
  • #160 guard lab default.json creation on first install
  • #161 fix agent save method, improve logging
  • #162 split logger by session for easier debugging
  • #164 fix N-Step-returns calculation
  • #166 fix pandas weird casting breaking issue causing process to hang
  • #167 uninstall unused tensorflow and tensorboard that come with Unity ML-Agents. rebuild Docker image.
  • #168 rebuild Docker and CI images

v2.0.0

5 years ago

This major v2.0.0 release addresses the user feedbacks on usability and feature requests:

  • makes the singleton case (single-agent-env) default
  • adds CUDA GPU support for all algorithms (except for distributed)
  • adds distributed training to all algorithms (ala A3C style)
  • optimizes compute, fixes some computation bugs

Note that this release is backward-incompatible with v1.x. and earlier.

v2.0.0: make components independent of the framework so it can be used outside of SLM-Lab for development and production, and improve usability. Backward-incompatible with v1.x.

Singleton Mode as Default

#153

  • singleton case (single-agent-env-body) is now the default. Any implementations need only to worry about singleton. Uses the Session in lab.
  • space case (multi-agent-env-body) is now an extension from singleton case. Simply add space_{method} to handle the space logic. Uses the SpaceSession in lab.
  • make components more independent from framework
  • major logic simplification to improve usability. Simplify the AEB and init sequences. remove post_body_init()
  • make network update and grad norm check more robust

CUDA support

#153

  • add attribute Net.cuda_id for device assignment (per network basis), and auto-calculate the cuda_id by trial and session index to distribute jobs
  • enable CUDA and add GPU support for all algorithms, except for distributed (A3C, DPPO etc.)
  • properly assign tensors to CUDA automatically depending if GPU is available and desired
  • run unit tests on machine with GTX 1070

Distributed Training

#153 #148

  • add distributed key to meta spec
  • enable distributed training using pytorch multiprocessing. Create new DistSession class which acts as the worker.
  • In distributed training, Trial creates the global networks for agents, then passes to and spawns DistSession. Effectively, the semantics of a session changes from being a disjoint copy to being a training worker.
  • make distributed usable for both singleton (single agent) and space (multiagent) cases.
  • add distributed cases to unit tests

State Normalization

#155

  • add state normalization using running mean and std: state = (state - mean) / std
  • apply to all algorithms
  • TODO conduct a large scale systematic study of the effect is state normalization vs without it

Bug Fixes and Improvements

#153

  • save() and load() now include network optimizers
  • refactor set_manual_seed to util
  • rename StackReplay to ConcatReplay for clarity
  • improve network training check of weights and grad norms
  • introduce BaseEnv as base class to OpenAIEnv and UnityEnv
  • optimize computations, major refactoring
  • update Dockerfile and release

Misc

  • #155 add state normalization using running mean and std
  • #154 fix A2C advantage calculation for Nstep returns
  • #152 refactor SIL implementation using multi-inheritance
  • #151 refactor Memory module
  • #150 refactor Net module
  • #147 update grad clipping, norm check, multicategorical API
  • #156 fix multiprocessing for device with cuda, without using cuda
  • #156 fix multi policy arguments to be consistent, and add missing state append logic

v1.1.2

5 years ago

This release adds PPOSIL, fixes some small issues with continuous actions, and PPO ratio computation.

Implementations

#145 Implement PPOSIL. Improve debug logging #143 add Arch installer thanks to @angel-ayala

Bug Fixes

#138 kill hanging processes of Electron for plotting #145 fix PPO wrong graph update sequence causing ratio to be 1. Fix continuous action output construction. add guards. #146 fix continuous actions and add full tests