Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".
#275 #278 #279 #280
This release adds an eval mode that is the same as OpenAI baseline. Spawn 2 environments, 1 for training and 1 more eval. In the same process (blocking), run training as usual, then at ckpt, run an episode on eval env and update stats.
The logic for the stats are the same as before, except the original body.df
is now split into two: body.train_df
and body.eval_df
. Eval df uses the main env stats except for t, reward
to reflect progress on eval env. Correspondingly, session analysis also produces both versions of data.
Data from body.eval_df
is used to generate session_df, session_graph, session_fitness_df
, whereas the data from body.train_df
is used to generate a new set of trainsession_df, trainsession_graph, trainsession_fitness_df
for debugging.
The previous process-based eval functionality is kept, but is now considered as parallel_eval
. This can be useful for more robust checkpointing and eval.
#279
This also speeds up run time by x2. For Atari Beamrider with DQN on V100 GPU, manual benchmark measurement gives 110 FPS for training every 4 frames, while eval achieves 160 FPS. This translates to 10M frames in roughly 24 hours.
yarn retro_eval data/reinforce_cartpole_2018_01_22_211751
eval_session_df
causing trial analysis to break; add reset_index for safetyuse_gae
and use_nstep
param to infer from lam, num_step_returns
start_step
offset, add unit tests for rate decay methodsmax_total_t
, max_epi
to max_tick
and max_tick_unit
for directness. retire graph_x
for the unit above#252 #257 #261 #267 Evaluation sessions during training on a subprocess. This does not interfere with the training process, but spawns multiple subprocesses to do independent evaluation, which then adds to an eval file, and at the end a final eval will finish and plot all the graphs and save all the data for eval.
'training_eval'
NUM_EVAL_EPI
in analysis.py
enjoy
and eval
mode syntax. see README.ckpt-epi10-totalt1000
eval
mode to lab. runs on a checkpoint file. see belowpython run_lab.py data/dqn_cartpole_2018_12_20_214412/dqn_cartpole_t0_spec.json dqn_cartpole eval@dqn_cartpole_t0_s2_ckpt-epi10-totalt1000
eval_session_df.csv
eval_trial_graph.png
, and an accompanying trial_df
as average of all session_df
sepi
and total_t
. This allows one to eval using the ckpt modelspec.meta.training_eval in
trainmode, a subprocess will launch using the ckpt prepath to run an eval Session, using the same way above
python run_lab.py data/dqn_cartpole_2018_12_20_214412/dqn_cartpole_t0_spec.json dqn_cartpole eval@dqn_cartpole_t0_s2_ckpt-epi10-totalt1000`Example eval trial graph:
PRs included #240 #241 #239 #238 #244 #248
To accommodate more advanced features and improvements, all the networks have been improved with better spec and code design, faster operations, and added features
optim.lr_scheduler
for learning rate decay. retire old methods.clip_grad
, lr_scheduler_spec
OnpolicyConcatReplay
preprocess_state
logic in onpolicy memoriesVarScheduler
similar to pytorch's LR scheduler. use clock with flexible scheduling units epi
or total_t
clock.max_tick_unit
specified from envaction_policy_update
, update agent spec to explore_var_spec
entropy_coef
with entropy_coef_spec
clip_eps
with clip_eps_spec
(PPO)math_util.py
math_util.py
from algorithm/
to lib/
max_episode
to max_epi
max_timestep
to max_t
save_epi_frequency
to save_frequency
traininig_min_timestep
to training_start_step
max_epi
as well as max_total_t
. propagate clock unit usagemax_tick, max_tick_unit
properties to env and clock from abovesave_frequency
to use the same units accordinglymax_total_t
as end-condition0.3.1
to 0.5.3
to address broken GPU with pytorch 1.0.0#243 #245
ckpt
for loading ckpt model. Example usage: yarn start pong.json dqn_pong enjoy@data/dqn_cartpole_2018_12_02_124127/dqn_cartpole_t0_s0_ckptbest
end_val
in enjoy mode#242 Atari benchmark had been failing, but the root cause had finally been discovered and fix: wrong image preprocessing. This can be due to several factors, and we are doing ablation studies to check against the old code: - Image normalization cause the input values to be lowered by ~255, and the resultant loss is too small for optimizer.
PR #242 introduces:
env/wrapper.py
TransformImage
to do the proper image transform: grayscale, downsize, and shape from w,h,c to PyTorch format c,h,w
FrameStack
which uses LazyFrames
for efficiency to replace the agent-specific Atari stack frame preprocessing. This simplifies the Atari memoriesTuned parameters will be obtained and released next version.
Attached is a quick training curve on Pong, DQN, where the solution avg is +18:
#222 #224
OnPolicyImageReplay
and ImageReplay
memories#223 #225
#221
train@
mode#170
Fix the long standing pytorch + distributed using spawn
multiprocessing due to Lab classes not pickleable. Just let the class wrapped in a mp_runner
passed as mp.Process(target=mp_runner, args)
so the classes don't get cloned from memory when spawning process, since it is now passed from outside.
#169 DQN target network replacement was in the wrong direction. Fix that.
#170 #171
Add a quick AtariPrioritizedReplay via some multi-inheritance black magic with PrioritizedReplay, AtariReplay
This release optimizes the RAM consumption and memory sampling speed after stress-testing with Atari. RAM growth is curbed, and replay memory RAM usage is now near theoretical optimality.
Thanks to @mwcvitkovic for providing major help with this release.
#163
add_single
. This changes the API.body.df
to track data efficiently as a replacement. This is the API replacement for above.#163 first optimization, halves replay RAM
float16
to accommodate big memory size. half a million max_size virtual memory goes from 200GB to 50GBfast_uniform_sampling
to speed up#165 second optimization, halves replay RAM again to the theoretical minimum
next_states
for replay memories due to redundancyself.latest_next_states
during sampling#164
pip install -e .
or python setup.py install
default.json
creation on first installThis major v2.0.0 release addresses the user feedbacks on usability and feature requests:
Note that this release is backward-incompatible with v1.x. and earlier.
v2.0.0
: make components independent of the framework so it can be used outside of SLM-Lab for development and production, and improve usability. Backward-incompatible with v1.x
.
#153
Session
in lab.space_{method}
to handle the space logic. Uses the SpaceSession
in lab.post_body_init()
#153
Net.cuda_id
for device assignment (per network basis), and auto-calculate the cuda_id
by trial and session index to distribute jobs#153 #148
distributed
key to meta specDistSession
class which acts as the worker.Trial
creates the global networks for agents, then passes to and spawns DistSession
. Effectively, the semantics of a session changes from being a disjoint copy to being a training worker.#155
state = (state - mean) / std
#153
save()
and load()
now include network optimizersset_manual_seed
to utilStackReplay
to ConcatReplay
for clarityBaseEnv
as base class to OpenAIEnv
and UnityEnv
This release adds PPOSIL, fixes some small issues with continuous actions, and PPO ratio computation.
#145 Implement PPOSIL. Improve debug logging #143 add Arch installer thanks to @angel-ayala
#138 kill hanging processes of Electron for plotting #145 fix PPO wrong graph update sequence causing ratio to be 1. Fix continuous action output construction. add guards. #146 fix continuous actions and add full tests