Chainerrl Versions Save

ChainerRL is a deep reinforcement learning library built on top of Chainer.

v0.8.0

4 years ago

Announcement

This release will probably be the final major update under the name of ChainerRL. The development team is planning to switch its backend from Chainer to PyTorch and continue its development as OSS.

Important enhancements

Soft Actor-Critic (https://arxiv.org/abs/1812.05905) with benchmark results is added.
- Agent class: chainerrl.agents.SoftActorCritic
- Example and benchmark results (MuJoCo): https://github.com/chainer/chainerrl/tree/v0.8.0/examples/mujoco/reproduction/soft_actor_critic
- Example (Roboschool Atlas): https://github.com/chainer/chainerrl/tree/v0.8.0/examples/atlas
Trained models of benchmark results are now downloadable. See READMEs of examples.
- For Atari envs: DQN, IQN, Rainbow, A3C
- For MuJoCo envs: DDPG, PPO, TRPO, TD3, Soft Actor-Critic
DQN-based agents now support recurrent models in a new, more efficient interface.
- Example: https://github.com/chainer/chainerrl/tree/v0.8.0/examples/atari/train_drqn_ale.py
TRPO now supports recurrent models and batch training.
A variant of IQN with double Q-learning is added.
- Agent class: chainerrl.agents.DoubleIQN.
- Example: https://github.com/chainer/chainerrl/tree/v0.8.0/examples/atari/train_double_iqn.py
IQN now supports prioritized experience replay.

Important bugfixes

The bug that the update of CategoricalDoubleDQN is same as that of CategoricalDQN is fixed.
The bug that batch training with N-step or episodic replay buffers does not work is fixed.
The bug that weight normalization is PrioritizedReplayBuffer with normalize_by_max == 'batch' is wrong is fixed.

Important destructive changes

Support of Python 2 is dropped. ChainerRL is now only tested with Python 3.5.1+.
The interface of DQN-based agents to use recurrent models has changed. See the DRQN example: https://github.com/chainer/chainerrl/tree/v0.8.0/examples/atari/train_drqn_ale.py

All updates

Enhancements

Recurrent DQN families with a new interface (#436)
Recurrent and batched TRPO (#446)
Add Soft Actor-Critic agent (#457)
Code to collect demonstrations from an agent. (#468)
Monitor with ContinuingTimeLimit support (#491)
Fix B007: Loop control variable not used within the loop body (#502)
Double IQN (#503)
Fix B006: Do not use mutable data structures for argument defaults. (#504)
Splits Replay Buffers into separate files in a replay_buffers module (#506)
Use chainer.grad in ACER (#511)
Prioritized Double IQN (#518)
Add policy loss to TD3's logged statistics (#524)
Adds checkpoint frequencies for serial and batch Agents. (#525)
Add a deterministic mode to IQN for stable tests (#529)
Use Link.cleargrads instead of Link.zerograds in REINFORCE (#536)
Use cupyx.scatter_add instead of cupy.scatter_add (#537)
Avoid cupy.zeros_like with numpy.ndrray (#538)
Use get_device_from_id since get_device is deprecated (#539)
Releases trained models for all reproduced agents (#565)

Documentation

Typo fix in Replay Buffer Docs (#507)
Fixes typo in docstring for AsyncEvaluator (#508)
Improve the algorithm list on README (#509)
Add Explorers to Documentation (#514)
Fixes syntax errors in ReplayBuffer docs. (#515)
Adds policies to the documentation (#516)
Adds demonstration collection to experiments docs (#517)
Adds List of Batch Agents to the README (#543)
Add documentation for Q-functions and some missing details in docstrings (#556)
Add comment on environment version difference (#582)
Adds ChainerRL Bibtex to the README (#584)
Minor Typo Fix (#585)

Examples

Rename examples directories (#487)
Adds training times for reproduced Mujoco results (#497)
Adds additional information to Grasping Example README (#501)
Fixes a comment in PPO example (#521)
Rainbow Scores (#546)
Update train_a3c.py (#547, thanks @xinyuewang1!)
Update train_a3c.py (#548, thanks @xinyuewang1!)
Improves formatting of IQN training times (#549)
Corrects Scores in Examples (#552)
Removes GPU option from README (#564)
Releases trained models for all reproduced agents (#565)
Add an example script for RoboschoolAtlasForwardWalk-v1 (#577)
Corrects Rainbow Results (#580)
Adds proper A3C scores (#581)

Testing

Add CI configs (#478)
Specify ubuntu 16.04 for Travis CI and modify a dependency accordingly (#520)
Remove a tailing space of DoubleIQN (#526)
Add a deterministic mode to IQN for stable tests (#529)
Fix import error when chainer==7.0.0b3 (#531)
Make test_monitor.py work on flexCI (#533)
Improve parameter distributions used in TestGaussianDistribution (#540)
Increase flexCI's time limit to 20min (#550)
decrease amount of decimal digits required to 4 (#554)
Use attrs<19.2.0 with pytest (#569)
Run slow tests with flexCI (#575)
Typo fix in CI comment. (#576)
Adds time to DDPG Tests (#587)
Fix CI errors due to pyglet, zipp, mock, and gym (#592)

Bugfixes

Fix a bug in batch_recurrent_experiences regarding next_action (#528)
Fix ValueError in SARSA with GPU (#534)
fix function call (#541)
Pass env_id to replay_buffer methods to fix batch training (#558)
Fixes Categorical Double DQN Error. (#567)
Fix weight normalization inside prioritized experience replay (#570)

v0.7.0

4 years ago

Important enhancements

Rainbow (https://arxiv.org/abs/1710.02298) with benchmark results is added. (thanks @seann999!)
- Agent class: chainerrl.agents.CategoricalDoubleDQN
- Example and benchmark results: https://github.com/chainer/chainerrl/tree/v0.7.0/examples/atari/rainbow
TD3 (https://arxiv.org/abs/1802.09477) with benchmark results is added.
- Agent class: chainerrl.agents.TD3
- Example and benchmark results: https://github.com/chainer/chainerrl/tree/v0.7.0/examples/mujoco/td3
PPO now supports recurrent models.
- Example: https://github.com/chainer/chainerrl/tree/v0.7.0/examples/ale/train_ppo_ale.py (with --recurrent option)
- Results: https://github.com/chainer/chainerrl/pull/431
DDPG now supports batch training
- Example: https://github.com/chainer/chainerrl/tree/v0.7.0/examples/gym/train_ddpg_batch_gym.py

Important bugfixes

The bug that some examples use the same random seed across envs for env.seed is fixed.
The bug that batch training with n-step return and/or recurrent models is not successful is fixed.
The bug that examples/ale/train_dqn_ale.py uses LinearDecayEpsilonGreedy even when NoisyNet is used is fixed.
The bug that examples/ale/train_dqn_ale.py does not use the value specified by --noisy-net-sigma is fixed.
The bug that chainerrl.links.to_factorized_noisy does not work correctly with chainerrl.links.Sequence is fixed.

Important destructive changes

chainerrl.experiments.train_agent_async now requires eval_n_steps (number of timesteps for each evaluation phase) and eval_n_episodes (number of episodes for each evaluation phase) to be explicitly specified, with one of them being None.
examples/ale/dqn_phi.py is removed.
chainerrl.initializers.LeCunNormal is removed. Use chainer.initializers.LeCunNormal instead.

All updates

Enhancement

Rainbow (#374)
Make copy_param support scalar parameters (#410)
Enables batch DDPG agents to be trained. (#416)
Enables asynchronous time-based evaluations of agents. (#420)
Removes obsolete dqn_phi file (#424)
Add Branched and use it to simplify train_ppo_batch_gym.py (#427)
Remove LeCunNormal since Chainer has it from v3 (#428)
Precompute log probability in PPO (#430)
Recurrent PPO with a stateless recurrent model interface (#431)
Replace Variable.data with Variable.array (again) (#434)
Make IQN work with tuple observations (#435)
Add VectorStackFrame to reduce memory usage in train_dqn_batch_ale.py (#443)
DDPG example that reproduces the TD3 paper (#452)
TD3 agent (#453)
update requirements.txt and setup.py for gym (#461)
Support gym>=0.12.2 by stopping to use underscore methods in gym wrappers (#462)
Add warning about numpy 1.16.0 (#476)

Documentation

Link to abstract pages on ArXiv (#409)
fixes typo (#412)
Fixes file path in grasping example README (#422)
Add links to references (#425)
Fixes minor grammar mistake in A3C ALE example (#432)
Add explanation of examples/atari (#437)
Link to chainer/chainer, not pfnet/chainer (#439)
Link to chainer/chainer(rl), not pfnet/chainer(rl) (#440)
fix & add docstring for FCStateQFunctionWithDiscreteAction (#441)
Fixes a typo in train_agent_batch Documentation. (#444)
Adds Rainbow to main README (#447)
Fixes Docstring in IQN (#451)
Improves Rainbow README (#458)
very small fix: add missing doc for eval_performance. (#459)
Adds IQN Results to readme (#469)
Adds IQN to the documentation. (#470)
Adds reference to mujoco folder in the examples README (#474)
Fixes incorrect comment. (#490)

Examples

Rainbow (#374)
Create an IQN example aimed at reproducing the original paper and its evaluation protocol. (#408)
Benchmarks DQN example (#414)
Enables batch DDPG agents to be trained. (#416)
Fixes scores for Demon Attack (#418)
Set observation_space of kuka env correctly (#421)
Fixes error in setting explorer in DQN ALE example. (#423)
Add Branched and use it to simplify train_ppo_batch_gym.py (#427)
A3C Example for reproducing paper results. (#433)
PPO example that reproduces the "Deep Reinforcement Learning that Matters" paper (#448)
DDPG example that reproduces the TD3 paper (#452)
TD3 agent (#453)
Apply noisy_net_sigma parameter (#465)

Testing

Use Python 3.6 in Travis CI (#411)
Increase tolerance of TestGaussianDistribution.test_entropy since sometimes it failed (#438)
make FrameStack follow original spaces (#445)
Split test_examples.sh (#472)
Fix Travis error (#492)
Use Python 3.6 for ipynb (#493)

Bugfixes

bugfix (#360, thanks @corochann!)
Fixes error in setting explorer in DQN ALE example. (#423)
Make sure the agent sees when episodes end (#429)
Pass env_id to replay buffer methods to correctly support batch training (#442)
Add VectorStackFrame to reduce memory usage in train_dqn_batch_ale.py (#443)
Fix a bug of unintentionally using same process indices (#455)
Make cv2 dependency optional (#456)
fix ScaledFloatFrame.observation_space (#460)
Apply noisy_net_sigma parameter (#465)
Match EpisodicReplayBuffer.sample with ReplayBuffer.sample (#485)
Make to_factorized_noisy work with sequential links (#489)

v0.6.0

5 years ago

Important enhancements

Implicit Quantile Network (IQN) https://arxiv.org/abs/1806.06923 agent is added: chainerrl.agents.IQN.
Training DQN and its variants with N-step returns is supported.
Resetting env with done=False via info dict is supported. When env.step returns a info dict with info['needs_reset']=True, env is reset. This feature is useful for implementing a continuing env.
Evaluation with a fixed number of timesteps is supported (except async training). This evaluation protocol is popular in Atari benchmarks.
- examples/atari/dqn now implements the same evaluation protocol as the Nature DQN paper.
An example script of training a DoubleDQN agent for a PyBullet-based robotic grasping env is added: examples/grasping.

Important bugfixes

The bug that PPO's obs_normalizer was not saved is fixed.
The bug that NonbiasWeightDecay didn't work with newer versions of Chainer is fixed.
The bug that argv argument was ignored by chainerrl.experiments.prepare_output_dir is fixed.

Important destructive changes

train_agent_with_evaluation and train_agent_batch_with_evaluation now require eval_n_steps (number of timesteps for each evaluation phase) and eval_n_episodes (number of episodes for each evaluation phase) to be explicitly specified, with one of them being None.
train_agent_with_evaluation's max_episode_len argument is renamed to train_max_episode_len.
ReplayBuffer.sample now returns a list of lists of N experiences to support N-step returns.

All updates

Enhancement

Implicit quantile networks (IQN) (#288)
Adds N-step learning for DQN-based agents. (#317)
Replaywarning (#321)
Close envs in async training (#343)
Allow envs to send a 'needs_reset' signal (#356)
Changes variable names in train_agent_with_evaluation (#358)
Use chainer.dataset.concat_examples in batch_states (#366)
Implements Time-based evaluations (#367)

Documentation

Add long description for pypi (#357, thanks @ljvmiranda921!)
A small change to the installation documentation (#369)
Adds a link to the ChainerRL visualizer from the main repository (#370)
adds implicit quantile networks to readme (#393)
Fix DQN.update's docstring (#394)

Examples

Grasping example (#371)
Adds Deepmind Scores to README in DQN Example (#383)

Testing

Fix TestTrainAgentAsync (#363)
Use AbnormalExitCodeWarning for nonzero exitcode warnings (#378)
Avoid random test failures due to asynchronousness (#380)
Drop hacking (#381)
Avoid gym 0.11.0 in Travis (#396)
Stabilize and speed up A3C tests (#401)
Reduce ACER's test cases and maximum timesteps (#404)
Add tests of IQN examples (#405)

Bugfixes

Avoid UnicodeDecodeError in setup.py (#365)
Save and load obs_normalizer of PPO (#377)
Make NonbiasWeightDecay work again (#390)
bug fix (#391, thanks @tappy27!)
Fix episodic training of DDPG (#399)
Fix PGT's training (#400)
Fix ResidualDQN's training (#402)

v0.5.0

5 years ago

Important enhancements

Batch synchronized training using multiple environment instances and a single GPU is supported for some agents:
- A2C (added as chainerrl.agents.A2C)
- PPO
- DQN and other agents that inherits DQN except SARSA
examples/ale/train_dqn_ale.py now follows "Tuned DoubleDQN" setting by default, and supports prioritized experience replay as an option
examples/atari/train_dqn.py is added as a basic example of applying DQN to Atari.

Important bugfixes

A bug in chainerrl.agents.CategoricalDQN that deteriorates performance is fixed
A bug in atari_wrappers.LazyFrame that unnecessarily increases memory usage is fixed

Important destructive changes

chainerrl.replay_buffer.PrioritizedReplayBuffer and chainerrl.replay_buffer.PrioritizedEpisodicReplayBuffer are updated:
- become FIFO (First In, First Out), reducing memory usage in Atari games
- compute priorities more closely following the paper
eval_explorer argument of chainerrl.experiments.train_agent_* is dropped (use chainerrl.wrappers.RandomizeAction for evaluation-time epsilon-greedy)
Interface of chainerrl.agents.PPO has changed a lot
Support of Chainer v2 is dropped
Support of gym<0.9.7 is dropped
Support of loading chainerrl<=0.2.0's replay buffer is dropped

All updates

Enhancement

A2C (#149, thanks @iory!)
Add wrappers to cast observations (#160)
Fix on flake8 3.5.0 (#214)
Use ()-shaped array for scalar loss (#219)
FIFO prioritized replay buffer (#277)
Update Policy class to inherit ABCMeta (#280, thanks @uidilr!)
Batch PPO Implementation (#295, thanks @ljvmiranda921!)
Mimic the details of prioritized experience replay (#301)
Add ScaleReward wrapper (#304)
Remove GaussianPolicy and obsolete policies (#305)
Make random access queue sampling code cleaner (#309)
Support gym==0.10.8 (#324)
Batch A2C/PPO/DQN (#326)
Use RandomizeAction wrapper instead of Explorer in evaluation (#328)
remove duplicate lines (typo) (#329, thanks @monado3!)
Merge consecutive with statements (#333)
Use Variable.array instead of Variable.data (#336)
Remove code for Chainer v2 (#337)
Implement getitem for ActionValue (#339)
Count updates of DQN (#341)
Move Atari Wrappers (#349)
Render wrapper (#350)

Documentation

fixes minor typos (#306)
fixes typo (#307)
Typos (#308)
fixes readme typo (#310)
Adds partial list of paper implementations with links to the main README (#311)
Adds another paper to list (#312)
adds some instructions regarding testing for potential contributors (#315)
Remove duplication of DQN in docs (#334)
nit on grammar of a comment: (#354)

Examples

Tuned DoubleDQN with prioritized experience replay (#302)
adds some descriptions to parseargs arguments (#319)
Make clip_eps positive (#340)
updates env in ddpg example (#345)
Examples (#348)

Testing

Fix Travis CI errors (#318)
Parse Chainer version with packaging.version (#322)
removes tests for old replay buffer (#347)

Bugfixes

Fix the error caused by inexact delta_z (#314)
Stop caching the result of numpy.concatenate in LazyFrames (#332)

v0.4.0

5 years ago

Important enhancements

TRPO (trust region policy optimization) is added: chainerrl.agents.TRPO.
C51 (categorical DQN) is added: chainerrl.agents.CategoricalDQN.
NoisyNet is added: chainerrl.links.FactorizedNoisyLinear and chainerrl.links.to_factorized_noisy.
Python 3.7 is supported
Examples were improved in terms of logging and random seed setting

Important destructive changes

The async module is renamed async_ for Python 3.7 support.

All updates

Enhancements

TRPO agent (#204)
Use numpy random (#206)
Add gpus argument for chainerrl.misc.set_random_seed (#207)
More check on nesting AttributeSavingMixin (#208)
show error message (#210, thanks @corochann!)
Add an option to set whether the agent is saved every time the score is improved (#213)
Make tests check exit status of subprocesses (#215)
make ReplayBuffer.load() compatible with v0.2.0. (#216, thanks @mr4msm!)
Add requirements-dev.txt (#222)
Align act and act_and_train's signature to the Agent interface (#230, thanks @lyx-x!)
Support dtype arg of spaces.Box (#231)
Set outdir to results and add help strings (#248)
Categorical DQN (C51) (#249)
Remove DiscreteActionValue.sample_epsilon_greedy_actions (#259)
Remove DQN.compute_q_values (#260)
Enable to change batch_states in PPO (#261, thanks @kuni-kuni!)
Remove unnecessary declaration and substitution of 'done' in the train_agent function (#271, thanks @uidilr!)

Documentation

Update the contribution guide to use pytest (#220)
Add docstring to ALE and fix seed range (#234)
Fix docstrings of DDPG (#241)
Update the algorithm section of README (#246)
Add CategoricalDQN to README (#252)
Remove unnecessary comments from examples/gym/train_categorical_dqn_gym.py (#255)
Update README.md of examples/ale (#275)

Examples

Fix OMP_NUM_THREADS setting (#235)
Improve random seed setting in ALE examples (#239)
Improve random seed setting for all examples (#243)
Use gym and atari wrappers instead of chainerrl.envs.ale (#253)
Remove unused args from examples/ale/train_categorical_dqn_ale.py and examples/ale/train_dqn_ale.py (#256)
Remove unused --profile argument (#258)
Hyperlink DOI against preferred resolver (#266, thanks @katrinleinweber!)

Testing

Fix import chainer.testing.condition (#200)
Use pytest (#209)
Fix PCL tests (#211)
Test loading v0.2.0 replay buffers (#217)
Use assertRaises instead of expectedFailure (#218)
Improve travis script (#242)
Run autopep8 in travis ci (#247)
Switch autopep8 and hacking (#257)
Use hacking 1.0 (#262)
Fix a too long line of PPO (#264)
Update to hacking 1.1.0 (#274)
Add tests of DQN's loss functions (#279)

Bugfixes

gym 0.9.6 is not working with python2 (#226)
Tiny fix: argument passing in SoftmaxDistribution (#228, thanks @lyx-x!)
Add docstring to ALE and fix seed range (#234)
except both Exception and KeyboardInterrupt (#250, thanks @uenoku!)
Switch autopep8 and hacking (#257)
Modify async to async_ to support Python 3.7 (#286, thanks @mmilk1231!)
Noisy network fixes (#287, thanks @seann999!)

v0.3.0

6 years ago

Important enhancements

Both Chainer v2 and v3 are now supported
PPO (Proximal Policy Optimization) has been added: chainerrl.agents.PPO
Replay buffers has been made faster

Important destructive changes

Episodic replay buffers' __len__ now counts the number of transitions, not episodes
ALE's grayscale conversion formula has been corrected
FCGaussianPolicyWithFixedCovariance now has a nonlinearity before the last layer

All updates

Enhancements

Add RMSpropAsync and NonbiasWeightDecay to optimizers/__init__.py (#113)
Use init_scope (#116)
Remove ALE dependency (#121)
Support environments without git command (#124)
Add PPO agent (#126)
add .gitignore (#127, thanks @knorth55!)
Use faster queue for replay buffers (#131)
Use F.matmul instead of F.batch_matmul (#141)
Add a utility function to draw a computational graph (#166)
Improve MLPBN (#171)
Improve StateActionQFunctions (#172)
Improve deterministic policies (#173)
Fix InvertGradients (#185)
Remove unused functions in DQN (#188)
Warn about negative exit code of child processes (#194)

Documentation

Add animation gifs (#107)
Synchronize docs version with package version (#111)
Add logo (#136)
[policies/gaussian_policy] Improve docstring (#140, thanks @iory!)
Improve docstrings (#142)
Fix a typo (#146)
Fix a broken link to travis ci (#153)
Add PPO to README as an implemented algorithm (#168)
Improve the docstring of AdditiveGaussian (#170)
Add docsting on eval_max_episode_len (#177)
Add docstring to DuelingDQN (#187)
Suppress Sphinx' warning in the docstring of PCL (#198)

Example

fix typo (#122)
Use Chain.init_scope in the quick start (#148)
Draw computational graphs in train_dqn_ale.py (#192)
Draw computational graphs in train_dqn_gym.py (#195)
Draw computational graphs in train_a3c_ale.py (#197)

Testing

Add CHAINER_VERSION config to CI (#143)
Specify --outdir on 2nd test (#154)
Return dict for info of env.step (#162)
Fix import error in tests (#180)
Mark TestBiasCorrection as slow (#181)
Add tests for SingleActionValue (#191)

Bugfixes

Fix save/load in EpisodicReplayBuffer (#130)
Fix REINFORCE's missing initialization of t (#133)
Fix episodic buffer __len__ (#155)
Remove duplicated import of explorers (#163)
Fix missing nonlinearity before the last layer (#165)
Use bytestrings to write git outputs (#178)
Patches to envs.ALE (#182)
Fix QuadraticActionValue and add tests (#190)

v0.2.0

6 years ago

Enhancements:

Agents
- REINFORCE #81
Training helper functions
- Hook functions #85
- Add more columns to scores.txt: episodes, max and min #78
- Improve naming of the output directories #72 #77
- Use logger instead of print #60
- Make train_agent_async's eval_interval optional #93
Misc
- Use Gumbel-Max trick for categorical sampling in GPU #88 #104
- Remove test arguments from links (use chainer.config instead) #100

Fixes:

Fix argument names #86
Fix option names #71
Fix the issue that average_loss is not updated #95

Dependency changes:

Switch to Chainer v2 #100

Changes that can affect performance:

train_agent_async won't decay learning rate by default any more. Use hook functions instead.

v0.1.0

7 years ago

Enhancements:

SARSA #39
Boltzmann explorer #40
ACER for continuous actions #29
PCL #45 #57
Prioritized Replay #44 #57

Fixes:

Fix spelling: s/updator/updater/ #48