Xagents Save Abandoned

Packaged deep reinforcement learning algorithms in tensorflow 2.x

Project README

rlalgorithms-tf2 - reusable, scalable, performant reinforcement learning algorithms in tf2

Installation
Description
Features
Usage
- Training
- Playing
- Tuning
Command line options
- Agent
- General
- Training
- Playing
- Tuning
Algorithms
- A2C
- ACER
- DDPG
- DQN / DDQN
- PPO
- TD3
- TRPO
License
Show your support
Contact

features

1. Installation

installation

Install swig using apt / brew depending on your os

pip install git+https://github.com/unsignedrant/rlalgorithms-tf2

Notes:

To be able to use atari environments / run tests, according to atari-py, you need to install ROMS:

mkdir Roms
wget http://www.atarimania.com/roms/Roms.rar
unrar e -r Roms.rar Roms
python -m atari_py.import_roms Roms
ale-import-roms --import-from-pkg atari_py.atari_roms

Verify installation

rlalgorithms-tf2

OUT:

rlalgorithms-tf2 1.0.1

Usage:
    rlalgorithms-tf2 <command> <agent> [options] [args]

Available commands:
    train      Train given an agent and environment
    play       Play a game given a trained agent and environment
    tune       Tune hyperparameters given an agent, hyperparameter specs, and environment

Use rlalgorithms-tf2 <command> to see more info about a command
Use rlalgorithms-tf2 <command> <agent> to see more info about command + agent

2. Description

rlalgorithms-tf2 is a tensorflow based mini-library which facilitates experimentation with existing reinforcement learning algorithms, as well as the implementation of new ones. It provides well tested components that can be easily modified or extended. The available selection of algorithms can be used directly or through command line.

3. Features

3.1. Tensorflow 2.x

All available agents are based on tensorflow 2.x.
High performance training loops executed in graph mode.
Keras models.

3.2. wandb support

Visualization of the training is supported, as well as many other awesome features provided by wandb.

wandb-agents

3.3. Multiple environments

All agents support multiple environments, which operations are conducted in tensorflow graph. This boosts training speed without the overhead of creating a process per environment. Atari and environments that return images, are wrapped in LazyFrames which significantly lower memory usage.

3.4. Multiple memory-optimized replay buffers

There are 2 kinds of replay buffers available:

ReplayBuffer1 which is deque-based (DQN, ACER).
ReplayBuffer2 which is numpy-based (DDPG, TD3).

Both support max size and initial size, and are usually combined with LazyFrames for memory optimality.

3.5. Command line options

All features are available through the command line. For more command line info, check command line options

3.6. Intuitive hyperparameter tuning from cli

Command line tuning interface based on optuna, which provides many hyperparameter features and types. 3 types are currently used by rlalgorithms-tf2:

Categorical:

rlalgorithms-tf2 tune <agent> --env <env> --interesting-param <val1> <val2> <val3> # ...

Int / log uniform:

rlalgorithms-tf2 tune <agent> --env <env> --interesting-param <min-val> <max-val>

And in both examples if --interesting-param is not specified, it will have the default value, or a fixed value, if only 1 value is specified. Also, some nice visualization options using optuna.visualization.matplotlib:

param-importances

3.7. Early stopping / reduce on plateau.

Early train stopping usually when plateau is reached for a pre-specified n number of times without any improvement. Learning rate is reduced by some pre-determined factor. To activate these features:

--divergence-monitoring-steps <train-steps-at-which-should-monitor>

3.8. Discrete and continuous action spaces

	A2C	ACER	DDPG	DQN	PPO	TD3	TRPO
Discrete	Yes	Yes	No	Yes	Yes	No	Yes
Continuous	Yes	No	Yes	No	Yes	Yes	Yes

3.9. Unit tests

Main components are covered using pytest.

3.10. Models are loaded from .cfg files

To facilitate experimentation, and eliminate redundancy, all agents support loading models by passing either --model <model.cfg> or --actor-model <actor.cfg> and --critic-model <critic.cfg>. If no models were passed, the default ones will be loaded. A typical model.cfg file would look like:

[convolutional-0]
filters=32
size=8
stride=4
activation=relu
initializer=orthogonal
gain=1.4142135

[convolutional-1]
filters=64
size=4
stride=2
activation=relu
initializer=orthogonal
gain=1.4142135

[convolutional-2]
filters=64
size=3
stride=1
activation=relu
initializer=orthogonal
gain=1.4142135

[flatten-0]

[dense-0]
units=512
activation=relu
initializer=orthogonal
gain=1.4142135
common=1

[dense-1]
initializer=orthogonal
gain=0.01
output=1

[dense-2]
initializer=orthogonal
gain=1.0
output=1

Which should generate a keras model similar to this one with output units 6, and 1 respectively:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 84, 84, 1)]  0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 20, 20, 32)   2080        input_1[0][0]                    
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 9, 9, 64)     32832       conv2d[0][0]                     
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 7, 7, 64)     36928       conv2d_1[0][0]                   
__________________________________________________________________________________________________
flatten (Flatten)               (None, 3136)         0           conv2d_2[0][0]                   
__________________________________________________________________________________________________
dense (Dense)                   (None, 512)          1606144     flatten[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 6)            3078        dense[0][0]                      
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            513         dense[0][0]                      
==================================================================================================
Total params: 1,681,575
Trainable params: 1,681,575
Non-trainable params: 0
__________________________________________________________________________________________________

Notes

You don't have to worry about this if you're going to use the default models, which are loaded automatically.
common=1 marks a layer to be reused by the following layers, which means dense-1 and dense-2 are called on the output of dense-0.
Initializer can be orthogonal or glorot_uniform, and to add more, you'll have to modify rlalgorithms-tf2.utils.common.ModelReader.initializers.
output=1 marks a layer as output which will be appended to the outputs of the resulting tf.keras.Model
Dense layers without units (output layers) will expect their respective units to be passed to rlalgorithms-tf2.utils.common.ModelReader.

3.11. Training history checkpoints

Saving training history is available for further benchmarking / visualizing results. This is achieved by specifying --history-checkpoint <history.parquet> which will result in a .parquet that will be updated at each episode end. A sample data point will have these columns:

mean_reward most recent mean of agent episode rewards.
best_reward most recent best of agent episode rewards.
episode_reward most recent episode reward.
step most recent agent step.
time training elapsed time.

Which enables producing plots similar to the ones below, using rlalgorithms-tf2.utils.common.plot_history

step-benchmark time-benchmark

3.12. Reproducible results

All operation results are reproducible by passing --seed <some-seed> or seed=some_seed to agent constructor.

3.13. Gameplay output to .jpg frames

Gameplay visual output can be saved to .jpg frames by passing --frame-dir <some-dir> to play command.

3.14. Resume training / history

Weights are saved to .tf by specifying --checkpoints <ckpt1.tf> <ckpt2.tf>. To resume training, --weights <ckpt1.tf> <ckpt2.tf> should load the weights saved earlier. If --history-checkpoint <ckpt.parquet> is specified, the file is looked for and if found, further training history will be saved to the same history ckpt.parquet and the agent metrics will be updated with the most recent ones contained in the history file.

4. Usage

All agents / commands are available through the command line.

rlalgorithms-tf2 <command> <agent> [options] [args]

Note: Unless called from command line with --weights passed, all models passed to agents in code, should be loaded with weights beforehand, if called for resuming training or playing.

4.1. Training

training

Through command line

rlalgorithms-tf2 train a2c --env PongNoFrameskip-v4 --n-env 16 --target-reward 19 --preprocess

Through direct importing

import rlalgorithms-tf2
from rlalgorithms-tf2 import A2C
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('PongNoFrameskip-v4')
model = ModelReader(
    rlalgorithms-tf2.agents['a2c']['model']['cnn'][0],
    output_units=[6, 1],
    input_shape=envs[0].observation_space.shape,
    optimizer='adam',
).build_model()
agent = A2C(envs, model)

Then either max_steps or target_reward should be specified to start training:

agent.fit(target_reward=19)

4.2. Playing

playing

Through command line

rlalgorithms-tf2 play a2c --env PongNoFrameskip-v4 --preprocess --weights <trained-a2c-weights> --render

Through direct importing

import rlalgorithms-tf2
from rlalgorithms-tf2 import A2C
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('PongNoFrameskip-v4')
model = ModelReader(
    rlalgorithms-tf2.agents['a2c']['model']['cnn'][0],
    output_units=[6, 1],
    input_shape=envs[0].observation_space.shape,
    optimizer='adam',
).build_model()
model.load_weights(
    '/path/to/trained-weights.tf'
).expect_partial()
agent = A2C(envs, model)
agent.play(render=True)

Save frames

agent.play(frame_dir='/path/to/frame-dir/')

rlalgorithms-tf2 play a2c --frame-dir /path/to/frame-dir/

4.3. Tuning

tuning

Notes

Due to an issue with tensorflow that causes occasional memory leaks, if trials are run consecutively using:
```
study.optimize(objective, n_trials=100)
```
The current implementation runs trials in separate processes that are killed after each trial, to release the resources. Therefore, you may find the suggested non-command line example different from optuna's docs.
There are hyperparameters that accept min and max values, and others that support n values. To know which is what, check the hp_type in the help menu table. categorical takes any number of values, otherwise min and max.
For more info about how the optimization algorithms work under the hood, you may want to check optuna docs.
Tuning from later stages of the training is available by passing --weights <weights1.tf> <weights2.tf> which loads agent respective model weights, and tuning starts from there.
Only the hyperparameters selected are tuned, the rest will keep the default values and will not be tuned or can have a single fixed value --flag <val>
Also, due to tensorflow issue mentioned above, tensorflow logging is silenced using TF_CPP_MIN_LOG_LEVEL environment variable to prevent each trial process from displaying the same import log messages over and over ...

Through command line

!TF_CPP_MIN_LOG_LEVEL=3 rlalgorithms-tf2 tune ppo --env PongNoFrameskip-v4 --study ppo-carnival --storage sqlite:///ppo-carnival.db --trial-steps 500000 --n-trials 100 --warmup-trials 3 --preprocess --n-envs 16 32 --lr 1e-5 1e-2 --opt-epsilon 1e-7 1e-4 --gamma 0.9 0.999 --entropy-coef 0.01 0.2 --n-steps 16 32 64 128 --lam 0.7 0.99

Through direct importing

import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from concurrent.futures import ProcessPoolExecutor, as_completed

import numpy as np
import optuna
import tensorflow as tf
from tensorflow.keras.optimizers import Adam

import rlalgorithms-tf2
from rlalgorithms-tf2 import PPO
from rlalgorithms-tf2.utils.common import ModelReader, create_envs


def get_hparams(trial):
    return {
        'n_steps': int(
            trial.suggest_categorical('n_steps', [2 ** i for i in range(2, 11)])
        ),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1e-2),
        'epsilon': trial.suggest_loguniform('epsilon', 1e-7, 1e-1),
        'entropy_coef': trial.suggest_loguniform('entropy_coef', 1e-8, 2e-1),
        'n_envs': int(
            trial.suggest_categorical('n_envs', [2 ** i for i in range(4, 7)])
        ),
        'grad_norm': trial.suggest_uniform('grad_norm', 0.1, 10.0),
        'lam': trial.suggest_loguniform('lam', 0.65, 0.99),
        'clip_norm': trial.suggest_loguniform('clip_norm', 0.01, 10),
    }


def optimize_agent(trial):
    hparams = get_hparams(trial)
    envs = create_envs('BreakoutNoFrameskip-v4', hparams['n_envs'])
    optimizer = Adam(
        hparams['learning_rate'],
        epsilon=hparams['epsilon'],
    )
    model_cfg = rlalgorithms-tf2.agents['ppo']['model']['cnn'][0]
    model = ModelReader(
        model_cfg,
        output_units=[envs[0].action_space.n, 1],
        input_shape=envs[0].observation_space.shape,
        optimizer=optimizer,
    ).build_model()
    model.compile(optimizer)
    agent = PPO(
        envs,
        model,
        entropy_coef=hparams['entropy_coef'],
        grad_norm=hparams['grad_norm'],
        n_steps=hparams['n_steps'],
        lam=hparams['lam'],
        clip_norm=hparams['clip_norm'],
        trial=trial,
        quiet=True,
    )
    steps = 500000
    agent.fit(max_steps=steps)
    current_rewards = np.around(np.mean(agent.total_rewards), 2)
    if not np.isfinite(current_rewards):
        current_rewards = 0
    return current_rewards


def run_trial():
    optuna.logging.set_verbosity(optuna.logging.ERROR)
    tf.get_logger().setLevel('ERROR')
    study = optuna.create_study(
        study_name='ppo-example',
        storage='sqlite:///ppo-example.db',
        load_if_exists=True,
        direction='maximize',
    )
    optuna.logging.set_verbosity(optuna.logging.INFO)
    study.optimize(optimize_agent, n_trials=1)


if __name__ == '__main__':
    for _ in range(100):
        with ProcessPoolExecutor(1) as executor:
            future_trials = [executor.submit(run_trial)]
            for future_trial in as_completed(future_trials):
                future_trial.result()

5. Command line options

Note: Not all the flags listed below are available at once, and to know which ones are, respective to the command you passed, you can use:

rlalgorithms-tf2 <command>

rlalgorithms-tf2 <command> <agent>

which should list command + agent options combined

Flags (Available for all agents)

5.1. Agent

flags	help	default	hp_type
--checkpoints	Path(s) to new model(s) to which checkpoint(s) will be saved during training	-	-
--display-precision	Number of decimals to be displayed	2	-
--divergence-monitoring-steps	Steps after which, plateau and early stopping are active	-	-
--early-stop-patience	Minimum plateau reduces to stop training	3	-
--gamma	Discount factor	0.99	log_uniform
--history-checkpoint	Path to .parquet file to save training history	-	-
--log-frequency	Log progress every n games	-	-
--plateau-reduce-factor	Factor multiplied by current learning rate when there is a plateau	0.9	-
--plateau-reduce-patience	Minimum non-improvements to reduce lr	10	-
--quiet	If specified, no messages by the agent will be displayed	-	-
	to the console
--reward-buffer-size	Size of the total reward buffer, used for calculating	100	-
	mean reward value to be displayed.
--seed	Random seed	-	-

5.2. General

flags	help	default	hp_type
--beta1	Beta1 passed to a tensorflow.keras.optimizers.Optimizer	0.9	log_uniform
--beta2	Beta2 passed to a tensorflow.keras.optimizers.Optimizer	0.999	log_uniform
--env	gym environment id	-	-
--lr	Learning rate passed to a tensorflow.keras.optimizers.Optimizer	0.0007	log_uniform
--max-frame	If specified, max & skip will be applied during preprocessing	-	categorical
--n-envs	Number of environments to create	1	categorical
--opt-epsilon	Epsilon passed to a tensorflow.keras.optimizers.Optimizer	1e-07	log_uniform
--preprocess	If specified, states will be treated as atari frames	-	-
	and preprocessed accordingly
--weights	Path(s) to model(s) weight(s) to be loaded by agent output_models	-	-

5.3. Training

flags	help
--max-steps	Maximum number of environment steps, when reached, training is stopped
--monitor-session	Wandb session name
--target-reward	Target reward when reached, training is stopped

5.4. Playing

flags	help	default
--action-idx	Index of action output by agent.model	0
--frame-delay	Delay between rendered frames	0
--frame-dir	Path to directory to save game frames	-
--frame-frequency	If --frame-dir is specified, save frames every n frames.	1
--render	If specified, the gameplay will be rendered	-

5.5. Tuning

flags	help	default
--n-jobs	Parallel trials	1
--n-trials	Number of trials to run	1
--non-silent	tensorflow, optuna and agent are silenced at trial start	-
	to avoid repetitive import messages at each trial start, unless
	this flag is specified
--storage	Database url	-
--study	Name of optuna study	-
--trial-steps	Maximum steps for a trial	500000
--warmup-trials	warmup trials before pruning starts	5

5.6. Off-policy (available to off-policy agents only)

flags	help	default	hp_type
--buffer-batch-size	Replay buffer batch size	32	categorical
--buffer-initial-size	Replay buffer initial size	-	int
--buffer-max-size	Maximum replay buffer size	10000	int

6. Algorithms

General notes

All the default hyperparameters don't work for all environments. Which means you either need to tune them according to the given environment, or pass previously tuned ones, in order to get good results.
--model <model.cfg> or --actor-model <actor.cfg> and --critic-model <critic.cfg> are optional which means, if not specified, the default model(s) will be loaded, so you don't have to worry about it.
You can also use external models by passing them to agent constructor. If you do, you will have to ensure your models outputs match what the implementation expects, or modify it accordingly.
For atari environments / the ones that return an image by default, use the --preprocess flag for image preprocessing.
For checkpoints to be saved, --checkpoints <checkpoint1.tf> <checkpoint2.tf> should be specified for the model(s) to be saved. The number of passed checkpoints should match the number of models the agent accepts.
For loading weights either for resuming training or for playing a game --weights <weights1.tf> <weights2.tf> and same goes for the weights, they should match the number of agent models.
For using a random seed, a seed=some_seed should be passed to agent constructor and ModelReader constructor if specified from code. If from the command line, all you need is to pass --seed <some-seed>
To save training history, history_checkpoint=some_history.parquet should be specified to agent constructor or alternatively using --history-checkpoint <some-history.parquet>. If the history checkpoint exists, training metrics will automatically start from where it left.

6.1. A2C

Number of models: 1
Action spaces: discrete and continuous

flags	help	default	hp_type
--entropy-coef	Entropy coefficient for loss calculation	0.01	log_uniform
--grad-norm	Gradient clipping value passed to tf.clip_by_value()	0.5	log_uniform
--model	Path to model .cfg file	-	-
--n-steps	Transition steps	5	categorical
--value-loss-coef	Value loss coefficient for value loss calculation	0.5	log_uniform

Command line

 rlalgorithms-tf2 train a2c --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --preprocess --checkpoints a2c-pong.tf

rlalgorithms-tf2 train a2c --env BipedalWalker-v3 --target-reward 100 --n-envs 16 --checkpoints a2c-bipedal-walker.tf

Non-command line

from tensorflow.keras.optimizers import Adam

import rlalgorithms-tf2
from rlalgorithms-tf2 import A2C
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('PongNoFrameskip-v4', 16)
model_cfg = rlalgorithms-tf2.agents['a2c']['model']['cnn'][0]
optimizer = Adam(learning_rate=7e-4)
model = ModelReader(
    model_cfg,
    output_units=[envs[0].action_space.n, 1],
    input_shape=envs[0].observation_space.shape,
    optimizer=optimizer,
).build_model()
agent = A2C(envs, model, checkpoints=['a2c-pong.tf'])
agent.fit(target_reward=19)

And for BipedalWalker-v3, the only difference is that you have to specify preprocess=False to create_envs()

6.2. ACER

Number of models: 1
Action spaces: discrete
Due to implementation details, buffer batch size for ACER is n-steps. Therefore buffer-batch-size is set to 1.

flags	help	default	hp_type
--delta	delta param used for trust region update	1	log_uniform
--ema-alpha	Moving average decay passed to tf.train.ExponentialMovingAverage()	0.99	log_uniform
--entropy-coef	Entropy coefficient for loss calculation	0.01	log_uniform
--epsilon	epsilon used in gradient updates	1e-06	log_uniform
--grad-norm	Gradient clipping value passed to tf.clip_by_value()	10	log_uniform
--importance-c	Importance weight truncation parameter.	10.0	log_uniform
--model	Path to model .cfg file	-	-
--n-steps	Transition steps	20	categorical
--replay-ratio	Lam value passed to np.random.poisson()	4	categorical
--trust-region	True by default, if this flag is specified,	-	-
	trust region updates will be used
--value-loss-coef	Value loss coefficient for value loss calculation	0.5	log_uniform

Command line

rlalgorithms-tf2 train acer --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --preprocess --checkpoints acer-pong.tf --buffer-max-size 5000 --buffer-initial-size 500 --trust-region

Non-command line

from tensorflow.keras.optimizers import Adam

import rlalgorithms-tf2
from rlalgorithms-tf2 import ACER
from rlalgorithms-tf2.utils.buffers import ReplayBuffer1
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('PongNoFrameskip-v4', 16)
buffers = [
    ReplayBuffer1(5000, initial_size=500, batch_size=1) for _ in range(len(envs))
]
model_cfg = rlalgorithms-tf2.agents['acer']['model']['cnn'][0]
optimizer = Adam(learning_rate=7e-4)
model = ModelReader(
    model_cfg,
    output_units=[envs[0].action_space.n, envs[0].action_space.n],
    input_shape=envs[0].observation_space.shape,
    optimizer=optimizer,
).build_model()
agent = ACER(envs, model, buffers, checkpoints=['acer-pong.tf'])
agent.fit(target_reward=19)

6.3. DDPG

Number of models: 2
Action spaces: continuous
FPS varies because a different number of updates is executed at each train step, unless --gradient-steps is specified.

flags	help	default	hp_type
--actor-model	Path to actor model .cfg file	-	-
--critic-model	Path to critic model .cfg file	-	-
--gradient-steps	Number of iterations per train step	-	int
--step-noise-coef	Coefficient multiplied by noise added to actions to step	0.1	log_uniform
--tau	Value used for syncing target model weights	0.005	log_uniform

Command line

rlalgorithms-tf2 train ddpg --env BipedalWalker-v3 --target-reward 100 --n-envs 16 --checkpoints ddpg-actor-bipedal-walker.tf ddpg-critic-bipedal-walker.tf --buffer-max-size 1000000 --buffer-initial-size 25000 --buffer-batch-size 100

Non-command line

from tensorflow.keras.optimizers import Adam

import rlalgorithms-tf2
from rlalgorithms-tf2 import DDPG
from rlalgorithms-tf2.utils.buffers import ReplayBuffer2
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('BipedalWalker-v3', 16, preprocess=False)
buffers = [
    ReplayBuffer2(62500, slots=5, initial_size=1560, batch_size=8)
    for _ in range(len(envs))
]
actor_model_cfg = rlalgorithms-tf2.agents['ddpg']['actor_model']['ann'][0]
critic_model_cfg = rlalgorithms-tf2.agents['ddpg']['critic_model']['ann'][0]
optimizer = Adam(learning_rate=7e-4)
actor_model = ModelReader(
    actor_model_cfg,
    output_units=[envs[0].action_space.shape[0]],
    input_shape=envs[0].observation_space.shape,
    optimizer=optimizer,
).build_model()
critic_model = ModelReader(
    actor_model_cfg,
    output_units=[1],
    input_shape=envs[0].observation_space.shape[0] + envs[0].action_space.shape[0],
    optimizer=optimizer,
).build_model()
agent = DDPG(
    envs,
    actor_model,
    critic_model,
    buffers,
    checkpoints=['ddpg-actor-bipedal-walker.tf', 'ddpg-critic-bipedal-walker.tf'],
)
agent.fit(target_reward=100)

6.4. DQN-DDQN

Number of models: 1
Action spaces: discrete

flags	help	default	hp_type
--double	If specified, DDQN will be used	-	-
--epsilon-decay-steps	Number of steps for `epsilon-start` to reach `epsilon-end`	150000	int
--epsilon-end	Epsilon end value (minimum exploration rate)	0.02	log_uniform
--epsilon-start	Starting epsilon value which is used to control random exploration.	1.0	log_uniform
	It should be decremented and adjusted according to implementation needs
--model	Path to model .cfg file	-	-
--target-sync-steps	Sync target models every n steps	1000	int

Command line

rlalgorithms-tf2 train dqn --env PongNoFrameskip-v4 --target-reward 19 --n-envs 3 --lr 1e-4 --preprocess --checkpoints dqn-pong.tf --buffer-max-size 50000 --buffer-initial-size 10000 --max-frame

Non-command line

from tensorflow.keras.optimizers import Adam

import rlalgorithms-tf2
from rlalgorithms-tf2 import DQN
from rlalgorithms-tf2.utils.buffers import ReplayBuffer1
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('PongNoFrameskip-v4', 3, max_frame=True)
buffers = [
    ReplayBuffer1(16666, initial_size=3333, batch_size=10) for _ in range(len(envs))
]
model_cfg = rlalgorithms-tf2.agents['dqn']['model']['cnn'][0]
optimizer = Adam(learning_rate=7e-4)
model = ModelReader(
    model_cfg,
    output_units=[envs[0].action_space.n],
    input_shape=envs[0].observation_space.shape,
    optimizer=optimizer,
).build_model()
agent = DQN(envs, model, buffers, checkpoints=['dqn-pong.tf'])
agent.fit(target_reward=19)

Note: if you need a DDQN, specify double=True to the agent constructor or --double

6.5. PPO

Number of models: 1
Action spaces: discrete, continuous

flags	help	default	hp_type
--advantage-epsilon	Value added to estimated advantage	1e-08	log_uniform
--clip-norm	Clipping value passed to tf.clip_by_value()	0.1	log_uniform
--entropy-coef	Entropy coefficient for loss calculation	0.01	log_uniform
--grad-norm	Gradient clipping value passed to tf.clip_by_value()	0.5	log_uniform
--lam	GAE-Lambda for advantage estimation	0.95	log_uniform
--mini-batches	Number of mini-batches to use per update	4	categorical
--model	Path to model .cfg file	-	-
--n-steps	Transition steps	128	categorical
--ppo-epochs	Gradient updates per training step	4	categorical
--value-loss-coef	Value loss coefficient for value loss calculation	0.5	log_uniform

Command line

rlalgorithms-tf2 train ppo --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --preprocess --checkpoints ppo-pong.tf

rlalgorithms-tf2 train ppo --env BipedalWalker-v3 --target-reward 200 --n-envs 16 --checkpoints ppo-bipedal-walker.tf

Non-command line

from tensorflow.keras.optimizers import Adam

import rlalgorithms-tf2
from rlalgorithms-tf2 import PPO
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('PongNoFrameskip-v4', 16)
model_cfg = rlalgorithms-tf2.agents['ppo']['model']['cnn'][0]
optimizer = Adam(learning_rate=7e-4)
model = ModelReader(
    model_cfg,
    output_units=[envs[0].action_space.n, 1],
    input_shape=envs[0].observation_space.shape,
    optimizer=optimizer,
).build_model()
agent = PPO(envs, model, checkpoints=['ppo-pong.tf'])
agent.fit(target_reward=19)

6.6. TD3

Number of models: 3
Action spaces: continuous
TD3 constructor accepts only 2 models as input but accepts 3 for checkpoints or weights, because the second critic network will be cloned at runtime.
FPS varies because a different number of updates is executed at each train step, unless --gradient-steps is specified.

flags	help	default	hp_type
--actor-model	Path to actor model .cfg file	-	-
--critic-model	Path to critic model .cfg file	-	-
--gradient-steps	Number of iterations per train step	-	int
--noise-clip	Target noise clipping value	0.5	log_uniform
--policy-delay	Delay after which, actor weights and target models will be updated	2	categorical
--policy-noise-coef	Coefficient multiplied by noise added to target actions	0.2	log_uniform
--step-noise-coef	Coefficient multiplied by noise added to actions to step	0.1	log_uniform
--tau	Value used for syncing target model weights	0.005	log_uniform

Command line

rlalgorithms-tf2 train td3 --env BipedalWalker-v3 --target-reward 300 --n-envs 16 --checkpoints td3-actor-bipedal-walker.tf td3-critic1-bipedal-walker.tf td3-critic2-bipedal-walker.tf --buffer-max-size 1000000 --buffer-initial-size 100 --buffer-batch-size 100

Non-command line

from tensorflow.keras.optimizers import Adam

import rlalgorithms-tf2
from rlalgorithms-tf2 import TD3
from rlalgorithms-tf2.utils.buffers import ReplayBuffer2
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('BipedalWalker-v3', 16, preprocess=False)
buffers = [
    ReplayBuffer2(62500, slots=5, initial_size=1560, batch_size=8)
    for _ in range(len(envs))
]
actor_model_cfg = rlalgorithms-tf2.agents['td3']['actor_model']['ann'][0]
critic_model_cfg = rlalgorithms-tf2.agents['td3']['critic_model']['ann'][0]
optimizer = Adam(learning_rate=7e-4)
actor_model = ModelReader(
    actor_model_cfg,
    output_units=[envs[0].action_space.shape[0]],
    input_shape=envs[0].observation_space.shape,
    optimizer=optimizer,
).build_model()
critic_model = ModelReader(
    actor_model_cfg,
    output_units=[1],
    input_shape=envs[0].observation_space.shape[0] + envs[0].action_space.shape[0],
    optimizer=optimizer,
).build_model()
agent = TD3(
    envs,
    actor_model,
    critic_model,
    buffers,
    checkpoints=[
        'td3-actor-bipedal-walker.tf',
        'td3-critic1-bipedal-walker.tf',
        'td3-critic2-bipedal-walker.tf',
    ],
)
agent.fit(target_reward=100)

6.7. TRPO

Number of models: 2
Action spaces: discrete, continuous

flags	help	default	hp_type
--actor-iterations	Actor optimization iterations per train step	10	int
--actor-model	Path to actor model .cfg file	-	-
--advantage-epsilon	Value added to estimated advantage	1e-08	log_uniform
--cg-damping	Gradient conjugation damping parameter	0.001	log_uniform
--cg-iterations	Gradient conjugation iterations per train step	10	-
--cg-residual-tolerance	Gradient conjugation residual tolerance parameter	1e-10	log_uniform
--clip-norm	Clipping value passed to tf.clip_by_value()	0.1	log_uniform
--critic-iterations	Critic optimization iterations per train step	3	int
--critic-model	Path to critic model .cfg file	-	-
--entropy-coef	Entropy coefficient for loss calculation	0	log_uniform
--fvp-n-steps	Value used to skip every n-frames used to calculate FVP	5	int
--grad-norm	Gradient clipping value passed to tf.clip_by_value()	0.5	log_uniform
--lam	GAE-Lambda for advantage estimation	1.0	log_uniform
--max-kl	Maximum KL divergence used for calculating Lagrange multiplier	0.001	log_uniform
--mini-batches	Number of mini-batches to use per update	4	categorical
--n-steps	Transition steps	512	categorical
--ppo-epochs	Gradient updates per training step	4	categorical
--value-loss-coef	Value loss coefficient for value loss calculation	0.5	log_uniform

Command line

rlalgorithms-tf2 train trpo --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --checkpoints trpo-actor-pong.tf trpo-critic-pong.tf --preprocess --lr 1e-3

rlalgorithms-tf2 train trpo --env BipedalWalker-v3 --target-reward 200 --n-envs 16 --checkpoints trpo-actor-pong.tf trpo-critic-pong.tf --lr 1e-3

Non-command line

from tensorflow.keras.optimizers import Adam

import rlalgorithms-tf2
from rlalgorithms-tf2 import TRPO
from rlalgorithms-tf2.utils.common import ModelReader, create_envs

envs = create_envs('PongNoFrameskip-v4', 16)
actor_model_cfg = rlalgorithms-tf2.agents['trpo']['actor_model']['cnn'][0]
critic_model_cfg = rlalgorithms-tf2.agents['trpo']['critic_model']['cnn'][0]
optimizer = Adam()
actor_model = ModelReader(
    actor_model_cfg,
    output_units=[envs[0].action_space.n],
    input_shape=envs[0].observation_space.shape,
    optimizer=optimizer,
).build_model()
critic_model = ModelReader(
    actor_model_cfg,
    output_units=[1],
    input_shape=envs[0].observation_space.shape,
    optimizer=optimizer,
).build_model()
agent = TRPO(
    envs,
    actor_model,
    critic_model,
    checkpoints=[
        'trpo-actor-pong.tf',
        'trpo-critic-pong.tf',
    ],
)
agent.fit(target_reward=100)

7. License

Distributed under the MIT License. See LICENSE for more information.

8. Show your support

Give a ⭐️ if this project helped you!

9. Contact

[email protected]

Project link: https://github.com/unsignedrant/rlalgorithms-tf2

Open Source Agenda is not affiliated with "Xagents" Project. README Source: unsignedrant/rlalgorithms-tf2

Stars

Open Issues

Last Commit

1 year ago

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/xagents"><img src="https://www.opensourceagenda.com/projects/xagents/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022