Packaged deep reinforcement learning algorithms in tensorflow 2.x
Install swig using apt / brew depending on your os
pip install git+https://github.com/unsignedrant/rlalgorithms-tf2
Notes:
To be able to use atari environments / run tests, according to atari-py, you need to install ROMS:
mkdir Roms
wget http://www.atarimania.com/roms/Roms.rar
unrar e -r Roms.rar Roms
python -m atari_py.import_roms Roms
ale-import-roms --import-from-pkg atari_py.atari_roms
Verify installation
rlalgorithms-tf2
OUT:
rlalgorithms-tf2 1.0.1
Usage:
rlalgorithms-tf2 <command> <agent> [options] [args]
Available commands:
train Train given an agent and environment
play Play a game given a trained agent and environment
tune Tune hyperparameters given an agent, hyperparameter specs, and environment
Use rlalgorithms-tf2 <command> to see more info about a command
Use rlalgorithms-tf2 <command> <agent> to see more info about command + agent
rlalgorithms-tf2 is a tensorflow based mini-library which facilitates experimentation with existing reinforcement learning algorithms, as well as the implementation of new ones. It provides well tested components that can be easily modified or extended. The available selection of algorithms can be used directly or through command line.
Visualization of the training is supported, as well as many other awesome features provided by wandb.
All agents support multiple environments, which operations are conducted in tensorflow graph. This boosts training speed without the overhead of creating a process per environment. Atari and environments that return images, are wrapped in LazyFrames which significantly lower memory usage.
There are 2 kinds of replay buffers available:
Both support max size and initial size, and are usually combined with LazyFrames for memory optimality.
All features are available through the command line. For more command line info, check command line options
Command line tuning interface based on optuna, which provides many hyperparameter features and types. 3 types are currently used by rlalgorithms-tf2:
Categorical:
rlalgorithms-tf2 tune <agent> --env <env> --interesting-param <val1> <val2> <val3> # ...
Int / log uniform:
rlalgorithms-tf2 tune <agent> --env <env> --interesting-param <min-val> <max-val>
And in both examples if --interesting-param
is not specified, it will have the default value,
or a fixed value, if only 1 value is specified. Also, some nice visualization options using
optuna.visualization.matplotlib:
Early train stopping usually when plateau is reached for a pre-specified n number of times without any improvement. Learning rate is reduced by some pre-determined factor. To activate these features:
--divergence-monitoring-steps <train-steps-at-which-should-monitor>
A2C | ACER | DDPG | DQN | PPO | TD3 | TRPO | |
---|---|---|---|---|---|---|---|
Discrete | Yes | Yes | No | Yes | Yes | No | Yes |
Continuous | Yes | No | Yes | No | Yes | Yes | Yes |
Main components are covered using pytest.
To facilitate experimentation, and eliminate redundancy, all agents support
loading models by passing either --model <model.cfg>
or --actor-model <actor.cfg>
and
--critic-model <critic.cfg>
. If no models were passed, the default ones will be loaded.
A typical model.cfg
file would look like:
[convolutional-0]
filters=32
size=8
stride=4
activation=relu
initializer=orthogonal
gain=1.4142135
[convolutional-1]
filters=64
size=4
stride=2
activation=relu
initializer=orthogonal
gain=1.4142135
[convolutional-2]
filters=64
size=3
stride=1
activation=relu
initializer=orthogonal
gain=1.4142135
[flatten-0]
[dense-0]
units=512
activation=relu
initializer=orthogonal
gain=1.4142135
common=1
[dense-1]
initializer=orthogonal
gain=0.01
output=1
[dense-2]
initializer=orthogonal
gain=1.0
output=1
Which should generate a keras model similar to this one with output units 6, and 1 respectively:
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 84, 84, 1)] 0
__________________________________________________________________________________________________
conv2d (Conv2D) (None, 20, 20, 32) 2080 input_1[0][0]
__________________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 9, 9, 64) 32832 conv2d[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 7, 7, 64) 36928 conv2d_1[0][0]
__________________________________________________________________________________________________
flatten (Flatten) (None, 3136) 0 conv2d_2[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 512) 1606144 flatten[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 6) 3078 dense[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 1) 513 dense[0][0]
==================================================================================================
Total params: 1,681,575
Trainable params: 1,681,575
Non-trainable params: 0
__________________________________________________________________________________________________
Notes
common=1
marks a layer to be reused by the following layers, which means
dense-1
and dense-2
are called on the output of dense-0
.orthogonal
or glorot_uniform
, and to add more,
you'll have to modify rlalgorithms-tf2.utils.common.ModelReader.initializers.output=1
marks a layer as output which will be appended to the outputs
of the resulting tf.keras.Model
Saving training history is available for further benchmarking / visualizing results.
This is achieved by specifying --history-checkpoint <history.parquet>
which will result
in a .parquet
that will be updated at each episode end. A sample data point will have these
columns:
mean_reward
most recent mean of agent episode rewards.best_reward
most recent best of agent episode rewards.episode_reward
most recent episode reward.step
most recent agent step.time
training elapsed time.Which enables producing plots similar to the ones below, using rlalgorithms-tf2.utils.common.plot_history
All operation results are reproducible by passing --seed <some-seed>
or seed=some_seed
to agent constructor.
Gameplay visual output can be saved to .jpg
frames by passing --frame-dir <some-dir>
to play
command.
Weights are saved to .tf
by specifying --checkpoints <ckpt1.tf> <ckpt2.tf>
. To resume training,
--weights <ckpt1.tf> <ckpt2.tf>
should load the weights saved earlier. If --history-checkpoint <ckpt.parquet>
is specified, the file is looked for and if found, further training history will be saved
to the same history ckpt.parquet
and the agent metrics will be updated with the most
recent ones contained in the history file.
All agents / commands are available through the command line.
rlalgorithms-tf2 <command> <agent> [options] [args]
Note: Unless called from command line with --weights
passed,
all models passed to agents in code, should be loaded with weights
beforehand, if called for resuming training or playing.
Through command line
rlalgorithms-tf2 train a2c --env PongNoFrameskip-v4 --n-env 16 --target-reward 19 --preprocess
Through direct importing
import rlalgorithms-tf2
from rlalgorithms-tf2 import A2C
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('PongNoFrameskip-v4')
model = ModelReader(
rlalgorithms-tf2.agents['a2c']['model']['cnn'][0],
output_units=[6, 1],
input_shape=envs[0].observation_space.shape,
optimizer='adam',
).build_model()
agent = A2C(envs, model)
Then either max_steps
or target_reward
should be specified to start training:
agent.fit(target_reward=19)
Through command line
rlalgorithms-tf2 play a2c --env PongNoFrameskip-v4 --preprocess --weights <trained-a2c-weights> --render
Through direct importing
import rlalgorithms-tf2
from rlalgorithms-tf2 import A2C
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('PongNoFrameskip-v4')
model = ModelReader(
rlalgorithms-tf2.agents['a2c']['model']['cnn'][0],
output_units=[6, 1],
input_shape=envs[0].observation_space.shape,
optimizer='adam',
).build_model()
model.load_weights(
'/path/to/trained-weights.tf'
).expect_partial()
agent = A2C(envs, model)
agent.play(render=True)
Save frames
agent.play(frame_dir='/path/to/frame-dir/')
or
rlalgorithms-tf2 play a2c --frame-dir /path/to/frame-dir/
Notes
Due to an issue with tensorflow that causes occasional memory leaks, if trials are run consecutively using:
study.optimize(objective, n_trials=100)
The current implementation runs trials in separate processes that are killed after each trial, to release the resources. Therefore, you may find the suggested non-command line example different from optuna's docs.
There are hyperparameters that accept min and max values, and others that support n values.
To know which is what, check the hp_type
in the help menu table. categorical
takes
any number of values, otherwise min and max.
For more info about how the optimization algorithms work under the hood, you may want to check optuna docs.
Tuning from later stages of the training is available by passing --weights <weights1.tf> <weights2.tf>
which loads agent respective model weights, and tuning starts from there.
Only the hyperparameters selected are tuned, the rest will keep the default values
and will not be tuned or can have a single fixed value --flag <val>
Also, due to tensorflow issue mentioned above, tensorflow logging is silenced
using TF_CPP_MIN_LOG_LEVEL
environment variable to prevent each trial process
from displaying the same import log messages over and over ...
Through command line
!TF_CPP_MIN_LOG_LEVEL=3 rlalgorithms-tf2 tune ppo --env PongNoFrameskip-v4 --study ppo-carnival --storage sqlite:///ppo-carnival.db --trial-steps 500000 --n-trials 100 --warmup-trials 3 --preprocess --n-envs 16 32 --lr 1e-5 1e-2 --opt-epsilon 1e-7 1e-4 --gamma 0.9 0.999 --entropy-coef 0.01 0.2 --n-steps 16 32 64 128 --lam 0.7 0.99
Through direct importing
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from concurrent.futures import ProcessPoolExecutor, as_completed
import numpy as np
import optuna
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
import rlalgorithms-tf2
from rlalgorithms-tf2 import PPO
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
def get_hparams(trial):
return {
'n_steps': int(
trial.suggest_categorical('n_steps', [2 ** i for i in range(2, 11)])
),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1e-2),
'epsilon': trial.suggest_loguniform('epsilon', 1e-7, 1e-1),
'entropy_coef': trial.suggest_loguniform('entropy_coef', 1e-8, 2e-1),
'n_envs': int(
trial.suggest_categorical('n_envs', [2 ** i for i in range(4, 7)])
),
'grad_norm': trial.suggest_uniform('grad_norm', 0.1, 10.0),
'lam': trial.suggest_loguniform('lam', 0.65, 0.99),
'clip_norm': trial.suggest_loguniform('clip_norm', 0.01, 10),
}
def optimize_agent(trial):
hparams = get_hparams(trial)
envs = create_envs('BreakoutNoFrameskip-v4', hparams['n_envs'])
optimizer = Adam(
hparams['learning_rate'],
epsilon=hparams['epsilon'],
)
model_cfg = rlalgorithms-tf2.agents['ppo']['model']['cnn'][0]
model = ModelReader(
model_cfg,
output_units=[envs[0].action_space.n, 1],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
model.compile(optimizer)
agent = PPO(
envs,
model,
entropy_coef=hparams['entropy_coef'],
grad_norm=hparams['grad_norm'],
n_steps=hparams['n_steps'],
lam=hparams['lam'],
clip_norm=hparams['clip_norm'],
trial=trial,
quiet=True,
)
steps = 500000
agent.fit(max_steps=steps)
current_rewards = np.around(np.mean(agent.total_rewards), 2)
if not np.isfinite(current_rewards):
current_rewards = 0
return current_rewards
def run_trial():
optuna.logging.set_verbosity(optuna.logging.ERROR)
tf.get_logger().setLevel('ERROR')
study = optuna.create_study(
study_name='ppo-example',
storage='sqlite:///ppo-example.db',
load_if_exists=True,
direction='maximize',
)
optuna.logging.set_verbosity(optuna.logging.INFO)
study.optimize(optimize_agent, n_trials=1)
if __name__ == '__main__':
for _ in range(100):
with ProcessPoolExecutor(1) as executor:
future_trials = [executor.submit(run_trial)]
for future_trial in as_completed(future_trials):
future_trial.result()
Note: Not all the flags listed below are available at once, and to know which ones are, respective to the command you passed, you can use:
rlalgorithms-tf2 <command>
or
rlalgorithms-tf2 <command> <agent>
which should list command + agent options combined
Flags (Available for all agents)
flags | help | default | hp_type |
---|---|---|---|
--checkpoints | Path(s) to new model(s) to which checkpoint(s) will be saved during training | - | - |
--display-precision | Number of decimals to be displayed | 2 | - |
--divergence-monitoring-steps | Steps after which, plateau and early stopping are active | - | - |
--early-stop-patience | Minimum plateau reduces to stop training | 3 | - |
--gamma | Discount factor | 0.99 | log_uniform |
--history-checkpoint | Path to .parquet file to save training history | - | - |
--log-frequency | Log progress every n games | - | - |
--plateau-reduce-factor | Factor multiplied by current learning rate when there is a plateau | 0.9 | - |
--plateau-reduce-patience | Minimum non-improvements to reduce lr | 10 | - |
--quiet | If specified, no messages by the agent will be displayed | - | - |
to the console | |||
--reward-buffer-size | Size of the total reward buffer, used for calculating | 100 | - |
mean reward value to be displayed. | |||
--seed | Random seed | - | - |
flags | help | default | hp_type |
---|---|---|---|
--beta1 | Beta1 passed to a tensorflow.keras.optimizers.Optimizer | 0.9 | log_uniform |
--beta2 | Beta2 passed to a tensorflow.keras.optimizers.Optimizer | 0.999 | log_uniform |
--env | gym environment id | - | - |
--lr | Learning rate passed to a tensorflow.keras.optimizers.Optimizer | 0.0007 | log_uniform |
--max-frame | If specified, max & skip will be applied during preprocessing | - | categorical |
--n-envs | Number of environments to create | 1 | categorical |
--opt-epsilon | Epsilon passed to a tensorflow.keras.optimizers.Optimizer | 1e-07 | log_uniform |
--preprocess | If specified, states will be treated as atari frames | - | - |
and preprocessed accordingly | |||
--weights | Path(s) to model(s) weight(s) to be loaded by agent output_models | - | - |
flags | help |
---|---|
--max-steps | Maximum number of environment steps, when reached, training is stopped |
--monitor-session | Wandb session name |
--target-reward | Target reward when reached, training is stopped |
flags | help | default |
---|---|---|
--action-idx | Index of action output by agent.model | 0 |
--frame-delay | Delay between rendered frames | 0 |
--frame-dir | Path to directory to save game frames | - |
--frame-frequency | If --frame-dir is specified, save frames every n frames. | 1 |
--render | If specified, the gameplay will be rendered | - |
flags | help | default |
---|---|---|
--n-jobs | Parallel trials | 1 |
--n-trials | Number of trials to run | 1 |
--non-silent | tensorflow, optuna and agent are silenced at trial start | - |
to avoid repetitive import messages at each trial start, unless | ||
this flag is specified | ||
--storage | Database url | - |
--study | Name of optuna study | - |
--trial-steps | Maximum steps for a trial | 500000 |
--warmup-trials | warmup trials before pruning starts | 5 |
flags | help | default | hp_type |
---|---|---|---|
--buffer-batch-size | Replay buffer batch size | 32 | categorical |
--buffer-initial-size | Replay buffer initial size | - | int |
--buffer-max-size | Maximum replay buffer size | 10000 | int |
General notes
--model <model.cfg>
or --actor-model <actor.cfg>
and --critic-model <critic.cfg>
are optional
which means, if not specified, the default model(s) will be loaded, so you don't have to worry about it.--preprocess
flag for image preprocessing.--checkpoints <checkpoint1.tf> <checkpoint2.tf>
should
be specified for the model(s) to be saved. The number of passed checkpoints should match the number
of models the agent accepts.--weights <weights1.tf> <weights2.tf>
and same goes for the weights, they should match the number of agent models.seed=some_seed
should be passed to agent constructor and ModelReader constructor if
specified from code. If from the command line, all you need is to pass --seed <some-seed>
history_checkpoint=some_history.parquet
should be specified
to agent constructor or alternatively using --history-checkpoint <some-history.parquet>
.
If the history checkpoint exists, training metrics will automatically start from where it left.flags | help | default | hp_type |
---|---|---|---|
--entropy-coef | Entropy coefficient for loss calculation | 0.01 | log_uniform |
--grad-norm | Gradient clipping value passed to tf.clip_by_value() | 0.5 | log_uniform |
--model | Path to model .cfg file | - | - |
--n-steps | Transition steps | 5 | categorical |
--value-loss-coef | Value loss coefficient for value loss calculation | 0.5 | log_uniform |
Command line
rlalgorithms-tf2 train a2c --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --preprocess --checkpoints a2c-pong.tf
OR
rlalgorithms-tf2 train a2c --env BipedalWalker-v3 --target-reward 100 --n-envs 16 --checkpoints a2c-bipedal-walker.tf
Non-command line
from tensorflow.keras.optimizers import Adam
import rlalgorithms-tf2
from rlalgorithms-tf2 import A2C
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('PongNoFrameskip-v4', 16)
model_cfg = rlalgorithms-tf2.agents['a2c']['model']['cnn'][0]
optimizer = Adam(learning_rate=7e-4)
model = ModelReader(
model_cfg,
output_units=[envs[0].action_space.n, 1],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
agent = A2C(envs, model, checkpoints=['a2c-pong.tf'])
agent.fit(target_reward=19)
And for BipedalWalker-v3
, the only difference is that you have to specify preprocess=False
to create_envs()
n-steps
.
Therefore buffer-batch-size
is set to 1.flags | help | default | hp_type |
---|---|---|---|
--delta | delta param used for trust region update | 1 | log_uniform |
--ema-alpha | Moving average decay passed to tf.train.ExponentialMovingAverage() | 0.99 | log_uniform |
--entropy-coef | Entropy coefficient for loss calculation | 0.01 | log_uniform |
--epsilon | epsilon used in gradient updates | 1e-06 | log_uniform |
--grad-norm | Gradient clipping value passed to tf.clip_by_value() | 10 | log_uniform |
--importance-c | Importance weight truncation parameter. | 10.0 | log_uniform |
--model | Path to model .cfg file | - | - |
--n-steps | Transition steps | 20 | categorical |
--replay-ratio | Lam value passed to np.random.poisson() | 4 | categorical |
--trust-region | True by default, if this flag is specified, | - | - |
trust region updates will be used | |||
--value-loss-coef | Value loss coefficient for value loss calculation | 0.5 | log_uniform |
Command line
rlalgorithms-tf2 train acer --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --preprocess --checkpoints acer-pong.tf --buffer-max-size 5000 --buffer-initial-size 500 --trust-region
Non-command line
from tensorflow.keras.optimizers import Adam
import rlalgorithms-tf2
from rlalgorithms-tf2 import ACER
from rlalgorithms-tf2.utils.buffers import ReplayBuffer1
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('PongNoFrameskip-v4', 16)
buffers = [
ReplayBuffer1(5000, initial_size=500, batch_size=1) for _ in range(len(envs))
]
model_cfg = rlalgorithms-tf2.agents['acer']['model']['cnn'][0]
optimizer = Adam(learning_rate=7e-4)
model = ModelReader(
model_cfg,
output_units=[envs[0].action_space.n, envs[0].action_space.n],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
agent = ACER(envs, model, buffers, checkpoints=['acer-pong.tf'])
agent.fit(target_reward=19)
--gradient-steps
is specified.flags | help | default | hp_type |
---|---|---|---|
--actor-model | Path to actor model .cfg file | - | - |
--critic-model | Path to critic model .cfg file | - | - |
--gradient-steps | Number of iterations per train step | - | int |
--step-noise-coef | Coefficient multiplied by noise added to actions to step | 0.1 | log_uniform |
--tau | Value used for syncing target model weights | 0.005 | log_uniform |
Command line
rlalgorithms-tf2 train ddpg --env BipedalWalker-v3 --target-reward 100 --n-envs 16 --checkpoints ddpg-actor-bipedal-walker.tf ddpg-critic-bipedal-walker.tf --buffer-max-size 1000000 --buffer-initial-size 25000 --buffer-batch-size 100
Non-command line
from tensorflow.keras.optimizers import Adam
import rlalgorithms-tf2
from rlalgorithms-tf2 import DDPG
from rlalgorithms-tf2.utils.buffers import ReplayBuffer2
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('BipedalWalker-v3', 16, preprocess=False)
buffers = [
ReplayBuffer2(62500, slots=5, initial_size=1560, batch_size=8)
for _ in range(len(envs))
]
actor_model_cfg = rlalgorithms-tf2.agents['ddpg']['actor_model']['ann'][0]
critic_model_cfg = rlalgorithms-tf2.agents['ddpg']['critic_model']['ann'][0]
optimizer = Adam(learning_rate=7e-4)
actor_model = ModelReader(
actor_model_cfg,
output_units=[envs[0].action_space.shape[0]],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
critic_model = ModelReader(
actor_model_cfg,
output_units=[1],
input_shape=envs[0].observation_space.shape[0] + envs[0].action_space.shape[0],
optimizer=optimizer,
).build_model()
agent = DDPG(
envs,
actor_model,
critic_model,
buffers,
checkpoints=['ddpg-actor-bipedal-walker.tf', 'ddpg-critic-bipedal-walker.tf'],
)
agent.fit(target_reward=100)
flags | help | default | hp_type |
---|---|---|---|
--double | If specified, DDQN will be used | - | - |
--epsilon-decay-steps | Number of steps for epsilon-start to reach epsilon-end |
150000 | int |
--epsilon-end | Epsilon end value (minimum exploration rate) | 0.02 | log_uniform |
--epsilon-start | Starting epsilon value which is used to control random exploration. | 1.0 | log_uniform |
It should be decremented and adjusted according to implementation needs | |||
--model | Path to model .cfg file | - | - |
--target-sync-steps | Sync target models every n steps | 1000 | int |
Command line
rlalgorithms-tf2 train dqn --env PongNoFrameskip-v4 --target-reward 19 --n-envs 3 --lr 1e-4 --preprocess --checkpoints dqn-pong.tf --buffer-max-size 50000 --buffer-initial-size 10000 --max-frame
Non-command line
from tensorflow.keras.optimizers import Adam
import rlalgorithms-tf2
from rlalgorithms-tf2 import DQN
from rlalgorithms-tf2.utils.buffers import ReplayBuffer1
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('PongNoFrameskip-v4', 3, max_frame=True)
buffers = [
ReplayBuffer1(16666, initial_size=3333, batch_size=10) for _ in range(len(envs))
]
model_cfg = rlalgorithms-tf2.agents['dqn']['model']['cnn'][0]
optimizer = Adam(learning_rate=7e-4)
model = ModelReader(
model_cfg,
output_units=[envs[0].action_space.n],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
agent = DQN(envs, model, buffers, checkpoints=['dqn-pong.tf'])
agent.fit(target_reward=19)
Note: if you need a DDQN, specify double=True
to the agent constructor or --double
flags | help | default | hp_type |
---|---|---|---|
--advantage-epsilon | Value added to estimated advantage | 1e-08 | log_uniform |
--clip-norm | Clipping value passed to tf.clip_by_value() | 0.1 | log_uniform |
--entropy-coef | Entropy coefficient for loss calculation | 0.01 | log_uniform |
--grad-norm | Gradient clipping value passed to tf.clip_by_value() | 0.5 | log_uniform |
--lam | GAE-Lambda for advantage estimation | 0.95 | log_uniform |
--mini-batches | Number of mini-batches to use per update | 4 | categorical |
--model | Path to model .cfg file | - | - |
--n-steps | Transition steps | 128 | categorical |
--ppo-epochs | Gradient updates per training step | 4 | categorical |
--value-loss-coef | Value loss coefficient for value loss calculation | 0.5 | log_uniform |
Command line
rlalgorithms-tf2 train ppo --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --preprocess --checkpoints ppo-pong.tf
or
rlalgorithms-tf2 train ppo --env BipedalWalker-v3 --target-reward 200 --n-envs 16 --checkpoints ppo-bipedal-walker.tf
Non-command line
from tensorflow.keras.optimizers import Adam
import rlalgorithms-tf2
from rlalgorithms-tf2 import PPO
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('PongNoFrameskip-v4', 16)
model_cfg = rlalgorithms-tf2.agents['ppo']['model']['cnn'][0]
optimizer = Adam(learning_rate=7e-4)
model = ModelReader(
model_cfg,
output_units=[envs[0].action_space.n, 1],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
agent = PPO(envs, model, checkpoints=['ppo-pong.tf'])
agent.fit(target_reward=19)
--gradient-steps
is specified.flags | help | default | hp_type |
---|---|---|---|
--actor-model | Path to actor model .cfg file | - | - |
--critic-model | Path to critic model .cfg file | - | - |
--gradient-steps | Number of iterations per train step | - | int |
--noise-clip | Target noise clipping value | 0.5 | log_uniform |
--policy-delay | Delay after which, actor weights and target models will be updated | 2 | categorical |
--policy-noise-coef | Coefficient multiplied by noise added to target actions | 0.2 | log_uniform |
--step-noise-coef | Coefficient multiplied by noise added to actions to step | 0.1 | log_uniform |
--tau | Value used for syncing target model weights | 0.005 | log_uniform |
Command line
rlalgorithms-tf2 train td3 --env BipedalWalker-v3 --target-reward 300 --n-envs 16 --checkpoints td3-actor-bipedal-walker.tf td3-critic1-bipedal-walker.tf td3-critic2-bipedal-walker.tf --buffer-max-size 1000000 --buffer-initial-size 100 --buffer-batch-size 100
Non-command line
from tensorflow.keras.optimizers import Adam
import rlalgorithms-tf2
from rlalgorithms-tf2 import TD3
from rlalgorithms-tf2.utils.buffers import ReplayBuffer2
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('BipedalWalker-v3', 16, preprocess=False)
buffers = [
ReplayBuffer2(62500, slots=5, initial_size=1560, batch_size=8)
for _ in range(len(envs))
]
actor_model_cfg = rlalgorithms-tf2.agents['td3']['actor_model']['ann'][0]
critic_model_cfg = rlalgorithms-tf2.agents['td3']['critic_model']['ann'][0]
optimizer = Adam(learning_rate=7e-4)
actor_model = ModelReader(
actor_model_cfg,
output_units=[envs[0].action_space.shape[0]],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
critic_model = ModelReader(
actor_model_cfg,
output_units=[1],
input_shape=envs[0].observation_space.shape[0] + envs[0].action_space.shape[0],
optimizer=optimizer,
).build_model()
agent = TD3(
envs,
actor_model,
critic_model,
buffers,
checkpoints=[
'td3-actor-bipedal-walker.tf',
'td3-critic1-bipedal-walker.tf',
'td3-critic2-bipedal-walker.tf',
],
)
agent.fit(target_reward=100)
flags | help | default | hp_type |
---|---|---|---|
--actor-iterations | Actor optimization iterations per train step | 10 | int |
--actor-model | Path to actor model .cfg file | - | - |
--advantage-epsilon | Value added to estimated advantage | 1e-08 | log_uniform |
--cg-damping | Gradient conjugation damping parameter | 0.001 | log_uniform |
--cg-iterations | Gradient conjugation iterations per train step | 10 | - |
--cg-residual-tolerance | Gradient conjugation residual tolerance parameter | 1e-10 | log_uniform |
--clip-norm | Clipping value passed to tf.clip_by_value() | 0.1 | log_uniform |
--critic-iterations | Critic optimization iterations per train step | 3 | int |
--critic-model | Path to critic model .cfg file | - | - |
--entropy-coef | Entropy coefficient for loss calculation | 0 | log_uniform |
--fvp-n-steps | Value used to skip every n-frames used to calculate FVP | 5 | int |
--grad-norm | Gradient clipping value passed to tf.clip_by_value() | 0.5 | log_uniform |
--lam | GAE-Lambda for advantage estimation | 1.0 | log_uniform |
--max-kl | Maximum KL divergence used for calculating Lagrange multiplier | 0.001 | log_uniform |
--mini-batches | Number of mini-batches to use per update | 4 | categorical |
--n-steps | Transition steps | 512 | categorical |
--ppo-epochs | Gradient updates per training step | 4 | categorical |
--value-loss-coef | Value loss coefficient for value loss calculation | 0.5 | log_uniform |
Command line
rlalgorithms-tf2 train trpo --env PongNoFrameskip-v4 --target-reward 19 --n-envs 16 --checkpoints trpo-actor-pong.tf trpo-critic-pong.tf --preprocess --lr 1e-3
or
rlalgorithms-tf2 train trpo --env BipedalWalker-v3 --target-reward 200 --n-envs 16 --checkpoints trpo-actor-pong.tf trpo-critic-pong.tf --lr 1e-3
Non-command line
from tensorflow.keras.optimizers import Adam
import rlalgorithms-tf2
from rlalgorithms-tf2 import TRPO
from rlalgorithms-tf2.utils.common import ModelReader, create_envs
envs = create_envs('PongNoFrameskip-v4', 16)
actor_model_cfg = rlalgorithms-tf2.agents['trpo']['actor_model']['cnn'][0]
critic_model_cfg = rlalgorithms-tf2.agents['trpo']['critic_model']['cnn'][0]
optimizer = Adam()
actor_model = ModelReader(
actor_model_cfg,
output_units=[envs[0].action_space.n],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
critic_model = ModelReader(
actor_model_cfg,
output_units=[1],
input_shape=envs[0].observation_space.shape,
optimizer=optimizer,
).build_model()
agent = TRPO(
envs,
actor_model,
critic_model,
checkpoints=[
'trpo-actor-pong.tf',
'trpo-critic-pong.tf',
],
)
agent.fit(target_reward=100)
Distributed under the MIT License. See LICENSE for more information.
Give a ⭐️ if this project helped you!
Project link: https://github.com/unsignedrant/rlalgorithms-tf2