Keras Adamw Versions Save

Keras/TF implementation of AdamW, SGDW, NadamW, Warm Restarts, and Learning Rate multipliers

v1.38a

2 years ago

Adds a DOI for citation purposes

v1.38

3 years ago
  • Fixed 'L1' object has no attribute 'l2' in TF 2.3.1 (and vice versa for non-l1_l2 objects)
  • Moved testing to TF2.3.1

v1.37

3 years ago

control_dependencies moved from tensorflow.python.ops to tensorflow.python.framework.ops; for backwards-compatibility, edited code to use tf.control_dependencies.

Further, TF2.3.0 isn't compatible with Keras 2.3.1 and earlier; unsure of later versions, but development proceeds with tf.keras.

v1.36

3 years ago

Existing code normalized as: norm = sqrt(batch_size / total_iterations), where total_iterations = (number of fits per epoch) * (number of epochs in restart). However, total_iterations = total_samples / batch_size --> norm = batch_size * sqrt(1 / (total_iterations_per_epoch * epochs)), making norm scale linearly with batch_size, which differs from authors' sqrt.

Users who never changed batch_size throughout training will be unaffected. (λ = λ_norm * sqrt(b / BT); λ_norm is what we pick, our "guess". The idea of normalization is to make it so that if our guess works well for batch_size=32, it'll work well for batch_size=16 - but if batch_size is never changed, then performance is only affected by the guess.)

Main change here, closing #52.

Updating existing code: for a choice of λ_norm that previously worked well, apply *= sqrt(batch_size). Ex: Dense(bias_regularizer=l2(1e-4)) --> Dense(bias_regularizer=l2(1e-4 * sqrt(32))).

v1.35

3 years ago

FEATURE: autorestart option which automatically handles Warm Restarts by resetting t_cur=0 after total_iterations iterations.

  • Defaults to True if use_cosine_annealing=True, else False
  • Must use use_cosine_annealing=True if using autorestart=True

Updated README and example.py.

v1.32

3 years ago

BUGFIXES:

  • Last weight in network would be updated with t_cur one update ahead, desynchronizing it from all other weights
  • AdamW in keras (optimizers.py, optimizers_225.py) weight updates were not mediated by eta_t, so cosine annealing had no effect.

FEATURES:

  • Added lr_t to tf.keras optimizers to track "actual" learning rate externally; use K.eval(model.optimizer.lr_t) to get "actual" learning rate for given t_cur and iterations
  • Added lr_t vs. iterations plot to README, and source code in example.py

MISC:

  • Added test_updates to ensure all weights update synchronously, and that eta_t first applies on weights as-is and then updates according to t_cur
  • Fixes #47

v1.31

3 years ago

BUGFIXES:

  • SGDW with momentum=0 would bug per variable scoping issues; rewritten code is correct and should run a little faster. Files affected: optimizers_v2.py, optimizers_225tf.py

MISC:

  • Added test case for SGDW(momentum=0)
  • Added control test for SGDW(momentum=0) vs SGD(momentum=0)
  • tests/import_selection.py -> tests/backend.py
  • test_optimizers.py can now run as __main__ without manually changing paths / working directories

v1.30

3 years ago

FEATURES:

  • Compatibility with TF 2.2 (other versions still compatible, but no longer tested)
  • eta_t now behaves deterministically, updating after t_cur (previously, behavior was semi-random)
  • Lots of code cleanup

USAGE NOTES:

  • t_cur should now be set to -1 instead of 0 to reset eta_t to 0
  • t_cur should now be set at iters == total_iterations - 2; explanation here
  • total_iterations must now be > 1, instead of only > 0
  • total_iterations <= 1 will force weight_decays and lr_multipliers to None

FIXES:

  • Optimizers will no longer zero layer penalties if weight decays cannot be applied (i.e. total_iterations is not > 1)
  • eta_t is now properly updated as a tf.Variable, instead of being an update tf.Tensor
  • Testing didn't actually include Eager in last version - now does

BREAKING:

  • utils_225tf.py removed
  • utils_common.py removed
  • optimizers_tfpy.py removed
  • utils.py code is now that of utils_225tf.py
  • utils_common.py merged with utils.py
  • self.batch_size is now an int, instead of tf.Variable

MISC:

  • tests: /test_optimizers, /test_optimizers_225, /test_optimizers_225tf, test_optimizers_v2, test_optimizers_tfpy removed
  • All tests now done in single file: tests/test_optimizers.py
  • _update_t_cur_eta_t and _update_t_cur_eta_t_apply_lr_mult added to utils.py
  • Updated examples.py and related parts in README

1.23

4 years ago

BUGFIX:

  • l1 was being decayed as l2, and vice versa; formula now correct

FEATURES:

  • Performance boost due to including only nonzero decays (l1, l2) in calculations

MISC:

  • Rename funcs in utils_common, and remove unused kwarg in get_weight_decays

v1.21

4 years ago

FEATURES:

  • from keras_adamw import now accounts for TF 1.x + Keras 2.3.x case
  • model and zero_penalties now show up in optimizer constructor input signatures, making them clearer and more Pythonic
  • Each optimizer now has its own full docstring, instead of deferring to help(AdamW)

BREAKING:

  • model is no longer to be passed as first positional argument, but as a later one, or a keyword argument (model=model)

BUGFIXES:

  • name defaults corrected, many were "AdamW" even if not AdamW - though no bugs were encountered as a result

MISC:

  • __init__ wrapper moved inside of __init__ to avoid overriding input signature