Gpauloski Kfac Pytorch Versions Save

Distributed K-FAC Preconditioner for PyTorch

v0.4.1

2 years ago

Major Changes

Critical bug fix in BaseKFACPreconditioner (#48)

Minor Changes

Updated Issue Templates to not include tag in title (#46).
Example training scripts now collect environment information for easier debugging (#47).

v0.4.0

2 years ago

Complete refactor of kfac-pytorch

See Pull Requests #38, #40, #41, and #42.

DevOps changes

kfac requires torch>=1.8 and Python >=3.7
tox used for testing environments and automation
pre-commit updated. Major changes include prefer single-quotes, mypy, flake8 plugins
Switch to setup.cfg for package metadata and tox/flake8/mypy/coverage configuration
Add requirement-dev.txt that contains all dependencies needed to run the test suite

Code quality and testing

Complete type annotations for all code
- Passes mypy
Separated testing utilities and unit tests into testing/ and tests/ respectively
Expansive unit testing suite that achieves 100% code coverage
New testing utilities include wrappers for simulating distributed environments and small test models
Added end-to-end training tests
- small unit test (included in pytest) that checks loss decreases when training with K-FAC
- MNIST integration test (not run with pytest) that verifies training with K-FAC achieves higher accuracy

`kfac` package improvements

KFAC layers separated from PyTorch module wrappers
- KFACBaseLayer handles general K-FAC computations and communications for an arbitrary layer
- ModuleHelper implementations provide a unified interface for interacting with supported PyTorch modules
  - Provides methods that return the size of the factors for the layer so the size of factors can be determined prior to training
  - Provides methods for getting the current gradients, updating the gradients, and computing the factors from the intermediate data
- Each KFACBaseLayer instance is passed a ModuleHelper instance corresponding to the module in the model being preconditioned
Removed broken LSTM/RNN/Embedding layer support
Module registration utilities moved out of the preconditioner class and into the kfac.layers.register module
Replaced the comm module with the distributed module that provide a more exhaustive set of distributed communication utilties
- All communication ops now return futures to the return value to allow more aggressive asynchronous communication
- Added allreduce bucketing for factor allreduce (closes #32)
- Added get_rank and get_world_size methods to enable K-FAC training when torch.distributed is not initialized
Enum types moved to enums module for convenience with type annotations
KFACBaseLayer is now agnostic of its placement
- I.e., the KFACBaseLayer expects some other object to correctly execute its operations according to some placement strategy.
- This change was made to allow other preconditioner implementations to use the math/communication operations provided by the KFACBaseLayer without being beholded to some placement strategy.
Created the BaseKFACPreconditioner which provides the minimal set of functionality for preconditioning with K-FAC
- Provides state dict saving/loading, a step() method, hook registration to KFACBaseLayer, and some small bookkeeping functionality
- The BaseKFACPreconditioner takes as input already registered KFACBaseLayers and an initialized WorkAssignment object.
- This change was made to factor out the strategy specific details from the core preconditioning functions with the goal of having preconditioner implementations that interact more closely with other frameworks such as DeepSpeed
- Added reset_batch() to clear the staged factors for the batch in the case of a bad batch of data (e.g., if the gradients overflowed)
- memory_usage() includes the intermediate factors accumulated for the current batch
- state_dict now includes K-FAC hyperparameters and steps in addition to factors
Added KFACPreconditioner, a subclass of BaseKFACPreconditioner, that implements the full functionality described in the KAISA paper.
New WorkAssignment interface that provides a schematic for the methods needed by BaseKFACPreconditioner to determine where to perform computations and communications
- Added the KAISAAssignment implementation that provides the KAISA gradient worker fraction-based strategy
K-FAC hyperparameter schedule changes
- Old inflexible KFACParamScheduler replace with a LambdaParamScheduler modeled on PyTorch's LambdaLRSchedule
- BaseKFACPreconditioner can be passed functions the return the current K-FAC hyperparameters rather than static float values
All printing done via logging and KFACBasePreconditioner takes an optional loglevel parameter (closes #33)

Example script changes

Added examples/requirements.txt
Usage instructions for examples moved to examples/README.md
Update examples to use new kfac API
Examples are now properly type annotated
Removed non-working language model example

Other changes + future goals

Removed a lot of content from the README that should eventually be moved to a wiki
- Previously, the README was quite verbose and made it difficult to find the important content
Updated README examples, publications, and development instructions
Future changes include:
- GitHub actions for running code formatting, unit tests, integration tests
- Issue/PR templates
- Added badges to README
- wiki

v0.3.2

2 years ago

README and Package dependency updates.

v0.3.1

2 years ago

v1.0.0

3 years ago

Gpauloski Kfac Pytorch Versions Save

v0.4.1

Major Changes

Minor Changes

v0.4.0

DevOps changes

Code quality and testing

kfac package improvements

Example script changes

Other changes + future goals

v0.3.2

v0.3.1

v1.0.0

`kfac` package improvements