MPI Parallel framework for training deep learning models built in Theano
Theano-MPI is a python framework for distributed training of deep learning models built in Theano. It implements data-parallelism in serveral ways, e.g., Bulk Synchronous Parallel, Elastic Averaging SGD and Gossip SGD. This project is an extension to theano_alexnet, aiming to scale up the training framework to more than 8 GPUs and across nodes. Please take a look at this technical report for an overview of implementation details. To cite our work, please use the following bibtex entry.
@article{ma2016theano,
title = {Theano-MPI: a Theano-based Distributed Training Framework},
author = {Ma, He and Mao, Fei and Taylor, Graham~W.},
journal = {arXiv preprint arXiv:1605.08325},
year = {2016}
}
Theano-MPI is compatible for training models built in different framework libraries, e.g., Lasagne, Keras, Blocks, as long as its model parameters can be exposed as theano shared variables. Theano-MPI also comes with a light-weight layer library for you to build customized models. See wiki for a quick guide on building customized neural networks based on them. Check out the examples of building Lasagne VGGNet, Wasserstein GAN, LS-GAN and Keras Wide-ResNet.
Theano-MPI depends on the following libraries and packages. We provide some guidance to the installing them in wiki.
Once all dependeices are ready, one can clone Theano-MPI and install it by the following.
$ python setup.py install [--user]
To accelerate the training of Theano models in a distributed way, Theano-MPI tries to identify two components:
It is recommended to organize your model and data definition in the following way.
launch_session.py
or launch_session.cfg
models/*.py
__init__.py
modelfile.py
: defines your customized ModelClassdata/*.py
dataname.py
: defines your customized DataClassYour ModelClass in modelfile.py
should at least have the following attributes and methods:
self.params
: a list of Theano shared variables, i.e. trainable model parametersself.data
: an instance of your customized DataClass defined in dataname.py
self.compile_iter_fns
: a method, your way of compiling train_iter_fn and val_iter_fnself.train_iter
: a method, your way of using your train_iter_fnself.val_iter
: a method, your way of using your val_iter_fnself.adjust_hyperp
: a method, your way of adjusting hyperparameters, e.g., learning rate.self.cleanup
: a method, necessary model and data clean-up steps.Your DataClass in dataname.py
should at least have the following attributes:
self.n_batch_train
: an integer, the amount of training batches needed to go through in an epochself.n_batch_val
: an integer, the amount of validation batches needed to go through during validationAfter your model definition is complete, you can choose the desired way of sharing parameters among model instances:
Below is an example launch config file for training a customized ModelClass on two GPUs.
# launch_session.cfg
RULE=BSP
MODELFILE=models.modelfile
MODELCLASS=ModelClass
DEVICES=cuda0,cuda1
Then you can launch the training session by calling the following command:
$ tmlauncher -cfg=launch_session.cfg
Alternatively, you can launch sessions within python as shown below:
# launch_session.py
from theanompi import BSP
rule=BSP()
# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1'] ,
modelfile = 'models.modelfile',
modelclass = 'ModelClass')
rule.wait()
More examples can be found here.
Training (+communication) time per 5120 images in seconds: [allow_gc = True, using nccl32 on copper]
Model | 1GPU | 2GPU | 4GPU | 8GPU |
---|---|---|---|---|
AlexNet-128b | 20.50 | 10.35+0.78 | 5.13+0.54 | 2.63+0.61 |
GoogLeNet-32b | 63.89 | 31.40+1.00 | 15.51+0.71 | 7.69+0.80 |
VGG16-16b | 358.29 | 176.08+13.90 | 90.44+9.28 | 55.12+12.59 |
VGG16-32b | 343.37 | 169.12+7.14 | 86.97+4.80 | 43.29+5.41 |
ResNet50-64b | 163.15 | 80.09+0.81 | 40.25+0.56 | 20.12+0.57 |
More details on the benchmark can be found in this notebook.
Test your single GPU model with theanompi/models/test_model.py
before trying data-paralle rule.
You may want to use those helper functions in /theanompi/lib/opt.py
to construct optimizers in order to avoid common pitfalls mentioned in (#22) and get better convergence.
Binding cores according to your NUMA topology may give better performance. Try the -bind
option with the launcher (needs hwloc depedency).
Using the launcher script is prefered to start training. Using python to start training currently cause core binding problem especially on a NUMA system.
Shuffling training examples before asynchronous training makes the loss surface a lot smoother during model converging.
Some known bugs and possible enhancement are listed in Issues. We welcome all kinds of participation (bug reporting, discussion, pull request, etc) in improving the framework.
© Contributors, 2016-2017. Licensed under an ECL-2.0 license.