Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow
AWS CloudFormation, which creates and configures Amazon Web Services resources with a template, simplifies the process of setting up a distributed deep learning cluster. The AWS CloudFormation Deep Learning template uses the Amazon Deep Learning AMI (which provides MXNet, TensorFlow, Caffe, Theano, Torch, and CNTK frameworks) to launch a cluster of EC2 instances and other AWS resources needed to perform distributed deep learning.
With this template, we continue with our mission to make distributed deep learning easy. AWS CloudFormation creates all resources in the customer account.
We've updated the AWS CloudFormation Deep Learning template to add some exciting new features and capabilities.
Introduce AWS DLAMI Conda v16.0 - For developers who want pre-installed pip packages of deep learning frameworks in separate virtual environments, the Conda-based AMI is available in Ubuntu and Amazon Linux. Check AWS DLAMI Official webpage for more details.
We now support 11 AWS regions- us-east-1, us-west-2, eu-west-1, us-east-2, ap-southeast-2, ap-northeast-1, ap-northeast-2, ap-south-1, eu-central-1,ap-southeast-1, us-west-1.
We now support g3 and c5 instances, and remove g2 instances.
Update MXNet submodule to 1.3.0.
We now support 10 AWS regions - us-east-1, us-west-2, eu-west-1, us-east-2, ap-southeast-2, ap-northeast-1, ap-northeast-2, ap-south-1, eu-central-1,ap-southeast-1.
We now support p3 instances.
We now support 5 AWS regions - us-east-1, us-east-2, us-west-2, eu-west-1 and ap-southeast-2.
We've enhanced the AWS CloudFormation Deep Learning template with automation that continues stack creation even if the provisioned number of worker instances falls short of the desired count. In the previous version of the template, if one of the worker instances failed to be provisioned, for example, if it a hit account limit, AWS CloudFormation rolled back the stack and required you to adjust your desired count and restart the stack creation process. The new template includes a function that automatically adjusts the count down and proceeds with setting up the rest of the cluster (stack).
We now support creating a cluster of CPU Amazon EC2 instance types.
We've also added Amazon Elastic File System (Amazon EFS) support for the cluster created with the template.
We now support creating a cluster of instances running Ubuntu. See the Ubuntu Deep Learning AMI.
The following architecture diagram shows the EC2 cluster infrastructure.
The Amazon Deep Learning template creates a stack that contains the following resources:
The startup script enables SSH forwarding on all hosts. Enabling SSH agent forwarding is essential because frameworks such as MXNet use SSH for communication between master and worker instances during distributed training.
The startup script on the master polls the master SQS queue for messages confirming that Auto Scaling setup is complete. The Lambda function sends two messages, one when the master Auto Scaling group is successfully set up, and a second when either the requested capacity is satisfied or when instances fail to launch on the worker Auto Scaling group. When instance launch fails on the worker Auto Scaling group, the Lambda function modifies the desired capacity to the number of instances that have been successfully launched.
Upon receiving messages on the Amazon SQS master queue, the setup script on the master configures all of the necessary worker metadata (IP addresses of the workers, GPU count, etc.,) and broadcasts the metadata on the worker SQS queue. Upon receiving this message, the startup script on the worker instances that are polling the SQS worker queue configure this metadata on the workers.
The following environment variables are set up on all the instances:
$DEEPLEARNING_WORKERS_PATH: The file path that contains the list of workers
$DEEPLEARNING_WORKERS_COUNT: The total number of workers
$DEEPLEARNING_WORKER_GPU_COUNT: The number of GPUs on the instance
$EFS_MOUNT: The directory where Amazon EFS is mounted
To set up a deep learning AWS CloudFormation stack, follow Using the AWS CloudFormation Deep Learning Template.
To demonstrate how to run distributed training using MXNet and Tensorflow frameworks, we use the standard CIFAR-10 model. CIFAR-10 is a sufficiently complex network that benefits from a distributed setup and that can be quickly trained on such a setup.
Follow Step 3 in Using the AWS CloudFormation Deep Learning Template.
Note: This could take a few minutes.
git clone https://github.com/awslabs/deeplearning-cfn $EFS_MOUNT/deeplearning-cfn && \
cd $EFS_MOUNT/deeplearning-cfn && \
#
#fetches dmlc/mxnet and tensorflow/models repos as submodules
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/models && \
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet && \
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet && \
git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/3rdparty/dmlc-core
# We also need to pull latest dmlc-core code to make it work in conda environment
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/3rdparty/dmlc-core
git fetch origin master
git checkout -b master origin/master
The following example shows how to run CIFAR-10 with data parallelism on MXNet. Note the use of the DEEPLEARNING_* environment variables.
#terminate all running Python processes across workers
while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \
done 10<$DEEPLEARNING_WORKERS_PATH
#navigate to the MXNet image-classification example directory \
cd $EFS_MOUNT/deeplearning-cfn/examples/incubator-mxnet/example/gluon/
#run the CIFAR10 distributed training example in mxnet conda environment(Change mxnet_p36 to mxnet_p27 if you wnat to run in python27 environment)\
../../tools/launch.py -n $DEEPLEARNING_WORKERS_COUNT -H $DEEPLEARNING_WORKERS_PATH "source activate mxnet_p36 && python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_device_sync"
We were able to run the training for 100 epochs in 25 minutes on 2 P2.8x EC2 instances and achieve a training accuracy of 92%.
These steps summarize how to get started. For more information about running distributed training on MXNet, see Run MXNet on Multiple Devices.
Horovod is a distributed training framework to make distributed deep learning easy and fast. You can get more information about the advantages of using Horovod to do distributed training with Tensorflow or other framework. Starting from DLAMI Conda v19.0, it comes with example pre-configured scripts to show how you can train model with Tensorflow and Horovod on multi gpus/machines.
The following example shows how to train a ResNet-50 model with synthetic data.
cd ~/examples/horovod/tensorflow
vi hosts
You should be able to see localhost slots=8
in your hosts file. The number of slots means how many GPUs you want to use in that machine to train your model. Also, you should append your worker nodes to the hosts file, and assign GPU number to it. To know how many GPUs available in your instance, run nvidia-smi
. After the change, your hosts file should look like
localhost slots=<#GPUs>
<worker node private ip> slots=<#GPUs>
......
You can easily calculate the number of GPUs you'll use to train the model by summing up the slots available on each machine. Note that: the argument passed to the train_synthetic.sh script below is passed to -np parameter of mpirun. The -np argument represents the total number of processes and the slots argument in hostfile represents the split of those processes per machine.
Then, just run
./train_synthetic.sh 24
or replace 24 with number of GPUs you use.