Distributed, mixed-precision training with PyTorch
Kindratenko, Volodymyr, Dawei Mu, Yan Zhan, John Maloney, Sayed Hadi Hashemi, Benjamin Rabe, Ke Xu, Roy Campbell, Jian Peng, and William Gropp. "HAL: Computer System for Scalable Deep Learning." In Practice and Experience in Advanced Research Computing, pp. 41-48. 2020. https://doi.org/10.1145/3311790.3396649.
We will focus on:
Apex
We will cover the following training methods for PyTorch:
torch.nn.DataParallel
torch.nn.DistributedDataParallel
Apex
TensorBoard
logging under distributed training contextWe will cover the following use cases:
RESULTS.md
for more details/home/shared/imagenet/raw/
Apex
(not required for PyTorch > 1.6.0)
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
torch.cuda.amp
instead of apex.amp
. See https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/ for more details.Apex
(Recommended)References:
Mixed precision training: majority of the network uses FP16 arithmetic, while automatically casting potentially unstable operations to FP32.
Key points:
Advantages:
imagenet_ddp_mixprec.py
is deprecated. Use imagenet_ddp_apex.py
.
python -m torch.distributed.launch --nproc_per_node=4 imagenet_ddp_apex.py -a resnet50 --b 208 --workers 20 --opt-level O2 /home/shared/imagenet/raw/
To run your programe on 2 nodes with 4 GPU each, you will need to open 2 terminals and run slightly different command on each node.
Node 0:
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="192.168.100.11" --master_port=8888 imagenet_ddp_apex.py -a resnet50 --b 208 --workers 20 --opt-level O2 /home/shared/imagenet/raw/
--nproc_per_node
: number of GPUs on the current node, each process is bound to a single GPU----node_rank
: rank of the current node, should be an int between 0
and --world-size - 1
--master_addr
: IP address for the master node of your choice. type str
--master_port
: open port number on the master node. type int
. if you don't know, use 8888
--workers
: # of data loading workers for the current node. this is different from the processes that run the programe on each GPU. the total # of processes = # of data loading workers + # of GPUs (one process to run each GPU)-b
: per GPU batch size, for a 16 GB GPU, 224
is the max batch size. Need to be a multiple of 8 to make use of Tensor Cores. If you are using tensorboard logging, you need to assign a slightly smaller batch size!
Node 1:
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="192.168.100.11" --master_port=8888 imagenet_ddp_apex.py -a resnet50 --b 208 --workers 20 --opt-level O2 /home/shared/imagenet/raw/
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0Gradient overflow.
ctrl C
on each compute node, there will still be some processes alive. To clean up all python processes on curr node, use:pkill -9 python
Use cases:
This is the most basic training method, no data parallel at all
python imagnet_nd.py -a resnet50 /home/shared/imagenet/raw/
The default learning rate schedule starts at 0.1 and decays by a factor of 10 every 30 epochs. This is appropriate for ResNet and models with batch normalization, but too high for AlexNet and VGG. Use 0.01 as the initial learning rate for AlexNet or VGG:
python imagnet_nd.py -a alexnet --lr 0.01 /home/shared/imagenet/raw/
Use cases:
References:
torch.nn.DataParallel
is easier to use (just wrap the model and run your training script). However, because it uses one process to compute the model weights and then distribute them to each GPU on the current node during each batch, networking quickly becomes a bottle-neck and GPU utilization is often very low. Furthermore, it requires that all the GPUs be on the same node and doesn’t work with Apex
for mixed-precision training.
Use cases:
References:
torch.nn.DistributedDataParallel
is the recommeded way of doing distributed training in PyTorch. It is proven to be significantly faster than torch.nn.DataParallel
for single-node multi-GPU data parallel training. nccl
backend is currently the fastest and highly recommended backend to be used with distributed training and this applies to both single-node and multi-node distributed training.
Multiprocessing with DistributedDataParallel duplicates the model on each GPU on each compute node. The GPUs can all be on the same node or spread across multiple nodes. If you have 2 computer nodes with 4 GPUs each, you have a total of 8 model replicas. Each replica is controlled by one process and handles a portion of the input data. Every process does identical tasks, and each process communicates with all the others. During the backwards pass, gradients from each node are averaged. Only gradients are passed between the processes/GPUs so that network communication is less of a bottleneck.
During training, each process loads its own minibatches from disk and passes them to its GPU. Each GPU does its own forward pass, and then the gradients are all-reduced across the GPUs. Gradients for each layer do not depend on previous layers, so the gradient all-reduce is calculated concurrently with the backwards pass to futher alleviate the networking bottleneck. At the end of the backwards pass, every node has the averaged gradients, ensuring that the model weights stay synchronized.
All this requires that the multiple processes, possibly on multiple nodes, are synchronized and communicate. Pytorch does this through its distributed.init_process_group
function. This function needs to know where to find process 0 so that all the processes can sync up and the total number of processes to expect. Each individual process also needs to know the total number of processes as well as its rank within the processes and which GPU to use. It’s common to call the total number of processes the world size
. Finally, each process needs to know which slice of the data to work on so that the batches are non-overlapping. Pytorch provides nn.utils.data.DistributedSampler
to accomplish this.
python imagenet_ddp.py -a resnet50 --dist-url 'tcp://MASTER_IP:MASTER_PORT' --dist-backend 'nccl' --world-size 1 --rank 0 --desired-acc 0.75 /home/shared/imagenet/raw/
To run your programe on 4 nodes with 4 GPU each, you will need to open 4 terminals and run slightly different command on each node.
Node 0:
python imagenet_ddp.py -a resnet50 --dist-url 'tcp://MASTER_IP:MASTER_PORT' --dist-backend 'nccl' --world-size 4 --rank 0 --desired-acc 0.75 /home/shared/imagenet/raw/
MASTER_IP
: IP address for the master node of your choiceMASTER_PORT
: open port number on the master node. if you don't know, use 8888
--world-size
: equals the # of compute nodes you are using--rank
: rank of the current node, should be an int between 0
and --world-size - 1
--desired-acc
: desired accuracy to stop training--workers
: # of data loading workers for the current node. this is different from the processes that run the programe on each GPU. the total # of processes = # of data loading workers + # of GPUs (one process to run each GPU)Node 1:
python imagenet_ddp.py -a resnet50 --dist-url 'tcp://MASTER_IP:MASTER_PORT' --dist-backend 'nccl' --world-size 4 --rank 1 --desired-acc 0.75 /home/shared/imagenet/raw/
Node 2:
python imagenet_ddp.py -a resnet50 --dist-url 'tcp://MASTER_IP:MASTER_PORT' --dist-backend 'nccl' --world-size 4 --rank 2 --desired-acc 0.75 /home/shared/imagenet/raw/
Node 3:
python imagenet_ddp.py -a resnet50 --dist-url 'tcp://MASTER_IP:MASTER_PORT' --dist-backend 'nccl' --world-size 4 --rank 3 --desired-acc 0.75 /home/shared/imagenet/raw/
TensorBoard
TensorBoard
is now a built-in module in PyTorch: torch.utils.tensorboard
Horovod
References:
Shout-out to all the references, blogs, code samples... used in this tutorial!