Elastic Deep Learning for deep learning framework on Kubernetes
Computing resources on cloud such as Amazon AWS、Baidu Cloud have multi-tenancy. Deep learning model training and inference with elastic resources will be common on cloud. We propose Elastic Deep Learning (EDL) that makes training and inference of deep learning models on cloud easier and more efficient.
Now EDL is an incubation-stage project of the LF AI Foundation.
EDL package support python2.7/3.6/3.7. You can install with pip install paddle_edl
. But we highly recommend you use it in our docker:
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash
pip install paddle-serving-server-gpu
cd example/distill/resnet
wget --no-check-certificate https://paddle-edl.bj.bcebos.com/distill_teacher_model/ResNeXt101_32x16d_wsl_model.tar.gz
tar -zxf ResNeXt101_32x16d_wsl_model.tar.gz
python -m paddle_serving_server_gpu.serve \
--model ResNeXt101_32x16d_wsl_model \
--mem_optim \
--port 9898 \
--gpu_ids 1
python -m paddle.distributed.launch --selected_gpus 0 \
./train_with_fleet.py \
--model=ResNet50_vd \
--data_dir=./ImageNet \
--use_distill_service=True \
--distill_teachers=127.0.0.1:9898
To run distillation on clusters, please reference Run EDL distillation training
Performance benchmark on industrial cluster
mode | teacher resource | student resource | total batch size | acc1 | acc5 | speed(img/s) |
---|---|---|---|---|---|---|
pure train | None | 8 * v100 | 256 | 77.1 | 93.5 | 1828 |
teacher and student on the same gpus | 8 * v100 | 8 * v100 | 256 | 79.0 | 94.3 | 656 |
EDL service distill | 40 * P4 | 8 * v100 | 256 | 79.0 | 94.5 | 1514 |
cd example/demo/collective
node_ips="127.0.0.1"
python -u paddle_edl.demo.collective.job_server_demo \
--node_ips ${node_ips} \
--pod_num_of_node 8 \
--time_interval_to_change 900 \
--gpu_num_of_node 8
# set the ImageNet data path
export PADDLE_EDL_IMAGENET_PATH=<your path>
# set the checkpoint path
export PADDLE_EDL_FLEET_CHECKPOINT_PATH=<your path>
mkdir -p resnet50_pod
unset http_proxy https_proxy
# running under edl
export PADDLE_RUNING_ENV=PADDLE_EDL
export PADDLE_JOB_ID="test_job_id_1234"
export PADDLE_POD_ID="not set"
python -u paddle_edl.demo.collective.job_client_demo \
--log_level 20 \
--package_sh ./resnet50/package.sh \
--pod_path ./resnet50_pod \
./train_pretrain.sh
model | dataset | gpu cards | total batch size | acc1 | acc5 |
---|---|---|---|---|---|
Resnet50 | ImageNet | 16 * v100 | 1024 | 75.5 | 92.8 |
The whole example is here