AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.
AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet. Deep Learning Containers provide optimized environments with TensorFlow and MXNet, Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances) libraries and are available in the Amazon Elastic Container Registry (Amazon ECR).
The AWS DLCs are used in Amazon SageMaker as the default vehicles for your SageMaker jobs such as training, inference, transforms etc. They've been tested for machine learning workloads on Amazon EC2, Amazon ECS and Amazon EKS services as well.
For the list of available DLC images, see Available Deep Learning Containers Images. You can find more information on the images available in Sagemaker here
This project is licensed under the Apache-2.0 License.
smdistributed.dataparallel
and smdistributed.modelparallel
are released under the AWS Customer Agreement.
We describe here the setup to build and test the DLCs on the platforms Amazon SageMaker, EC2, ECS and EKS.
We take an example of building a MXNet GPU python3 training container.
export ACCOUNT_ID=<YOUR_ACCOUNT_ID>
export REGION=us-west-2
export REPOSITORY_NAME=beta-mxnet-training
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com
python3 -m venv dlc
source dlc/bin/activate
pip install -r src/requirements.txt
bash src/setup.sh mxnet
The paths to the dockerfiles follow a specific pattern e.g., mxnet/training/docker/<version>/<python_version>/Dockerfile.
These paths are specified by the buildspec.yml residing in mxnet/training/buildspec.yml i.e. <framework>/<training|inference>/buildspec.yml. If you want to build the dockerfile for a particular version, or introduce a new version of the framework, re-create the folder structure as per above and modify the buildspec.yml file to specify the version of the dockerfile you want to build.
python src/main.py --buildspec mxnet/training/buildspec.yml --framework mxnet
The above step should take a while to complete the first time you run it since it will have to download all base layers
and create intermediate layers for the first time.
Subsequent runs should be much faster.python src/main.py --buildspec mxnet/training/buildspec.yml \
--framework mxnet \
--image_types training \
--device_types cpu \
--py_versions py3
--image_types <training/inference>
--device_types <cpu/gpu>
--py_versions <py2/py3>
python src/main.py --buildspec mxnet/training/buildspec.yml \
--framework mxnet \
--image_types training \
--device_types gpu \
--py_versions py3
# mxnet/training/buildspec.yml
1 account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2 region: ®ION <set-$REGION-in-environment>
3 framework: &FRAMEWORK mxnet
4 version: &VERSION 1.6.0 *<--- Change this to 1.7.0*
................
# mxnet/training/buildspec.yml
41 images:
42 BuildMXNetCPUTrainPy3DockerImage:
43 <<: *TRAINING_REPOSITORY
...................
49 docker_file: !join [ docker/, *VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
# deep-learning-containers/mxnet/training/docker/1.6.0/py3
COPY README-context.rst README.rst
then README-context.rst needs to first be copied into the build context. You can do this by adding the artifact in
the framework buildspec file under the context key:
# mxnet/training/buildspec.yml
19 context:
20 README.xyz: *<---- Object name (Can be anything)*
21 source: README-context.rst *<--- Path for the file to be copied*
22 target: README.rst *<--- Name for the object in** the build context*
19 context:
.................
23 training_context: &TRAINING_CONTEXT
24 README.xyz:
25 source: README-context.rst
26 target: README.rst
...............
41 images:
42 BuildMXNetCPUTrainPy3DockerImage:
43 <<: *TRAINING_REPOSITORY
.......................
50 context:
51 <<: *TRAINING_CONTEXT
52 README.xyz:
53 source: README-context.rst
54 target: README.rst
The following steps outline how to add a package to your image. For more information on customizing your container, see Building AWS Deep Learning Containers Custom Images.
# mxnet/training/docker/1.6.0/py3/Dockerfile.gpu
139 RUN ${PIP} install --no-cache --upgrade \
140 keras-mxnet==2.2.4.2 \
...........................
159 ${MX_URL} \
160 awscli
to
139 RUN ${PIP} install --no-cache --upgrade \
140 keras-mxnet==2.2.4.2 \
...........................
160 awscli \
161 octopush
As part of your iteration with your PR, sometimes it is helpful to run your tests locally to avoid using too many extraneous resources or waiting for a build to complete. The testing is supported using pytest.
Similar to building locally, to test locally, you’ll need access to a personal/team AWS account. To test out:
Either on an EC2 instance with the deep-learning-containers repo cloned, or on your local machine, make sure you have the images you want to test locally (likely need to pull them from ECR). Then change directory into the cloned folder. Install the requirements for tests.
cd deep-learning-containers/
pip install -r src/requirements.txt
pip install -r test/requirements.txt
In a shell, export environment variable DLC_IMAGES to be a space separated list of ECR uris to be tested. Set CODEBUILD_RESOLVED_SOURCE_VERSION to some unique identifier that you can use to identify the resources your test spins up. Set PYTHONPATH as the absolute path to the src/ folder. Example: [Note: change the repository name to the one setup in your account]
export DLC_IMAGES="$ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/pr-pytorch-training:training-gpu-py3 $ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/pr-mxnet-training:training-gpu-py3"
export PYTHONPATH=$(pwd)/src
export CODEBUILD_RESOLVED_SOURCE_VERSION="my-unique-test"
Our pytest framework expects the root dir to be test/dlc_tests, so change directories in your shell to be here
cd test/dlc_tests
To run all tests (in series) associated with your image for a given platform, use the following command
# EC2
pytest -s -rA ec2/ -n=auto
# ECS
pytest -s -rA ecs/ -n=auto
#EKS
cd ../
export TEST_TYPE=eks
python test/testrunner.py
Remove -n=auto
to run the tests sequentially.
To run a specific test file, provide the full path to the test file
pytest -s ecs/mxnet/training/test_ecs_mxnet_training.py
To run a specific test function (in this example we use the cpu dgl ecs test), modify the command to look like so:
pytest -s ecs/mxnet/training/test_ecs_mxnet_training.py::test_ecs_mxnet_training_dgl_cpu
To run SageMaker local mode tests, launch a cpu or gpu EC2 instance with latest Deep Learning AMI.
git clone https://github.com/{github_account_id}/deep-learning-containers/
cd deep-learning-containers && git checkout {branch_name}
$(aws ecr get-login --no-include-email --registry-ids ${aws_id} --region ${aws_region})
cd test/sagemaker_tests/mxnet/training/
pip3 install -r requirements.txt
python3 -m pytest -v integration/local --region us-west-2 \
--docker-base-name ${aws_account_id}.dkr.ecr.us-west-2.amazonaws.com/mxnet-inference \
--tag 1.6.0-cpu-py36-ubuntu18.04 --framework-version 1.6.0 --processor cpu \
--py-version 3
python3 -m pytest -v integration/local \
--docker-base-name ${aws_account_id}.dkr.ecr.us-west-2.amazonaws.com/tensorflow-inference \
--tag 1.15.2-cpu-py36-ubuntu16.04 --framework-version 1.15.2 --processor cpu
To run SageMaker remote tests on your account please setup following pre-requisites
AmazonSageMakerFullAccess
cd test/sagemaker_tests/mxnet/training/
pip3 install -r requirements.txt
pytest integration/sagemaker/test_mnist.py \
--region us-west-2 --docker-base-name mxnet-training \
--tag training-gpu-py3-1.6.0 --framework-version 1.6.0 --aws-id {aws_id} \
--instance-type ml.p3.8xlarge
python3 -m pytest test/integration/sagemaker/test_tfs. --registry {aws_account_id} \
--region us-west-2 --repo tensorflow-inference --instance-types ml.c5.18xlarge \
--tag 1.15.2-py3-cpu-build --versions 1.15.2
To run SageMaker benchmark tests on your account please perform the following steps:
sm_benchmark_env_settings.config
in the deep-learning-containers/ folderexport DLC_IMAGES="<image_uri_1-you-want-to-benchmark-test>"
# export DLC_IMAGES="$DLC_IMAGES <image_uri_2-you-want-to-benchmark-test>"
# export DLC_IMAGES="$DLC_IMAGES <image_uri_3-you-want-to-benchmark-test>"
export BUILD_CONTEXT=PR
export TEST_TYPE=benchmark-sagemaker
export CODEBUILD_RESOLVED_SOURCE_VERSION=$USER
export REGION=us-west-2
source sm_benchmark_env_settings.config
pip install -r requirements.txt
python test/testrunner.py
# Assuming that the cwd is deep-learning-containers/
cd test/dlc_tests
pytest benchmark/sagemaker/<framework-name>/<image-type>/test_*.py
deep-learning-containers/test/dlc_tests/benchmark/sagemaker/<framework-name>/<image-type>/resources/
Note: SageMaker does not support tensorflow_inference py2 images.