Kubedl Versions Save

Run your deep learning workloads on Kubernetes more easily and efficiently.

v0.5.0

1 year ago

Features

In this release, we have brought some major features that helps cluster admins to manage workloads easier and run more effciently.

  1. Enable data caching across different jobs and decouple lifecycle between job and cache system.
  2. Introduce job-coordinator to schedule and admit jobs in multi tenants queues.
  3. Introduce a new workload named ElasticBatch job, which abstracts offline inference jobs.

v0.4.3

1 year ago

Feature

  1. implement elastic training protocal(easyscale) on pytorch.
  2. fault tolerance driven by AIMaster.

Bugfix

  1. sync determination in elastic training.
  2. gang schedule deadlock due to unexpected configuration.

v0.4.2

1 year ago

Changes Since v0.4.1

API Improvements

  • Introduce job queue api (we'll implement job-level queue in next release)

Workloads

  • support distributed communication style of torch-elastic both on normal container network/host network.
  • upgrade vendor to k8s 1.21 to improve performance and other optimizations.

v0.4.1

2 years ago

Version v0.4.1 is a stable release, which introduces a lot of stability fixes, API improvements and code optimizations.

Changes Since v0.4.0

API Improvements

  • Introduce modelPath, description, imageTag to Model/ModelVersion specification.
  • Introduce CacheBackend to integrate with cloud native distributed cache systems for training jobs.
  • Introduce Notebook to enable juypter virtual environment capability.

Workloads

  • Bug fixes of MPIJob implementations.
  • Bug fixes of Cron scheduling.

Runtime & Dashboard

  • Optimize error and stack-tracing messages.
  • Support volcano gang scheduler protocol.
  • Remove authentic of dashboard backend.
  • Set TerminationMessageFallbackToLogsOnError as default termination policy.
  • Scale in extra pods/services when expected replicas decreases.
  • Refactor to improve code reusability and robustness.
  • Support failover by failed reasons.

v0.4.0

2 years ago
  • Add KubeDL model and inference support
  • Add KubeDL dashboard
  • Add volcano as another gang scheduler backend
  • Miscellaneous bug fixes

v0.3.0

3 years ago

Features

Change the CRD definition to the training.kubedl.io group

v0.2.0

3 years ago

Features

  • Support MPI training job.
  • Add hostnetwork mode.
  • Support attaching tensorboard to a running tensorflow job.
  • Support mars job persistency.
  • Exporse available webservice addresses in mars job status.

v0.1.0

3 years ago

v0.1.0 is the first formally release version of KubeDL, including a list of stable features:

  • Support running prevalent ML/DL workloads in a single operator.
  • Support submitting a job with artifacts synced from remote source such as github without rebuilding the image.
  • Support advanced scheduling features such as gang scheduling with pluggable backend schedulers.
  • Instrumented with unified prometheus metrics for different types of DL jobs, such as job launch delay, current number of pending/running jobs.
  • Support job metadata persistency with a pluggable storage backend such as Mysql.
  • Enable specific workload type according to the installed CRDs automatically or through the startup flags explicitly.
  • A modular architecture that can be easily extended for more types of DL/ML workloads with shared libraries, see how to add a custom job workload.

The official docker.io/kubedl/kubedl:v0.1.0 is hosted under dockerhub