Mpi Operator Versions Save

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)

v0.5.0

3 weeks ago

Changes since v0.4.0

  • Features:
    • Add support for MPICH (#562, @sheevy)
    • Field runLauncherAsWorker allows to add the launcher pod into the hostfile as a worker (#612, @kuizhiqing)
    • Add PodGroup minResources calculation for volcano integration (#566, @lowang-bh)
  • Bug fixes:
    • Fix panic when using PodGroups and PriorityClasses (#561, @tenzen-y)
    • Fix installation of mpijob Python module (#579, @vsoch)
    • Fix hostfile when jobs in different namespaces have the same name (#622, @kuizhiqing)
  • Clean ups:
    • Upgrade k8s libraries to v1.29 (#633, @tenzen-y)
    • Fail the mpi-operator binary if access to API is denied (#619, @emsixteeen)

Acknowledgments

Thank you to all the contributors (in no particular order): @sheevy @alculquicondor @terrytangyuan @tenzen-y @kuizhiqing @lowang-bh @vsoch @emsixteeen @wang-mask @benash @yeahdongcn @xhejtman @pheianox @lianghao208

v0.4.0

1 year ago

Changes since 0.3.0

  • Breaking changes
    • Removed v1 operator. If you want to use MPIJob v1, you can use the training-operator.
  • Support for suspending semantics. Third party controllers can leverage the suspend field to implement queuing and preemption for an MPIJob.
  • Support for the coscheduling plugins of the scheduler-plugins.
  • The operator supports multi-architecture (amd64, aarch64, and ppc64le).
  • Bug fixes
    • Fix support for elastic Horovod.

Acknowledgements

Special thanks to @tenzen-y for multiple contributions. Thank you to all the contributors (in no particular order): @mimowo @adilhusain-s @davidLif @ArangoGutierrez @shaowei-su @ggaaooppeenngg @pugangxa @HeGaoYuan @Dimss @alculquicondor @terrytangyuan

v0.3.0

2 years ago

Release v0.3.0

  • Scalability improvements
    • Worker start up no longer issues requests to kube-apiserver.
    • Dropped kubectl-delivery init container, reducing stress on kube-apiserver.
  • Support for Intel MPI.
  • Support for runPolicy (ttlSecondsAfterFinish, activeDeadlineSeconds, backoffLimit) by using a k8s Job for the launcher.
  • Samples for plain MPI applications.
  • Production readiness improvements:
    • Increased coverage throughout unit, integration and E2E tests.
    • More robust API validation.
    • Revisited v2beta1 MPIJob API.
    • Using fully-qualified label names, in consistency with other kubeflow operators.

v0.2.3

3 years ago

Enhancements

  • Added support for RH OCP4.1 and RH OCP4.2
  • Added additional installation methods
  • Added support for Go Modules and removed vendor directories
  • Added default ephemeral storage for init container
  • Overwrite NVIDIA env vars to avoid using GPUs on launcher
  • Added health check and callbacks around various leader election phases
  • Honor user-specified worker command
  • Exposed main container name as a configurable field
  • Added RunPolicy to MPIJobSpec that reuses kubeflow/common spec
  • Allow to specify the name of the gang scheduler and priority for pod group
  • Added error log when pod spec does not have any containers
  • Switched to use distroless images
  • Refactored the kubectl-delivery to improve the launcher performance
  • Added Prometheus metrics for job monitoring
  • Added experimental version of v1 MPIJob controller and APIs
  • Support Volcano as a scheduler
  • Switched to use pods for launcher job and statefulset workers
  • Switched to use klog for logging
  • More consistent labels with other Kubeflow operators

Fixes

  • Fixed nil pointer exceptions that could accidentally restart the pod
  • Updated status to running only when launcher is active and all workers are ready
  • Fixed the incorrect namespace for initializing informers and endpoints of leader election
  • Fixed issue in v1 controller's CRD existence check

Documentation

v0.2.2

4 years ago
  • Added default resource requirements for init container
  • Merged multiple deployment configuration files into a single YAML file
  • Switched to use JobStatus from kubeflow/common
  • Launcher and workers are now created together

v0.2.1

4 years ago
  • Switch Docker files and examples to use v1alpha2 MPI Operator.

v0.2.0

4 years ago

API Changes

  • Add v1alpha2 version of the MPI Operator with more consistent API spec with other Kubeflow operators
  • Support ActiveDeadlineSeconds in MPIJobSpec
  • Support custom resource types other than GPUs
  • Remove launcherOnMaster field

Enhancements

  • Support gang scheduling
  • Add StartTime and CompletionTime in job status
  • Add leader election
  • Switch to use pod group for gang scheduling
  • Add example on Apache MXNet using v1alpha1 version of the MPI Operator

0.1.0

5 years ago

Initial release of the MPI Operator.