Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
Thank you to all the contributors (in no particular order): @sheevy @alculquicondor @terrytangyuan @tenzen-y @kuizhiqing @lowang-bh @vsoch @emsixteeen @wang-mask @benash @yeahdongcn @xhejtman @pheianox @lianghao208
Special thanks to @tenzen-y for multiple contributions. Thank you to all the contributors (in no particular order): @mimowo @adilhusain-s @davidLif @ArangoGutierrez @shaowei-su @ggaaooppeenngg @pugangxa @HeGaoYuan @Dimss @alculquicondor @terrytangyuan
runPolicy
(ttlSecondsAfterFinish
, activeDeadlineSeconds
, backoffLimit
)
by using a k8s Job for the launcher.JobStatus
from kubeflow/commonv1alpha2
MPI Operator.ActiveDeadlineSeconds
in MPIJobSpec
launcherOnMaster
fieldStartTime
and CompletionTime
in job statusInitial release of the MPI Operator.