Tf Operator Versions Save

Distributed ML Training and Fine-Tuning on Kubernetes

v1.8.0-rc.0

3 weeks ago

New features

Bug fixes

Misc

v1.7.0

6 months ago

Breaking Changes

New features

Bug fixes

  • Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
  • Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
  • Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

v1.7.0-rc.0

9 months ago

Breaking Changes

New features

Bug fixes

  • Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
  • Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
  • Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

v1.6.0

1 year ago

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1773

Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702

New Features

Bug fixes

Misc

Closed issues:

  • The default value for CleanPodPolicy is inconsistent. #1753
  • HPA support for PyTorch Elastic #1751
  • Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
  • paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
  • *job API(master) cannot compatible with old job #1725
  • Support coscheduling plugin #1722
  • Number of worker threads used by the controller can't be configured #1706
  • Conformance: Training tests #1698
  • PyTorch and MPI Operator pulls hardcoded initContainer #1696
  • PaddlePaddle Training: why can't find pods #1694
  • Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
  • [SDK] Create unify client for all Training Job types #1691
  • Support Kubernetes v1.25 #1682
  • panic happened when add podgroup watch #1679
  • OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
  • There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
  • Change Kubernetes version for test #1665
  • Support for multiplatform container imege (amd64 and arm64) #1664
  • Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
  • After setting hostNetwork to true, mpi does not work #1657
  • What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
  • When will MPIJob support v2beta1 version? #1653
  • Kubernetes HPA doesn't work with elastic PytorchJob #1645
  • training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
  • Training operator fails to create HPA for TorchElastic jobs #1626
  • Release v1.5.0 tracking #1622
  • upgrade client-go #1599
  • trainning-operator may need to monitor PodGroup #1574
  • Error: invalid memory address or nil pointer dereference #1553
  • The pytorchJob training is slow #1532
  • pytorch elastic scheduler error #1504

v1.6.0-rc.1

1 year ago

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower

Merged pull requests:

Closed issues:

  • The default value for CleanPodPolicy is inconsistent. #1753
  • HPA support for PyTorch Elastic #1751
  • Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
  • paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
  • *job API(master) cannot compatible with old job #1725
  • Support coscheduling plugin #1722
  • Number of worker threads used by the controller can't be configured #1706
  • Conformance: Training tests #1698
  • PyTorch and MPI Operator pulls hardcoded initContainer #1696
  • PaddlePaddle Training: why can't find pods #1694
  • Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
  • [SDK] Create unify client for all Training Job types #1691
  • Support Kubernetes v1.25 #1682
  • panic happened when add podgroup watch #1679
  • OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
  • There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
  • Change Kubernetes version for test #1665
  • Support for multiplatform container imege (amd64 and arm64) #1664
  • Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
  • After setting hostNetwork to true, mpi does not work #1657
  • What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
  • When will MPIJob support v2beta1 version? #1653
  • Kubernetes HPA doesn't work with elastic PytorchJob #1645
  • training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
  • Training operator fails to create HPA for TorchElastic jobs #1626
  • Release v1.5.0 tracking #1622
  • upgrade client-go #1599
  • trainning-operator may need to monitor PodGroup #1574
  • Error: invalid memory address or nil pointer dereference #1553
  • The pytorchJob training is slow #1532
  • pytorch elastic scheduler error #1504

v1.6.0-rc.0

1 year ago

v1.6.0-rc.0 release

v1.5.0

1 year ago

Full Changelog

New Features

Bug Fixes

Misc

v1.5.0-rc.0

1 year ago

Full Changelog

Closed issues:

  • MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617
  • unable to fetch TFJob when I use client.go run tfjob #1612
  • Pytorchjob dist-mnist no training logs #1601
  • kubectl get tfjob -o yaml, but not status output #1598
  • missing image in tf_job_design_doc.md #1590
  • Labels in Python client are out of date #1587
  • PyTorchJob Pods "Not Ready" After Completing Training #1577
  • cannot use "github.com/go-openapi/spec".Schema{...} (type "github.com/go-openapi/spec".Schema) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value #1576
  • PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570
  • pytorchjob doesn't have status.startTIme. #1566
  • Optional-test-infra Deprecation Notice - Training #1561
  • Should we update MPIJob unit test CleanPodPolicy field? #1555
  • --enable-gang-scheduling=true doesn't work for MPIJob #1548
  • PyTorchJob fails when creating a task with a different namespace but the same name #1543
  • Reconcile PyTorchJob error: PyTorchJob.status.replicaStatuses: Invalid value: "null" after enable-gang-scheduling #1538
  • Job TTLs not working #1533
  • Support PodGroup in scheduler-plugins/coscheduling #1518
  • support elastic training #1515
  • Modified the configuration of RootLogger #1514
  • Add checking import order in CI #1510
  • Scale down of pytorchJob cause workers pod to restart #1509
  • Support label selector based success/failure conditions #1507
  • [feat] Support SuccessPolicy in PyTorchJob #1505
  • pytorch elastic scheduler error #1504
  • Could you add the example of MPIJob in this repository #1502
  • [Feature] Create a Informer/ClientSet for PyTorch Jobs #1499
  • [feature] Make init container injection logic availabel to all jobs #1498
  • Roadmaps for 1.4 release #1496
  • [bug] (MpiJob)Init container KubectlDeliveryImage should remain the ability that it can be specified from container parameters or environment variables. #1494
  • Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492
  • Python PytorchJob: no attribute openapi_types for example code #1481
  • PyTorch DistributedDataParallel training with multi nodes #1475
  • Installing kubeflow-training breaks import for other kubeflow packages (katib, fairing, etc.) #1471
  • Deprecate ksonnet and use python/golang to submit jobs #1468
  • Help Wanted in ParameterServerStrategy Example. #1459
  • Bug: SomeTimes Coredumped using tfjob #1456
  • [question] PyTorchJob MNIST example training speed #1454
  • tfjob status not match when EnableDynamicWorker set true #1452
  • training-operator set scheduler error #1447
  • [sdk]: Replace TableLogger component in the SDK for better support with ipykernel>=6.x #1446
  • SDK: wait_for_job reports typeError #1445
  • Update prometheus monitoring doc #1443
  • Master branch should provide a nightly image #1433
  • Clean up test folder before testing #1429
  • Clean up TF specific docs #1424
  • [feature] Support SchedulingPolicy in PyTorchJob #1414
  • Hyperlinks in the "Overview" section is incorrect/not found #1411
  • add workqueue metric #1407
  • Validation fails for MXJob Tune example #1402
  • Rate exceeded for aws ecr image #1400
  • change layout to follow the standard of kubebuilder? #1397
  • [example] kubeflow/tf-dist-mnist-test:1.0 is missing in v1.2-branch examples/v1/dist-mnist #1393
  • Update kubeflow/website for 1.4 release #1392
  • Cut beta release of tf-operator for 1.4 release #1385
  • "invalid memory address or nil pointer dereference" #1382
  • some questions about job sync #1379
  • Provides a default Grafana dashboard #1376
  • [feature] Support different PS/worker types #1369
  • Need to copy all (mainly pytorch) framework's example dir to tf-operator/examples #1366
  • Add more CRD validations markers to block invalid job on client apply #1363
  • Update presubmit and post submit job triggers #1354
  • Optimize post submit jobs flow #1353
  • Enable leader election in controller manager using controllermanagerconfig #1350
  • Support mpi jobs in universal operator #1345
  • post-submit job failure in master branch #1343
  • Improve observability of universal operator #1340
  • Best practice to organize main.go and Dockerfile? #1333
  • Should training operator keep clientset in the same repository? #1332
  • Test image has incorrect tag? #1329
  • Prepare e2e tests for all frameworks #1323
  • Reduce e2e replica-restart-policy-tests running time #1319
  • Improve logs structure by consolidating libs from controller runtime and controllers #1313
  • Enable tests for all frameworks #1311
  • [bug] The pod wil be recreated until the expectation expires #1306
  • Upgrade CRDs to apiextensions.k8s.io/v1 #1304
  • Add role details as new columns to kubectl get jobs output for CRD. #1301
  • How to handle long pending pods in a TF-job? #1282
  • Could you release a new version of Python SDK #1279
  • Update swagger.json schema for TFJobSpec to include RunPolicy #1278
  • Not able to pass environment variable from tfjob to pod #1273
  • v1_time.py is not generated by hack/python-sdk/gen-sdk.sh #1271
  • Add a step to upload artifact #1258
  • [feature] Support multi port in TFJob #1251
  • [feat] Add scale subresource #1220
  • Pod get re-created after it exited and get garbage collected #1186
  • Clean up vendor dependencies #1162

Merged pull requests:

v1.4.0

2 years ago

Full Changelog

Merged pull requests:

Closed issues:

  • Question: What is the recommended way for Data Scientists to run a distributed training job #1535
  • Restore KUBEFLOW_NAMESPACE options #1522
  • Improve test coverage #1497
  • swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
  • [bug] Missing init container in PyTorchJob #1482
  • PytorchJob DDP training will stop if I delete a worker pod #1478
  • Write down e2e failure debug process #1467
  • How can i add the Priorityclass to the TFjob? #1466
  • github.com/go-logr/zapr.(*zapLogger).Error #1444
  • Display coverage % in GitHub actions list #1442
  • Add Go test to CI #1436
  • Podgroup is constantly created and deleted after tfjob is success or failure #1426
  • Cut official release of 1.3.0 #1425
  • Add "not maintained" notice to other operator repos #1423
  • Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381
  • Python SDK for Kubeflow Training Operator #1380
  • Rename this repo #1348
  • Universal Operator Phase III: Graduate operator to production grade #1318

v1.4.0-rc.0

2 years ago

Full Changelog

Features and improvements:

  • Display coverage % in GitHub actions list #1442
  • Add Go test to CI #1436

Fixed bugs:

  • [bug] Missing init container in PyTorchJob #1482
  • Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381

Closed issues:

  • Restore KUBEFLOW_NAMESPACE options #1522
  • Improve test coverage #1497
  • swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
  • PytorchJob DDP training will stop if I delete a worker pod #1478
  • Write down e2e failure debug process #1467
  • How can i add the Priorityclass to the TFjob? #1466
  • github.com/go-logr/zapr.(*zapLogger).Error #1444
  • Podgroup is constantly created and deleted after tfjob is success or failure #1426
  • Cut official release of 1.3.0 #1425
  • Add "not maintained" notice to other operator repos #1423
  • Python SDK for Kubeflow Training Operator #1380

Merged pull requests: