Kuberay Versions Save

A toolkit to run Ray applications on Kubernetes

v1.1.0

1 month ago

Highlights

  • RayJob improvements

  • Structured logging

    • In KubeRay v1.1.0, we have changed the KubeRay logs to JSON format, and each log message includes context information such as the custom resource’s name and reconcileID. Hence, users can filter out logs associated with a RayCluster, RayJob, or RayService CR by its name.
  • RayService improvements

    • Refactor health check mechanism to improve the stability.
    • Deprecate the deploymentUnhealthySecondThreshold and serviceUnhealthySecondThreshold to avoid unintentional preparation of new RayCluster custom resource.
  • TPU multi-host PodSlice support

    • The KubeRay team is actively working with the Google GKE and TPU teams on integration. The required changes in KubeRay have already been completed. The GKE team will complete some tasks on their side this week or next. Then, users should be able to use multi-host TPU PodSlice with a static RayCluster (without autoscaling).
  • Stop publishing images on DockerHub; instead, we will only publish on Quay.

RayJob

RayJob state machine refactor

  • [RayJob][Status][1/n] Redefine the definition of JobDeploymentStatusComplete (#1719, @kevin85421)
  • [RayJob][Status][2/n] Redefine ready for RayCluster to avoid using HTTP requests to check dashboard status (#1733, @kevin85421)
  • [RayJob][Status][3/n] Define JobDeploymentStatusInitializing (#1737, @kevin85421)
  • [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls (#1743, @kevin85421)
  • [RayJob][Status][5/n] Refactor getOrCreateK8sJob (#1750, @kevin85421)
  • [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL (#1762, @kevin85421)
  • [RayJob][Status][7/n] Define JobDeploymentStatusNew explicitly (#1772, @kevin85421)
  • [RayJob][Status][8/n] Only a RayJob with the status Running can transition to Complete at this moment (#1774, @kevin85421)
  • [RayJob][Status][9/n] RayJob should not pass any changes to RayCluster (#1776, @kevin85421)
  • [RayJob][10/n] Add finalizer to the RayJob when the RayJob status is JobDeploymentStatusNew (#1780, @kevin85421)
  • [RayJob][Status][11/n] Refactor the suspend operation (#1782, @kevin85421)
  • [RayJob][Status][12/n] Resume suspended RayJob (#1783, @kevin85421)
  • [RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending (#1798, @kevin85421)
  • [RayJob][Status][14/n] Decouple the Initializing status and Running status (#1801, @kevin85421)
  • [RayJob][Status][15/n] Unify the codepath for the status transition to Suspended (#1805, @kevin85421)
  • [RayJob][Status][16/n] Refactor Running status (#1807, @kevin85421)
  • [RayJob][Status][17/n] Unify the codepath for status updates (#1814, @kevin85421)
  • [RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay (#1831, @kevin85421)
  • [RayJob][Status][19/n] Transition to Complete if the K8s Job fails (#1833, @kevin85421)

Others

  • [Refactor] Remove global utils.GetRayXXXClientFuncs (#1727, @rueian)
  • [Feature] Warn Users When Updating the RayClusterSpec in RayJob CR (#1778, @Yicheng-Lu-llll)
  • Add apply configurations to generated client (#1818, @astefanutti)
  • RayJob: inject RAY_DASHBOARD_ADDRESS envariable variable for user provided submiter templates (#1852, @andrewsykim)
  • [Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus Complete and a JobStatus SUCCEEDED (#1919, @kevin85421)
  • add toleration for GPUs in sample pytorch RayJob (#1914, @andrewsykim)
  • Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data (#1891, @andrewsykim)
  • rayjob controller: refactor environment variable check in unit tests (#1870, @andrewsykim)
  • RayJob: don't delete submitter job when ShutdownAfterJobFinishes=true (#1881, @andrewsykim)
  • rayjob controller: update EndTime to always be the time when the job deployment transitions to Complete status (#1872, @andrewsykim)
  • chore: remove ConfigMap from ray-job.kueue-toy-sample.yaml (#1976, @kevin85421)
  • [Kueue] Add a sample YAML for Kueue toy sample (#1956, @kevin85421)
  • [RayJob] Support ActiveDeadlineSeconds (#1933, @kevin85421)
  • [Feature][RayJob] Support light-weight job submission (#1893, @kevin85421)
  • [RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration (#1942, @Yicheng-Lu-llll)
  • [RayJob] Refactor Rayjob E2E Tests to Use Server-Side Apply (#1927, @Yicheng-Lu-llll)
  • [RayJob] Rewrite RayJob envtest (#1916, @kevin85421)
  • [Chore][RayJob] Remove the TODO of verifying the schema of RayJobInfo because it is already correct (#1911, @rueian)
  • [RayJob] Set missing CPU limit (#1899, @kevin85421)
  • [RayJob] Set the timeout of the HTTP client from 2 mins to 2 seconds (#1910, @kevin85421)
  • [Feature][RayJob] Support light-weight job submission with entrypoint_num_cpus, entrypoint_num_gpus and entrypoint_resources (#1904, @rueian)
  • [RayJob] Improve dashboard client log (#1903, @kevin85421)
  • [RayJob] Validate whether runtimeEnvYAML is a valid YAML string (#1898, @kevin85421)
  • [RayJob] Add additional print columns for RayJob (#1895, @andrewsykim)
  • [Test][RayJob] Transition to Complete if the JobStatus is STOPPED (#1871, @kevin85421)
  • [RayJob] Inject RAY_SUBMISSION_ID env variable for user provided submitter template (#1868, @kevin85421)
  • [RayJob] Transition to Complete if the JobStatus is STOPPED (#1855, @kevin85421)
  • [RayJob][Kueue] Move limitation check to validateRayJobSpec (#1854, @kevin85421)
  • [RayJob] Validate RayJob spec (#1813, @kevin85421)
  • [Test][RayJob] Kueue happy-path scenario (#1809, @kevin85421)
  • [RayJob] Delete the Kubernetes Job and its Pods immediately when suspending (#1791, @rueian)
  • [Feature][RayJob] Remove the deprecated RuntimeEnv from CRD. Use RuntimeEnvYAML instead. (#1792, @rueian)
  • [Bug][RayJob] Avoid nil pointer dereference (#1756, @kevin85421)
  • [RayJob]: Add RayJob with RayCluster spec e2e test (#1636, @astefanutti)

Logging

  • Support json structured logging (#1912, @andrewsykim)
  • [Structure Logging][1/n] Make the format of the controller name consistent (#1938, @kevin85421)
  • [Structure Logging][2/n] Add context to each log message (#1945, @kevin85421)
  • [structure logging][3/n] Remove verbosity (#1953, @kevin85421)
  • [Refactor][1/n] Replace logrus with logr to keep logging consistent (#1835, @rueian)
  • [Refactor] Remove any unnecessary logger (#1894, @kevin85421)

RayService

Health-check mechanism refactor

  • [RayService][Health-Check][1/n] Offload the health check responsibilities to K8s and RayCluster (#1656, @kevin85421)
  • [RayService][Health-Check][2/n] Remove the hotfix to prevent unnecessary HTTP requests (#1658, @kevin85421)
  • [RayService][Health-Check][3/n] Update the definition of HealthLastUpdateTime for DashboardStatus (#1659, @kevin85421)
  • [RayService][Health-Check][4/n] Remove the health check for Ray Serve applications. (#1660, @kevin85421)
  • [RayService][Health-Check][5/n] Remove unused variable deploymentUnhealthySecondThreshold (#1664, @kevin85421)
  • [RayService][Health-Check][6/n] Remove ServiceUnhealthySecondThreshold (#1665, @kevin85421)
  • [RayService][Health-Check][7/n] Remove LastUpdateTime from multiple places (#1666, @kevin85421)
  • [RayService][Health-Check][8/n] Add readiness / liveness probes (#1674, @kevin85421)

Others

  • [Refactor] Define the value type of the concurrent map explicitly to avoid type conversion (#1789, @kevin85421)
  • [Refactor] Rename EnableAgentService to EnableServeService (#1673, @kevin85421)
  • [Refactor][RayService] Use ServeServiceNameForRayService to get the k8s svc name for a RayService (#1931, @rueian)
  • [RayService] Refactor to Rely More on RayService Status in RayService E2E Tests (#1928, @Yicheng-Lu-llll)
  • [RayService] Add New Status: NumServeEndpoints (#1901, @Yicheng-Lu-llll)
  • [RayService] Avoid Duplicate Serve Service (#1867, @Yicheng-Lu-llll)
  • [RayService][Bug] Serve Service May Select Pods That Are Actually Unready for Serving Traffic (#1856, @Yicheng-Lu-llll)
  • [RayService] Deprecate the built-in ingress support of RayService (#1843, @kevin85421)
  • [RayService][Status][1/n] Remove DashboardStatus (#1839, @kevin85421)
  • [RayService][Hotfix] Hotfix for Flaky Zero Downtime Rollout Test (#1837, @Yicheng-Lu-llll)
  • [RayService][Status][2/n] Remove WaitForDashboard (#1840, @kevin85421)
  • [RayService][HA] Fix flaky tests (#1823, @kevin85421)
  • [RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers (#1808, @Yicheng-Lu-llll)
  • [RayService] Fixed issue where the custom serve port is not reflected in the serve health check for worker Pods (#1816, @Yicheng-Lu-llll)
  • [RayService] Remove everything related to Ray Serve V1 API (#1790, @kevin85421)
  • [RayService] Unify multi-app and single-app codepath (#1787, @architkulkarni)
  • [RayService] Remove serve v1 API (#1779, @architkulkarni)
  • [RayService] Allow updating WorkerGroupSpecs without rolling out new cluster (#1734, @architkulkarni)
  • [RayService] Use DashboardPort for RayService instead of DashboardAgentPort (#1742, @architkulkarni)
  • [rayservice] Remove dagdriver from ray_v1alpha1_rayservice.yaml (#1649, @zcin)
  • Fix Log to indicate we are Using DashboardPort in RayService (#2001, @Yicheng-Lu-llll)
  • [RayService] fix kubebuilder printcolumn annotations for RayService (#1981, @andrewsykim)
  • [RayService] Address Recent Flakiness in RayService Zero Downtime Rollout Test (#1979, @Yicheng-Lu-llll)

RayCluster

  • [GCS FT] Enhance observability of redis cleanup job (#1709, @evalaiyc98)
  • [Feature] Support for overwriting the generated ray start command with a user-specified container command (#1704, @kevin85421)
  • Support suspension of RayClusters (#1711, @andrewsykim)
  • fix: validate RayCluster name with validating webhook (#1732, @davidxia)
  • [Hotfix][Bug] suspend is not a stateless operation (#1741, @kevin85421)
  • chore: remove HeadGroupSpec.Replicas from raycluster_types.go (#1589, @davidxia)
  • chore: remove all deprecated HeadGroupSpec.replicas (#1588, @davidxia)
  • Add volcano taskSpec annotations to pod (#1754, @Tongruizhe)
  • [Nit] Remove redundant code snippet (#1810, @evalaiyc98)
  • [Chore] Improve the appearance of compute resources status in the output of kubectl describe (#1802, @kevin85421)
  • [Refactor][GCS FT] Use DeleteAllOf to delete cluster pods before cleaning up redis (#1785, @rueian)
  • [Feature][GCS FT] Best-effort redis cleanup job (#1766, @rueian)
  • feat: show RayCluster's total resources (#1748, @davidxia)
  • [Feature] Adding RAY_CLOUD_INSTANCE_ID as unique id for Ray node (#1759, @kevin85421)
  • [Refactor] Use RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV as timeout of status check in tests (#1755, @rueian)
  • Check existing pods for suspended RayCluster before calling DeleteCollection (#1745, @andrewsykim)
  • [Refactor][RayCluster] Replace RayClusterReconciler.Log with LogConstructor (#1952, @rueian)
  • ray-operator: disallow pod creation in namespaces outside of RayCluster namespace (#1951, @andrewsykim)
  • [Bug][GCS FT] Clean up the Redis key before the head Pod is deleted (#1989, @kevin85421)
  • [Refactor][RayCluster] Make ray.io/group=headgroup be constant (#1970, @rueian)
  • [Feature][autoscaler v2] Set RAY_NODE_TYPE_NAME when starting ray node (#1973, @kevin85421)
  • [Refactor][RayCluster] RayClusterHeadPodsAssociationOptions and RayClusterWorkerPodsAssociationOptions (#2023, @rueian)

Helm charts

  • introduce batch.jobs rules for multiple namespace role (#1707, @riccardomc)
  • Add common containerEnv section to Helm Chart (#1932, @chainlink)
  • Update securityContext values.yaml for kuberay-operator to safe defaults. (#1896, @vinayakankugoyal)
  • RayCluster Helm: Make volumeMounts and volumes optional for workers (#1689, @calizarr)
  • Exposing min/max replica counts for default worker group (#1963, @sercanCyberVision)
  • [Fix][Helm chart] Move service.headService -> head.headService in values.yaml (#1998, @jjaniec)
  • Add seccompProfile.type=RuntimeDefault to kuberay-operator. (#1955, @vinayakankugoyal)
  • [Bug] Reconciler error when changing the value of nameOverride in values.yaml of helm installation for Ray Cluster (#1966, @chrisxstyles)

TPU

  • Add NumOfHosts to WorkerGroupSpec (CRD change only) (#1834, @richardsliu)
  • [Refactor][Multi-host] Create a function to associate RayCluster and the headless svc (#1948, @kevin85421)
  • TPU Multi-Host Support (#1913, @ryanaoleary)
  • Build Headless Service for Multi-Host TPU Worker Pods (#1920, @ryanaoleary)
  • [TPU] Add envtests for multi-host (#1950, @kevin85421)
  • Add NumOfHosts to RayCluster helm-chart template (#1969, @ryanaoleary)
  • Add v4 TPU manifests samples (#1968, @richardsliu)
  • Add missing labels on RayCluster TPU manifests (#1987, @richardsliu)

KubeRay API Server

  • removed serve v1 support (#1825, @blublinsky)
  • Fixing Python client handling of env from (#1845, @blublinsky)
  • Enhancements to e2e test, adding Autoscaling (#1765, @blublinsky)
  • added support for secure API server build (#1749, @blublinsky)
  • add autoscaler support (#1699, @blublinsky)
  • fixed JobSubmission API (#1717, @blublinsky)
  • fixed some bugs in e2e-test (#1682, @blublinsky)
  • Increased time precision using uint (#1675, @blublinsky)
  • Fixed the issue with jobSubmitter resources (#1676, @blublinsky)
  • Adding capability to create ray cluster with serve support -clean (#1672, @blublinsky)
  • Added Job submission support to the API server (#1639, @blublinsky)
  • Add end to end tests to apiserver (#1460, @z103cb)
  • Flip Min and max replicas for apiserver workerNodeSpec (#1638, @tedhtchang)
  • Added security to the API server (#1677, @blublinsky)

CI

  • Clean up WorkersToDelete field during the CI test (#1763, @Yicheng-Lu-llll)
  • [Bug] Clean up WorkersToDelete after the scaling process finishes (#1747, @kevin85421)
  • Upgrade dependencies to address CVEs (#1865, @ChristianZaccaria)
  • run ./hack/update-codegen.sh in generate make target (#1848, @andrewsykim)
  • fix applyconfiguration generated code (#1847, @andrewsykim)
  • Improve flexibility in RayCluster yaml test (#1812, @evalaiyc98)
  • Bump tj-actions/verify-changed-files from 11.1 to 17 in /.github/workflows (#1795, @dependabot[bot])
  • Only build/push Multi Arch images when merging to master (#1764, @Yicheng-Lu-llll)
  • Upgrade to address High CVEs (#1731, @ChristianZaccaria)
  • Publish Multi Arch images (#1716, @tedhtchang)
  • [test] Upgrade envtest to latest version (#1720, @astefanutti)
  • Bump golang.org/x/net from 0.14.0 to 0.17.0 in /experimental (#1701, @dependabot[bot])
  • Upgrade Kubernetes dependencies to v0.28.3 and Golang to 1.20 (#1648, @astefanutti)
  • chore: mark generated files as such (#1663, @davidxia)
  • Update kind version. (#1957, @vinayakankugoyal)
  • [Refactor] Rewrite RayCluster envtest (#1949, @kevin85421)
  • Make KubeRay Operator Image FIPS compliant (#1633, @anishasthana)
  • [CI] Fix image release pipeline (#1878, @kevin85421)
  • [CI] Do not load Ray into kind cluster (#1863, @architkulkarni)
  • [CI] stream operator logs from kind in go e2e tests (#1793, @rueian)
  • [CI] Fix variable initializations used in test case declarations (#1775, @rueian)
  • [CI] Stop to publish new images to DockerHub (#1702, @kevin85421)
  • [CI] Skip the flaky compatibility test test_detached_actor until https://github.com/ray-project/ray/issues/41343 (#1694, @rueian)
  • [CI]: Kuberay operator e2e tests (#1575, @astefanutti)
  • [CI] Don't need to publish the security proxy image (#1885, @kevin85421)
  • Remove generate target from build/test targets (#1874, @andrewsykim)
  • [CI] Fix apiserver test in image-release process (#1880, @kevin85421)
  • [CI] Stop publishing images to DockerHub (#1926, @Yicheng-Lu-llll)
  • [CI] Don't push new images to DockerHub (#1923, @kevin85421)
  • [CI] Use quay as the default image registry (#1939, @kevin85421)
  • [Refactor][envtest] Centralize all helpers in envtest for better DX (#1977, @rueian)
  • Use standard golang image as build image and distroless image as base image for kuberay operator. (#1967, @vinayakankugoyal)
  • [CI] Pin crd-ref-docs to v0.0.10 (#1988, @kevin85421)
  • ray-operator: parameterize Test_ShouldDeletePod (#2000, @MadhavJivrajani)
  • [Test][RayCluster] Test redis cleanup job in the e2e compatibility test (#2026, @rueian)
  • Bump google.golang.org/protobuf from 1.32.0 to 1.33.0 in /experimental (#1992, @dependabot[bot])
  • Bump google.golang.org/protobuf from 1.32.0 to 1.33.0 in /cli (#1993, @dependabot[bot])

Documentation

  • [Doc] Improve DEVELOPMENT.md by adding more guidances (#1794, @rueian)
  • [Ray 2.9.0 Release] Update Ray versions from 2.8.0 to 2.9.0 (#1770, @architkulkarni)
  • docs: add comment explaining util.go:calculatePodResource() (#1767, @davidxia)
  • Fix typo in DEVELOPMENT.md (#1698, @kevin85421)
  • chore: Update K8s compatibility (#1696, @kevin85421)
  • [Doc] Add deprecations to ServiceUnhealthySecondThreshold and DeploymentUnhealthySecondThreshold (#1688, @rueian)
  • [Doc] Add blogs and talks to readme (#1691, @architkulkarni)
  • Update feature-request.yml (#1907, @anyscalesam)
  • Update bug-report.yml (#1906, @anyscalesam)
  • [Doc] Support consistency check for API reference in CI (#1655, @rudeigerc)
  • [Doc] Support CRD docs generation (#1625, @rudeigerc)
  • [Doc] Update release docs (#1621, @kevin85421)
  • Post release 1.0.0 (#1651, @kevin85421)
  • Update CHANGELOG for v1.0.0 (#1650, @kevin85421)
  • [release][v1.1.0] Improve release doc and update KubeRay API server chart's repository (#1960, @kevin85421)
  • add best practices for ray cluster on ACK from Alibaba Cloud blog (#1985, @kadisi)

Others

  • Use a default user agent 'kuberay-operator' instead of the default user-agent from controller-runtime (#1982, @andrewsykim)
  • [Telemetry] KubeRay version and CRD (#2024, @kevin85421)
  • chore: improve coverage for util.go:CheckAllPodsRunning() (#1929, @davidxia)
  • Fixes to shorten generated Route name with consideration for namespace (#1883, @neilisaur)
  • [Bug] Fix rebase error (#1897, @kevin85421)
  • Refactor to Ensure Consistent Use of CRDType (#1892, @Yicheng-Lu-llll)
  • Fix versioning in sample manifests (#1857, @andrewsykim)
  • [Feature] Split ray.io/originated-from into ray.io/originated-from-cr-name and ray.io/originated-from-crd (#1864, @kevin85421)
  • Add ray.io/originated-from labels (#1830, @rueian)
  • Add structured config and default sidecar container configuration (#1822, @andrewsykim)
  • [CRD] Sync v1alpha1 CRD with v1 CRD (#1788, @kevin85421)
  • [CRD] Delete CRD v1alpha1 (#1771, @kevin85421)
  • Revert "[CRD] Delete CRD v1alpha1 (#1771)" (#1784, @kevin85421)
  • [CRD] Delete CRD v1alpha1 (#1771, @kevin85421)
  • chore: add kuberay- name prefix to validating webhook Service (#1729, @davidxia)
  • chore webhook: change K8s annotations to use kuberay-operator (#1730, @davidxia)
  • [Refactor] Move constant.go from common to utils to avoid circular dependency (#1726, @kevin85421)
  • Update overwrite-container-cmd example (#1722, @kevin85421)
  • [Refactor] Standardize all k8s.io/api/core/v1 imports as corev1 (#1721, @rueian)
  • [Bug] Avoid assigning an entry to a map that is nil (#1715, @kevin85421)
  • [Feature] Override the block option of rayStartParams to true (#1718, @rueian)
  • Set imagePullPolicy in manager.yaml (#1710, @evalaiyc98)
  • fix operator: remove unused mutating and conversion webhook configs (#1705, @davidxia)
  • updated python client (#1700, @blublinsky)
  • Add flag leader-election-namespace (#1624, @chenk008)
  • feat: add all three CRDs to the all category (#1683, @davidxia)
  • chore: Remove the sanity check YAML for Quay (#1695, @kevin85421)
  • [Post Ray 2.8.0 Release] Update Ray versions to Ray 2.8.0 (#1678, @kevin85421)
  • Add validating webhook (#1584, @davidxia)

v1.0.0

6 months ago

KubeRay is officially in General Availability!

  • Bump the CRD version from v1alpha1 to v1.
  • Relocate almost all documentation to the Ray website.
  • Improve RayJob UX.
  • Improve GCS fault tolerance.

GCS fault tolerance

  • [GCS FT] Improve GCS FT cleanup UX (#1592, @kevin85421)
  • [Bug][RayCluster] Fix RAY_REDIS_ADDRESS parsing with redis scheme and… (#1556, @rueian)
  • [Bug] RayService with GCS FT HA issue (#1551, @kevin85421)
  • [Test][GCS FT] End-to-end test for cleanup_redis_storage (#1422)(#1459) (#1466, @rueian)
  • [Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted (#1412, @kevin85421)
  • Update GCS fault tolerance YAML (#1404, @kevin85421)
  • [GCS FT] Consider the case of sidecar containers (#1386, @kevin85421)
  • [GCS FT] Give readiness / liveness probes good default values (#1364, @kevin85421)
  • [GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events (#1341, @kevin85421)

CRD versioning

  • [CRD] Inject CRD version to the Autoscaler sidecar container (#1496, @kevin85421)
  • [CRD][2/n] Update from CRD v1alpha1 to v1 (#1482, @kevin85421)
  • [CRD][1/n] Create v1 CRDs (#1481, @kevin85421)
  • [CRD] Set maxDescLen to 0 (#1449, @kevin85421)

RayService

  • [Hotfix][Bug] Avoid unnecessary zero-downtime upgrade (#1581, @kevin85421)
  • [Feature] Add an example for RayService high availability (#1566, @kevin85421)
  • [Feature] Add a flag to make zero downtime upgrades optional (#1564, @kevin85421)
  • [Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. (#1420, @kevin85421)
  • [Bug] Fix the filename of text summarizer YAML (#1415, @kevin85421)
  • [serve] Change text ml yaml to use french in user config (#1403, @zcin)
  • [services] Add text ml rayservice yaml (#1402, @zcin)
  • [Bug] Fix flakiness of RayService e2e tests (#1385, @kevin85421)
  • Add RayService sample test (#1377, @Darren221)
  • [RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold (#1293, @kevin85421)
  • [RayService][Observability] Add more loggings about networking issues (#1282, @kevin85421)

RayJob

  • [Feature] Improve observability for flaky RayJob test (#1587, @kevin85421)
  • [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running (#1583, @architkulkarni)
  • [RayJob] Fix RayJob status reconciliation (#1539, @astefanutti)
  • [RayJob]: Always use target RayCluster image as default RayJob submitter image (#1548, @astefanutti)
  • [RayJob] Add default CPU and memory for job submitter pod (#1319, @architkulkarni)
  • [Bug][RayJob] Check dashboard readiness before creating job pod (#1381) (#1429, @rueian)
  • [Feature][RayJob] Use RayContainerIndex instead of 0 (#1397) (#1427, @rueian)
  • [RayJob] Enable job log streaming by setting PYTHONUNBUFFERED in job container (#1375, @architkulkarni)
  • Add field to expose entrypoint num cpus in rayjob (#1359, @shubhscoder)
  • [RayJob] Add runtime env YAML field (#1338, @architkulkarni)
  • [Bug][RayJob] RayJob with custom head service name (#1332, @kevin85421)
  • [RayJob] Add e2e sample yaml test for shutdownAfterJobFinishes (#1269, @architkulkarni)

RayCluster

  • [Enhancement] Remove unused variables in constant.go (#1474, @evalaiyc98)
  • [Enhancement] GPU RayCluster doesn't work on GKE Autopilot (#1470, @kevin85421)
  • [Refactor] Parameterize TestGetAndCheckServeStatus (#1450, @evalaiyc98)
  • [Feature] Make replicas optional for WorkerGroupSpec (#1443, @kevin85421)
  • use raycluster app's name as podgroup name key word (#1446, @lowang-bh)
  • [Refactor] Make port name variables consistent and meaningful (#1389, @evalaiyc98)
  • [Feature] Use image of Ray head container as the default Ray Autoscaler container (#1401, @kevin85421)
  • Update Autoscaler YAML for the Autoscaler tutorial (#1400, @kevin85421)
  • [Feature] Ray container must be the first application container (#1379, @kevin85421)
  • [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
  • [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)

Helm charts

  • Remove miniReplicas in raycluster-cluster.yaml (#1473, @evalaiyc98)
  • Helm chart ray-cluster template reference fix (#1469, @chrisxstyles)
  • fix: Issue #1391 - Custom labels not being pulled in (#1398, @rxraghu)
  • Remove unnecessary kustomize in make helm (#1370, @shubhscoder)
  • [Feature] Allow RayCluster Helm chart to specify different images for different worker groups (#1352, @Darren221)
  • Allow manually creating init containers in Kuberay helm charts (#1287, @richardsliu)

KubeRay API Server

  • Added Python API server client (#1561, @blublinsky)
  • updating url use v1 (#1577, @blublinsky)
  • Fixed processing of job submitter (#1562, @blublinsky)
  • extended job APIs (#1537, @blublinsky)
  • fixed volumes test in cluster test (#1498, @blublinsky)
  • Add documentation for API Server monitoring (#1479, @blublinsky)
  • created HA example for API server (#1461, @blublinsky)
  • Numerous fixes to the API server to make RayJob APIs working (#1447, @blublinsky)
  • Updated API server documentation (#1435, @z103cb)
  • servev2 support for API server (#1419, @blublinsky)
  • replacement for https://github.com/ray-project/kuberay/pull/1312 (#1409, @blublinsky)
  • Updates to the apiserver swagger-ui (#1410, @z103cb)
  • implemented liveness/readyness probe for the API server (#1369, @blublinsky)
  • Operator support for openShift (#1371, @blublinsky)
  • Removed use of the of BUILD_FLAGS in apiserver makefile (#1336, @z103cb)
  • Api server makefile (#1301, @z103cb)

Documentation

  • [Doc] Update release docs (#1621, @kevin85421)
  • [Doc] Fix release doc format (#1578, @kevin85421)
  • Update kuberay mcad integration doc (#1373, @tedhtchang)
  • [Release][Doc] Add instructions to release Go modules. (#1546, @kevin85421)
  • [Post v1.0.0-rc.1] Reenable sample YAML tests for latest release and update some docs (#1544, @kevin85421)
  • Update operator development instruction (#1458, @tedhtchang)
  • doc: fix moved link (#1462, @hongchaodeng)
  • Fix mkDocs (#1448, @kevin85421)
  • Update Kuberay doc to version 1.0.0 rc.0 (#1441, @Yicheng-Lu-llll)
  • [Doc] Delete unused docs (#1440, @kevin85421)
  • [Post Ray 2.7.0 Release] Update Ray versions to Ray 2.7.0 (#1423, @GeneDer)
  • [Doc] Update README (#1433, @kevin85421)
  • [release] Redirect users to Ray website (#1431, @kevin85421)
  • [Docs] Update Security Guidance on Dashboard Ingress (#1413, @ijrsvt)
  • Update Volcano integration doc (#1380, @annajung)
  • [Doc] Add gke bucket yaml (#1372, @architkulkarni)
  • [RayJob] [Doc] Add real-world Ray Job use case tutorial for KubeRay (#1361, @architkulkarni)
  • Delete ray_v1alpha1_rayjob.batch-inference.yaml (#1360, @architkulkarni)
  • Documentation and example for running simple NLP service on kuberay (#1340, @gvspraveen)
  • Add a document for profiling (#1299, @Yicheng-Lu-llll)
  • Fix: Typo (#1295, @ArgonQQ)
  • [Post release v0.6.0] Update CHANGELOG.md (#1274, @kevin85421)
  • Release v0.6.0 doc validation (#1271, @kevin85421)
  • [Doc] Develop Ray Serve Python script on KubeRay (#1250, @kevin85421)
  • [Doc] Fix the order of comments in sample Job YAML file (#1242, @architkulkarni)
  • [Doc] Upload a screenshot for the Serve page in Ray dashboard (#1236, @kevin85421)
  • Fix typo (#1241, @mmourafiq)

CI

  • [Bug] Fix flaky sample YAML tests (#1590, @kevin85421)
  • Allow to install and remove operator via scripts (#1545, @jiripetrlik)
  • [CI] Create release tag for ray-operator Go module (#1574, @astefanutti)
  • [Test][Bug] Update worker replias idempotently in rayjob autoscaler envtest (#1471) (#1543, @rueian)
  • Update Dockerfiles to address CVE-2023-44487 (HTTP/2 Rapid Reset) (#1540, @astefanutti)
  • [CI] Skip redis raycluster sample YAML test (#1465, @architkulkarni)
  • Revert "[CI] Skip redis raycluster sample YAML test" (#1490, @rueian)
  • Remove GOARCH in ray-operator/Dockfile to support multi-arch images (#1442, @ideal)
  • Update Dockerfile to address closed CVEs (#1488, @anishasthana)
  • [CI] Update latest release to v1.0.0-rc.0 in tests (#1467, @architkulkarni)
  • [CI] Reenable rayjob sample yaml latest test (#1464, @architkulkarni)
  • [CI] Skip redis raycluster sample YAML test (#1465, @architkulkarni)
  • Updating logrus and net packages in go.mod (#1495, @jbusche)
  • Allow E2E tests to run with arbitrary k8s cluster (#1306, @jiripetrlik)
  • Bump golang.org/x/net from 0.0.0-20210405180319-a5a99cb37ef4 to 0.7.0 in /proto (#1345, @dependabot[bot])
  • Bump golang.org/x/text from 0.3.5 to 0.3.8 in /proto (#1344, @dependabot[bot])
  • Bump go.mongodb.org/mongo-driver from 1.3.4 to 1.5.1 in /apiserver (#1407, @dependabot[bot])
  • Bump golang.org/x/sys from 0.0.0-20210510120138-977fb7262007 to 0.1.0 in /proto (#1346, @dependabot[bot])
  • Bump golang.org/x/net from 0.0.0-20210813160813-60bc85c4be6d to 0.7.0 in /cli (#1405, @dependabot[bot])
  • Bump github.com/emicklei/go-restful from 2.9.5+incompatible to 2.16.0+incompatible in /ray-operator (#1348, @dependabot[bot])
  • Bump golang.org/x/sys from 0.0.0-20211210111614-af8b64212486 to 0.1.0 in /cli (#1347, @dependabot[bot])
  • [CI] Remove RayService tests from comopatibility-test.py (#1395, @kevin85421)
  • [CI] Remove extraPortMappings from kind configurations (#1366, @kevin85421)
  • [CI] Update latest ray version 2.5.0 -> 2.6.3 (#1320, @architkulkarni)
  • Bump the golangci-lint version in the api server makefile (#1342, @z103cb)
  • [CI] Refactor pipeline and test RayCluster sample yamls (#1321, @architkulkarni)
  • Update doc and base image for Go 1.19 (#1330, @tedhtchang)
  • Fix release actions (#1323, @anishasthana)
  • Upgrade to Go 1.19 (#1325, @kevin85421)
  • [CI] Run sample job YAML tests in buildkite (#1315, @architkulkarni)
  • [CI] Downgrade kind from to v0.20.0 to v0.11.1 (#1313, @architkulkarni)
  • [CI] Publish KubeRay operator / apiserver images to Quay (#1307, @kevin85421)
  • [CI] Install kuberay operator in buildkite test (#1308, @architkulkarni)
  • [CI] Verify kubectl in kind-in-docker step (#1305, @architkulkarni)
  • [Quay] Sanity check for KubeRay repository setup (#1300, @kevin85421)
  • [CI] Only run test_ray_serve for Ray 2.6.0 and later (#1288, @kevin85421)
  • Update ray operator Dockerfile (#1213, @anishasthana)
  • [Golang] Remove go get (#1283, @ijrsvt)
  • Dependencies: Upgrade golang.org/x packages (#1281, @ijrsvt)
  • [CI] Add kind-in-Docker test to Buildkite CI (#1243, @architkulkarni)

Others

  • Fix: odd number of arguments (#1594, @chenk008)
  • [Feature][Observability] Scrape Autoscaler and Dashboard metrics (#1493, @kevin85421)
  • [Benchmark] KubeRay memory / scalability benchmark (#1324, @kevin85421)
  • Do not update pod labels if they haven't changed (#1304, @JoshKarpel)
  • Add Ray cluster spec for TPU pods (#1292, @richardsliu)
  • [Grafana][Observability] Embed Grafana dashboard panels into Ray dashboard (#1278, @kevin85421)
  • [Feature] Allow custom labels&annotations for kuberay operator (#1275) (#1276, @mariusp)

v0.6.0

9 months ago

Highlights

  • RayService

    • RayService starts to support Ray Serve multi-app API (#1136, #1156)
    • RayService stability improvements (#1231, #1207, #1173)
    • RayService observability (#1230)
    • RayService examples
      • [RayService] Stable Diffusion example (#1181, @kevin85421)
      • MobileNet example (#1175, @kevin85421)
    • RayService troubleshooting handbook (#1221)
  • RayJob refactoring (#1177)

  • Autoscaler stability improvements (#1251, #1253)

RayService

  • [RayService][Observability] Add more logging for RayService troubleshooting (#1230, @kevin85421)
  • [Bug] Long image pull time will trigger blue-green upgrade after the head is ready (#1231, @kevin85421)
  • [RayService] Stable Diffusion example (#1181, @kevin85421)
  • [RayService] Update docs to use multi-app (#1179, @zcin)
  • [RayService] Change runtime env for e2e autoscaling test (#1178, @zcin)
  • [RayService] Add e2e tests (#1167, @zcin)
  • [RayService][docs] Improve explanation for config file and in-place updates (#1229, @zcin)
  • [RayService][Doc] RayService troubleshooting handbook (#1221, @kevin85421)
  • [Doc] Improve RayService doc (#1235, @kevin85421)
  • [Doc] Improve FAQ page and RayService troubleshooting guide (#1225, @kevin85421)
  • [RayService] Add RayService alb ingress CR (#1169, @sihanwang41)
  • [RayService] Add support for multi-app config in yaml-string format (#1156, @zcin)
  • [rayservice] Add support for getting multi-app status (#1136, @zcin)
  • [Refactor] Remove Dashboard Agent service (#1207, @kevin85421)
  • [Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error (#1173, @kevin85421)
  • MobileNet example (#1175, @kevin85421)
  • [Bug] fix RayActorOptionSpec.items.spec.serveConfig.deployments.rayActorOptions.memory int32 data type (#1220, @kevin85421)

RayJob

  • [RayJob] Submit job using K8s job instead of checking Status and using DashboardHTTPClient (#1177, @architkulkarni)
  • [Doc] [RayJob] Add documentation for submitterPodTemplate (#1228, @architkulkarni)

Autoscaler

  • [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
  • [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)

Helm

  • [Helm][RBAC] Introduce the option crNamespacedRbacEnable to enable or disable the creation of Role/RoleBinding for RayCluster preparation (#1162, @kevin85421)
  • [Bug] Allow zero replica for workers for Helm (#968, @ducviet00)
  • [Bug] KubeRay tries to create ClusterRoleBinding when singleNamespaceInstall and rbacEnable are set to true (#1190, @kevin85421)

KubeRay API Server

  • Add support for openshift routes (#1183, @blublinsky)
  • Adding API server support for service account (#1148, @blublinsky)

Documentation

  • [release v0.6.0] Update tags and versions (#1270, @kevin85421)
  • [release v0.6.0-rc.1] Update tags and versions (#1264, @kevin85421)
  • [release v0.6.0-rc.0] Update tags and versions (#1237, @kevin85421)
  • [Doc] Develop Ray Serve Python script on KubeRay (#1250, @kevin85421)
  • [Doc] Fix the order of comments in sample Job YAML file (#1242, @architkulkarni)
  • [Doc] Upload a screenshot for the Serve page in Ray dashboard (#1236, @kevin85421)
  • [Doc] GKE GPU cluster setup (#1223, @kevin85421)
  • [Doc][Website] Add complete document link (#1224, @yuxiaoba)
  • Add FAQ page (#1150, @Yicheng-Lu-llll)
  • [Doc] Add gofumpt lint instructions (#1180, @architkulkarni)
  • [Doc] Add helm update command to chart validation step in release process (#1165, @architkulkarni)
  • [Doc] Add git fetch --tags command to release instructions (#1164, @architkulkarni)
  • Add KubeRay related blogs (#1147, @tedhtchang)
  • [2.5.0 Release] Change version numbers 2.4.0 -> 2.5.0 (#1151, @ArturNiederfahrenhorst)
  • [Sample YAML] Bump ray version in pod security YAML to 2.4.0 (#1160, @architkulkarni)
  • Add instruction to skip unit tests in DEVELOPMENT.md (#1171, @architkulkarni)
  • Fix typo (#1241, @mmourafiq)
  • Fix typo (#1232, @mmourafiq)

CI

  • [CI] Add kind-in-Docker test to Buildkite CI (#1243, @architkulkarni)
  • [CI] Remove unnecessary release.yaml workflow (#1168, @architkulkarni)

Others

  • Pin operator version in single namespace installation(#1193) (#1210, @wjzhou)
  • RayCluster updates status frequently (#1211, @kevin85421)
  • Improve the observability of the init container (#1149, @Yicheng-Lu-llll)
  • [Ray Observability] Disk usage in Dashboard (#1152, @kevin85421)

v0.5.2

11 months ago

Changelog for v0.5.2

Highlights

The KubeRay 0.5.2 patch release includes the following improvements.

  • Allow specifying the entire headService and serveService YAML spec. Previously, only certain special fields such as labels and annotations were exposed to the user.
  • RayService stability improvements
    • RayService object’s Status is being updated due to frequent reconciliation (#1065, @kevin85421)
    • [RayService] Submit requests to the Dashboard after the head Pod is running and ready (#1074, @kevin85421)
    • Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
  • Allow watching multiple namespaces
    • [Feature] Watch CR in multiple namespaces with namespaced RBAC resources (#1106, @kevin85421)
  • Autoscaler stability improvements

Contributors

We'd like to thank the following contributors for their contributions to this release:

@ByronHsu, @Yicheng-Lu-llll, @anishasthana, @architkulkarni, @blublinsky, @chrisxstyles, @dirtyValera, @ecurtin, @jasoonn, @jjyao, @kevin85421, @kodwanis, @msumitjain, @oginskis, @psschwei, @scarlet25151, @sihanwang41, @tedhtchang, @varungup90, @xubo245

Features

  • Add a flag to enable/disable worker init container injection (#1069, @ByronHsu)
  • Add a warning to discourage users from launching a KubeRay-incompatible autoscaler. (#1102, @kevin85421)
  • Add consistency check for deepcopy generated files (#1127, @varungup90)
  • Add kubernetes dependency in python client library (#998, @jasoonn)
  • Add support for pvcs to apiserver (#1118, @psschwei)
  • Add support for tolerations, env, annotations and labels (#1070, @blublinsky)
  • Align Init Container's ImagePullPolicy with Ray Container's ImagePullPolicy (#1080, @Yicheng-Lu-llll)
  • Connect Ray client with TLS using Nginx Ingress on Kind cluster (#729) (#1051, @tedhtchang)
  • Expose entire head pod Service to the user (#1040, @architkulkarni)
  • Exposing Serve Service (#1117, @kodwanis)
  • [Test] Add e2e test for sample RayJob yaml on kind (#935, @architkulkarni)
  • Parametrize ray-operator makefile (#1121, @anishasthana)
  • RayService object's Status is being updated due to frequent reconciliation (#1065, @kevin85421)
  • [Feature] Support suspend in RayJob (#926, @oginskis)
  • [Feature] Watch CR in multiple namespaces with namespaced RBAC resources (#1106, @kevin85421)
  • [RayService] Submit requests to the Dashboard after the head Pod is running and ready (#1074, @kevin85421)
  • feat: Rename instances of rayiov1alpha1 to rayv1alpha1 (#1112, @anishasthana)
  • ray-operator: Reuse contexts across ray operator reconcilers (#1126, @anishasthana)

Fixes

  • Fix CI (#1145, @kevin85421)
  • Fix config frequent update (#1014, @sihanwang41)
  • Fix for Sample YAML Config Test - 2.4.0 Failure due to 'suspend' Field (#1096, @Yicheng-Lu-llll)
  • Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
  • [Bug] Autoscaler doesn't support TLS (#1119, @chrisxstyles)
  • [Bug] Enable ResourceQuota by adding Resources for the health-check init container (#1043, @kevin85421)
  • [Bug] Fix null map handling in BuildServiceForHeadPod function (#1095, @architkulkarni)
  • [Bug] RayService restarts repeatedly with Autoscaler (#1037, @kevin85421)
  • [Bug] Service (Serve) changing port from 8000 to 9000 doesn't work (#1081, @kevin85421)
  • [Bug] autoscaler not working properly in rayjob (#1064, @Yicheng-Lu-llll)
  • [Bug] compatibility test for the nightly Ray image fails (#1055, @kevin85421)
  • [Bug] rayStartParams is required at this moment. (#1031, @kevin85421)
  • [Bug][Autoscaler] Operator does not remove workers (#1139, @kevin85421)
  • [Bug][Doc] fix the link error of operator document (#1046, @xubo245)
  • [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed (#1036, @kevin85421)
  • [Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server (#1128, @kevin85421)
  • [Bug][k8s compatibility] k8s v1.20.7 ClusterIP svc do not updated under RayService (#1110, @kevin85421)
  • [Helm][ray-cluster] Fix parsing envFrom field in additionalWorkerGroups (#1039, @dirtyValera)

Documentation

  • [Doc] Copyedit dev guide (#1012, @architkulkarni)
  • [Doc] Update nav to include missing files and reorganize nav (#1011, @architkulkarni)
  • [Doc] Update version from 0.4.0 to 0.5.0 on remaining kuberay docs files (#1018, @architkulkarni)
  • [Doc][Website] Update KubeRay introduction and fix layout issues (#1042, @kevin85421)
  • [Docs][Website] One word typo fix in docs and README (#1068, @ecurtin)
  • Add a document to outline the default settings for rayStartParams in Kuberay (#1057, @Yicheng-Lu-llll)
  • Example Pod to connect Ray client to remote a Ray cluster with TLS enabled (#994, @tedhtchang)
  • [Post release v0.5.0] Update CHANGELOG.md (#1026, @kevin85421)
  • [Post release v0.5.0] Update release doc (#1028, @kevin85421)
  • [Post Ray 2.4 Release] Update Ray versions to Ray 2.4.0 (#1049, @jjyao)
  • [Post release v0.5.0] Remove block from rayStartParams (#1015, @kevin85421)
  • [Post release v0.5.0] Remove block from rayStartParams for python client and KubeRay operator tests (#1050, @Yicheng-Lu-llll)
  • [Post release v0.5.0] Remove serviceType (#1013, @kevin85421)
  • [Post v0.5.0] Remove init containers from YAML files (#1010, @kevin85421)
  • [Sample YAML] Bump ray version in pod security YAML to 2.4.0 (#1160) (#1161, @architkulkarni)
  • Kuberay 0.5.0 docs validation update docs for GCS FT (#1004, @scarlet25151)
  • Release v0.5.0 doc validation (#997, @kevin85421)
  • Release v0.5.0 doc validation part 2 (#999, @architkulkarni)
  • Release v0.5.0 python client library validation (#1006, @jasoonn)
  • [release v0.5.2] Update tags and versions to 0.5.2 (#1159, @architkulkarni)

v0.5.0

1 year ago

Highlights

The KubeRay 0.5.0 release includes the following improvements.

  • Interact with KubeRay via a Python client
  • Integrate KubeRay with Kubeflow to provide an interactive development environment (link).
  • Integrate KubeRay with Ray TLS authentication
  • Improve the user experience for KubeRay on AWS EKS (link)
  • Fix some Kubernetes networking issues
  • Fix some stability bugs in RayJob and RayService

Contributors

The following individuals contributed to KubeRay 0.5.0. This list is alphabetical and incomplete.

@akanso @alex-treebeard @architkulkarni @cadedaniel @cskornel-doordash @davidxia @Dmitrigekhtman @ducviet00 @gvspraveen @harryge00 @jasoonn @Jeffwan @kevin85421 @psschwei @scarlet25151 @sihanwang41 @wilsonwang371 @Yicheng-lu-llll

Python client (alpha)(New!)

  • Alkanso/python client (#901, @akanso)
  • Reorganize python client library (#984, @jasoonn)

Kubeflow (New!)

  • [Feature][Doc] Kubeflow integration (#937, @kevin85421)
  • [Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration (#750, @kevin85421)

TLS authentication (New!)

  • [Feature] TLS authentication (#989, @kevin85421)

AWS EKS (New!)

  • [Feature][Doc] Access S3 bucket from Pods in EKS (#958, @kevin85421)

Kubernetes networking (New!)

  • Read cluster domain from resolv.conf or env (#951, @harryge00)
  • [Feature] Replace service name with Fully Qualified Domain Name (#938, @kevin85421)
  • [Feature] Add default init container in workers to wait for GCS to be ready (#973, @kevin85421)

Observability

  • Fix issue with head pod not monitered by Prometheus under certain condition (#963, @Yicheng-Lu-llll)
  • [Feature] Improve and fix Prometheus & Grafana integrations (#895, @kevin85421)
  • Add example and tutorial to explain how to create custom metrics for Prometheus (#914, @Yicheng-Lu-llll)
  • feat: enrich kubectl get output (#878, @davidxia)

RayCluster

  • Fix issue with operator OOM restart (#946, @wilsonwang371)
  • [Feature][Hotfix] Add observedGeneration to the status of CRDs (#979, @kevin85421)
  • Customize the Prometheus export port (#954, @Yicheng-Lu-llll)
  • [Feature] The default ImagePullPolicy should be IfNotPresent (#947, @kevin85421)
  • Inject the --block option to ray start command automatically (#932, @Yicheng-Lu-llll)
  • Inject cluster name as an environment variable into head and worker pods (#934, @Yicheng-Lu-llll)
  • Ensure container ports without names are also included in the head node service (#891, @Yicheng-Lu-llll)
  • fix: .status.availableWorkerReplicas (#887, @davidxia)
  • fix: only filter RayCluster events for reconciliation (#882, @davidxia)
  • refactor: remove redundant import in raycluster_controller.go (#884, @davidxia)
  • refactor: use equivalent, shorter Builder.Owns() method (#881, @davidxia)
  • [RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change (#850, @architkulkarni)
  • [Feature] Make head serviceType optional (#851, @kevin85421)
  • [RayCluster controller] Add headServiceAnnotations field to RayCluster CR (#841, @cskornel-doordash)

RayJob (alpha)

  • [Hotfix][release blocker][RayJob] HTTP client from submitting jobs before dashboard initialization completes (#1000, @kevin85421)
  • [RayJob] Propagate error traceback string when GetJobInfo doesn't return valid JSON (#943, @architkulkarni)
  • [RayJob][Doc] Fix RayJob sample config. (#807, @DmitriGekhtman)

RayService (alpha)

  • [RayService] Skip update events without change (#811, @sihanwang41)

Helm

  • Add rayVersion in the RayCluster chart (#975, @Yicheng-Lu-llll)
  • [Feature] Support environment variables for KubeRay operator chart (#978, @kevin85421)
  • [Feature] Add service account section in helm chart (#969, @ducviet00)
  • Update apiserver chart location in readme (#896, @psschwei)
  • add sidecar container option (#920, @akihikokuroda)
  • match selector of service to pod labels (#918, @akihikokuroda)
  • [Feature] Nodeselector/Affinity/Tolerations value to kuberay-apiserver chart (#879, @alex-treebeard)
  • [Feature] Enable namespaced installs via helm chart (#860, @alex-treebeard)
  • Remove unused fields from KubeRay operator and RayCluster charts (#839, @kevin85421)
  • [Bug] Remove an unused field (ingress.enabled) from KubeRay operator chart (#812, @kevin85421)
  • [helm] Add memory limits and resource documentation. (#789, @DmitriGekhtman)

CI

  • [Feature] Add python client test to action (#993, @jasoonn)
  • [CI][Buildkite] Fix the PATH issue (#952, @kevin85421)
  • [CI][Buildkite] An example test for Buildkite (#919, @kevin85421)
  • refactor: Fix flaky tests by using RetryOnConflict (#904, @Yicheng-Lu-llll)
  • Use k8sClient from client.New in controller test (#898, @Yicheng-Lu-llll)
  • [Bug] Fix flaky test: should be able to update all Pods to Running (#893, @kevin85421)
  • Enable test framework to install operator with custom config and put operator in a namespace with enforced PSS in security testing (#876, @Yicheng-Lu-llll)
  • Ensure all temp files are deleted after the compatibility test (#886, @Yicheng-Lu-llll)
  • Adding a test for the document for the Pod security standard (#866, @Yicheng-Lu-llll)
  • [Feature] Run config tests with the latest release of KubeRay operator (#858, @kevin85421)
  • [Feature] Define a general-purpose cleanup method for CREvent (#849, @kevin85421)
  • [Feature] Remove Docker container and NodePort from compatibility test (#844, @kevin85421)
  • Remove Docker from BasicRayTestCase (#840, @kevin85421)
  • [Feature] Move some functions from prototype test framework to a new utils file (#837, @kevin85421)
  • [CI] Add workflow to manually trigger release image push (#801, @DmitriGekhtman)
  • [CI] Pin go version in CRD consistency check (#794, @DmitriGekhtman)
  • [Feature] Improve the observability of integration tests (#775, @jasoonn)

Sample YAML files

  • Improve ray-cluster.external-redis.yaml (#986, @Yicheng-Lu-llll)
  • remove ray-cluster.getting-started.yaml (#987, @Yicheng-Lu-llll)
  • [Feature] Read Redis password from Kubernetes Secret (#950, @kevin85421)
  • [Ray 2.3.0] Update --redis-password for RayCluster (#929, @kevin85421)
  • [Bug] KubeRay does not work on M1 macs. (#869, @kevin85421)
  • [Post Ray 2.3 Release] Update Ray versions to Ray 2.3.0 (#925, @cadedaniel)
  • [Post Ray 2.2.0 Release] Update Ray versions to Ray 2.2.0 (#822, @DmitriGekhtman)

Documentation

  • Update contribution doc to show users how to reach out via slack (#936, @gvspraveen)
  • [Feature][Docs] Explain how to specify container command for head pod (#912, @kevin85421)
  • [post-0.4.0 KubeRay release] update proto version to 0.4.0 (#830, @scarlet25151)
  • [0.4.0 release] Update changelog for KubeRay 0.4.0 (#836, @DmitriGekhtman)
  • [Docs] Revise release note docs (#835, @DmitriGekhtman)
  • [release] Add release command and guidance for KubeRay cli (#834, @Jeffwan)
  • [Release] Add tools and docs for changelog generator (#833, @Jeffwan)
  • [Bug] error: git cmd when following docs (#831, @kevin85421)
  • [post-0.4.0 KubeRay release] Update KubeRay versions (#821, @DmitriGekhtman)
  • [Feature][Doc] End-to-end KubeRay operator development process on Kind (#826, @kevin85421)
  • [Release][Docs] Update release instructions (#819, @DmitriGekhtman)
  • [docs] Tweaks to main README, add basic API Server README. (#809, @DmitriGekhtman)
  • update docs for release v0.4.0 (#778, @scarlet25151)
  • [docs] Update KubeRay operator README. (#808, @DmitriGekhtman)
  • [Release] Update docs for release v0.4.0 (#779, @kevin85421)

v0.4.0

1 year ago

Highlights

The KubeRay 0.4.0 release includes the following improvements.

Contributors

The following individuals contributed to KubeRay 0.4.0. This list is alphabetical and incomplete.

@AlessandroPomponio @architkulkarni @Basasuya @DmitriGekhtman @IceKhan13 @asm582 @davidxia @dhaval0108 @haoxins @iycheng @jasoonn @Jeffwan @jianyuan @kaushik143 @kevin85421 @lizzzcai @orcahmlee @pcmoritz @peterghaddad @rafvasq @scarlet25151 @shrekris-anyscale @sigmundv @sihanwang41 @simon-mo @tbabej @tgaddair @ulfox @wilsonwang371 @wuisawesome

New features and integrations

  • [Feature] Support Volcano for batch scheduling (#755, @tgaddair)
  • kuberay int with MCAD (#598, @asm582)

Helm

These changes pertain to KubeRay's Helm charts.

  • [Bug] Remove an unused field (ingress.enabled) from KubeRay operator chart (#812, @kevin85421)
  • [helm] Add memory limits and resource documentation. (#789, @DmitriGekhtman)
  • [Helm] Expose security context in helm chart. (#773, @DmitriGekhtman)
  • [Helm] Clean up RayCluster Helm chart ahead of KubeRay 0.4.0 release (#751, @DmitriGekhtman)
  • [Feature] Expose initContainer image in RayCluster chart (#674, @kevin85421)
  • [Feature][Helm] Expose the autoscalerOptions (#666, @orcahmlee)
  • [Feature][Helm] Align the key of minReplicas and maxReplicas (#663, @orcahmlee)
  • Helm: add service type configuration to head group for ray-cluster (#614, @IceKhan13)
  • Allow annotations in ray cluster helm chart (#574, @sigmundv)
  • [Feature][Helm] Enable sidecar configuration in Helm chart (#604, @kevin85421)
  • [bugfix][apiserver helm]: Adding missing rbacenable value (#594, @dhaval0108)
  • [Bug] Modification of nameOverride will cause label selector mismatch for head node (#572, @kevin85421)
  • [Helm][minor] Make "disabled" flag for worker groups optional (#548, @kevin85421)
  • helm: Uncomment the disabled key for the default workergroup (#543, @tbabej)
  • Fix Helm chart default configuration (#530, @kevin85421)
  • helm-chart/ray-cluster: Allow setting pod lifecycle (#494, @ulfox)

CI

The changes in this section pertain to KubeRay CI, testing, and developer workflows.

  • [Feature] Improve the observability of integration tests (#775, @jasoonn)
  • [CI] Pin go version in CRD consistency check (#794, @DmitriGekhtman)
  • [Feature] Test sample RayService YAML to catch invalid or out of date one (#731, @jasoonn)
  • Replace kubectl wait command with RayClusterAddCREvent (#705, @kevin85421)
  • [Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones (#678, @kevin85421)
  • [Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky (#650, @jasoonn)
  • Configuration Test Framework Prototype (#605, @kevin85421)
  • Update tests for better Mac M1 compatibility (#654, @shrekris-anyscale)
  • [Bug] Update wait function in test_detached_actor (#635, @kevin85421)
  • [Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_detached_actor flaky (#619, @kevin85421)
  • [Feature] Docker support for chart-testing (#623, @jasoonn)
  • [Feature] Optimize the wait functions in E2E tests (#609, @kevin85421)
  • [Feature] Running end-to-end tests on local machine (#589, @kevin85421)
  • [CI]use fixed version of gofumpt (#596, @wilsonwang371)
  • update test files before separating them (#591, @wilsonwang371)
  • Add reminders to avoid RBAC synchronization bug (#576, @kevin85421)
  • [Feature] Consistency check for RBAC (#577, @kevin85421)
  • [Feature] Sync for manifests and helm chart (#564, @kevin85421)
  • [Feature] Add a chart-test script to enable chart lint error reproduction on laptop (#563, @kevin85421)
  • [Feature] Add helm lint check in Github Actions (#554, @kevin85421)
  • [Feature] Add consistency check for types.go, CRDs, and generated API in GitHub Actions (#546, @kevin85421)
  • support ray 2.0.0 in compatibility test (#508, @wilsonwang371)

KubeRay Operator deployment

The changes in this section pertain to deployment of the KubeRay Operator.

  • Fix finalizer typo and re-create manifests (#631, @AlessandroPomponio)
  • Change Kuberay operator Deployment strategy type to Recreate (#566, @haoxins)
  • [Bug][Doc] Increase default operator resource requirements, improve docs (#727, @kevin85421)
  • [Feature] Sync logs to local file (#632, @Basasuya)
  • [Bug] label rayNodeType is useless (#698, @kevin85421)
  • Revise sample configs, increase memory requests, update Ray versions (#761, @DmitriGekhtman)

RayCluster controller

The changes in this section pertain to the RayCluster controller sub-component of the KubeRay Operator.

  • [autoscaler] Expose autoscaler container security context. (#752, @DmitriGekhtman)
  • refactor: log more descriptive info from initContainer (#526, @davidxia)
  • [Bug] Fail to create ingress due to the deprecation of the ingress.class annotation (#646, @kevin85421)
  • [kuberay] Fix inconsistent RBAC truncation for autoscaling clusters. (#689, @DmitriGekhtman)
  • [raycluster controller] Always honor maxReplicas (#662, @DmitriGekhtman)
  • [Autoscaler] Pass pod name to autoscaler, add pod patch permission (#740, @DmitriGekhtman)
  • [Bug] Shallow copy causes different worker configurations (#714, @kevin85421)
  • Fix duplicated volume issue (#690, @wilsonwang371)
  • [fix][raycluster controller] No error if head ip cannot be determined. (#701, @DmitriGekhtman)
  • [Feature] Set default appProtocol for Ray head service to tcp (#668, @kevin85421)
  • [Telemetry] Inject env identifying KubeRay. (#562, @DmitriGekhtman)
  • fix: correctly set GPUs in rayStartParams (#497, @davidxia)
  • [operator] enable bashrc before container start (#427, @Basasuya)
  • [Bug] Pod reconciliation fails if worker pod name is supplied (#587, @kevin85421)

Ray Jobs (alpha)

The changes pertain to the RayJob controller sub-component of the KubeRay Operator.

  • [Feature] [RayJobs] Use finalizers to implement stopping a job upon cluster deletion (#735, @kevin85421)
  • [ray job] support stop job after job cr is deleted in cluster selector mode (#629, @Basasuya)
  • [RayJob] Fix example misconfiguration. (#602, @DmitriGekhtman)
  • [operator] support clusterselector in job crd (#470, @Basasuya)

Ray Services (alpha)

The changes pertain to the RayService controller sub-component of the KubeRay Operator.

  • [RayService] Skip update events without change (#811, @sihanwang41)
  • [RayService] Track whether Serve app is ready before switching clusters (#730, @shrekris-anyscale)
  • [RayService] Compare cached hashed config before triggering update (#655, @shrekris-anyscale)
  • Disable async serve handler in Ray Service cluster. (#447, @iycheng)
  • [RayService] Revert "Disable async serve handler in Ray Service cluster (#447)" (#606, @shrekris-anyscale)
  • add support for rayserve in apiserver (#456, @scarlet25151)
  • Fix initial health check not obeying deploymentUnhealthySecondThreshold (#540, @jianyuan)

KubeRay API Server

  • [Bug][apiserver] fix apiserver create rayservice missing serve port (#734, @scarlet25151)
  • Support updating RayServices using the KubeRay API Server (#633, @scarlet25151)
  • [api server] enable job spec server (#416, @Basasuya)

Security

  • [Bug] client_golang used by KubeRay has a vulnerability (#728, @kevin85421)

Observability

  • feat: update RayCluster .status.reason field with pod creation error (#639, @davidxia)
  • feat: enrich RayCluster status with head IPs (#468, @davidxia)
  • config/prometheus: add metrics exporter for workers (#469, @ulfox)

Documentation

  • [docs] Updated Volcano integration documentation (#776, @tgaddair)
  • [0.4.0 Release] Minor doc improvements (#780, @DmitriGekhtman)
  • Update gcs-ft.md (#777, @wilsonwang371)
  • [Feature] Refactor test framework & test kuberay-operator chart with configuration framework (#759, @kevin85421)
  • fix docs: typo in README.md (#760, @davidxia)
  • [APIServer][Docs] Identify API server as community-managed and optional (#753, @DmitriGekhtman)
  • Add documentations for the release process of Helm charts (#723, @kevin85421)
  • [docs] Fix markdown in ray services (#712, @lizzzcai)
  • Cross-reference docs. (#703, @DmitriGekhtman)
  • Adding example of manually setting up NGINX Ingress (#699, @jasoonn)
  • [docs] State version requirement for kubectl (#702, @DmitriGekhtman)
  • Remove ray-cluster.without-block.yaml (#675, @kevin85421)
  • [doc] Add instructions about how to use SSL/TLS for redis connection. (#652, @iycheng)
  • [Feature][Docs] AWS Application Load Balancer (ALB) support (#658, @kevin85421)
  • [Feature][Doc] Explain that RBAC should be synchronized manually (#641, @kevin85421)
  • [doc] Reformat README.md (#599, @rafvasq)
  • [doc] Copy-Edit RayJob (#608, @rafvasq)
  • [doc] VS Code IDE setup (#613, @kevin85421)
  • [doc] Copy-Edit RayService (#607, @rafvasq)
  • fix mkdocs URL (#600, @asm582)
  • [doc] Add a tip on docker images (#586, @DmitriGekhtman)
  • Update ray-operator documentation and image version in ray-cluster.heterogeneous.yaml (#585, @jasoonn)
  • [Doc] Cannot build kuberay with Go 1.16 (#575, @kevin85421)
  • docs: Add instructions for working with Argo CD (#535, @haoxins)
  • Update Helm doc. (#531, @DmitriGekhtman)
  • Failure happened when install operator with kubectl apply (#525, @kevin85421)
  • fix examples: bad K8s log config causing logs to be lost (#501, @davidxia)
  • Helm instructions: kubectl apply -> kubectl create (#505, @DmitriGekhtman)
  • apiserver add new api docs (#498, @scarlet25151)

ray-cluster-chart-latest

1 year ago

A Helm chart for Kubernetes

kuberay-operator-chart-latest

1 year ago

A Helm chart for Kubernetes

kuberay-apiserver-chart-latest

1 year ago

A Helm chart for kuberay-apiserver

v0.3.0

1 year ago

v0.3.0 (2022-08-17)

Full Changelog

RayService (new feature!)

  • [rayservice] Fix config names to match serve config format directly (#464, @edoakes)
  • Disable pin on head for serve controller by default in service operator (#457, @iycheng)
  • add wget timeout to probes (#448, @wilsonwang371)
  • Disable async serve handler in Ray Service cluster. (#447, @iycheng)
  • Add more env for RayService head or worker pods (#439, @brucez-anyscale)
  • RayCluster created by RayService set death info env for ray container (#419, @brucez-anyscale)
  • Add integration test for kuberay ray service and improve ray service operator (#415, @brucez-anyscale)
  • Fix a potential reconcile issue for RayService and allow config unhealth time threshold in CR (#384, @brucez-anyscale)
  • [Serve] Unify logger and add user facing events (#378, @simon-mo)
  • Improve RayService Operator logic to handle head node crash (#376, @brucez-anyscale)
  • Add serving service for users traffic with health check (#367, @brucez-anyscale)
  • Create a service for dashboard agent (#324, @brucez-anyscale)
  • Update RayService CR to integrate with Ray Nightly (#322, @brucez-anyscale)
  • RayService: zero downtime update and healthcheck HA recovery (#307, @brucez-anyscale)
  • RayService: Dev RayService CR and Controller logic (#287, @brucez-anyscale)
  • KubeRay: kubebuilder creat RayService Controller and CR (#270, @brucez-anyscale)

RayJob (new feature!)

  • Properly convert unix time into meta time (#480, @pingsutw)
  • Fix nil pointer dereference (#429, @pingsutw)
  • Improve RayJob controller quality to alpha (#398, @Jeffwan)
  • Submit ray job after cluster is ready (#405, @pingsutw)
  • Add RayJob CRD and controller logic (#303, @harryge00)

Cluster Fault Tolerant (new feature!)

  • tune readiness probe timeouts (#411, @wilsonwang371)
  • enable ray external storage namespace (#406, @wilsonwang371)
  • Initial support for external Redis and GCS HA (#294, @wilsonwang371)

Autoscaler (new feature!)

  • [Autoscaler] Match autoscaler image to Ray head image for Ray >= 2.0.0 (#423, @DmitriGekhtman)
  • [autoscaler] Better defaults and config options (#414, @DmitriGekhtman)
  • [autoscaler] Make log file mount path more specific. (#391, @DmitriGekhtman)
  • [autoscaler] Flip prioritize-workers-to-delete feature flag (#379, @DmitriGekhtman)
  • Update autoscaler image (#371, @DmitriGekhtman)
  • [minor] Update autoscaler image. (#313, @DmitriGekhtman)
  • Provide override for autoscaler image pull policy. (#297, @DmitriGekhtman)
  • [RFC][autoscaler] Add autoscaler container overrides and config options for scale behavior. (#278, @DmitriGekhtman)
  • [autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider (#274, @DmitriGekhtman)

Operator

  • correct gcs ha to gcs ft (#482, @wilsonwang371)
  • Fix panic in cleanupInvalidVolumeMounts (#481, @MissiontoMars)
  • fix: worker node can't connect to head node service (#445, @pingsutw)
  • Add http resp code check for kuberay (#435, @brucez-anyscale)
  • Fix wrong ray start command (#431, @pingsutw)
  • fix controller: use Service's TargetPort (#383, @davidxia)
  • Generate clientset for new specs (#392, @Basasuya)
  • Add Ray address env. (#388, @DmitriGekhtman)
  • Add the support to replace evicted head pod (#381, @Jeffwan)
  • [Bug] Fix raycluster updatestatus list wrong label (#377, @scarlet25151)
  • Make replicas optional for the head spec. (#362, @DmitriGekhtman)
  • Add ray head service endpoints in status for expose raycluster's head node endpoints (#341, @scarlet25151)
  • Support KubeRay management labels (#345, @Jeffwan)
  • fix: bug in object store memory validation (#332, @davidxia)
  • feat: add EventReason type for events (#334, @davidxia)
  • minor refactor: fix camel-casing of unHealthy -> unhealthy (#333, @davidxia)
  • refactor: remove redundant imports (#317, @davidxia)
  • Fix GPU-autofill for rayStartParams (#328, @DmitriGekhtman)
  • ray-operator: add missing space in controller log messages (#316, @davidxia)
  • fix: use head group's ServiceAccount in autoscaler RoleBinding (#315, @davidxia)
  • fix typos in comments and help messages (#304, @davidxia)
  • enable force cluster upgrade (#231, @wilsonwang371)
  • fix operator: correctly set head pod service account (#276, @davidxia)
  • [hotfix] Fix Service account typo (#285, @DmitriGekhtman)
  • Rename RayCluster folder to Ray since the group is Ray (#275, @brucez-anyscale)
  • KubeRay: Relocate files to enable controller extension with Kubebuilder (#268, @brucez-anyscale)
  • fix: use configured RayCluster service account when autoscaling (#259, @davidxia)
  • suppress not found errors into regular logs (#222, @akanso)
  • adding label check (#221, @akanso)
  • Prioritize WorkersToDelete (#208, @sriram-anyscale)
  • Simplify k8s client creation (#179, @chenk008)
  • [ray-operator]Make log timestamp readable (#206, @chenk008)
  • bump controller-runtime to 0.11.1 and Kubernetes to v1.23 (#180, @chenk008)

APIServer

  • Add envs in cluster service api (#432, @MissiontoMars)
  • Expose swallowed detail error messages (#422, @Jeffwan)
  • fix: typo RAY_DISABLE_DOCKER_CPU_WRARNING -> RAY_DISABLE_DOCKER_CPU_WARNING (#421, @pingsutw)
  • Add hostPathType and mountPropagationMode field for apiserver (#413, @scarlet25151)
  • Fix ListAllComputeTemplates proto comments (#407, @MissiontoMars)
  • Enable DefaultHTTPErrorHandler and Upgrade grpc-gateway to v2 (#369, @Jeffwan)
  • Validate namespace consistency in the request when creating the cluster and the compute template (#365, @daikeshi)
  • Update compute template service url to include namespace path param (#363, @Jeffwan)
  • fix apiserver created raycluster metrics port missing and check (#356, @scarlet25151)
  • Support mounting volumes in API request (#346, @Jeffwan)
  • add standard label for the filtering of cluster (#342, @scarlet25151)
  • expose kubernetes events in apiserver (#343, @scarlet25151)
  • Update ray-operator version in the apiserver (#340, @pingsutw)
  • fix: typo worker_group_sepc -> worker_group_spec (#330, @davidxia)
  • Fix gpu-accelerator in template (#296, @armandpicard)
  • Add namespace scope to compute template operations (#244, @daikeshi)
  • Add namespace scope to list operation (#237, @daikeshi)
  • Add namespace scope for Ray cluster get and delete operations (#229, @daikeshi)

CLI

  • Cli: make namespace optional to adapt to ListAll operation (#361, @Jeffwan)

Deployment (kubernetes & helm)

  • sync up helm chart's role (#472, @scarlet25151)
  • helm-charts/ray-cluster: Allow extra workers (#451, @ulfox)
  • Update helm chart version to 0.3.0 (#461, @Jeffwan)
  • helm-chart/ray-cluster: allow head autoscaling (#443, @ulfox)
  • modify kuberay operator crds in kuberay operator chart and add apiserver chart (#354, @scarlet25151)
  • Warn explicitly against using kubectl apply to create RayCluster CRD. (#302, @DmitriGekhtman)
  • Sync crds to Helm chart (#280, @haoxins)
  • [Feature]Run kuberay in a single namespace (#258, @wilsonwang371)
  • fix duplicated port config and manager.yaml missing config (#250, @wilsonwang371)
  • manifests: Add live/ready probes (#243, @haoxins)
  • Helm: supports custom probe seconds (#239, @haoxins)
  • Add CD for helm charts (#199, @ddelange)

Build and Testing

  • Enable docker image push for release-0.3 branch (#462, @Jeffwan)
  • add new 8000 port forwarding in kind (#424, @wilsonwang371)
  • improve compatibility test stability (#418, @wilsonwang371)
  • improve test stability (#394, @wilsonwang371)
  • use more strict formatting (#385, @wilsonwang371)
  • fix flaky test issue (#370, @wilsonwang371)
  • provide more detailed information in case of test failures (#352, @wilsonwang371)
  • fix wrong kuberay image used by compatibility test (#327, @wilsonwang371)
  • add cluster nodes info test (#299, @wilsonwang371)
  • Fix the image name in deploy cmd (#293, @brucez-anyscale)
  • [CI]enable ci test to check ctrl plane health state (#279, @wilsonwang371)
  • [bugfix]update flaky test timeout (#254, @wilsonwang371)
  • Update format by running gofumpt (#236, @wilsonwang371)
  • Add unit tests for raycluster_controller reconcilePods function (#219, @Waynegates)
  • Support ray 1.12 (#245, @wilsonwang371)
  • add 1.11 to compatibility test and update comment (#217, @wilsonwang371)
  • run compatibility in parallel using multiple workflows (#215, @wilsonwang371)

Monitoring

  • add-state-machine-and-exposing-port (#319, @scarlet25151)
  • Install: Fix directory path for prometheus install.sh (#256, @Tomcli)
  • Fix Ray Operator prometheus config (#253, @Tomcli)
  • Emit prometheus metrics from kuberay control plane (#232, @Jeffwan)
  • Enable metrics-export-port by default and configure prometheus monitoring (#230, @scarlet25151)

Docs

  • [doc] Config and doc updates ahead of KubeRay 0.3.0/Ray 2.0.0 (#486, @DmitriGekhtman)
  • document the raycluster status (#473, @scarlet25151)
  • Clean up example samples (#434, @DmitriGekhtman)
  • Add ray state api doc link in ray service doc (#428, @brucez-anyscale)
  • [docs] Add sample configs with larger Ray pods (#426, @DmitriGekhtman)
  • Add RayJob docs and development docs (#404, @Jeffwan)
  • Add gcs ha doc into mkdocs (#402, @brucez-anyscale)
  • [minor] Add client and dashboard ports to ports in example configs. (#399, @DmitriGekhtman)
  • Add documentation for RayService (#387, @brucez-anyscale)
  • Fix broken links by creating referenced soft links (#335, @Jeffwan)
  • Support hosting swagger ui in apiserver (#344, @Jeffwan)
  • Remove autoscaler debug example to prevent confusion (#326, @Jeffwan)
  • Add a link to protobuf-grpc-service design page in proto doc (#310, @yabuchan)
  • update readme and address issue #286 (#311, @wilsonwang371)
  • docs fix: specify only Go 1.16 or 1.17 works right now (#261, @davidxia)
  • Add documention link in readme (#247, @simon-mo)
  • Use mhausenblas/mkdocs-deploy-gh-pages action for docs (#233, @Jeffwan)
  • Build KubeRay Github site (#216, @Jeffwan)

Contributors

Thank you to everyone who contributed to this release! ❤️

Users whose commits are in this release (alphabetically by user name) @akanso @armandpicard @Basasuya @brucez-anyscale @chenk008 @daikeshi @davidxia @ddelange @DmitriGekhtman @edoakes @haoxins @harryge00 @iycheng @Jeffwan @MissiontoMars @pingsutw @scarlet25151 @simon-mo @sriram-anyscale @Tomcli @ulfox @Waynegates @wilsonwang371 @yabuchan

A special shoutout to these folks who helped report, test, and review the codes: @caitengwei @ericl @pcmoritz @wuisawesome

And thank you very much to everyone else not listed here who contributed in other ways like filing issues, giving feedback, testing fixes, helping users in slack, etc. 🙏