Ray Versions Save

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

ray-2.10.0

3 weeks ago

Release Highlights

Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).

[Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
[RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
[Serve] Added a default autoscaling policy set via num_replicas=”auto” (#42613).
[Serve] Added support for active load shedding via max_queued_requests (#42950).
[Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
- max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
[Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
- max_concurrent_queries -> max_ongoing_requests
- target_num_ongoing_requests_per_replica -> target_ongoing_requests
- downscale_smoothing_factor -> downscaling_factor
- upscale_smoothing_factor -> upscaling_factor
[Serve] WARNING: the following default values will change in Ray 2.11:
- Default for max_ongoing_requests will change from 100 to 5.
- Default for target_ongoing_requests will change from 1 to 2.
[Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
[Train] Added support for accelerator types via ScalingConfig(accelerator_type).
[Train] Revamped the XGBoostTrainer and LightGBMTrainer to no longer depend on xgboost_ray and lightgbm_ray. A new, more flexible API will be released in a future release.
[Train/Tune] Refactored local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR.

Ray Libraries

Ray Data

🎉 New Features:

Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
Allow to specify application-level error to retry for actor task (#42492)
Add num_rows_per_file parameter to file-based writes (#42694)
Add DataIterator.materialize (#43210)
Skip schema call in DataIterator.to_tf if tf.TypeSpec is provided (#42917)
Add option to append for Dataset.write_bigquery (#42584)
Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)

💫 Enhancements:

Restructure stdout logging for better readability (#43360)
Add a more performant way to read large TFRecord datasets (#42277)
Modify ImageDatasource to use Image.BILINEAR as the default image resampling filter (#43484)
Reduce internal stack trace output by default (#43251)
Perform incremental writes to Parquet files (#43563)
Warn on excessive driver memory usage during shuffle ops (#42574)
Distributed reads for ray.data.from_huggingface (#42599)
Remove Stage class and related usages (#42685)
Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)

🔨 Fixes:

Turn off actor locality by default (#44124)
Normalize block types before internal multi-block operations (#43764)
Fix memory metrics for OutputSplitter (#43740)
Fix race condition issue in OpBufferQueue (#43015)
Fix early stop for multiple Limit operators. (#42958)
Fix deadlocks caused by Dataset.streaming_split for job hanging (#42601)

📖 Documentation:

Revamp Ray Data documentation for GA (#44006, #44007, #44008, #44098, #44168, #44093, #44105)

Ray Train

🎉 New Features:

Add support for accelerator types via ScalingConfig(accelerator_type) for improved worker scheduling (#43090)

💫 Enhancements:

Add a backend-specific context manager for train_func for setup/teardown logic (#43209)
Remove DEFAULT_NCCL_SOCKET_IFNAME to simplify network configuration (#42808)
Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)

🔨 Fixes:

Enable scheduling workers with memory resource requirements (#42999)
Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)
[Lightning] Fix resuming from checkpoint when using RayFSDPStrategy (#43594)
[Lightning] Fix deadlock in RayTrainReportCallback (#42751)
[Transformers] Fix checkpoint reporting behavior when get_latest_checkpoint returns None (#42953)

📖 Documentation:

Enhance docstring and user guides for train_loop_config (#43691)
Clarify in ray.train.report docstring that it is not a barrier (#42422)
Improve documentation for prepare_data_loader shuffle behavior and set_epoch (#41807)

🏗 Architecture refactoring:

Simplify XGBoost and LightGBM Trainer integrations. Implemented XGBoostTrainer and LightGBMTrainer as DataParallelTrainer. Removed dependency on xgboost_ray and lightgbm_ray. (#42111, #42767, #43244, #43424)
Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup (#42314)
Refactor restoration configuration to be centered around storage_path (#42853, #43179)
Deprecations related to SyncConfig (#42909)
Remove deprecated preprocessor argument from Trainers (#43146, #43234)
Hard-deprecate MosaicTrainer and remove SklearnTrainer (#42814)

Ray Tune

💫 Enhancements:

Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
Add support to TBXLogger for logging images (#37822)
Improve validation of Experiment(config) to handle RLlib AlgorithmConfig (#42816, #42116)

🔨 Fixes:

Fix reuse_actors error on actor cleanup for function trainables (#42951)
Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)

📖 Documentation:

Minor documentation fixes (#42118, #41982)

🏗 Architecture refactoring:

Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
Deprecations related to SyncConfig and chdir_to_trial_dir (#42909)
Refactor restoration configuration to be centered around storage_path (#42853, #43179)
Add back NevergradSearch (#42305)
Clean up invalid checkpoint_dir and reporter deprecation notices (#42698)

Ray Serve

🎉 New Features:

Added support for active load shedding via max_queued_requests (#42950).
Added a default autoscaling policy set via num_replicas=”auto” (#42613).

🏗 API Changes:

Renamed the following parameters. Each of the old names will be supported for another release before removal.
- max_concurrent_queries to max_ongoing_requests
- target_num_ongoing_requests_per_replica to target_ongoing_requests
- downscale_smoothing_factor to downscaling_factor
- upscale_smoothing_factor to upscaling_factor
WARNING: the following default values will change in Ray 2.11:
- Default for max_ongoing_requests will change from 100 to 5.
- Default for target_ongoing_requests will change from 1 to 2.

💫 Enhancements:

Add RAY_SERVE_LOG_ENCODING env to set the global logging behavior for Serve (#42781).
Config Serve's gRPC proxy to allow large payload (#43114).
Add blocking flag to serve.run() (#43227).
Add actor id and worker id to Serve structured logs (#43725).
Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
- max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
Autoscaling metrics (tracking ongoing and queued metrics) are now collected at deployment handles by default instead of at the Serve replicas (#42578).
- This means you can now set max_ongoing_requests=1 for autoscaling deployments and still upscale properly, because requests queued at handles are properly taken into account for autoscaling.
- You should expect deployments to upscale more aggressively during bursty traffic, because requests will likely queue up at handles during bursts of traffic.
- If you see any issues, please report them on GitHub and you can switch back to the old method of collecting metrics by setting the environment variable RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0
Improved the downscaling behavior of smoothing_factor for low numbers of replicas (#42612).
Various logging improvements (#43707, #43708, #43629, #43557).
During in-place upgrades or when replicas become unhealthy, Serve will no longer wait for old replicas to gracefully terminate before starting new ones (#43187). New replicas will be eagerly started to satisfy the target number of healthy replicas.
- This new behavior is on by default and can be turned off by setting RAY_SERVE_EAGERLY_START_REPLACEMENT_REPLICAS=0

🔨 Fixes:

Fix deployment route prefix override by default route prefix from serve run cli (#43805).
Fixed a bug causing batch methods to hang upon cancellation (#42593).
Unpinned FastAPI dependency version (#42711).
Delay proxy marking itself as healthy until it has routes from the controller (#43076).
Fixed an issue where multiplexed deployments could go into infinite backoff (#43965).
Silence noisy KeyError on disconnects (#43713).
Fixed the prometheus counter metrics emitted as gauge bug (#43795, #43901).
- All the serve counter metrics are emitted as counters with _total suffix. The old gauge metrics are still emitted for compatibility.

📖 Documentation:

Update serve logging config docs (#43483).
Added documentation for max_replicas_per_node (#42743).

RLlib

🎉 New Features:

The “new API stack” is now in alpha stage and available for PPO single- (#42272) and multi-agent and for SAC single-agent (#42571, #42570, #42568)
- ConnectorV2 API (#43669, #43680, #43040, #41074, #41212)
- Episode APIs (SingleAgentEpisode and MultiAgentEpisode) (#42009, #43275, #42296, #43818, #41631)
- EnvRunner APIs (SingleAgentEnvRunner and MultiAgentEnvRunner) (#41558, #41825, #42296, #43779)
In preparation of DQN on the new API stack: PrioritizedEpisodeReplayBuffer (#43258, #42832)

💫 Enhancements:

Old API Stack cleanups:
- Move SampleBatch column names (e.g. SampleBatch.OBS) into new class (Columns). (#43665)
- Remove old exec_plan API code. (#41585)
- Introduce OldAPIStack decorator (#43657)
- RLModule API: Add functionality to define kernel and bias initializers via config. (#42137)
Learner/LearnerGroup APIs:
- Replace Learner/LearnerGroup specific config classes (e.g. LearnerHyperparameters) with AlgorithmConfig. (#41296)
- Learner/LearnerGroup: Allow updating from Episodes. (#41235)
In preparation of DQN on the new API stack: (#43199, #43196)

🔨 Fixes:

New API Stack bug fixes: Fix policy_to_train logic (#41529), fix multi-APU for PPO on the new API stack. (#44001), Issue 40347: (#42090)
Other fixes: MultiAgentEnv would NOT call env.close() on a failed sub-env (#43664), Issue 42152 (#43317), issue 42396: (#43316), issue 41518 (#42011), issue 42385 (#43313)

📖 Documentation:

New API Stack examples: Self-play and league-based self-play (#43276), MeanStdFilter (for both single-agent and multi-agent) (#43274), Prev-actions/prev-rewards for multi-agent (#43491)
Other docs fixes and enhancements: (#43438, #41472, #42117, #43458)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Autoscaler v2 is in alpha and can be tried out with Kuberay.
Introduced subreaper to prevent leaks of sub-processes created by user code. (#42992)

💫 Enhancements:

Ray state api get_task() now accepts ObjectRef (#43507)
Add an option to disable task tracing for task/actor (#42431)
Improved object transfer throughput. (#43434)
Ray client now compares the Ray and Python version for compatibility with the remote Ray cluster. (#42760)

🔨 Fixes:

Fixed several bugs for streaming generator (#43775, #43772, #43413)
Fixed Ray counter metrics emitted as gauge bug (#43795)
Fixed a bug where empty resource task doesn’t work with placement group (#43448)
Fixed a bug where CPU resource is not released for a blocked worker inside placement group (#43270)
Fixed GCS crashes when PG commit phase failed due to node failure (#43405)
Fixed a bug where Ray memory monitor prematurely kill tasks (#43071)
Fixed placement group resource leak (#42942)
Upgraded cloudpickle to 3.0 which fixes the incompatibility with dataclasses (#42730)

📖 Documentation:

Updated the doc for Ray accelerators support (#41849)

Ray Clusters

💫 Enhancements:

[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster (#42604)
[spark] Add global mode for ray on spark cluster (#41153)

🔨 Fixes:

[VSphere] Only deploy ovf to first host of cluster (#42258)

Thanks

Many thanks to all those who contributed to this release!

@ronyw7, @xsqian, @justinvyu, @matthewdeng, @sven1977, @thomasdesr, @veryhannibal, @klebster2, @can-anyscale, @simran-2797, @stephanie-wang, @simonsays1980, @kouroshHakha, @Zandew, @akshay-anyscale, @matschaffer-roblox, @WeichenXu123, @matthew29tang, @vitsai, @Hank0626, @anmyachev, @kira-lin, @ericl, @zcin, @sihanwang41, @peytondmurray, @raulchen, @aslonnie, @ruisearch42, @vszal, @pcmoritz, @rickyyx, @chrislevn, @brycehuang30, @alexeykudinkin, @vonsago, @shrekris-anyscale, @andrewsykim, @c21, @mattip, @hongchaodeng, @dabauxi, @fishbone, @scottjlee, @justina777, @surenyufuz, @robertnishihara, @nikitavemuri, @Yard1, @huchen2021, @shomilj, @architkulkarni, @liuxsh9, @Jocn2020, @liuyang-my, @rkooo567, @alanwguo, @KPostOffice, @woshiyyya, @n30111, @edoakes, @y-abe, @martinbomio, @jiwq, @arunppsg, @ArturNiederfahrenhorst, @kevin85421, @khluu, @JingChen23, @masariello, @angelinalg, @jjyao, @omatthew98, @jonathan-anyscale, @sjoshi6, @gaborgsomogyi, @rynewang, @ratnopamc, @chris-ray-zhang, @ijrsvt, @scottsun94, @raychen911, @franklsf95, @GeneDer, @madhuri-rai07, @scv119, @bveeramani, @anyscalesam, @zen-xu, @npuichigo

ray-2.9.3

1 month ago

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

Fix protobuf breaking change by adding a compat layer. (#43172)
Bump up task failure logs to warnings to make sure failures could be troubleshooted (#43147)
Fix placement group leaks (#42942)

Ray Data

🔨 Fixes:

Skip schema call in to_tf if tf.TypeSpec is provided (#42917)
Skip recording memory spilled stats when get_memory_info_reply is failed (#42824)

Ray Serve

🔨 Fixes:

Fixing DeploymentStateManager qualifying replicas as running prematurely (#43075)

Thanks

Many thanks to all those who contributed to this release!

@rynewang, @GeneDer, @alexeykudinkin, @edoakes, @c21, @rkooo567

ray-2.9.2

2 months ago

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

Fix out of disk test on release branch (https://github.com/ray-project/ray/pull/42724)

Ray Data

🔨 Fixes:

Fix failing huggingface test (https://github.com/ray-project/ray/pull/42727)
Fix deadlocks caused by streaming_split (https://github.com/ray-project/ray/pull/42601) (https://github.com/ray-project/ray/pull/42755)
Fix locality config not being respected in DataConfig (https://github.com/ray-project/ray/pull/42204 https://github.com/ray-project/ray/pull/42204) (https://github.com/ray-project/ray/pull/42722)
Stability & accuracy improvements for Data+Train benchmark (https://github.com/ray-project/ray/pull/42027)
Add retry for _sample_fragment during ParquetDatasource._estimate_files_encoding_ratio() (https://github.com/ray-project/ray/pull/42759) (https://github.com/ray-project/ray/pull/42774)
Skip recording memory spilled stats when get_memory_info_reply is failed (https://github.com/ray-project/ray/pull/42824) (https://github.com/ray-project/ray/pull/42834)

Ray Serve

🔨 Fixes:

Pin the fastapi & starlette version to avoid breaking proxy (https://github.com/ray-project/ray/pull/42740 https://github.com/ray-project/ray/pull/42740)
Fix IS_PYDANTIC_2 logic for pydantic<1.9.0 (https://github.com/ray-project/ray/pull/42704) (https://github.com/ray-project/ray/pull/42708)
fix missing message body for json log formats (https://github.com/ray-project/ray/pull/42729) (https://github.com/ray-project/ray/pull/42874)

Thanks

Many thanks to all those who contributed to this release!

@c21, @raulchen, @can-anyscale, @edoakes, @peytondmurray, @scottjlee, @aslonnie, @architkulkarni, @GeneDer, @Zandew, @sihanwang41

ray-2.9.1

2 months ago

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

Adding debupgy as the ray debugger (#42311)
Fix task events profile events per task leak (#42248)
Make sure redis sync context and async context connect to the same redis instance (#42040)

Ray Data

🔨 Fixes:

[Data] Retry write if error during file clean up (#42326)

Ray Serve

🔨 Fixes:

Improve handling the websocket server disconnect scenario (#42130)
Fix pydantic config documentation (#42216)
Address issues under high network delays:
- Enable setting queue length response deadline via environment variable (#42001)
- Add exponential backoff for queue_len_response_deadline_s (#42041)

ray-2.9.0

3 months ago

Release Highlights

This release contains fixes for the Ray Dashboard. Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023
Ray Train has now upgraded support for spot node preemption -- allowing Ray Train to handle preemption node failures differently than application errors.
Ray is now compatible with Pydantic versions <2.0.0 and >=2.5.0, addressing a piece of user feedback we’ve consistently received.
The Ray Dashboard now has a page for Ray Data to monitor real-time execution metrics.
Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray Serve and Ray data for several releases. See the documentation for details.
We’ve added experimental support for new accelerators: Intel GPU (#38553), Intel Gaudi Accelerators (#40561), and Huawei Ascend NPU (#41256).

Ray Libraries

Ray Data

🎉 New Features:

Add the dashboard for Ray Data to monitor real-time execution metrics and log file for debugging (https://docs.ray.io/en/master/data/monitoring-your-workload.html).
Introduce concurrency argument to replace ComputeStrategy in map-like APIs (#41461)
Allow task failures during execution (#41226)
Support PyArrow 14.0.1 (#41036)
Add new API for reading and writing Datasource (https://github.com/ray-project/ray/issues/40296)
Enable group-by over multiple keys in datasets (#37832)
Add support for multiple group keys in map_groups (#40778)

💫 Enhancements:

Optimize OpState.outqueue_num_blocks (#41748)
Improve stall detection for StreamingOutputsBackpressurePolicy (#41637)
Enable read-only Datasets to be executed on new execution backend (#41466, #41597)
Inherit block size from downstream ops (#41019)
Use runtime object memory for scheduling (#41383)
Add retries to file writes (#41263)
Make range datasource streaming (#41302)
Test core performance metrics (#40757)
Allow ConcurrencyCapBackpressurePolicy._cap_multiplier to be set to 1.0 (#41222)
Create StatsManager to manage _StatsActor remote calls (#40913)
Expose max_retry_cnt parameter for BigQuery Write (#41163)
Add rows outputted to data metrics (#40280)
Add fault tolerance to remote tasks (#41084)
Add operator-level dropdown to ray data overview (#40981)
Avoid slicing too-small blocks (#40840)
Ray Data jobs detail table (#40756)
Update default shuffle block size to 1GB (#40839)
Log progress bar to data logs (#40814)
Operator level metrics (#40805)

🔨 Fixes:

Partial fix for Dataset.context not being sealed after creation (#41569)
Fix the issue that DataContext is not propagated when using streaming_split (#41473)
Fix Parquet partition filter bug (#40947)
Fix split read output blocks (#41070)
Fix BigQueryDatasource fault tolerance bugs (#40986)

📖 Documentation:

Add example of how to read and write custom file types (#41785)
Fix ray.data.read_databricks_tables doc (#41366)
Add read_json docs example for setting PyArrow block size when reading large files (#40533)
Add AllToAllAPI to dataset methods (#40842)

Ray Train

🎉 New Features:

Support reading Result from cloud storage (#40622)

💫 Enhancements:

Sort local Train workers by GPU ID (#40953)
Improve logging for Train worker scheduling information (#40536)
Load the latest unflattened metrics with Result.from_path (#40684)
Skip incrementing failure counter on preemption node died failures (#41285)
Update TensorFlow ReportCheckpointCallback to delete temporary directory (#41033)

🔨 Fixes:

Update config dataclass repr to check against None (#40851)
Add a barrier in Lightning RayTrainReportCallback to ensure synchronous reporting. (#40875)
Restore Tuner and Results properly from moved storage path (#40647)

📖 Documentation:

Improve torch, lightning quickstarts and migration guides + fix torch restoration example (#41843)
Clarify error message when trying to use local storage for multi-node distributed training and checkpointing (#41844)
Copy edits and adding links to docstrings (#39617)
Fix the missing ray module import in PyTorch Guide (#41300)
Fix typo in lightning_mnist_example.ipynb (#40577)
Fix typo in deepspeed.rst (#40320)

🏗 Architecture refactoring:

Remove Legacy Trainers (#41276)

Ray Tune

🎉 New Features:

Support reading Result from cloud storage (#40622)

💫 Enhancements:

Skip incrementing failure counter on preemption node died failures (#41285)

🔨 Fixes:

Restore Tuner and Results properly from moved storage path (#40647)

📖 Documentation:

Remove low value Tune examples and references to them (#41348)
Clarify when to use MLflowLoggerCallback and setup_mlflow (#37854)

🏗 Architecture refactoring:

Delete legacy TuneClient/TuneServer APIs (#41469)
Delete legacy Searchers (#41414)
Delete legacy persistence utilities (air.remote_storage, etc.) (#40207)

Ray Serve

🎉 New Features:

Introduce logging config so that users can set different logging parameters for different applications & deployments.
Added gRPC context object into gRPC deployments for user to set custom code and details back to the client.
Introduce a runtime environment feature that allows running applications in different containers with different images. This feature is experimental and a new guide can be found in the Serve docs.

💫 Enhancements:

Explicitly handle gRPC proxy task cancellation when the client dropped a request to not waste compute resources.
Enable async __del__ in the deployment to execute custom clean up steps.
Make Ray Serve compatible with Pydantic versions <2.0.0 and >=2.5.0.

🔨 Fixes:

Fixed gRPC proxy streaming request latency metrics to include the entire lifecycle of the request, including the time to consume the generator.
Fixed gRPC proxy timeout request status from CANCELLED to DEADLINE_EXCEEDED.
Fixed previously Serve shutdown spamming log files with logs for each event loop to only log once on shutdown.
Fixed issue during batch requests when a request is dropped, the batch loop will be killed and not processed any future requests.
Updating replica log filenames to only include POSIX-compliant characters (removed the “#” character).
Replicas will now be gracefully shut down after being marked unhealthy due to health check failures instead of being force killed.
- This behavior can be toggled using the environment variable RAY_SERVE_FORCE_STOP_UNHEALTHY_REPLICAS=1, but this is planned to be removed in the near future. If you rely on this behavior, please file an issue on github.

RLlib

🎉 New Features:

New API stack (in progress):
- New MultiAgentEpisode class introduced. Basis for upcoming multi-agent EnvRunner, which will replace RolloutWorker APIs. (#40263, #40799)
- PPO runs with new SingleAgentEnvRunner (w/o Policy/RolloutWorker APIs). CI learning tests added. (#39732, #41074, #41075)
- By default: PPO reverted to use old API stack by default, for now. Pending feature-completion of new API stack (incl. multi-agent, RNN support, new EnvRunners, etc..). (#40706)
Old API stack:
- APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. (#40927)
- Added on_workers_recreated callback to Algorithm, which is triggered after workers have failed and been restarted. (#40354)

💫 Enhancements:

Old API stack and rllib_contrib cleanups: #40939, #40744, #40789, #40444, #37271

🔨 Fixes:

Restoring from a checkpoint from an older wheel (where AlgorithmConfig.rl_module_spec was NOT a “@property” yet) breaks when trying to load from this checkpoint. (#41157)
SampleBatch slicing crashes when using tf + SEQ_LENS + zero-padding. (#40905)
Other fixes: #39978, #40788, #41168, #41204

📖 Documentation:

Updated codeblocks in RLlib. (#37271)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray serve and Ray data for several releases. See the documentation for details.
- As part of the change, num_returns=”dynamic” is planning to be deprecated, and its return type is changed from ObjectRefGenerator -> “DynamicObjectRefGenerator”
Add experimental accelerator support for new hardwares.
- Add experimental support for Intel GPU (#38553)
- Add experimental support for Intel Gaudi Accelerators (#40561)
- Add experimental support for Huawei Ascend NPU (#41256)
Add the initial support to run MPI based code on top of Ray.(#40917, #41349)

💫 Enhancements:

Optimize next/anext performance for streaming generator (#41270)
Make the number of connections and thread number of the object manager client tunable. (#41421)
Add __ray_call__ default actor method (#41534)

🔨 Fixes:

Fix NullPointerException cause by raylet id is empty when get actor info in java worker (#40560)
Fix a bug where SIGTERM is ignored to worker processes (#40210)
Fix mmap file leak. (#40370)
Fix the lifetime issue in Plasma server client releasing object. (#40809)
Upgrade grpc from 1.50.2 to 1.57.1 to include security fixes (#39090)
Fix the bug where two head nodes are shown from ray list nodes (#40838)
Fix the crash when the GCS address is not valid. (#41253)
Fix the issue of unexpectedly high socket usage in ray core worker processes. (#41121)
Make worker_process_setup_hook work with strings instead of Python functions (#41479)

Ray Clusters

💫 Enhancements:

Stability improvements for the vSphere cluster launcher
Better CLI output for cluster launcher

🔨 Fixes:

Fixed run_init for TPU command runner

📖Documentation:

Added missing steps and simplified YAML in top-level clusters quickstart
Clarify that job entrypoints run on the head node by default and how to override it

Dashboard

💫 Enhancements:

Improvements to the Ray Data Dashboard
- Added Ray Data-specific overview on jobs page, including a table view with Dataset-level metrics
- Added operator-level metrics granularity to drill down on Dataset operators
- Added additional metrics for monitoring iteration over Datasets

Docs

🎉 New Features:

Updated to Sphinx version 7.1.2. Previously, the docs build used Sphinx 4.3.2. Upgrading to a recent version provides a more modern user experience while fixing many long standing issues. Let us know how you like the upgrade or any other docs issues on your mind, on the Ray Slack #docs channel.

Thanks

Many thanks to all those who contributed to this release!

@justinvyu, @zcin, @avnishn, @jonathan-anyscale, @shrekris-anyscale, @LeonLuttenberger, @c21, @JingChen23, @liuyang-my, @ahmed-mahran, @huchen2021, @raulchen, @scottjlee, @jiwq, @z4y1b2, @jjyao, @JoshTanke, @marxav, @ArturNiederfahrenhorst, @SongGuyang, @jerome-habana, @rickyyx, @rynewang, @batuhanfaik, @can-anyscale, @allenwang28, @wingkitlee0, @angelinalg, @peytondmurray, @rueian, @KamenShah, @stephanie-wang, @bryanjuho, @sihanwang41, @ericl, @sofianhnaide, @RaffaGonzo, @xychu, @simonsays1980, @pcmoritz, @aslonnie, @WeichenXu123, @architkulkarni, @matthew29tang, @larrylian, @iycheng, @hongchaodeng, @rudeigerc, @rkooo567, @robertnishihara, @alanwguo, @emmyscode, @kevin85421, @alexeykudinkin, @michaelhly, @ijrsvt, @ArkAung, @mattip, @harborn, @sven1977, @liuxsh9, @woshiyyya, @hahahannes, @GeneDer, @vitsai, @Zandew, @evalaiyc98, @edoakes, @matthewdeng, @bveeramani

ray-2.8.1

4 months ago

Release Highlights

The Ray 2.8.1 patch release contains fixes for the Ray Dashboard.

Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023

Ray Dashboard

🔨 Fixes:

[core][state][log] Cherry pick changes to prevent state API from reading files outside the Ray log directory (#41520) [Dashboard] Migrate Logs page to use state api. (#41474) (#41522)

ray-2.8.0

5 months ago

Release Highlights

This release features stability improvements and API clean-ups across the Ray libraries.

In Ray Serve, we are deprecating the previously experimental DAG API for deployment graphs. Model composition will be supported through deployment handles providing more flexibility and stability. The previously deprecated Ray Serve 1.x APIs have also been removed. We’ve also added a new Java APIs that aligns with the Ray Serve 2.x APIs. More API changes in the release notes below.
In RLlib, we’ve moved 24 algorithms into rllib_contrib (still available within RLlib for Ray 2.8).
We’ve added support for PyTorch-compatible input files shuffling for Ray Data. This allows users to randomly shuffle input files for better model training accuracy. This release also features new Ray Data datasources for Databricks and BigQuery.
On the Ray Dashboard, we’ve added new metrics for Ray Data in the Metrics tab. This allows users to monitor Ray Data workload including real time metrics of cluster memory, CPU, GPU, output data size, etc. See the doc for more details.
Ray Core now supports profiling GPU tasks or actors using Nvidia Nsight. See the documentation for instructions.
We fixed 2 critical bugs raised by many kuberay / ML library users, including a child process leak issue from Ray worker that leaks the GPU memory (#40182) and an job page excessive loading time issue when Ray HA cluster restarts a head node (#40742)
Python 3.7 support is officially deprecated from Ray.

Ray Libraries

Ray Data

🎉 New Features:

Add support for shuffling input files (#40154)
Support streaming read of PyTorch dataset (#39554)
Add BigQuery datasource (#37380)
Add Databricks table / SQL datasource (#39852)
Add inverse transform functionality to LabelEncoder (#37785)
Add function arg params to Dataset.map and Dataset.flat_map (#40010)

💫Enhancements:

Hard deprecate DatasetPipeline (#40129)
Remove BulkExecutor code path (#40200)
Deprecate extraneous Dataset parameters and methods (#40385)
Remove legacy iteration code path (#40013)
Implement streaming output backpressure (#40387)
Cap op concurrency with exponential ramp-up (#40275)
Store ray dashboard metrics in _StatsActor (#40118)
Slice output blocks to respect target block size (#40248)
Drop columns before grouping by in Dataset.unique() (#40016)
Standardize physical operator runtime metrics (#40173)
Estimate blocks for limit and union operator (#40072)
Store bytes spilled/restored after plan execution (#39361)
Optimize sample_boundaries in SortTaskSpec (#39581)
Optimization to reduce ArrowBlock building time for blocks of size 1 (#38833)

🔨 Fixes:

Fix bug where _StatsActor errors with PandasBlock (#40481)
Remove deprecated do_write (#40422)
Improve error message when reading HTTP files (#40462)
Add flag to skip get_object_locations for metrics (#39884)
Fall back to fetch files info in parallel for multiple directories (#39592)
Replace deprecated .pieces with updated .fragments (#39523)
Backwards compatibility for Preprocessor that have been fit in older versions (#39173)
Removing unnecessary data copy in convert_udf_returns_to_numpy (#39188)
Do not eagerly free root RefBundles (#39016)

📖Documentation:

Remove out-of-date Data examples (#40127)
Remove unused and outdated source examples (#40271)

Ray Train

🎉 New Features:

Add initial support for scheduling workers on neuron_cores (#39091)

💫Enhancements:

Update PyTorch Lightning import path to support both pytorch_lightning and lightning (#39841, #40266)
Propagate driver DataContext to RayTrainWorkers (#40116)

🔨 Fixes:

Fix error propagation for as_directory if to_directory fails (#40025)

📖Documentation:

Update checkpoint hierarchy documentation for RayTrainReportCallbacks. (#40174)
Update Lightning RayDDPStrategy docstring (#40376)

🏗 Architecture refactoring:

Deprecate LightningTrainer, AccelerateTrainer, `TransformersTrainer (#40163)
Clean up legacy persistence mode code paths (#39921, #40061, #40069, #40168)
Deprecate legacy DatasetConfig (#39963)
Remove references to DatasetPipeline (#40159)
Enable isort (#40172)

Ray Tune

💫Enhancements:

Separate storage checkpoint index bookkeeping (#39927, #40003)
Raise an error if Tuner.restore() is called on an instance (#39676) 🏗 Architecture refactoring:
Clean up legacy persistence mode code paths (#39918, #40061, #40069, #40168, #40175, #40192, #40181, #40193)
Migrate TuneController tests (#39704)
Remove TuneRichReporter (#40169)
Remove legacy Ray Client tests (#40415)

Ray Serve

💫Enhancements:

The single-app configuration format for the Serve Config (i.e. the Serve Config without the ‘applications’ field) has been deprecated in favor of the new configuration format. Both single-app configuration and DAG API will be removed in 2.9.
The Serve REST API is now accessible through the dashboard port, which defaults to 8265.
Accessing the Serve REST API through the dashboard agent port (default 52365) is deprecated. The support will be removed in a future version.
Ray job error tracebacks are now logged in the job driver log for easier access when jobs fail during start up.
Deprecated single-application config file
Deprecated DAG API: InputNode and DAGDriver
Removed deprecated Deployment 1.x APIs: Deployment.deploy(), Deployment.delete(), Deployment.get_handle()
Removed deprecated 1.x API: serve.get_deployment and serve.list_deployments
New Java API supported (aligns with Ray Serve 2.x API)

🔨 Fixes:

The dedicated_cpu and detached options in serve.start() have been fully disallowed.
Error will be raised when users pass invalid gRPC service functions and fail early.
The proxy’s readiness check now uses a linear backoff to avoid getting stuck in an infinite loop if it takes longer than usual to start.
grpc_options on serve.start() was only allowing a gRPCOptions object in Ray 2.7.0. Dictionaries are now allowed to be used asgrpc_options in the serve.start() call.

RLlib

💫Enhancements:

rllib_contrib algorithms (A2C, A3C, AlphaStar #36584, AlphaZero #36736, ApexDDPG #36596, ApexDQN #36591, ARS #36607, Bandits #36612, CRR #36616, DDPG, DDPPO #36620, Dreamer(V1), DT #36623, ES #36625, LeelaChessZero #36627, MA-DDPG #36628, MAML, MB-MPO #36662, PG #36666, QMix #36682, R2D2, SimpleQ #36688, SlateQ #36710, and TD3 #36726) all produce warnings now if used. See here for more information on the rllib_contrib efforts. (36620, 36628, 3
Provide msgpack checkpoint translation utility to convert checkpoint into msgpack format for being able to move in between python versions (#38825).

🔨 Fixes:

Issue 35440 (JSON output writer should include INFOS #39632)
Issue 39453 (PettingZoo wrappers should use correct multi-agent dict spaces #39459)
Issue 39421 (Multi-discrete action spaces not supported in new stack #39534)
Issue 39234 (Multi-categorical distribution bug #39464) #39654, #35975, #39552, #38555

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Python 3.7 support is officially deprecated from Ray.
Supports profiling GPU tasks or actors using Nvidia Nsight. See the doc for instructions.
Ray on spark autoscaling is officially supported from Ray 2.8. See the REP for more details. 💫Enhancements:
IDLE node information in detail is available from ray status -v (#39638)
Adding a new accelerator to Ray is simplified with a new accelerator interface. See the in-flight REP for more details (#40286).
Typing_extensions is removed from a dependency requirement because Python 3.7 support is deprecated. (#40336)
Ray state API supports case insensitive match. (#34577)
ray start --runtime-env-agent-port is officially supported. (#39919)
Driver exit code is available fromjob info (#39675)

🔨 Fixes:

Fixed a worker leak when Ray is used with placement group because Ray didn’t handle SIGTERM properly (#40182)
Fixed an issue job page loading takes a really long time when Ray HA cluster restarts a head node (#40431)
[core] loosen the check on release object (#39570)
[Core] ray init sigterm (#39816)
[Core] Non Unit Instance fractional value fix (#39293)
[Core]: Enable get_actor_name for actor runtime context (#39347)
[core][streaming][python] Fix asyncio.wait coroutines args deprecated warnings #40292

📖Documentation:

The Ray streaming generator doc (alpha) is officially available at https://docs.ray.io/en/master/ray-core/ray-generator.html

Ray Clusters

💫Enhancements:

Enable GPU support for vSphere cluster launcher (#40667)

📖Documentation:

Setup RBAC by KubeRay Helm chart
KubeRay upgrade documentation
RayService high availability

🔨 Fixes:

Assorted fixes for vSphere cluster launcher (#40487, #40516, #40655)

Dashboard

🎉 New Features:

New metrics for ray data can be found in the Metrics tab. 🔨 Fixes:
Fix bug where download log button did not download all logs for actors.

Thanks

Many thanks to all who contributed to this release!

@scottjlee, @chappidim, @alexeykudinkin, @ArturNiederfahrenhorst, @stephanie-wang, @chaowanggg, @peytondmurray, @maxpumperla, @arvind-chandra, @iycheng, @JalinWang, @matthewdeng, @wfangchi, @z4y1b2, @alanwguo, @Zandew, @kouroshHakha, @justinvyu, @yuanchen8911, @vitsai, @hongchaodeng, @allenwang28, @caozy623, @ijrsvt, @omus, @larrylian, @can-anyscale, @joncarter1, @ericl, @lejara, @jjyao, @Ox0400, @architkulkarni, @edoakes, @raulchen, @bveeramani, @sihanwang41, @WeichenXu123, @zcin, @Codle, @dimakis, @simonsays1980, @cadedaniel, @angelinalg, @luv003, @JingChen23, @xwjiang2010, @rynewang, @Yicheng-Lu-llll, @scrivy, @michaelhly, @shrekris-anyscale, @xxnwj, @avnishn, @woshiyyya, @aslonnie, @amogkam, @krfricke, @pcmoritz, @liuyang-my, @jonathan-anyscale, @rickyyx, @scottsun94, @richardliaw, @rkooo567, @stefanbschneider, @kevin85421, @c21, @sven1977, @GeneDer, @matthew29tang, @RocketRider, @LaynePeng, @samhallam-reverb, @scv119, @huchen2021

ray-2.7.1

6 months ago

Release Highlights

Ray Serve:
- Added an application tag to the ray_serve_num_http_error_requests metric
- Fixed a bug where no data shows up on the Error QPS per Application panel in the Ray Dashboard
RLlib:
- DreamerV3: Bug fix enabling support for continuous actions.
Ray Train:
- Fix a bug where setting a local storage path on Windows errors (#39951)
Ray Tune:
- Fix a broken Trial.node_ip property (#40028)
Ray Core:
- Fixes a segfault when a streaming generator and actor cancel is used together
- Fix autoscaler sdk accidentally initialize ray worker leading to leaked driver showing up in the dashboard.
- Added a new user guide and fixes for the vSphere cluster launcher.
- Fixed a bug where ray start would occasionally fail with ValueError: acceleratorType should match v(generation)-(cores/chips).
Dashboard:
- Improvement on cluster page UI
- Fix a bug that overview page UI will crash

Ray Libraries

Ray Serve

🔨 Fixes:

Fixed a bug where no data shows up on the Error QPS per Application panel in the Ray Dashboard

RLlib

🔨 Fixes:

DreamerV3: Bug fix enabling support for continuous actions (39751).

Ray Core and Ray Clusters

🔨 Fixes:

Fixed Ray cluster stability on a high latency environment

Thanks

Many thanks to all those who contributed to this release!

@chaowanggg, @allenwang28, @shrekris-anyscale, @GeneDer, @justinvyu, @can-anyscale, @edoakes, @architkulkarni, @rkooo567, @rynewang, @rickyyx, @sven1977

ray-2.7.0

7 months ago

Release Highlights

Ray 2.7 release brings important stability improvements and enhancements to Ray libraries, with Ray Train and Ray Serve becoming generally available. Ray 2.7 is accompanied with a GA release of KubeRay.

Following user feedback, we are rebranding “Ray AI Runtime (AIR)” to “Ray AI Libraries”. Without reducing any of the underlying functionality of the original Ray AI runtime vision as put forth in Ray 2.0, the underlying namespace (ray.air) is consolidated into ray.data, ray.train, and ray.tune. This change reduces the friction for new machine learning (ML) practitioners to quickly understand and leverage Ray for their production machine learning use cases.
With this release, Ray Serve and Ray Train’s Pytorch support are becoming Generally Available -- indicating that the core APIs have been marked stable and that both libraries have undergone significant production hardening.
In Ray Serve, we are introducing a new backwards-compatible DeploymentHandle API to unify various existing Handle APIs, a high performant gRPC proxy to serve gRPC requests through Ray Serve, along with various stability and usability improvements.
In Ray Train, we are consolidating various Pytorch-based trainers into the TorchTrainer, reducing the amount of refactoring work new users needed to scale existing training scripts. We are also introducing a new train.Checkpoint API, which provides a consolidated way of interacting with remote and local storage, along with various stability and usability improvements.
In Ray Core, we’ve added initial integrations with TPUs and AWS accelerators, enabling Ray to natively detect these devices and schedule tasks/actors onto them. Ray Core also officially now supports actor task cancellation and has an experimental streaming generator that supports streaming response to the caller.

Take a look at our refreshed documentation and the Ray 2.7 migration guide and let us know your feedback!

Ray Libraries

Ray AIR

🏗 Architecture refactoring:

Ray AIR namespace: We are sunsetting the "Ray AIR" concept and namespace (#39516, #38632, #38338, #38379, #37123, #36706, #37457, #36912, #37742, #37792, #37023). The changes follow the proposal outlined in this REP.
Ray Train Preprocessors, Predictors: We now recommend using Ray Data instead of Preprocessors (#38348, #38518, #38640, #38866) and Predictors (#38209).

Ray Data

🎉 New Features:

In this release, we’ve integrated the Ray Core streaming generator API by default, which allows us to reduce memory footprint throughout the data pipeline (#37736).
Avoid unnecessary data buffering between Read and Map operator (zero-copy fusion) (#38789)
Add Dataset.write_images to write images (#38228)
Add Dataset.write_sql() to write SQL databases (#38544)
Support sort on multiple keys (#37124)
Support reading and writing JSONL file format (#37637)
Support class constructor args for Dataset.map() and flat_map() (#38606)
Implement streamed read from Hugging Face Dataset (#38432)

💫Enhancements:

Read data with multi-threading for FileBasedDataSource (#39493)
Optimization to reduce ArrowBlock building time for blocks of size 1 (#38988)
Add partition_filter parameter to read_parquet (#38479)
Apply limit to Dataset.take() and related methods (#38677)
Postpone reader.get_read_tasks until execution (#38373)
Lazily construct metadata providers (#38198)
Support writing each block to a separate file (#37986)
Make iter_batches an Iterable (#37881)
Remove default limit on Dataset.to_pandas() (#37420)
Add Dataset.to_dask() parameter to toggle consistent metadata check (#37163)
Add Datasource.on_write_start (#38298)
Remove support for DatasetDict as input into from_huggingface() (#37555)

🔨 Fixes:

Backwards compatibility for Preprocessor that have been fit in older versions (#39488)
Do not eagerly free root RefBundles (#39085)
Retry open files with exponential backoff (#38773)
Avoid passing local_uri to all non-Parquet data sources (#38719)
Add ctx parameter to Datasource.write (#38688)
Preserve block format on map_batches over empty blocks (#38161)
Fix args and kwargs passed to ActorPool map_batches (#38110)
Add tif file extension to ImageDatasource (#38129)
Raise error if PIL can't load image (#38030)
Allow automatic handling of string features as byte features during TFRecord serialization (#37995)
Remove unnecessary file system wrapping (#38299)
Remove _block_udf from FileBasedDatasource reads (#38111)

📖Documentation:

Standardize API references (#37015, #36980, #37007, #36982, etc)

Ray Train

🤝 API Changes

Ray Train and Ray Tune Checkpoints: Introduced a new train.Checkpoint class that unifies interaction with remote storage such as S3, GS, and HDFS. The changes follow the proposal in [REP35] Consolidated persistence API for Ray Train/Tune (#38452, #38481, #38581, #38626, #38864, #38844)
Ray Train with PyTorch Lightning: Moving away from the LightningTrainer in favor of the TorchTrainer as the recommended way of running distributed PyTorch Lightning. The changes follow the proposal outlined in [REP37] [Train] Unify Torch based Trainers on the TorchTrainer API (#37989)
Ray Train with Hugging Face Transformers/Accelerate: Moving away from the TransformersTrainer/AccelerateTrainer in favor of the TorchTrainer as the recommended way of running distributed Hugging Face Transformers and Accelerate. The changes follow the proposal outlined in [REP37] [Train] Unify Torch based Trainers on the TorchTrainer API (#38083, #38295)
Deprecated preprocessor arg to Trainer (#38640)
Removed deprecated Result.log_dir (#38794)

💫Enhancements:

Various improvements and fixes for the console output of Ray Train and Tune (#37572, #37571, #37570, #37569, #37531, #36993)
Raise actionable error message for missing dependencies (#38497)
Use posix paths throughout library code (#38319)
Group consecutive workers by IP (#38490)
Split all Ray Datasets by default (#38694)
Add static Trainer methods for getting tree-based models (#38344)
Don't set rank-specific local directories for Train workers (#38007)

🔨 Fixes:

Fix trainer restoration from S3 (#38251)

🏗 Architecture refactoring:

Updated internal usage of the new Checkpoint API (#38853, #38804, #38697, #38695, #38757, #38648, #38598, #38617, #38554, #38586, #38523, #38456, #38507, #38491, #38382, #38355, #38284, #38128, #38143, #38227, #38141, #38057, #38104, #37888, #37991, #37962, #37925, #37906, #37690, #37543, #37475, #37142, #38855, #38807, #38818, #39515, #39468, #39368, #39195, #39105, #38563, #38770, #38759, #38767, #38715, #38709, #38478, #38550, #37909, #37613, #38876, #38868, #38736, #38871, #38820, #38457)

📖Documentation:

Restructured the Ray Train documentation to make it easier to find relevant content (#37892, #38287, #38417, #38359)
Improved examples, references, and navigation items (#38049, #38084, #38108, #37921, #38391, #38519, #38542, #38541, #38513, #39510, #37588, #37295, #38600, #38582, #38276, #38686, #38537, #38237, #37016)
Removed outdated examples (#38682, #38696, #38656, #38374, #38377, #38441, #37673, #37657, #37067)

Ray Tune

🤝 API Changes

Ray Train and Ray Tune Checkpoints: Introduced a new train.Checkpoint class that unifies interaction with remote storage such as S3, GS, and HDFS. The changes follow the proposal in [REP35] Consolidated persistence API for Ray Train/Tune (#38452, #38481, #38581, #38626, #38864, #38844)
Removed deprecated Result.log_dir (#38794)

💫Enhancements:

Various improvements and fixes for the console output of Ray Train and Tune (#37572, #37571, #37570, #37569, #37531, #36993)
Raise actionable error message for missing dependencies (#38497)
Use posix paths throughout library code (#38319)
Improved the PyTorchLightning integration (#38883, #37989, #37387, #37400)
Improved the XGBoost/LightGBM integrations (#38558, #38828)

🔨 Fixes:

Fix hyperband r calculation and stopping (#39157)
Replace deprecated np.bool8 (#38495)
Miscellaneous refactors and fixes (#38165, #37506, #37181, #37173)

🏗 Architecture refactoring:

Updated internal usages of the new Checkpoint API (#38853, #38804, #38697, #38695, #38757, #38648, #38598, #38617, #38554, #38586, #38523, #38456, #38507, #38491, #38382, #38355, #38284, #38128, #38143, #38227, #38141, #38057, #38104, #37888, #37991, #37962, #37925, #37906, #37690, #37543, #37475, #37142, #38855, #38807, #38818, #39515, #39468, #39368, #39195, #39105, #38563, #38770, #38759, #38767, #38715, #38709, #38478, #38550, #37909, #37613, #38876, #38868, #38736, #38871, #38820, #38457)
Removed legacy TrialRunner/Executor (#37927)

Ray Serve

🎉 New Features:

Added keep_alive_timeout_s to Serve config file to allow users to configure HTTP proxy’s duration to keep idle connections alive when no requests are ongoing.
Added gRPC proxy to serve gRPC requests through Ray Serve. It comes with feature parity with HTTP while offering better performance. Also, replaces the previous experimental gRPC direct ingress.
Ray 2.7 introduces a new DeploymentHandle API that will replace the existing RayServeHandle and RayServeSyncHandle APIs in a future release. You are encouraged to migrate to the new API to avoid breakages in the future. To opt in, either use handle.options(use_new_handle_api=True) or set the global environment variable export RAY_SERVE_ENABLE_NEW_HANDLE_API=1. See https://docs.ray.io/en/latest/serve/model_composition.html for more details.
Added a new API get_app_handle that gets a handle used to send requests to an application. The API uses the new DeploymentHandle API.
Added a new developer API get_deployment_handle that gets a handle that can be used to send requests to any deployment in any application.
Added replica placement group support.
Added a new API serve.status which can be used to get the status of proxies and Serve applications (and their deployments and replicas). This is the pythonic equivalent of the CLI serve status.
A --reload option has been added to the serve run CLI.
Support X-Request-ID in http header

💫Enhancements:

Downstream handlers will now be canceled when the HTTP client disconnects or an end-to-end timeout occurs.
Ray Serve is now “generally available,” so the core APIs have been marked stable.
- serve.start and serve.run have a few small changes and deprecations in preparation for this, see https://docs.ray.io/en/latest/serve/api/index.html for details.
Added a new metric (ray_serve_num_ongoing_http_requests) to track the number of ongoing requests in each proxy
Add RAY_SERVE_MULTIPLEXED_MODEL_ID_MATCHING_TIMEOUT_S flag to wait until the model matching.
Reduce the multiplexed model id information publish interval.
Add Multiplex metrics into dashboard
Added metrics to track controller restarts and control loop progress
- https://github.com/ray-project/ray/pull/38177
- https://github.com/ray-project/ray/pull/38000
Various stability, flexibility, and performance enhancements to Ray Serve’s autoscaling.

🔨 Fixes:

Fixed a memory leak in Serve components by upgrading gRPC: https://github.com/ray-project/ray/issues/38591.
Fixed a memory leak due to asyncio.Events not being removed in the long poll host: https://github.com/ray-project/ray/pull/38516.
Fixed a bug where bound deployments could not be passed within custom objects: https://github.com/ray-project/ray/issues/38809.
Fixed a bug where all replica handles were unnecessarily broadcasted to all proxies every minute: https://github.com/ray-project/ray/pull/38539.
Fixed a bug where ray_serve_deployment_queued_queries wouldn’t decrement when clients disconnected: https://github.com/ray-project/ray/pull/37965.

📖Documentation:

Added docs for how to use keep_alive_timeout_s in the Serve config file.
Added usage and examples for serving gRPC requests through Serve’s gRPC proxy.
Added example for passing deployment handle responses by reference.
Added a Ray Serve Autoscaling guide to the Ray Serve docs that goes over basic configurations and autoscaling examples. Also added an Advanced Ray Serve Autoscaling guide that goes over more advanced configurations and autoscaling examples.
Added docs explaining how to debug memory leaks in Serve.
Added docs that explain how Serve cancels disconnected requests and how to handle those disconnections.

RLlib

🎉 New Features:

In Ray RLlib, we have implemented Google’s new DreamerV3, a sample-efficient, model-based, and hyperparameter hassle-free algorithm. It solves a wide variety of challenging reinforcement learning environments out-of-the-box (e.g. the MineRL diamond challenge), for arbitrary observation- and action-spaces as well as dense and sparse reward functions.

💫Enhancements:

Added support for Gymnasium 0.28.1 (#35698)
Dreamer V3 tuned examples and support for “XL” Dreamer models (#38461)
Added an action masking example for RL Modules (#38095)

🔨 Fixes:

Multiple fixes to DreamerV3 (#37979) (#38259) (#38461) (#38981)
Fixed TorchBinaryAutoregressiveDistribution.sampled_action_logp() returning probs not log probs. (#37240)
Fix a bug in Multi-Categorical distribution. It should use logp and not log_p. (#36814)
Index tensors in slate epsilon greedy properly so SlateQ does not fail on multiple GPUs (#37481)
Removed excessive deprecation warnings in exploration related files (#37404)
Fixed missing agent index in policy input dict on environment reset (#37544)

📖Documentation:

Added docs for DreamerV3 (#37978)
Added docs on torch.compile usage (#37252)
Added docs for the Learner API (#37729)
Improvements to Catalogs and RL Modules docs + Catalogs improvements (#37245)
Extended our metrics and callbacks example to showcase how to do custom summarisation on custom metrics (#37292)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Actor task cancelation is officially supported.
The experimental streaming generator is now available. It means the yielded output is sent to the caller before the task is finished and overcomes the limitation from num_returns="dynamic" generator. The API could be used by specifying num_returns="streaming". The API has been used for Ray data and Ray serve to support streaming use cases. See the test script to learn how to use the API. The documentation will be available in a few days.

💫Enhancements:

Minimal Ray installation pip install ray doesn't require the Python grpcio dependency anymore.
[Breaking change] ray job submit now exits with 1 if the job fails instead of 0. To get the old behavior back, you may use ray job submit ... || true . (#38390)
[Breaking change] get_assigned_resources in pg will return the name of the original resources instead of formatted name (#37421)
[Breaking change] Every env var specified via ${ENV_VAR} now can be replaced. Previous versions only supported limited number of env vars. (#36187)
[Java] Update Guava package (#38424)
[Java] Update Jackson Databind XML Parsing (#38525)
[Spark] Allow specifying CPU / GPU / Memory resources for head node of Ray cluster on spark (#38056)

🔨 Fixes:

[Core] Internal gRPC version is upgraded from 1.46.6 to 1.50.2, which fixes the memory leak issue
[Core] Bind jemalloc to raylet and GCS (#38644) to fix memory fragmentation issue
[Core] Previously, when a ray is started with ray start --node-ip-address=..., the driver also had to specify ray.init(_node_ip_address). Now Ray finds the node ip address automatically. (#37644)
[Core] Child processes of workers are cleaned up automatically when a raylet dies (#38439)
[Core] Fix the issue where there are lots of threads created when using async actor (#37949)
[Core] Fixed a bug where tracing did not work when an actor/task was defined prior to calling ray.init: https://github.com/ray-project/ray/issues/26019
Various other bug fixes
- [Core] loosen the check on release object (#39570)
- [Core][agent] fix the race condition where the worker process terminated during the get_all_workers call #37953
- [Core]Fix PG leakage caused by GCS restart when PG has not been successfully remove after the job died (#35773)
- [Core]Fix internal_kv del api bug in client proxy mode (#37031)
- [Core] Pass logs through if sphinx-doctest is running (#36306)
- [Core][dashboard] Make intentional ray system exit from worker exit non task failing (#38624)
- [Core][dashboard] Add worker pid to task info (#36941)
- [Core] Use 1 thread for all fibers for an actor scheduling queue. (#37949)
- [runtime env] Fix Ray hangs when nonexistent conda environment is specified #28105 (#34956)

Ray Clusters

💫Enhancements:

New Cluster Launcher for vSphere #37815
TPU pod support for cluster launcher #37934

📖Documentation:

The KubeRay documentation has been moved to https://docs.ray.io/en/latest/cluster/kubernetes/index.html from its old location at https://ray-project.github.io/kuberay/.
New guide: GKE Ingress on KubeRay (39073)
New tutorial: Cloud storage from GKE on KubeRay #38858
New tutorial: Batch inference tutorial using KubeRay RayJob CR #38857
New benchmarks for RayService custom resource on KubeRay #38647
New tutorial: Text summarizer using NLP with RayService #38647

Thanks

Many thanks to all those who contributed to this release!

@simran-2797, @can-anyscale, @akshay-anyscale, @c21, @EdwardCuiPeacock, @rynewang, @volks73, @sven1977, @alexeykudinkin, @mattip, @Rohan138, @larrylian, @DavidYoonsik, @scv119, @alpozcan, @JalinWang, @peterghaddad, @rkooo567, @avnishn, @JoshKarpel, @tekumara, @zcin, @jiwq, @nikosavola, @seokjin1013, @shrekris-anyscale, @ericl, @yuxiaoba, @vymao, @architkulkarni, @rickyyx, @bveeramani, @SongGuyang, @jjyao, @sihanwang41, @kevin85421, @ArturNiederfahrenhorst, @justinvyu, @pleaseupgradegrpcio, @aslonnie, @kukushking, @94929, @jrosti, @MattiasDC, @edoakes, @PRESIDENT810, @cadedaniel, @ddelange, @alanwguo, @noahjax, @matthewdeng, @pcmoritz, @richardliaw, @vitsai, @Michaelvll, @tanmaychimurkar, @smiraldr, @wfangchi, @amogkam, @crypdick, @WeichenXu123, @darthhexx, @angelinalg, @chaowanggg, @GeneDer, @xwjiang2010, @peytondmurray, @z4y1b2, @scottsun94, @chappidim, @jovany-wang, @jaidisido, @krfricke, @woshiyyya, @Shubhamurkade, @ijrsvt, @scottjlee, @kouroshHakha, @allenwang28, @raulchen, @stephanie-wang, @iycheng

ray-2.6.3

8 months ago

The Ray 2.6.3 patch release contains fixes for Ray Serve, and Ray Core streaming generators.

Ray Core

🔨 Fixes:

[Core][Streaming Generator] Fix memory leak from the end of object stream object #38152 (#38206)

Ray Serve

🔨 Fixes:

[Serve] Fix serve run help message (#37859) (#38018)
[Serve] Decrement ray_serve_deployment_queued_queries when client disconnects (#37965) (#38020)

RLib

📖 Documentation:

[RLlib][docs] Learner API Docs (#37729) (#38137)