Ray Versions Save

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

ray-2.6.3

8 months ago

The Ray 2.6.3 patch release contains fixes for Ray Serve, and Ray Core streaming generators.

Ray Core

🔨 Fixes:

[Core][Streaming Generator] Fix memory leak from the end of object stream object #38152 (#38206)

Ray Serve

🔨 Fixes:

[Serve] Fix serve run help message (#37859) (#38018)
[Serve] Decrement ray_serve_deployment_queued_queries when client disconnects (#37965) (#38020)

RLib

📖 Documentation:

[RLlib][docs] Learner API Docs (#37729) (#38137)

ray-2.6.2

8 months ago

The Ray 2.6.2 patch release contains a critical fix for ray's logging setup, as well fixes for Ray Serve, Ray Data, and Ray Job.

Ray Core

🔨 Fixes:

[Core] Pass logs through if sphinx-doctest is running (#36306) (#37879)
[cluster-launcher] Pick GCP cluster launcher tests and fix (#37797)

Ray Serve

🔨 Fixes:

[Serve] Apply request_timeout_s from Serve config to the cluster (#37884) (#37903)

Ray Air

🔨 Fixes:

[air] fix pyarrow lazy import (#37670) (#37883)

ray-2.6.1

9 months ago

The Ray 2.6.1 patch release contains a critical fix for cluster launcher, and compatibility update for Ray Serve protobuf definition with python 3.11, as well doc improvements.

⚠️ Cluster launcher in Ray 2.6.0 fails to start multi-node clusters. Please update to 2.6.1 if you plan to use 2.6.0 cluster launcher.

Ray Core

🔨 Fixes:

[core][autoscaler] Fix env variable overwrite not able to be used if the command itself uses the env #37675

Ray Serve

🔨 Fixes:

[serve] Cherry-pick Serve enum to_proto fixes for Python 3.11 #37660

Ray Air

📖Documentation:

[air][doc] Update docs to reflect head node syncing deprecation #37475

ray-2.6.0

9 months ago

Release Highlights

Serve: Better streaming support -- In this release, Support for HTTP streaming response and WebSockets is now on by default. Also, @serve.batch-decorated methods can stream responses.
Train and Tune: Users are now expected to provide cloud storage or NFS path for distributed training or tuning jobs instead of a local path. This means that results written to different worker machines will not be directly synced to the head node. Instead, this will raise an error telling you to switch to one of the recommended alternatives: cloud storage or NFS. Please see https://github.com/ray-project/ray/issues/37177 if you have questions.
Data: We are introducing a new streaming integration of Ray Data and Ray Train. This allows streaming data ingestion for model training, and enables per-epoch data preprocessing. The DatasetPipeline API is also being deprecated in favor of Dataset with streaming execution.
RLlib: Public alpha release for the new multi-gpu Learner API that is less complex and more powerful compared to our previous solution (blogpost). This is used under PPO algorithm by default.

Ray Libraries

Ray AIR

🎉 New Features:

Added support for restoring Results from local trial directories. (#35406)

💫 Enhancements:

[Train/Tune] Disable Train/Tune syncing to head node (#37142)
[Train/Tune] Introduce new console output progress reporter for Train and Tune (#35389, #36154, #36072, #35770, #36764, #36765, #36156, #35977)
[Train/Data] New Train<>Data streaming integration (#35236, #37215, #37383)

🔨 Fixes:

Pass on KMS-related kwargs for s3fs (#35938)
Fix infinite recursion in log redirection (#36644)
Remove temporary checkpoint directories after restore (#37173)
Removed actors that haven't been started shouldn't be tracked (#36020)
Fix bug in execution for actor re-use (#36951)
Cancel pg.ready() task for pending trials that end up reusing an actor (#35748)
Add case for Dict[str, np.array] batches in DummyTrainer read bytes calculation (#36484)

📖 Documentation:

Remove experimental features page, add github issue instead (#36950)
Fix batch format in dreambooth example (#37102)
Fix Checkpoint.from_checkpoint docstring (#35793)

🏗 Architecture refactoring:

Remove deprecated mlflow and wandb integrations (#36860, #36899)
Move constants from tune/results.py to air/constants.py (#35404)
Clean up a few checkpoint related things. (#35321)

Ray Data

🎉 New Features:

New streaming integration of Ray Data and Ray Train. This allows streaming data ingestion for model training, and enables per-epoch data preprocessing. (#35236)
Enable execution optimizer by default (#36294, #35648, #35621, #35952)
Deprecate DatasetPipeline (#35753)
Add Dataset.unique() (#36655, #36802)
Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches() (#36842) (#37260)
Enforce strict mode batch format for DataIterator.iter_batches() (#36686)
Remove ray.data.range_arrow() (#35756)

💫 Enhancements:

Optimize block prefetching (#35568)
Enable isort for data directory (#35836)
Skip writing a file for an empty block in Dataset.write_datasource() (#36134)
Remove shutdown logging from StreamingExecutor (#36408)
Spread map task stages by default for arg size <50MB (#36290)
Read->SplitBlocks to ensure requested read parallelism is always met (#36352)
Support partial execution in Dataset.schema() with new execution plan optimizer (#36740)
Propagate iter stats for Dataset.streaming_split() (#36908)
Cache the computed schema to avoid re-executing (#37103)

🔨 Fixes:

Support sub-progress bars on AllToAllOperators with optimizer enabled (#34997)
Fix DataContext not propagated properly for Dataset.streaming_split() operator
Fix edge case in empty bundles with Dataset.streaming_split() (#36039)
Apply Arrow table indices mapping on HuggingFace Dataset prior to reading into Ray Data (#36141)
Fix issues with combining use of Dataset.materialize() and Dataset.streaming_split() (#36092)
Fix quadratic slowdown when locally shuffling tensor extension types (#36102)
Make sure progress bars always finish at 100% (#36679)
Fix wrong output order of Dataset.streaming_split() (#36919)
Fix the issue that StreamingExecutor is not shutdown when the iterator is not fully consumed (#36933)
Calculate stage execution time in StageStatsSummary from BlockMetadata (#37119)

📖 Documentation:

Standardize Data API ref (#36432, #36937)
Docs for working with PyTorch (#36880)
Split "Consuming data" guide (#36121)
Revise "Loading data" (#36144)
Consolidate Data user guides (#36439)

🏗 Architecture refactoring:

Remove simple blocks representation (#36477)

Ray Train

🎉 New Features:

LightningTrainer support DeepSpeedStrategy (#36165)

💫 Enhancements:

Unify Lightning and AIR CheckpointConfig (#36368)
Add support for custom pipeline class in TransformersPredictor (#36494)

🔨 Fixes:

Fix Deepspeed device ranks check in Lightning 2.0.5 (#37387)
Clear stale lazy checkpointing markers on all workers. (#36291)

📖 Documentation:

Migrate Ray Train code-block to testcode. (#36483)

🏗 Architecture refactoring:

Deprecate BatchPredictor (#36947, #37178)

Ray Tune

🔨 Fixes:

Optuna: Update distributions to use new APIs (#36704)
BOHB: Fix nested bracket processing (#36568)
Hyperband: Fix scheduler raising an error for good PENDING trials (#35338)
Fix param space placeholder injection for numpy/pandas objects (#35763)
Fix result restoration with Ray Client (#35742)
Fix trial runner/controller whitelist attributes (#35769)

📖 Documentation:

Remove missing example from Tune "Other examples" (#36691)

🏗 Architecture refactoring:

Remove tune/automl (#35557)
Remove hard-deprecated modules from structure refactor (#36984)
Remove deprecated mlflow and wandb integrations (#36860, #36899)
Move constants from tune/results.py to air/constants.py (#35404)
Deprecate redundant syncing related parameters (#36900)
Deprecate legacy modules in ray.tune.integration (#35160)

Ray Serve

💫 Enhancements:

Support for HTTP streaming response and WebSockets is now on by default.
@serve.batch-decorated methods can stream responses.
@serve.batch settings can be reconfigured dynamically.
Ray Serve now uses “power of two random choices” routing. This improves enforcement of max_concurrent_queries and tail latencies under load.

🔨 Fixes:

Fixed the bug previously unable to use a custom module named after “utils”.
Fixed serve downscaling issue by adding a new draining state to the http proxy. This helps http proxies to not take new requests when there are no replicas on the node and prevents interruption on the ongoing requests when the node is downscaled. Also, enables downscaling to happen when the requests use Ray’s object store which is blocking downscaling of the node.
Fixed non-atomic shutdown logic. Serve shutdown will be run in the background and not require the client to wait for the shutdown to complete. And won’t be interrupted when the client is force killed.

RLlib

🎉 New Features:

Public alpha release for the new multi-gpu Learner API that is less complex and more powerful than the old training stack (blogpost). This is used under PPO algorithm by default.
Added RNN support on the new RLModule API
Added TF-version of DreamerV3 (link). The comprehensive results will be published soon.
Added support for torch 2.x compile method in sampling from environment

💫 Enhancements:

Added an Example on how to do pretraining with BC and then continuing finetuning with PPO (example)
RLlib deprecation Notices (algorithm/, evaluation/, execution/, models/jax/) (#36826)
Enable eager_tracing=True by default. (#36556)

🔨 Fixes:

Fix bug in Multi-Categorical distribution. It should use logp and not log_p. (#36814)
Fix LSTM + Connector bug: StateBuffer restarting states on every in_eval() call. (#36774)

🏗 Architecture refactoring:

Multi-GPU Learner API

Ray Core

🎉 New Features:

[Core][Streaming Generator] Cpp interfaces and implementation (#35291)
[Core][Streaming Generator] Streaming Generator. Support Core worker APIs + cython generator interface. (#35324)
[Core][Streaming Generator] Streaming Generator. E2e integration (#35325)
[Core][Streaming Generator] Support async actor and async generator interface. (#35584)
[Core][Streaming Generator] Streaming Generator. Support the basic retry/lineage reconstruction (#35768)
[Core][Streaming Generator] Allow to raise an exception to avoid check failures. (#35766)
[Core][Streaming Generator] Fix a reference leak when a stream is deleted with out of order writes. (#35591)
[Core][Streaming Generator] Fix a reference leak when pinning requests are received after refs are consumed. (#35712)
[Core][Streaming Generator] Handle out of order report when retry (#36069)
[Core][Streaming Generator] Make it compatible with wait (#36071)
[Core][Streaming Generator] Remove busy waiting (#36070)
[Core][Autoscaler v2] add test for node provider (#35593)
[Core][Autoscaler v2] add unit tests for NodeProviderConfig (#35590)
[Core][Autoscaler v2] test ray-installer (#35875)
[Core][Autoscaler v2] fix too many values to unpack (expected 2) bug (#36231)
[Core][Autoscaler v2] Add idle time information to Autoscaler endpoint. (#36918)
[Core][Autoscaler v2] Cherry picks change to Autoscaler intereface (#37407)
[Core][Autoscaler v2] Fix idle time duration when node resource is not updated periodically (#37121) (#37175)
[Core][Autoscaler v2] Fix pg id serialization with hex rather than binary for cluster state reporting #37132 (#37176)
[Core][Autoscaler v2] GCS Autoscaler V2: Add instance id to ray [3/x] (#35649)
[Core][Autoscaler v2] GCS Autoscaler V2: Add node type name to ray (#36714)
[Core][Autoscaler v2] GCS Autoscaler V2: Add placement group's gang resource requests handling [4/x] (#35970)
[Core][Autoscaler v2] GCS Autoscaler V2: Handle ReportAutoscalingState (#36768)
[Core][Autoscaler v2] GCS Autoscaler V2: Interface [1/x] (#35549)
[Core][Autoscaler v2] GCS Autoscaler V2: Node states and resource requests [2/x] (#35596)
[Core][Autoscaler v2] GCS Autoscaler V2: Support Autoscaler.sdk.request_resources [5/x] (#35846)
[Core][Autoscaler v2] Ray status interface [1/x] (#36894)
[Core][Autoscaler v2] Remove usage of grpcio from Autoscaler SDK (#36967)
[Core][Autoscaler v2] Update Autoscaler proto for default enum value (#36962)
[Core][Autoscalerv2] Update Autoscaler.proto / instance_manager.proto dependency (#36116)

💫 Enhancements:

[Core] Make some grpcio imports lazy (#35705)
[Core] Only instantiate gcs channels on driver (#36389)
[Core] Port GcsSubscriber to Cython (#35094)
[Core] Print out warning every 1s when sched_cls_id is greater than 100 (#35629)
[Core] Remove attrs dependency (#36270)
[Core] Remove dataclasses requirement (#36218)
[Core] Remove grpcio from Ray minimal dashboard (#36636)
[Core] Remove grpcio import from usage_lib (#36542)
[Core] remove import thread (#36293)
[Core] Remove Python grpcio from check_health (#36304)
[Core] Retrieve the token from GCS server [4/n] (#37003) (#37294)
[Core] Retry failed redis request (#35249)
[Core] Sending ReportWorkerFailure after the process died. (#35320)
[Core] Serialize auto-inits (#36127)
[Core] Support auto-init ray for get_runtime_context() (#35903)
[Core] Suppress harmless ObjectRefStreamEndOfStreamError when using asyncio (#37062) (#37200)
[Core] Unpin grpcio and make Ray run on mac M1 out of the box (#35932)
[Core] Add a better error message for health checking network failures (#36957) (#37366)
[Core] Add ClusterID to ClientCallManager [2/n] (#36526)
[Core] Add ClusterID token to GCS server [3/n] (#36535)
[Core] Add ClusterID token to GRPC server [1/n] (#36517)
[Core] Add extra metrics for workers (#36973)
[Core] Add get_worker_id() to runtime context (#35967)
[Core] Add logs for Redis leader discovery for observability. (#36108)
[Core] Add metrics for object size distribution in object store (#37005) (#37110)
[Core] Add resource idle time to resource report from node. (#36670)
[Core] Check that temp_dir must be absolute path. (#36431)
[Core] Clear CPU affinity for worker processes (#36816)
[Core] Delete object spilling dead code path. (#36286)
[Core] Don't drop rpc status in favor of reply status (#35530)
[Core] Feature flag actor task logs with off by default (#35921)
[Core] Graceful handling of returning bundles when node is removed (#34726)
[Core] Graceful shutdown in TaskEventBuffer destructor (#35857)
[Core] Guarantee the ordering of put ActorTaskSpecTable and ActorTable (#35683)
[Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy (#36718)
[Core] Make “import” ray work without grpcio (#35737)
[Core][dashboard] Add task name in task log magic token (#35377)
[Core][deprecate run_function_on_all_workers 3/n] delete run_function_on_all_workers (#30895)
[Core][devex] Move ray/util build targets to separate build files (#36598)
[Core][logging][ipython] Fix log buffering when consecutive runs within ray log dedup window (#37134) (#37174)
[Core][Logging] Switch worker_setup_hook to worker_process_setup_hook (#37247) (#37463)
[Core][Metrics] Use Autoscaler-emitted metrics for pending/active/failed nodes. (#35884)
[Core][state] Record file offsets instead of logging magic token to track task log (#35572)
[CI] [Minimal install] Check python version in minimal install (#36887)
[CI] second try of fixing vllm example in CI #36712
[CI] skip vllm_example #36665
[CI][Core] Add more visbility into state api stress test (#36465)
[CI][Doc] Add windows 3.11 wheel support in doc and CI #37297 (#37302)
[CI][py3.11] Build python wheels on mac os for 3.11 (#36185)
[CI][python3.11] windows 3.11 wheel build
[CI][release] Add mac 3.11 wheels to release scripts (#36396)
[CI] Update state api scale test (#35543)
[Release Test] Fix dask on ray 1tb sort failure. (#36905)
[Release Test] Make the cluster name unique for cluster launcher release tests (#35801)
[Test] Deflakey gcs fault tolerance test in mac os (#36471)
[Test] Deflakey pubsub integration_test (#36284)
[Test] Change instance type to r5.8xlarge for dask_on_ray_1tb_sort (#37321) (#37409)
[Test] Move generators test to large (#35747)
[Test][Core] Handled the case where memories is empty for dashboard test (#35979)

🔨 Fixes:

[Core] Fix GCS FD usage increase regression. (#35624)
[Core] Fix issues with worker churn in WorkerPool (#36766)
[Core] Fix proctitle for generator tasks (#36928)
[Core] Fix ray.timeline() (#36676)
[Core] Fix raylet memory leak in the wrong setup. (#35647)
[Core] Fix test_no_worker_child_process_leaks (#35840)
[Core] Fix the GCS crash when connecting to a redis cluster with TLS (#36916)
[Core] Fix the race condition where grpc requests are handled while c… (#37301)
[Core] Fix the recursion error when async actor has lots of deserialization. (#35494)
[Core] Fix the segfault from Opencensus upon shutdown (#36906) (#37311)
[Core] Fix the unnecessary logs (#36931) (#37313)
[Core] Add a special head node resource and use it to pin the serve controller to the head node (#35929)
[Core] Add debug log for serialized object size (#35992)
[Core] Cache schema and test (#37103) (#37201)
[Core] Fix 'ray stack' on macOS (#36100)
[Core] Fix a wrong metrics setup link from the doc. (#37312) (#37367)
[Core] Fix lint (#35844)(#36739)
[Core] Fix literalinclude path (#35660)
[Core] Fix microbenchmark (#35823)
[Core] Fix single_client_wait_1k perf regression (#35614)
[Core] Get rid of shared_ptr for GcsNodeManager (#36738)
[Core] Remove extra step in M1 installation instructions (#36029)
[Core] Remove unnecessary AsyncGetResources in NodeManager::NodeAdded (#36412)
[Core] Unskip test_Autoscaler_shutdown_node_http_everynode (#36420)
[Core] Unskip test_get_release_wheel_url for mac (#36430)

📖 Documentation:

[Doc] Clarify that session can also mean a ray cluster (#36422)
[Doc] Fix doc build on M1 (#35689)
[Doc] Fix documentation failure due to typing_extensions (#36732)
[Doc] Make doc code snippet testable [3/n] (#35407)
[Doc] Make doc code snippet testable [4/n] (#35506)
[Doc] Make doc code snippet testable [5/n] (#35562)
[Doc] Make doc code snippet testable [7/n] (#36960)
[Doc] Make doc code snippet testable [8/n] (#36963)
[Doc] Some instructions on how to size the head node (#36429)
[Doc] Fix doc for runtime-env-auth (#36421)
[Doc][dashboard][state] Promote state api and dashboard usage in Core user guide. (#35760)
[Doc][python3.11] Update mac os wheels built link (#36379)
[Doc] [typo] Rename acecelerators.md to accelerators.md (#36500)

Many thanks to all those who contributed to this release!

@ericl, @ArturNiederfahrenhorst, @sihanwang41, @scv119, @aslonnie, @bluecoconut, @alanwguo, @krfricke, @frazierprime, @vitsai, @amogkam, @GeneDer, @jovany-wang, @gjoliver, @simran-2797, @rkooo567, @shrekris-anyscale, @kevin85421, @angelinalg, @maxpumperla, @kouroshHakha, @Yard1, @chaowanggg, @justinvyu, @fantow, @Catch-Bull, @cadedaniel, @ckw017, @hora-anyscale, @rickyyx, @scottsun94, @XiaodongLv, @SongGuyang, @RocketRider, @stephanie-wang, @inpefess, @peytondmurray, @sven1977, @matthewdeng, @ijrsvt, @MattiasDC, @richardliaw, @bveeramani, @rynewang, @woshiyyya, @can-anyscale, @omus, @eax-anyscale, @raulchen, @larrylian, @Deegue, @Rohan138, @jjyao, @iycheng, @akshay-anyscale, @edoakes, @zcin, @dmatrix, @bryant1410, @WanNJ, @architkulkarni, @scottjlee, @JungeAlexander, @avnishn, @harisankar95, @pcmoritz, @wuisawesome, @mattip

ray-2.5.1

10 months ago

The Ray 2.5.1 patch release adds wheels for MacOS for Python 3.11. It also contains fixes for multiple components, along with fixes for our documentation.

Ray Train

🔨 Fixes:

Don't error on eventual success when running with auto-recovery (#36266)

Ray Core

🎉 New Features:

Build Python wheels on Mac OS for Python 3.11 (#36373)

🔨 Fixes:

[Autoscaler] Fix a bug that can cause undefined behavior when clusters attempt to scale up aggressively. (#36241)
Fix mypy error: Module "ray" does not explicitly export attribute "remote" (#36356)

ray-2.5.0

10 months ago

The Ray 2.5 release features focus on a number of enhancements and improvements across the Ray ecosystem, including:

Training LLMs with Ray Train: New support for checkpointing distributed models, and Pytorch Lightning FSDP to enable training large models on Ray Train’s LightningTrainer
LLM applications with Ray Serve & Core: New support for streaming responses and model multiplexing
Improvements to Ray Data: In 2.5, strict mode is enabled by default. This means that schemas are required for all Datasets, and standalone Python objects are no longer supported. Also, the default batch format is fixed to NumPy, giving better performance for batch inference.
RLlib enhancements: New support for multi-gpu training, along with ray-project/rllib-contrib to contain the community contributed algorithms
Core enhancements: Enable new feature of lightweight resource broadcasting to improve reliability and scalability. Add many enhancements for Core reliability, logging, scheduler, and worker process.

Ray Libraries

Ray AIR

💫Enhancements:

Experiment restore stress tests (#33706)
Context-aware output engine
- Add parameter columns to status table (#35388)
- Context-aware output engine: Add docs, experimental feature docs, prepare default on (#35129)
- Fix trial status at end (more info + cut off) (#35128)
- Improve leaked mentions of Tune concepts (#35003)
- Improve passed time display (#34951)
- Use flat metrics in results report, use Trainable._progress_metrics (#35035)
- Print experiment information at experiment start (#34952)
- Print single trial config + results as table (#34788)
- Print out worker ip for distributed train workers. (#33807)
- Minor fix to print configuration on start. (#34575)
- Check air_verbosity against None. (#33871)
- Better wording for empty config. (#33811)
Flatten config and metrics before passing to mlflow (#35074)
Remote_storage: Prefer fsspec filesystems over native pyarrow (#34663)
Use filesystem wrapper to exclude files from upload (#34102)
GCE test variants for air_benchmark and air_examples (#34466)
New storage path configuration
- Add RunConfig.storage_path to replace SyncConfig.upload_dir and RunConfig.local_dir. (#33463)
- Use Ray storage URI as default storage path, if configured [no_early_kickoff] (#34470)
- Move to new storage_path API in tests and examples (#34263)

🔨 Fixes:

Store unflattened metrics in _TrackedCheckpoint (#35658) (#35706)
Fix test_tune_torch_get_device_gpu race condition (#35004)
Deflake test_e2e_train_flow.py (#34308)
Pin deepspeed version for now to unblock ci. (#34406)
Fix AIR benchmark configuration link failure. (#34597)
Fix unused config building function in lightning MNIST example.

📖Documentation:

Change doc occurrences of ray.data.Dataset to ray.data.Datastream (#34520)
DreamBooth example: Fix code for batch size > 1 (#34398)
Synced tabs in AIR getting started (#35170)
New Ray AIR link for try it out (#34924)
Correctly Render the Enumerate Numbers in convert_torch_code_to_ray_air (#35224)

Ray Data Processing

🎉 New Features:

Implement Strict Mode and enable it by default.
Add column API to Dataset (#35241)
Configure progress bars via DataContext (#34638)
Support using concurrent actors for ActorPool (#34253)
Add take_batch API for collecting data in the same format as iter_batches and map_batches (#34217)

💫Enhancements:

Improve map batches error message for strict mode migration (#35368)
Improve docstring and warning message for from_huggingface (#35206)
Improve notebook widget display (#34359)
Implement some operator fusion logic for the new backend (#35178 #34847)
Use wait based prefetcher by default (#34871)
Implement limit physical operator (#34705 #34844)
Require compute spec to be explicitly spelled out #34610
Log a warning if the batch size is misconfigured in a way that would grossly reduce parallelism for actor pool. (#34594)
Add alias parameters to the aggregate function, and add quantile fn (#34358)
Improve repr for Arrow Table and pandas types (#34286 #34502)
Defer first block computation when reading a Datasource with schema information in metadata (#34251)
Improve handling of KeyboardInterrupt (#34441)
Validate aggregation key in Aggregate LogicalOperator (#34292)
Add usage tag for which block formats are used (#34384)
Validate sort key in Sort LogicalOperator (#34282)
Combine_chunks before chunking pyarrow.Table block into batches (#34352)
Use read stage name for naming Data-read tasks on Ray Dashboard (#34341)
Update path expansion warning (#34221)
Improve state initialization for ActorPoolMapOperator (#34037)

🔨 Fixes:

Fix ipython representation (#35414)
Fix bugs in handling of nested ndarrays (and other complex object types) (#35359)
Capture the context when the dataset is first created (#35239)
Cooperatively exit producer threads for iter_batches (#34819)
Autoshutdown executor threads when deleted (#34811)
Fix backpressure when reading directly from input datasource (#34809)
Fix backpressure handling of queued actor pool tasks (#34254)
Fix row count after applying filter (#34372)
Remove unnecessary setting of global logging level to INFO when using Ray Data (#34347)
Make sure the tf and tensor iteration work in dataset pipeline (#34248)
Fix '_unwrap_protocol' for Windows systems (#31296)

📖Documentation:

Add batch inference object detection example (#35143)
Refine batch inference doc (#35041)

Ray Train

🎉 New Features:

Experimental support for distributed checkpointing (#34709)

💫Enhancements:

LightningTrainer: Enable prog bar (#35350)
LightningTrainer enable checkpoint full dict with FSDP strategy (#34967)
Support FSDP Strategy for LightningTrainer (#34148)

🔨 Fixes:

Fix HuggingFace -> Transformers wrapping logic (#35276, #35284)
LightningTrainer always resumes from the latest AIR checkpoint during restoration. (#35617) (#35791)
Fix lightning trainer devices setting (#34419)
TorchCheckpoint: Specifying pickle_protocol in torch.save() (#35615) (#35790)

📖Documentation:

Improve visibility of Trainer restore and stateful callback restoration (#34350)
Fix rendering of diff code-blocks (#34355)
LightningTrainer Dolly V2 FSDP Fine-tuning Example (#34990)
Update LightningTrainer MNIST example. (#34867)
LightningTrainer Advanced Example (#34082, #34429)

🏗 Architecture refactoring:

Restructure ray.train HuggingFace modules (#35270) (#35488)
rename _base_dataset to _base_datastream (#34423)

Ray Tune

🎉 New Features:

Ray Tune's new execution path is now enabled per default (#34840, #34833)

💫Enhancements:

Make `Tuner.restore(trainable=...) a required argument (#34982)
Enable tune.ExperimentAnalysis to pull experiment checkpoint files from the cloud if needed (#34461)
Add support for nested hyperparams in PB2 (#31502)
Release test for durable multifile checkpoints (#34860)
GCE variants for remaining Tune tests (#34572)
Add tune frequent pausing release test. (#34501)
Add PyArrow to ray[tune] dependencies (#34397)
Fix new execution backend for BOHB (#34828)
Add tune frequent pausing release test. (#34501)

🔨 Fixes:

Set config on trial restore (#35000)
Fix test_tune_torch_get_device_gpu race condition (#35004)
Fix a typo in tune/execution/checkpoint_manager state serialization. (#34368)
Fix tune_scalability_network_overhead by adding --smoke-test. (#34167)
Fix lightning_gpu_tune_.* release test (#35193)

📖Documentation:

Fix Tune tutorial (#34660)
Fix typo in Tune restore guide (#34247)

🏗 Architecture refactoring:

Use Ray-provided tabulate package (#34789)

Ray Serve

🎉 New Features:

Add support for json logging format.(#35118)
Add experimental support for model multiplexing.(#35399, #35326)
Added experimental support for HTTP StreamingResponses. (#35720)
Add support for application builders & arguments (#34584)

💫Enhancements:

Add more bucket size for histogram metrics. (#35242).
Add route information into the custom metrics. (#35246)
Add HTTPProxy details to Serve Dashboard UI (#35159)
Add status_code to http qps & latency (#35134)
Stream Serve logs across different drivers (#35070)
Add health checking for http proxy actors (#34944)
Better surfacing of errors in serve status (#34773)
Enable TLS on gRPCIngress if RAY_USE_TLS is on (#34403)
Wait until replicas have finished recovering (with timeout) to broadcast LongPoll updates (#34675)
Replace ClassNode and FunctionNode with Application in top-level Serve APIs (#34627)

🔨 Fixes:

Set app_msg to empty string by default (#35646)
Fix dead replica counts in the stats. (#34761)
Add default app name (#34260)
gRPC Deployment schema check & minor improvements (#34210)

📖Documentation:

Clean up API reference and various docstrings (#34711)
Clean up RayServeHandle and RayServeSyncHandle docstrings & typing (#34714)

RLlib

🎉 New Features:

Migrating approximately ~25 of the 30 algorithms from RLlib into rllib_contrib. You can review the REP here. This release we have covered A3C and MAML.
The APPO / IMPALA and PPO are all moved to the new Learner and RLModule stack.
The RLModule now supports Checkpointing.(#34717 #34760)

💫Enhancements:

Introduce experimental larger than GPU train batch size feature for torch (#34189)
Change occurences of "_observation_space_in_preferred_format" to "_obs_space_in_preferred_format" (#33907)
Add a flag to allow disabling initialize_loss_from_dummy_batch logit. (#34208)
Remove check specs from default Model forward code path to improve performance (#34877)
Remove some specs from encoders to smoothen dev experience (#34911)

🔨 Fixes:

Fix MultiCallbacks class: To be used only with utility function that returns a class to use in the config. (#33863)
Fix test backward compatibility test for RL Modules (#33857)
Don't serialize config in Policy states (unless needed for msgpack-type checkpoints). (#33865)
DM control suite wrapper fix: dtype of obs needs to be pinned to float32. (#33876)
In the Json_writer convert all non string keys to keys (#33896)
Fixed a bug with kl divergence calculation of torch.Dirichlet distribution within RLlib (#34209)
Change broken link in parameter_noise.py (#34231)
Fixed bug in restoring a gpu trained algorithm (#35024)
Fix IMPALA/APPO when using multi GPU setup and Multi-Agent Env (#35120)

📖Documentation:

Add examples and docs for Catalog (#33898)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Support both sync and async actor generator interface. (#35584 #35708 #35324 #35656 #35803 #35794 #35707)

💫Enhancements:

[Scheduler] Introduce spill_on_unavailable option for soft NodeAffinitySchedulingStrategy (#34224)
[Data] Use wait based prefetcher by default (#34871)
[Reliability] During GCS restarts, grpc based resource broadcaster should only add ALIVE nodes during initialization (#35349)
[Reliability] Guarantee the ordering of put ActorTaskSpecTable and ActorTable (#35683) (#35718)
[Reliability] Graceful handling of returning bundles when node is removed (#34726) (#35542)
[Reliability] Task backend - marking tasks failed on worker death (#33818)
[Reliability] Task backend - Add worker dead info to failed tasks when job exits. (#34166)
[Logging] Make ray.get(timeout=0) to throw timeout error (#35126)
[Logging] Provide good error message if the factional resource precision is beyond 0.0001 (#34590)
[Logging] Add debug logs to show UpdateResourceUsage rpc source (#35062)
[Logging] Add actor_id as an attribute of RayActorError when the actor constructor fails (#34958)
[Logging] Worker startup hook (#34738)
[Worker] Partially addresses ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#33976)
[Worker] Change worker niceness in job submission environment (#34727)
Shorten the membership checking time to 5 seconds. (#34769)
[Syncer] Remove spammy logs. (#34654)
[Syncer] Delete disconnected node view in ray syncer when connection is broken. (#35312)
[Syncer] Turn on ray syncer again. (#35116)
[Syncer] Start ray syncer reconnection after a delay (#35115)
Serialize requests in the redis store client. (#35123)
Reduce self alive check from 60s to 5s. (#34992)
Add object owner and copy metrics to node stats (#35119)
Start the synchronization connection after receiving all nodes info. (#34645)
Improve the workflow finding Redis leader. (#34108)
Make execute_after accept chrono (#35099)
Lazy import autoscaler + don't import opentelemetry unless setup hook (#33964)

🔨 Fixes:

[pubsub] Handle failures when publish fails. (#33115)
Convert gcs port read from env variable from str to int (#34482)
Fix download_wheels.sh wheel urls (#34616)
Fix ray start command output (#34081)
Fetch_local once for each object ref (#34884)
Combine_chunks before chunking pyarrow.Table block into batches (#34352)
Replace deprecated usage of get_runtime_context().node_id (#34874)
Fix std::move without std namespace (#34149)
Fix the recursion error when an async actor has lots of deserialization. (#35494) (#35532)
Fix async actor shutdown issue when exit_actor is used (#32407)
[Event] Fix incorrect event timestamp (#34402)
[Metrics] Fix shared memory is not displayed properly (#34301)
Fix GCS FD usage increase regression. (#35624) (#35738)
Fix raylet memory leak in the wrong setup. (#35647) (#35673)
Retry failed redis request (#35249) (#35481)
Add more messages when accessing a dead actor. (#34697)
Fix the placement group stress test regression. (#34192)
Mark Raylet unhealthy if GCS can't recognize it. (#34087)
Remove multiple core workers in one process 2/n (#34942)
Remove python 3.6 support (#34373 #34416)

📖Documentation:

Make doc code snippet testable (#35274 #35057)
Revamp ray core api reference [1/n] (#34428)
Add Ray core fault tolerance guide for GCS and node (#33446)
Ray Debugging Doc Part 1 (OOM) (#34309)
Rewrite the placement group documentation (#33518)

Ray Clusters

💫Enhancements:

[Docker] [runtime env] Bump boto3 version from 1.4.8 to 1.26.82, add pyOpenSSL and cryptography (#33273)
[Jobs] Fix race condition in supervisor actor creation and add timeout for pending jobs (#34223)
[Release test] [Cluster launcher] Add gcp minimal and full cluster launcher release test (#34878)
[Release test] [Cluster launcher] Add release test for aws example-full.yaml (#34487)

📖Documentation:

[runtime env] Clarify conditions for local pip and conda requirements files (#34071)
[KubeRay] Provide GKE instructions in KubeRay example (#33339)
[KubeRay] Update KubeRay doc for release v0.5.0 (#34178)

Dashboard

💫Enhancements:

Feature flag task logs recording (#34056)
Fix log proxy not loading non test/plain files. (#33870)
[no_early_kickoff] Make dashboard address connectable from remote nodes when not set to 127.0.0.1 (localhost) (#35027)
[state][job] Supporting job listing(getting) and logs from state API (#35124)
[state][ci] Fix stress_test_state_api_scale (#35332)
[state][dashboard][log] Fix subdirectory log getting (#35283)
[state] Push down filtering to GCS for listing/getting task from state api (#35109)(#34433)
[state] Task log - Improve log tailing from log_client and support tailing from offsets [2/4] (#28188)
[state] Use --err flag to query stderr logs from worker/actors instead of --suffix=err (#34300)
[state][no_early_kickoff] Make state api return results that are strongly typed (#34297)
[state] Efficient get/list actors with filters on some high-cardinality fields (#34348)
[state] Fix list nodes test in test_state_api.py (#34349)
[state] Add head node flag is_head_node to state API and GcsNodeInfo (#34299)
Make actor tasks' name default to <actor_repr>.<task_name> (#35371)
Task backend GC policy - worker update [1/3] (#34896)
[state] Support task logs from state API (#35101)

Known Issues

A bug in the Autoscaler can cause undefined behaviour when clusters attempt to scale up aggressively. This is fixed in following releases, as well as post-release on the 2.5.0 branch (#36482).

Many thanks to all those who contributed to this release!

@vitsai, @XiaodongLv, @justinvyu, @Dan-Yeh, @dependabot[bot], @alanwguo, @grimreaper, @yiwei00000, @pomcho555, @ArturNiederfahrenhorst, @maxpumperla, @jjyao, @ijrsvt, @sven1977, @Yard1, @pcmoritz, @c21, @architkulkarni, @jbedorf, @amogkam, @ericl, @jiafuzha, @clarng, @shrekris-anyscale, @matthewdeng, @gjoliver, @jcoffi, @edoakes, @ethanabrooks, @iycheng, @Rohan138, @angelinalg, @Linniem, @aslonnie, @zcin, @wuisawesome, @Catch-Bull, @woshiyyya, @avnishn, @jjyyxx, @jianoaix, @bveeramani, @sihanwang41, @scottjlee, @YQ-Wang, @mattip, @can-anyscale, @xwjiang2010, @fedassembly, @joncarter1, @robin-anyscale, @rkooo567, @DACUS1995, @simran-2797, @ProjectsByJackHe, @zen-xu, @ashahab, @larrylian, @kouroshHakha, @raulchen, @sofianhnaide, @scv119, @nathan-az, @kevin85421, @rickyyx, @Sahar-E, @krfricke, @chaowanggg, @peytondmurray, @cadedaniel

ray-2.4.0

1 year ago

Ray 2.4 - Generative AI and LLM support

Over the last few months, we have seen a flurry of innovative activity around generative AI models and large language models (LLM). To continue our effort to ensure Ray provides a pivotal compute substrate for generative AI workloads and addresses the challenges (as explained in our blog series), we have invested engineering efforts in this release to ensure that these open source LLM models and workloads are accessible to the open source community and performant with Ray.

This release includes new examples for training, batch inference, and serving with your own LLM.

Generative AI and LLM Examples

Ray Train enhancements

We're introducing the LightningTrainer, allowing you to scale your PyTorch Lightning on Ray. As part of our continued effort for seamless integration and ease of use, we have enhanced and replaced our existing ray_lightning integration, which was widely adopted, with the latest changes to Pytorch Lighting.
we’re releasing an AccelerateTrainer, allowing you to run HuggingFace Accelerate and DeepSpeed on Ray with minimal code changes. This Trainer integrates with the rest of the Ray ecosystem—including the ability to run distributed hyperparameter tuning with each trial being a distributed training job.

Ray Data highlights

Streaming execution is enabled by default, providing users with a more efficient data processing pipeline that can handle larger datasets and minimize memory consumption. Check out the docs here: (doc)
- Note that this means data output may no longer preserve the original order in more cases. To retain the original ordering properties of Ray Data prior to 2.4, you can set the config ray.data.DatasetContext.get_current().execution_options.preserve_order = True.
We've implemented asynchronous batch prefetching of Dataset.iter_batches (doc), improving performance by fetching data in parallel while the main thread continues processing, thus reducing waiting time.
Support reading SQL databases (doc), enabling users to seamlessly integrate relational databases into their Ray Data workflows.
Introduced support for reading WebDataset (doc), a common format for high-performance deep learning training jobs.

Ray Serve highlights

Multi-app CLI & REST API support is now available, allowing users to manage multiple applications with different configurations within a single Ray Serve deployment. This simplifies deployment and scaling processes for users with multiple applications. (doc)
Enhanced logging and metrics for Serve applications, giving users better visibility into their application's performance and facilitating easier debugging and monitoring. (doc)

Other enhancements

Ray 2.4 is the last version that supports Python 3.6
We've also added a brand new landing page

Ray Libraries

Ray AIR

💫Enhancements:

Add nightly test for alpa opt 30b inference. (#33419)
Add a sanity checking release test for Alpa and ray nightly. (#32995)
Add TorchDetectionPredictor (#32199)
Add artifact_location, run_name to MLFlow integration (#33641)
Add *path properties to Result and ResultGrid (#33410)
Make Preprocessor.transform lazy by default (#32872)
Make BatchPredictor lazy (#32510, #32796)
Use a configurable ray temp directory for the TempFileLock util (#32862)
Add collate_fn to iter_torch_batches (#32412)
Allow users to pass Callable[[torch.Tensor], torch.Tensor] to TorchVisionTransform (#32383)
Automatically move DatasetIterator torch tensors to correct device (#31753)

🔨 Fixes:

Fix use_gpu with HuggingFacePredictor (#32333)
Make Keras Callback raise DeprecationWarning (#33775)
Pin framework to tf in AIR rl offline trainer example (#33750)
Fix test_tracked_actor (#33075)
Label Checkpoint.from_checkpoint as developer API (#33094)
Don't make lock files when moving dirs (#32924)
avoid inconsistency of create filesystem from uri for hdfs case (#30611)
Fix DatasetIterator backwards compability (#32526)
Fix CountVectorizer failing with big data (#32351)
Fix NoneType error loading TorchCheckpoint through from_uri. (#32386)
Fix dtype type hint in DLPredictor methods (#32198)
Allow None in set_preprocessor (#33088)

📖Documentation:

Add dreambooth example + release test (#33025)
GPT-J fine tuning with DeepSpeed example (#33090)
GPT-J Serving Example (#33114)
Add object detection example (#31553)
Add computer vision guide (#32885)
Add Pytorch ResNet batch prediction example (#32470)
Large model inference examples (#32874)
Add data ingestion clarification for AIR converting existing pytorch code example (#32058)
Add BatchPredictor.from_checkpoint to docs (#32877)
Fix wording of Many model training guidance (#32319)
Add required dependencies for batch prediction notebook (#33897)
Added link to preprocessors in Ray AIR Key Concepts page (#33526)
Rewording in analyze_tuning_results.ipynb (#32671)

🏗 Architecture refactoring:

Deprecations for 2.4 (#33765)
Deprecate TensorflowCheckpoint.get_model model_definition parameter (#33776)

Ray Data Processing

🎉 New Features:

Enable streaming execution by default (#32493)
Support asynchronous batch prefetching of Dataset.iter_batches() (#33620)
Introduce Dataset.materialize() API (#34184)
Add Dataset.streaming_split() API (#32991)
Support reading SQL databases with ray.data.read_sql() (#33353)
Support reading WebDataset with ray.data.read_webdataset() (#33336)
Allow generator UDFs for Dataset.map_batches() and flat_map() (#32767)
Add collate_fn to Dataset.iter_torch_batches() (#32412)
Add support for ignore_missing_paths in reading Datasource (#33126)

💫Enhancements:

Deprecate Dataset.lazy() (#33812)
Deprecate Dataset.dataset_format() (#33437)
Deprecate Dataset.fully_executed() and Dataset.is_fully_executed() (#33342)
Deprecate Datasource.do_write() (#32015)
Add iter_rows to DatasetIterator (#33180)
Support optional tf_schema parameter in read_tfrecords() and write_tfrecords() (#32857)
Add Dataset.iter_batches(batch_format=None) support, which will yield batches in the current batch format with zero copies (#33562)
Add meta_provider parameter into read_images (#33791)
Add missing passthrough args to Dataset.read_images() (#32942)
Allow read_binary_files(output_arrow_format=True) to return Arrow format (#33780)
Improve performance of DefaultFileMetaProvider (#33117)
Improved naming of Ray Data map tasks for dashboard (#32585)
Support different numbers of blocks/rows per block in Dataset.zip() (#32795)
No preserve order by default for streaming execution (#32300)
Make write an operator as part of the execution plan (#32015)
Optimize metadata creation for RangeDatasource (#33712)
Add telemetry for Ray Data (#32896)
Make BatchPredictor lazy (#32510)
Make Preprocessor.transform lazy by default (#32872)
Allow users to pass Callable[[torch.Tensor], torch.Tensor] to TorchVisionTransform (#32383)
Automatically move DatasetIterator torch tensors to correct device (#31753)
Promote _create_strict_ragged_ndarray to public API (#31975)
Add support for string tensor columns in ArrowTensorArray and ArrowVariableShapedTensorArray (#32143)

🔨 Fixes:

Data layer performance/bug fixes and tweaks (#32744)
Clean up RAY_DATASET_FORCE_LOCAL_METADATA flag (#32483)
Fix Datasource write_results type (#33936)
Add objects GC in dataset iterator (#34030, #34141)
Fix to_pandas failure on datasets returned by from_spark() (#32968)
Fix zip stage to preserve order when executing the other side (#33649)
Fix _get_read_tasks to use NodeAffinitySchedulingStrategy (#33212)
Guard against using ipywidgets in google colab (#32841)
Fix from_items parallelism to create the expected number of blocks (#32821)
Always preserve order for the BulkExecutor (#32437)

📖Documentation:

Fix Ragged Tensor Documentation (#33029)

Ray Train

🎉 New Features:

The LightningTrainer has been revamped
- [Doc] LightningTrainer end-2-end starter example [no_early_kickoff] (#33494)
- Lightning Trainer Release tests + docstring sample test (#33323)
- Support metric logging and checkpointing for LightningTrainer (#33183)
- Add LightningTrainer to support Pytorch Lightning DDP training <Part 1>. (#33161)
- Add LightningPredictor to support batch prediction (#33196)
- Support metric logging and checkpointing for LightningTrainer (#33183)
Implement AccelerateTrainer (#33269)
Add Trainer.restore API for train experiment-level fault tolerance (#31920)
Ray Train telemetry to collect the AIR trainer type (#33277)

💫Enhancements:

Recommend Trainer.restore on errors raised by trainer.fit() (#33610)
Improve lazy checkpointing (#32233)
Sort CUDA_VISIBLE_DEVICES (#33159)
Support returning multiple devices in train.torch.get_device() (#32893)
Set torch.distributed env vars (#32450)
Use the actual task name being executed for _RayTrainWorker__execute. (#33065)

🔨 Fixes:

Fix the import path for LightningTrainer to be compatible with Pytorch Lightning 2.0. (#34033)
Pin framework to tf in AIR rl offline trainer example (#33750)
Fix HF Trainer with DatasetIterator, handle device_map (#32955)
Fix failing test_torch_trainer (#32963)
Deflake test_gpu by sorting the devices (#33002)
(Bandaid) Mitigate OOMs on checkpointing (#33089)
Empty cache on the correct device (#33603)

📖Documentation:

[Doc] Improve AIR Lightning APIs and docstrings (#33895)
Update quickstart example to use dataloader (#33050)
add intro content from training module (#32088)
Add back important Ray Train integration methods (for Torch/TF) (#32551)
Restructure API reference (#32360)
Add Pytorch ResNet finetuning starter example (#32936)

🏗 Architecture refactoring:

Hard deprecate Backend encode_data/decode_data (#33784)

Ray Tune

🎉 New Features:

Add new experimental execution path
- Add TuneController (#33499)
- Refactor TrialRunner to separate out executor calls (#33448)
- Move trainable_kwargs generation to trial.py (#33249)
- Move experiment state/checkpoint/resume management into a separate file (#32457)
- Cache ready futures in RayTrialExecutor (#32093)
- Use generic _ObjectCache for actor reuse (#33045)
- Event manager part 2: Implementation (#31811)
Add new experimental output format
- Add new console output code path (behind feature flag) (#33609)
- Fix new output path with new execution path (#33880)
- Fix OrderedDict import. (#33709)
- Fix Ray Tune output v2 failures (#33697)
- Update wording to "Logical Resource Usage". (#33312)
- clean up tune/train result output (#32234)
- Add unit test to experimental/output.py (#33767)
Ray Tune Telemetry to collect entrypoint and searchers/scheduler usage (#33740)

💫Enhancements:

Experiment restore/resume
- Allow re-specifying param space in Tuner.restore (#32317)
- Replace reference values in a config dict with placeholders (#31927)
- Add Tuner.can_restore(path) utility for checking if an experiment exists at a path/uri (#32003)
- Update Tuner.restore usage to prepare for trainable becoming a required arg (#32912)
- Fix resuming from cloud storage (+ test) (#32504)
Syncing to cloud storage
- Sync trial artifacts to cloud (#32334)
- Fix ensure directory in bucket path sync (#33692)
- Sync less often and only wait at end of experiment (#32155)
- Unrevert "Add more comprehensive support for remote_checkpoint_dir w/ url params (#32479)" (#32576)
- Use on_experiment_end hook for the final wait of SyncCallback sync processes (#33390)
- Cleanup path-related properties in experiment classes (#33370)
- Update trainable remote_checkpoint_dir upon actor reuse (#32420)
- Add use_threads=False in pyarrow syncing (#32256)
Better support for multi-tenancy
- Prefix global object registry with job ID to avoid conflicts in multi tenancy (#33095)
- Add test for multi-tenancy workaround and documentation to FAQ (#32560)
- release test for nested air (tune) oom (#31768)
Fault tolerance improvements
- Add Tune worker fault tolerance test (#33473)
- Improve logging, unify trial retry logic, improve trial restore retry test. (#32242)
Integrations
- [wandb] Wait for WandbLoggerCallback actors to finish uploading to wandb on experiment end (#33174)
- [air] Aim logger (#32041)

🔨 Fixes:

ExperimentAnalysis: Ignore empty checkpoints but don't fail (#33770)
TrialRunner checkpointing shouldn't fail if ray.data.Dataset w/o lineage captured in trial config (#33565)
Evict object cache for actor re-use when search ended (#33593)
Raise warning (not an exception) if metadata file not found in checkpoint (#33123)
remove deep copy in trial.__getstate__ (#32624)
Fix "ValueError: I/O operation on closed file" (#31269)

📖Documentation:

Restructure API reference (#32311)
Don't recommend tune.run API in logging messages when using the Tuner (#33642)
Split "Tune stopping and resuming" into two user guides (#33495)
Remove Ray Client references from Tune and Train docs/examples (#32299)
Add tune checkpoint user guide. (#33145)
improve log_to_file doc. (#32128)
Fix broken Tune links to overview and intergration (#32442)

🏗 Architecture refactoring:

Deprecation cycle
- Hard deprecate Tune MLflow/W&B mixin/callbacks (#33782)
- Fix two tests after structure refactor deprecation (#32517)
- Remove deprecated Resources class (#32490)
- Structure refactor: Raise on import of old modules (#32486)

Ray Serve

🎉 New Features:

Multi-app supports CLI and REST API.(#33347, #33490, #33300, #33216, #33013)

💫Enhancements:

Add telemetry for lightweight config updates (#34039)
Deployment & replica info automatically exist in user customized metrics.(33451)
Add route and request id in the ray serve log entry.(33365)
Add telemetry for common Serve usage patterns (#33505)
Add log_to_stderr option to logger and improve internal logging (#33597)
Make http retries tunable (#32532)
Extend configurable HTTP options (#33160)
Prevent mixing single/multi-app config deployment (#33340)
Expose FastAPI docs path (#32863)
Add http request latency (#32839)

🔨 Fixes:

Recover the pending actors during the controller failures (#33890)
Fix tensorarray to numpy conversion (#34115)
Allow app rename when redeploying config (#33385)
Fix traceback string for RayTaskErrors when deploying serve app (#33120)

📖Documentation:

Add serve example documentation for object detection, stable diffusion and (#33164)

RLlib

🎉 New Features:

RLModule API is available in Alpha. See details here. PPO has been migrated to this API but in a limited mode.
Catalog API is revamped to be consistent with RLModule. See details here.

💫Enhancements:

Default framework is now torch instead of tf. (#33603)
Hard deprecate the old rllib/agent folder (#33242)

🔨 Fixes:

[RLlib] Don't serialize config in Policy states (unless needed for msgpack-type checkpoints). (#33865)
[RLlib] Fix MultiCallbacks class: To be used only with utility function that returns a class to use in the config. (#33863)
[RLlib] DM control suite wrapper fix: dtype of obs needs to be pinned to float32. (#33876)
[RLlib] Fix apex dqn deprecated add_batch call (#33814)
[RLlib] AlgorithmConfig.update_from_dict needs to work for MultiCallbacks. (#33796)
[RLlib] Add dist_inputs to action sampler fn returns in TorchPolicyV2 (#33795)

📖Documentation:

Rewritten the API documentation for better discoverability.
- [RLlib][Docs] Restructure RLModule API page (#33363)
- [RLlib][Docs] Restructure Replay buffer API page (#33359)
- [RLlib][Docs] Restructure Utils API page (#33358)
- [RLlib][Docs] Restructure Sampler's API page (#33357)
- [RLlib][Docs] Restructure Modelv2's API page (#33356)
- [RLlib][Docs] Restructure Algorithm's API page (#33345)
- [RLlib][Docs] Restructure Policy's API page (#33344)
[RLlib] Fix Getting Started example never returning (#33140)

Ray Core

🎉 New Features:

Ray officially support scale to up to 2000 nodes. See scalability envelope for more details.
Ray introduces an experimental API RAY_preload_python_modules to preload Python modules before tasks or actors are scheduled. This will eventually reduce startup time of Ray workloads that import large libraries. Please try it out and share feedback in #ray-preload-modules-feedback in the Ray Slack. To enable, configure the modules to preload via RAY_preload_python_modules=torch,tensorflow when starting Ray.

💫Enhancements:

Mark raylet unhealthy if GCS can't recognize it. (#34216)
Improve the workflow finding Redis leader.(#34183)
Improve Redis related observability when failed. (#33842)
Improve the serialization error for tasks, actors and ray.put (#33660)
Experimental preload_python_modules flag for preloading modules in default_worker (#33572)
Remove actor deletion upon job termination (#31019)
Better support per worker gpu usage from the cluster view. (#33515)
Task backend - Profile events capping (#33321)
Fifo worker killing policy (#33430)
Write ray address even if ray node is started with --block (#32961)
Turn on light weight resource broadcasting. (#32625)
Add opt-in flag for Windows and OSX clusters, update ray start output to match docs (#32409)

🔨 Fixes:

Fix arm64 wheels builds ((#34320)
Fix ray start command output(#34273)
Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#34181)
Ignore resource usage update from unknown node (#33619)
Fix keepalive in grpc client #33986
Autosummary class by default (#32983)
Fix non default dashboard_agent_listen_port not used when connected to the node (#33834)
Allow using local wheels to run release tests. (#32739)
Fix the error message when storage is not set. (#33581)
Fixing lint issue in benchmark_worker_startup (#33440)
Pin json-schema < 4.18 (#33412)
Fix demand leak when worker failed (#31175)
Remove some usage of deprecated runtime context apis (#33236)
Remove dead SchedulingResources class (#33250)
Release lock before sleeping (#33221)
Remove asyncio.ensure_future call in run_async_func_in_event_loop(#32932)
Upgrade gtest to 1.13 (#32858)
Update OpenCensus (#32553)
Remove usage_lib.LibUsageRecorder (#32806)
Fix the race condition in the new resource broadcasting. (#32798)
Task backend - disable verbose print. (#32764)
Building py37+cu118 and using cu116 in default ray-ml image (#32636)
Do not set flushing thread niceness for task backend #32439
Fix gRPC callback API destruction issues. (#32151)
Fix comments and a corner case in #32302 (#32323)
Script to compare perf metrics between releases (#32290)

📖Documentation:

Rewrite the placement group documentation (#34302)
Add tips of writing fault tolerant Ray applications (#32191)
Removed docs referring to ray client. (#32209)
Improve the streaming_split pydoc (#33424)
Add doc link for logs dedup (#33879)

Ray Clusters

💫Enhancements:

Added end to end release tests for example AWS cluster launcher YAML files (#32670)

Dashboard

🎉 New Features:

Ray serve releases its own dedicated dashboard! See the documentation for more details.
You can now access the error messages from every task and actor from the Ray dashboard.
Better out of memory debugging support. See the out of memory troubleshooting guide for more details.

🔨 Fixes:

Add the OOM failure graph (#34129)
Improve the existing OOM metrics (#33453)
Task backend - increase worker side GC limit to 100k (#33563)
Add device index to the GPU metrics (#33328)
Hide failed nodes by default. (#33455)
Add worker startup & initialization time to state API + use it for many_tasks (#31916)
Fix per component metrics bugs. (#33450)
Fix the incorrect object store size from dashboard vs metrics

Many thanks to all those who contributed to this release!

@zjf2012, @christy, @fyrestone, @avnishn, @scottjlee, @sijieamoy, @jjyao, @sven1977, @jamesclark-Zapata, @cadedaniel, @jovany-wang, @pcmoritz, @MaskRay, @csivanich, @augray, @wuisawesome, @Wendi-anyscale, @maxpumperla, @shawnpanda, @DmitriGekhtman, @yuduber, @gjoliver, @ju2ez, @clarkzinzow, @brycehuang30, @iycheng, @justinvyu, @dmatrix, @edoakes, @tmbdev, @scottsun94, @jianoaix, @cool-RR, @prrajput1199, @amogkam, @ckw017, @alanwguo, @architkulkarni, @chaowanggg, @AmeerHajAli, @stephanie-wang, @bewestphal, @matthew29tang, @dbczumar, @sihanwang41, @ericl, @soumitrak, @matthewdeng, @Catch-Bull, @peytondmurray, @XiaodongLv, @bveeramani, @YQ-Wang, @Linniem, @ProjectsByJackHe, @woshiyyya, @c21, @shrekris-anyscale, @zcin, @Yard1, @can-anyscale, @kouroshHakha, @robertnishihara, @richardliaw, @krfricke, @shomilj, @ArturNiederfahrenhorst, @ijrsvt, @GokuMohandas, @jbedorf, @xwjiang2010, @anydayeol, @clarng, @davidxia, @rickyyx, @Siraj-Qazi, @kira-lin, @scv119, @chengscott, @angelinalg, @rkooo567, @rshin, @deanwampler, @gramhagen, @larrylian, @WeichenXu123, @simonsays1980

ray-2.3.1

1 year ago

The Ray 2.3.1 patch release contains fixes for multiple components:

Ray Data Processing

Support different number of blocks/rows per block in zip() (https://github.com/ray-project/ray/pull/32795)

Ray Serve

Revert serve run to use Ray Client instead of Ray Jobs (https://github.com/ray-project/ray/pull/32976)
Fix issue with max_concurrent_queries being ignored when autoscaling (https://github.com/ray-project/ray/pull/32772 and https://github.com/ray-project/ray/pull/33022)

Ray Core

Write Ray address even if Ray node is started with --block (https://github.com/ray-project/ray/pull/32961)
Fix Ray on Spark running on layered virtualenv python environment (https://github.com/ray-project/ray/pull/32996)

Dashboard

Fix disk metric showing double the actual value (https://github.com/ray-project/ray/pull/32674)

ray-2.3.0

1 year ago

Release Highlights

The streaming backend for Ray Datasets is in Developer Preview. It is designed to enable terabyte-scale ML inference and training workloads. Please contact us if you'd like to try it out on your workload, or you can find the preview guide here: https://docs.google.com/document/d/1BXd1cGexDnqHAIVoxTnV3BV0sklO9UXqPwSdHukExhY/edit
New Information Architecture (Beta): We’ve restructured the Ray dashboard to be organized around user personas and workflows instead of entities.
Ray-on-Spark is now available (Preview)!: You can launch Ray clusters on Databricks and Spark clusters and run Ray applications. Check out the documentation to learn more.

Ray Libraries

Ray AIR

💫Enhancements:

Add set_preprocessor method to Checkpoint (#31721)
Rename Keras callback and its parameters to be more descriptive (#31627)
Deprecate MlflowTrainableMixin in favor of setup_mlflow() function (#31295)
W&B
- Have train_loop_config logged as a config (#31901)
- Allow users to exclude config values with WandbLoggerCallback (#31624)
- Rename WandB save_checkpoints to upload_checkpoints (#31582)
- Add hook to get project/group for W&B integration (#31035, 31643)
- Use Ray actors instead of multiprocessing for WandbLoggerCallback (#30847)
- Update WandbLoggerCallback example (#31625)
Predictor
- Place predictor kwargs in object store (#30932)
- Delegate BatchPredictor stage fusion to Datasets (#31585)
- Rename DLPredictor.call_model tensor parameter to inputs (#30574)
- Add use_gpu to HuggingFacePredictor (#30945)
Checkpoints
- Various Checkpoint improvements (#30948)
- Implement lazy checkpointing for same-node case (#29824)
- Automatically strip "module." from state dict (#30705)
- Allow user to pass model to TensorflowCheckpoint.get_model (#31203)

🔨 Fixes:

Fix and improve support for HDFS remote storage. (#31940)
Use specified Preprocessor configs when using stream API. (#31725)
Support nested Chain in BatchPredictor (#31407)

📖Documentation:

Restructure API References (#32535)
API Deprecations (#31777, #31867)
Various fixes to docstrings, documentation, and examples (#30782, #30791)

🏗 Architecture refactoring:

Use NodeAffinitySchedulingPolicy for scheduling (#32016)
Internal resource management refactor (#30777, #30016)

Ray Data Processing

🎉 New Features:

Lazy execution by default (#31286)
Introduce streaming execution backend (#31579)
Introduce DatasetIterator (#31470)
Add per-epoch preprocessor (#31739)
Add TorchVisionPreprocessor (#30578)
Persist Dataset statistics automatically to log file (#30557)

💫Enhancements:

Async batch fetching for map_batches (#31576)
Add informative progress bar names to map_batches (#31526)
Provide an size bytes estimate for mongodb block (#31930)
Add support for dynamic block splitting to actor pool (#31715)
Improve str/repr of Dataset to include execution plan (#31604)
Deal with nested Chain in BatchPredictor (#31407)
Allow MultiHotEncoder to encode arrays (#31365)
Allow specify batch_size when reading Parquet file (#31165)
Add zero-copy batch API for ds.map_batches() (#30000)
Text dataset should save texts in ArrowTable format (#30963)
Return ndarray dicts for single-column tabular datasets (#30448)
Execute randomize_block_order eagerly if it's the last stage for ds.schema() (#30804)

🔨 Fixes:

Don't drop first dataset when peeking DatasetPipeline (#31513)
Handle np.array(dtype=object) constructor for ragged ndarrays (#31670)
Emit warning when starting Dataset execution with no CPU resources available (#31574)
Fix the bug of eagerly clearing up input blocks (#31459)
Fix Imputer failing with categorical dtype (#31435)
Fix schema unification for Datasets with ragged Arrow arrays (#31076)
Fix Discretizers transforming ignored cols (#31404)
Fix to_tf when the input feature_columns is a list. (#31228)
Raise error message if user calls Dataset.iter (#30575)

📖Documentation:

Refactor Ray Data API documentation (#31204)
Add seealso to map-related methods (#30579)

Ray Train

🎉 New Features:

Add option for per-epoch preprocessor (#31739)

💫Enhancements:

Change default NCCL_SOCKET_IFNAME to blacklist veth (#31824)
Introduce DatasetIterator for bulk and streaming ingest (#31470)
Clarify which RunConfig is used when there are multiple places to specify it (#31959)
Change ScalingConfig to be optional for DataParallelTrainers if already in Tuner param_space (#30920)

🔨 Fixes:

Use specified Preprocessor configs when using stream API. (#31725)
Fix off-by-one AIR Trainer checkpoint ID indexing on restore (#31423)
Force GBDTTrainer to use distributed loading for Ray Datasets (#31079)
Fix bad case in ScalingConfig->RayParams (#30977)
Don't raise TuneError on fail_fast="raise" (#30817)
Report only once in SklearnTrainer (#30593)
Ensure GBDT PGFs match passed ScalingConfig (#30470)

📖Documentation:

Restructure API References (#32535)
Remove Ray Client references from Train docs/examples (#32321)
Various fixes to docstrings, documentation, and examples (#29463, #30492, #30543, #30571, #30782, #31692, #31735)

🏗 Architecture refactoring:

API Deprecations (#31763)

Ray Tune

💫Enhancements:

Improve trainable serialization error (#31070)
Add support for Nevergrad optimizer with extra parameters (#31015)
Add timeout for experiment checkpoint syncing to cloud (#30855)
Move validate_upload_dir to Syncer (#30869)
Enable experiment restore from moved cloud uri (#31669)
Save and restore stateful callbacks as part of experiment checkpoint (#31957)

🔨 Fixes:

Do not default to reuse_actors=True when mixins are used (#31999)
Only keep cached actors if search has not ended (#31974)
Fix best trial in ProgressReporter with nan (#31276)
Make ResultGrid return cloud checkpoints (#31437)
Wait for final experiment checkpoint sync to finish (#31131)
Fix CheckpointConfig validation for function trainables (#31255)
Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable (#31231)
Fix AxSearch save and nan/inf result handling (#31147)
Fix AxSearch search space conversion for fixed list hyperparameters (#31088)
Restore searcher and scheduler properly on Tuner.restore (#30893)
Fix progress reporter sort_by_metric with nested metrics (#30906)
Don't raise TuneError on fail_fast="raise" (#30817)
Fix duplicate printing when trial is done (#30597)

📖Documentation:

Restructure API references (#32449)
Remove Ray Client references from Tune docs/examples (#32321)
Various fixes to docstrings, documentation, and examples (#29581, #30782, #30571, #31045, #31793, #32505)

🏗 Architecture refactoring:

Deprecate passing a custom trial executor (#31792)
Move signal handling into separate method (#31004)
Update staged resources in a fixed counter for faster lookup (#32087)
Rename overwrite_trainable argument in Tuner restore to trainable (#32059)

Ray Serve

🎉 New Features:

Serve python API to support multi application (#31589)

💫Enhancements:

Add exponential backoff when retrying replicas (#31436)
Enable Log Rotation on Serve (#31844)
Use tasks/futures for asyncio.wait (#31608)
Change target_num_ongoing_requests_per_replica to positive float (#31378)

🔨 Fixes:

Upgrade deprecated calls (#31839)
Change Gradio integration to take a builder function to avoid serialization issues (#31619)
Add initial health check before marking a replica as RUNNING (#31189)

📖Documentation:

Document end-to-end timeout in Serve (#31769)
Document Gradio visualization (#28310)

RLlib

🎉 New Features:

Gymnasium is now supported. (Notes)
Connectors are now activated by default (#31693, 30388, 31618, 31444, 31092)
Contribution of LeelaChessZero algorithm for playing chess in a MultiAgent env. (#31480)

💫Enhancements:

[RLlib] Error out if action_dict is empty in MultiAgentEnv. (#32129)
[RLlib] Upgrade tf eager code to no longer use experimental_relax_shapes (but reduce_retracing instead). (#29214)
[RLlib] Reduce SampleBatch counting complexity (#30936)
[RLlib] Use PyTorch vectorized max() and sum() in SampleBatch.init when possible (#28388)
[RLlib] Support multi-gpu CQL for torch (tf already supported). (#31466)
[RLlib] Introduce IMPALA off_policyness test with GPU (#31485)
[RLlib] Properly serialize and restore StateBufferConnector states for policy stashing (#31372)
[RLlib] Clean up deprecated concat_samples calls (#31391)
[RLlib] Better support MultiBinary spaces by treating Tuples as superset of them in ComplexInputNet. (#28900)
[RLlib] Add backward compatibility to MeanStdFilter to restore from older checkpoints. (#30439)
[RLlib] Clean up some signatures for compute_actions. (#31241)
[RLlib] Simplify logging configuration. (#30863)
[RLlib] Remove native Keras Models. (#30986)
[RLlib] Convert PolicySpec to a readable format when converting to_dict(). (#31146)
[RLlib] Issue 30394: Add proper __str__() method to PolicyMap. (#31098)
[RLlib] Issue 30840: Option to only checkpoint policies that are trainable. (#31133)
[RLlib] Deprecate (delete) contrib folder. (#30992)
[RLlib] Better behavior if user does not specify stopping condition in RLLib CLI. (#31078)
[RLlib] PolicyMap LRU cache enhancements: Swap out policies (instead of GC'ing and recreating) + use Ray object store (instead of file system). (#29513)
[RLlib] AlgorithmConfig.overrides() to replace multiagent->policies->config and evaluation_config dicts. (#30879)
[RLlib] deprecation_warning(.., error=True) should raise ValueError, not DeprecationWarning. (#30255)
[RLlib] Add gym.spaces.Text serialization. (#30794)
[RLlib] Convert MultiAgentBatch to SampleBatch in offline_rl.py. (#30668)
[RLlib; Tune] Make Algorithm.train() return Tune-style config dict (instead of AlgorithmConfig object). (#30591)

🔨 Fixes:

[RLlib] Fix waterworld example and test (#32117)
[RLlib] Change Waterworld v3 to v4 and reinstate indep. MARL test case w/ pettingzoo. (#31820)
[RLlib] Fix OPE checkpointing. Save method name in configuration dict. (#31778)
[RLlib] Fix worker state restoration. (#31644)
[RLlib] Replace ordinary pygame imports by try_import_..(). (#31332)
[RLlib] Remove crude VR checks in agent collector. (#31564)
[RLlib] Fixed the 'RestoreWeightsCallback' example script. (#31601)
[RLlib] Issue 28428: QMix not working w/ GPUs. (#31299)
[RLlib] Fix using yaml files with empty stopping conditions. (#31363)
[RLlib] Issue 31174: Move all checks into AlgorithmConfig.validate() (even simple ones) to avoid errors when using tune hyperopt objects. (#31396)
[RLlib] Fix tensorflow_probability imports. (#31331)
[RLlib] Issue 31323: BC/MARWIL/CQL do work with multi-GPU (but config validation prevents them from running in this mode). (#31393)
[RLlib] Issue 28849: DT fails with num_gpus=1. (#31297)
[RLlib] Fix PolicyMap.__del__() to also remove a deleted policy ID from the internal deque. (#31388)
[RLlib] Use get_model_v2() instead of get_model() with MADDPG. (#30905)
[RLlib] Policy mapping fn can not be called with keyword arguments. (#31141)
[RLlib] Issue 30213: Appending RolloutMetrics to sampler outputs should happen after(!) all callbacks (such that custom metrics for last obs are still included). (#31102)
[RLlib] Make convert_to_torch tensor adhere to docstring. (#31095)
[RLlib] Fix convert to torch tensor (#31023)
[RLlib] Issue 30221: random policy does not handle nested spaces. (#31025)
[RLlib] Fix crashing remote envs example (#30562)
[RLlib] Recursively look up the original space from obs_space (#30602)

📖Documentation:

[RLlib; docs] Change links and references in code and docs to "Farama foundation's gymnasium" (from "OpenAI gym"). (#32061)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Task Events Backend: Ray aggregates all submitted task information to provide better observability (#31840, #31761, #31278, #31247, #31316, #30934, #30979, #31207, #30867, #30829, #31524, #32157). This will back up features like task state API, advanced progress bar, and Ray timeline.

💫Enhancements:

Remote generator now works for ray actors and ray clients (#31700, #31710).
Revamp default scheduling strategy, improve worker startup performance up to 8x for embarrassingly parallel workloads (#31934, #31868).
Worker code clean up and allow workers lazy bind to jobs (#31836, #31846, #30349, #31375).
A single Ray cluster can scale up to 2000 nodes and 20k actors(#32131, #30131, #31939, #30166, #30460, #30563).
Out-of-memory prevention enhancement is now GA with more robust worker killing policies and better user experiences (#32217, #32361, #32219, #31768, #32107, #31976, #31272, #31509, #31230).

🔨 Fixes:

Improve garbage collection upon job termination (#32127, #31155)
Fix opencensus protobuf bug (#31632)
Support python 3.10 for runtime_env conda (#30970)
Fix crashes and memory leaks (#31640, #30476, #31488, #31917, #30761, #31018)

📖Documentation:

Deprecation (#31845, #31140, #31528)

Ray Clusters

🎉 New Features:

Ray-on-Spark is now available as Preview! (#28771, #31397, #31962)

💫Enhancements:

[observability] Better memory formatting for ray status and autoscaler (#32337)
[autoscaler] Add flag to disable periodic cluster status log. (#31869)

🔨 Fixes:

[observability][autoscaler] Ensure pending nodes is reset to 0 after scaling (#32085)
Make ~/.bashrc optional in cluster launcher commands (#32393)

📖Documentation:

Improvements to job submission
Remove references to Ray Client

Dashboard

🎉 New Features:

New Information Architecture (beta): We’ve restructured the Ray dashboard to be organized around user personas and workflows instead of entities. For developers, the jobs and actors tab will be most useful. For infrastructure engineers, the cluster tab may be more valuable.
Advanced progress bar: Tasks visualization that allow you to see the progress of all your ray tasks
Timeline view: We’ve added a button to download detailed timeline data about your ray job. Then, one can click a link and use the perfetto open-source visualization tool to visualize the timeline data.
More metadata tables. You can now see placement groups, tasks, actors, and other information related to your jobs.

📖Documentation:

We’ve restructured the documentation to make the dashboard documentation more prominent
We’ve improved the documentation around setting up Prometheus and Grafana for metrics.

Many thanks to all those who contributed to this release!

@minerharry, @scottsun94, @iycheng, @DmitriGekhtman, @jbedorf, @krfricke, @simonsays1980, @eltociear, @xwjiang2010, @ArturNiederfahrenhorst, @richardliaw, @avnishn, @WeichenXu123, @Capiru, @davidxia, @andreapiso, @amogkam, @sven1977, @scottjlee, @kylehh, @yhna940, @rickyyx, @sihanwang41, @n30111, @Yard1, @sriram-anyscale, @Emiyalzn, @simran-2797, @cadedaniel, @harelwa, @ijrsvt, @clarng, @pabloem, @bveeramani, @lukehsiao, @angelinalg, @dmatrix, @sijieamoy, @simon-mo, @jbesomi, @YQ-Wang, @larrylian, @c21, @AndreKuu, @maxpumperla, @architkulkarni, @wuisawesome, @justinvyu, @zhe-thoughts, @matthewdeng, @peytondmurray, @kevin85421, @tianyicui-tsy, @cassidylaidlaw, @gvspraveen, @scv119, @kyuyeonpooh, @Siraj-Qazi, @jovany-wang, @ericl, @shrekris-anyscale, @Catch-Bull, @jianoaix, @christy, @MisterLin1995, @kouroshHakha, @pcmoritz, @csko, @gjoliver, @clarkzinzow, @SongGuyang, @ckw017, @ddelange, @alanwguo, @Dhul-Husni, @Rohan138, @rkooo567, @fzyzcjy, @chaokunyang, @0x2b3bfa0, @zoltan-fedor, @Chong-Li, @crypdick, @jjyao, @emmyscode, @stephanie-wang, @starpit, @smorad, @nikitavemuri, @zcin, @tbukic, @ayushthe1, @mattip

ray-2.2.0

1 year ago

Release Highlights

Ray 2.2 is a stability-focused release, featuring stability improvements across many Ray components.

Ray Jobs API is now GA. The Ray Jobs API allows you to submit locally developed applications to a remote Ray Cluster for execution. It simplifies the experience of packaging, deploying, and managing a Ray application.
Ray Dashboard has received a number of improvements, such as the ability to see cpu flame graphs of your Ray workers and new metrics for memory usage.
The Out-Of-Memory (OOM) Monitor is now enabled by default. This will increase the stability of memory-intensive applications on top of Ray.
[Ray Data] we’ve heard numerous users report that when files are too large, Ray Data can have out-of-memory or performance issues. In this release, we’re enabling dynamic block splitting by default, which will address the above issues by avoiding holding too much data in memory.

Ray Libraries

Ray AIR

🎉 New Features:

Add a NumPy first path for Torch and TensorFlow Predictors (#28917)

💫Enhancements:

Suppress "NumPy array is not writable" error in torch conversion (#29808)
Add node rank and local world size info to session (#29919)

🔨 Fixes:

Fix MLflow database integrity error (#29794)
Fix ResourceChangingScheduler dropping PlacementGroupFactory args (#30304)
Fix bug passing 'raise' to FailureConfig (#30814)
Fix reserved CPU warning if no CPUs are used (#30598)

📖Documentation:

Fix examples and docs to specify batch_format in BatchMapper (#30438)

🏗 Architecture refactoring:

Deprecate Wandb mixin (#29828)
Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#30365)

Ray Data Processing

🎉 New Features:

Support all PyArrow versions released by Apache Arrow (#29993, #29999)
Add select_columns() to select a subset of columns (#29081)
Add write_tfrecords() to write TFRecord files (#29448)
Support MongoDB data source (#28550)
Enable dynamic block splitting by default (#30284)
Add from_torch() to create dataset from Torch dataset (#29588)
Add from_tf() to create dataset from TensorFlow dataset (#29591)
Allow to set batch_size in BatchMapper (#29193)
Support read/write from/to local node file system (#29565)

💫Enhancements:

Add include_paths in read_images() to return image file path (#30007)
Print out Dataset statistics automatically after execution (#29876)
Cast tensor extension type to opaque object dtype in to_pandas() and to_dask() (#29417)
Encode number of dimensions in variable-shaped tensor extension type (#29281)
Fuse AllToAllStage and OneToOneStage with compatible remote args (#29561)
Change read_tfrecords() output from Pandas to Arrow format (#30390)
Handle all Ray errors in task compute strategy (#30696)
Allow nested Chain preprocessors (#29706)
Warn user if missing columns and support str exclude in Concatenator (#29443)
Raise ValueError if preprocessor column doesn't exist (#29643)

🔨 Fixes:

Support custom resource with remote args for random_shuffle() (#29276)
Support custom resource with remote args for random_shuffle_each_window() (#29482)
Add PublicAPI annotation to preprocessors (#29434)
Tensor extension column concatenation fixes (#29479)
Fix iter_batches() to not return empty batch (#29638)
Change map_batches() to fetch input blocks on-demand (#29289)
Change take_all() to not accept limit argument (#29746)
Convert between block and batch correctly for map_groups() (#30172)
Fix stats() call causing Dataset schema to be unset (#29635)
Raise error when batch_format is not specified for BatchMapper (#30366)
Fix ndarray representation of single-element ragged tensor slices (#30514)

📖Documentation:

Improve map_batches() documentation about execution model and UDF pickle-ability requirement (#29233)
Improve to_tf() docstring (#29464)

Ray Train

🎉 New Features:

Added MosaicTrainer (#29237, #29620, #29919)

💫Enhancements:

Fast fail upon single worker failure (#29927)
Optimize checkpoint conversion logic (#29785)

🔨 Fixes:

Propagate DatasetContext to training workers (#29192)
Show correct error message on training failure (#29908)
Fix prepare_data_loader with enable_reproducibility (#30266)
Fix usage of NCCL_BLOCKING_WAIT (#29562)

📖Documentation:

Deduplicate Train examples (#29667)

🏗 Architecture refactoring:

Hard deprecate train.report (#29613)
Remove deprecated Train modules (#29960)
Deprecate old prepare_model DDP args #30364

Ray Tune

🎉 New Features:

Make Tuner.restore work with relative experiment paths (#30363)
Tuner.restore from a local directory that has moved (#29920)

💫Enhancements:

with_resources takes in a ScalingConfig (#30259)
Keep resource specifications when nesting with_resources in with_parameters (#29740)
Add trial_name_creator and trial_dirname_creator to TuneConfig (#30123)
Add option to not override the working directory (#29258)
Only convert a BaseTrainer to Trainable once in the Tuner (#30355)
Dynamically identify PyTorch Lightning Callback hooks (#30045)
Make remote_checkpoint_dir work with query strings (#30125)
Make cloud checkpointing retry configurable (#30111)
Sync experiment-checkpoints more often (#30187)
Update generate_id algorithm (#29900)

🔨 Fixes:

Catch SyncerCallback failure with dead node (#29438)
Do not warn in BayesOpt w/ Uniform sampler (#30350)
Fix ResourceChangingScheduler dropping PGF args (#30304)
Fix Jupyter output with Ray Client and Tuner (#29956)
Fix tests related to TUNE_ORIG_WORKING_DIR env variable (#30134)

📖Documentation:

Add user guide for analyzing results (using ResultGrid and Result) (#29072)
Tune checkpointing and Tuner restore docfix (#29411)
Fix and clean up PBT examples (#29060)
Fix TrialTerminationReporter in docs (#29254)

🏗 Architecture refactoring:

Remove hard deprecated SyncClient/Syncer (#30253)
Deprecate Wandb mixin, move to setup_wandb() function (#29828)

Ray Serve

🎉 New Features:

Guard for high latency requests (#29534)
Java API Support (blog)

💫Enhancements:

Serve K8s HA benchmarking (#30278)
Add method info for http metrics (#29918)

🔨 Fixes:

Fix log format error (#28760)
Inherit previous deployment num_replicas (29686)
Polish serve run deploy message (#29897)
Remove calling of get_event_loop from python 3.10

RLlib

🎉 New Features:

Fault tolerant, elastic WorkerSets: An asynchronous Ray Actor manager class is now used inside all of RLlib’s Algorithms, adding fully flexible fault tolerance to rollout workers and workers used for evaluation. If one or more workers (which are Ray actors) fails - e.g. due to a SPOT instance going down - the RLlib Algorithm will now flexibly wait it out and periodically try to recreate the failed workers. In the meantime, only the remaining healthy workers are used for sampling and evaluation. (#29938, #30118, #30334, #30252, #29703, #30183, #30327, #29953)

💫Enhancements:

RLlib CLI: A new and enhanced RLlib command line interface (CLI) has been added, allowing for automatically downloading example configuration files, python-based config files (defining an AlgorithmConfig object to use), better interoperability between training and evaluation runs, and many more. For a detailed overview of what has changed, check out the new CLI documentation. (#29204, #29459, #30526, #29661, #29972)
Checkpoint overhaul: Algorithm checkpoints and Policy checkpoints are now more cohesive and transparent. All checkpoints are now characterized by a directory (with files and maybe sub-directories), rather than a single pickle file; Both Algorithm and Policy classes now have a utility static method (from_checkpoint()) for directly instantiating instances from a checkpoint directory w/o knowing the original configuration used or any other information (having the checkpoint is sufficient). For a detailed overview, see here. (#28812, #29772, #29370, #29520, #29328)
A new metric for APPO/IMPALA/PPO has been added that measures off-policy’ness: The difference in number of grad-updates the sampler policy has received thus far vs the trained policy’s number of grad-updates thus far. (#29983)

🏗 Architecture refactoring:

AlgorithmConfig classes: All of RLlib’s Algorithms, RolloutWorkers, and other important classes now use AlgorithmConfig objects under the hood, instead of python config dicts. It is no longer recommended (however, still supported) to create a new algorithm (or a Tune+RLlib experiment) using a python dict as configuration. For more details on how to convert your scripts to the new AlgorithmConfig design, see here. (#29796, #30020, #29700, #29799, #30096, #29395, #29755, #30053, #29974, #29854, #29546, #30042, #29544, #30079, #30486, #30361)
Major progress was made on the new Connector API and making sure it can be used (tentatively) with the “config.rollouts(enable_connectors=True)” flag. Will be fully supported, across all of RLlib’s algorithms, in Ray 2.3. (#30307, #30434, #30459, #30308, #30332, #30320, #30383, #30457, #30446, #30024, #29064, #30398, #29385, #30481, #30241, #30285, #30423, #30288, #30313, #30220, #30159)
Progress was made on the upcoming RLModule/RLTrainer/RLOptimizer APIs. (#30135, #29600, #29599, #29449, #29642)

🔨 Fixes:

Various bug fixes: #25925, #30279, #30478, #30461, #29867, #30099, #30185, #29222, #29227, #29494, #30257, #29798, #30176, #29648, #30331

📖Documentation:

RLlib CLI, Checkpoint overhaul, AlgorithmConfigs
Minor fixes: #29261, #29752

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Out-of-memory monitor is now Beta and is enabled by default.

💫Enhancements:

The Ray Jobs API has graduated from Beta to GA. This means Ray Jobs will maintain API backward compatibility.
Run Ray job entrypoint commands (“driver scripts”) on worker nodes by specifying entrypoint_num_cpus, entrypoint_num_gpus, or entrypoint_resources. (#28564, #28203)
(Beta) OpenAPI spec for Ray Jobs REST API (#30417)
Improved Ray health checking mechanism. The fix will reduce the frequency of GCS marking raylets fail mistakenly when it is overloaded. (#29346, #29442, #29389, #29924)

🔨 Fixes:

Various fixes for hanging / deadlocking (#29491, #29763, #30371, #30425)
Set OMP_NUM_THREADS to num_cpus required by task/actors by default (#30496)
set worker non recyclable if gpu is envolved by default (#30061)

📖Documentation:

General improvements of Ray Core docs, including design patterns and tasks.

Ray Clusters

💫Enhancements:

Stability improvements for Ray Autoscaler / KubeRay Operator integration. (#29933 , #30281, #30502)

Dashboard

🎉 New Features:

Additional improvements from the default metrics dashboard. We now have actor, placement group, and per component memory usage breakdown. You can see details from the doc.
New profiling feature using py-spy under the hood. You can click buttons to see stack trace or cpu flame graphs of your workers.
Autoscaler and job events are available from the dashboard. You can also access the same data using ray list cluster-events.

🔨 Fixes:

Stability improvements from the dashboard
Dashboard now works at large scale cluster! It is tested with 250 nodes and 10K+ actors (which matches the Ray scalability envelope).
- Smarter api fetching logic. We now wait for the previous API to finish before sending a new API request when polling for new data.
- Fix agent memory leak and high CPU usage.

💫Enhancements:

General improvements to the progress bar. You can now see progress bars for each task name if you drill into the job details.
More metadata is available in the jobs and actors tables.
There is now a feedback button embedded into the dashboard. Please submit any bug reports or suggestions!

Many thanks to all those who contributed to this release!

@shrekris-anyscale, @rickyyx, @scottjlee, @shogohida, @liuyang-my, @matthewdeng, @wjrforcyber, @linusbiostat, @clarkzinzow, @justinvyu, @zygi, @christy, @amogkam, @cool-RR, @jiaodong, @EvgeniiTitov, @jjyao, @ilee300a, @jianoaix, @rkooo567, @mattip, @maxpumperla, @ericl, @cadedaniel, @bveeramani, @rueian, @stephanie-wang, @lcipolina, @bparaj, @JoonHong-Kim, @avnishn, @tomsunelite, @larrylian, @alanwguo, @VishDev12, @c21, @dmatrix, @xwjiang2010, @thomasdesr, @tiangolo, @sokratisvas, @heyitsmui, @scv119, @pcmoritz, @bhavika, @yzs981130, @andraxin, @Chong-Li, @clarng, @acxz, @ckw017, @krfricke, @kouroshHakha, @sijieamoy, @iycheng, @gjoliver, @peytondmurray, @xcharleslin, @DmitriGekhtman, @andreichalapco, @vitrioil, @architkulkarni, @simon-mo, @ArturNiederfahrenhorst, @sihanwang41, @pabloem, @sven1977, @avivhaber, @wuisawesome, @jovany-wang, @Yard1