Openucx Ucc Versions Save

Unified Collective Communication Library

v1.3.0

1 month ago

1.3.0 (April 18, 2024)

New Features and Enhancements

CL/HIER

Disable onesided alltoallv {PR #875}

TL/CUDA

Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

Enable hybrid alltoallv {PR #781}
Avoid copy in knomial scatter {PR #771}
Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
Remove memcpy in last SRA step {PR #743}
Fix sparse pack in hybrid a2av {PR #825}
Fix recycle in hybrid a2av {PR #827}
Reorder ranks for SRA {PR #834}
Use ring allgather when reordering needed {PR #879}
Use pipelining in SRA allreduce for CUDA {PR #873}
Poll for onesided alltoall completion {PR #876}
Add support for non-host buffers in bruck alltoall {PR #852}
Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

Enable bcast for any predefined dt {PR #774}
Don't print team create error {PR #777}
Check datasize supported {PR #776}
Fix sharp context cleanup {PR #843}

API

Remove duplicate get_version_string {PR #933}

TL/NCCL

Make team init non-blocking {PR #772}
Add CUDA managed to score {PR #793}
Make ncclGroupEnd nb {PR #798}
Lazy init nccl comm {PR #851}

TL/MLX5

Share ib_ctx and pd {PR #749}
Rcache {PR #753}
Device memory and topo init {PR #780}
Adding mcast interface {PR #784}
A2A part 1 -- coll init {PR #790}
A2A part 2 -- full collective {PR #802}
Revisit team and ctx init {PR #815}
Fix context create hang {PR #887}
Add librdmacm linkage {PR #910}

CORE

Fix score update when only score given {PR #779}
Coverity fixes {PR #809}
Additional coverty fixes {PR #813}
Fix error handling for ctx create epilog {PR #818}
Skip zero size collectives {PR #787}

DOCS

Updating NEWS for v1.2 {PR #791}
Updating NEWS for v1.3 {PR #937}

BUILD and TEST

Updated build system to enable UCC with ROCm 6.x {PR #906 and #917}
Check op and dt compatibility {PR #773}
Fix barrier test {PR #799}
Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}

v.1.3.0-rc1

2 months ago

1.3.0 (TBD)

New Features and Enhancements

CL/HIER

Disable onesided alltoallv {PR #875}

TL/CUDA

Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

Enable hybrid alltoallv {PR #781}
Avoid copy in knomial scatter {PR #771}
Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
Remove memcpy in last SRA step {PR #743}
Fix sparse pack in hybrid a2av {PR #825}
Fix recycle in hybrid a2av {PR #827}
Reorder ranks for SRA {PR #834}
Use ring allgather when reordering needed {PR #879}
Use pipelining in SRA allreduce for CUDA {PR #873}
Poll for onesided alltoall completion {PR #876}
Add support for non-host buffers in bruck alltoall {PR #852}
Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

Enable bcast for any predefined dt {PR #774}
Don't print team create error {PR #777}
Check datasize supported {PR #776}
Fix sharp context cleanup {PR #843}

API

Remove duplicate get_version_string {PR #933}

TL/NCCL

Make team init non-blocking {PR #772}
Add CUDA managed to score {PR #793}
Make ncclGroupEnd nb {PR #798}
Lazy init nccl comm {PR #851}

TL/MLX5

Share ib_ctx and pd {PR #749}
Rcache {PR #753}
Device memory and topo init {PR #780}
Adding mcast interface {PR #784}
A2A part 1 -- coll init {PR #790}
A2A part 2 -- full collective {PR #802}
Revisit team and ctx init {PR #815}
Fix context create hang {PR #887}
Add librdmacm linkage {PR #910}

CORE

Fix score update when only score given {PR #779}
Coverity fixes {PR #809}
Additional coverty fixes {PR #813}
Fix error handling for ctx create epilog {PR #818}
Skip zero size collectives {PR #787}

DOCS

Updating NEWS for v1.2 {PR #791}

TEST

Check op and dt compatibility {PR #773}
Fix barrier test {PR #799}
Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}

v1.2.0

11 months ago

This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:

New Features and Enhancements

CL/HIER

Fixed single proc on node issue in alltoall (#658)
Implemented allreduce rab pipelined (#608)
Added bcast 2step algorithm (#620)
Fixed allreduce rab pipeline (#759)

TL/CUDA

Support for CUDA 12
Fixed cache unmap issue (#642)
Implemented reduce scatter linear (#669)
Added algorithm selection based on topology (#688)
Fixed linear algorithms (#751)
Fixed pipelining in linear rs (#770)

TL/UCP

Added special service worker (#560)
Added scatterv (#663)
Added gatherv (#664)
Fixed running with npolls 0 (#695)
Added knomial allgather (#729)
Fixed bug for triggered colls (#757)
Added bruck alltoall (#756)
Added SLOAV alltoallv (#687)
Large message broadcast optimizations (#738)
Ranks reordering in ring allgather for better locality(#69)

TL/SHARP

Fixed memory type check in allreduce (#662)
Added support for sharpv3 dt (#661)
Fixed assert check (#686)
Implemented SHARP OOB fixes (#746)
Fixed local rank when NODE SBGP not enabled (#760)
Prevented sharp team with team max ppn > 1 (#761)

CORE

Fixed memory type score update (#650)
Fixed ucc parser build (#666)
Implemented ucc_pipeline_params (#675)
Changed log level of config_modify (#667)
Fixed timeout handle for triggered post (#679)

DOCS

Added User Guide (#720)

v1.2.0-rc1

11 months ago

This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:

New Features and Enhancements

CL/HIER

Fixed single proc on node issue in alltoall (#658)
Implemented allreduce rab pipelined (#608)
Added bcast 2step algorithm (#620)
Fixed allreduce rab pipeline (#759)

TL/CUDA

Fixed cache unmap issue (#642)
Implemented reduce scatter linear (#669)
Added algorithm selection based on topology (#688)
Fixed linear algorithms (#751)
Fixed pipelining in linear rs (#770)

TL/UCP

Added special service worker (#560)
Added scatterv (#663)
Added gatherv (#664)
Fixed running with npolls 0 (#695)
Added knomial allgather (#729)
Fixed bug for triggered colls (#757)
Added bruck alltoall (#756)

TL/SHARP

Fixed memory type check in allreduce (#662)
Added support for sharpv3 dt (#661)
Fixed assert check (#686)
Implemented SHARP OOB fixes (#746)
Fixed local rank when NODE SBGP not enabled (#760)
Prevented sharp team with team max ppn > 1 (#761)

CORE

Fixed memory type score update (#650)
Fixed ucc parser build (#666)
Implemented ucc_pipeline_params (#675)
Changed log level of config_modify (#667)
Fixed timeout handle for triggered post (#679)

DOCS

Added User Guide (#720)

v1.1.0

1 year ago

Features

API

Added float 128 and float 32, 64, 128 (complex) data types
Added Active Sets based collectives to support dynamic groups as well as point-to-point messaging
Added ucc_team_get_attr interface

Core

Config file support
Fixed component search

CL

Added split rail allreduce collective implementation
Enable hierarchical alltoallv and barrier
Fixed cleanup bugs

TL

Added SELF TL supporting team size one

UCP

Added service broadcast
Added reduce_scatterv ring algorithm
Added k-nomial based gather collective implementation
Added one-sided get based algorithms

SHARP

Fixed SHARP OOB
Added SHARP broadcast

GPU Collectives (CUDA, NCCL TL and RCCL TL)

Added RCCL TL to support RCCL collectives
Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
Added multiring allgatherv, alltoall, reduce-scatter, and reduce-scatterv multiring in CUDA TL
Added topo based ring construction in CUDA TL to maximize bandwidth
Added NCCL gather, scatter and its vector variant
Enable using multiple streams for collectives
Added support for RCCL gather (v), scatter (v), broadcast, allgather (v), barrier, alltoall (v) and all reduce collectives
Added ROCm memory component
Adapted all GPU collectives to executor design

Tests

Added tests for triggered collectives in perftests
Fixed bugs in multi-threading tests

Utils

Added CPU model and vendor detection
Several bug fixes in all components

v1.1.0-rc1

1 year ago

1.1.0 Features

API

Added float 128 and float 32, 64, 128 (complex) data types
Added Active Sets based collectives to support dynamic groups as well as point-to-point messaging

Core

Config file support
Fixed component search

CL

Added split rail all reduce collective implementation
Enable hierarchical alltoallv
Fixed cleanup bugs

TL

Added SELF TL supporting team size one

UCP

Added service broadcast
Added reduce_scatterv ring algorithm
Added k-nomial based gather collective implementation
Added one-sided get based algorithms

SHARP

Fixed SHARP OOB
Added SHARP broadcast

GPU Collectives (CUDA, NCCL TL and RCCL TL)

Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
Added multiring allgatherv, alltoall in CUDA TL
Added NCCL gather, scatter and its vector variant
Enable using multiple streams for collectives
Added support for RCCL gather (v), scatter (v), broadcast, allgather (v), barrier, alltoall (v) and all reduce collectives
Added ROCm memory component
Adapted all GPU collectives to executor design

Tests

Added tests for triggered collectives in perftests
Fixed bugs in multi-threading tests

Utils

Added CPU model and vendor detection
Several bug fixes in all components

v1.0.0

2 years ago

1.0.0 Features

API

Added Avg reduce operation
Added nonblocking team destroy option
Added user-defined datatype definitions
Added Bfloat16 type
Clarify semantics of core abstractions including teams and context
Added timeout option

Core

Added coll scoring and selection support
Added support for Triggered collectives
Added support for timeouts in collectives
Added support for team create without ep in post
Added support for multithreaded context progress
Added support for nonblocking team destroy

CL

Added support for hierarchical collectives
Added support for hierarchical allreduce collective operation
Added support for collectives based on one-sided communication routines

TL

Added SHARP TL

UCP

Added Bcast SAG algorithm for large messages
Added Knomial based reduce algorithm
Making allgather and alltoall agree with the API
Added SRA knomial allreduce algorithm
Added pairwise alltoall and alltoallv algorithms
Added allgather and allgatherv ring algorithms
Added support for collective operations based on one-sided semantics
Added support for alltoall with one-sided transfer semantics
Bug fixes

SHARP

Added support for switch-based hardware collectives (SHARP)

NCCL

Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce scatter, bcast, allgather and allgatherv

Tests

Updated tests to test the newly added algorithms and operations

v1.0.0-rc2

2 years ago

1.0.0 Features

API

Added Avg reduce operation
Added nonblocking team destroy option
Added user-defined datatype definitions
Added Bfloat16 type
Clarify semantics of core abstractions including teams and context
Added timeout option

Core

Added coll scoring and selection support
Added support for Triggered collectives
Added support for timeouts in collectives
Added support for team create without ep in post
Added support for multithreaded context progress
Added support for nonblocking team destroy

CL

Added support for hierarchical collectives
Added support for hierarchical allreduce collective operation
Added support for collectives based on one-sided communication routines

TL

Added SHARP TL

UCP

Added Bcast SAG algorithm for large messages
Added Knomial based reduce algorithm
Making allgather and alltoall agree with the API
Added SRA knomial allreduce algorithm
Added pairwise alltoall and alltoallv algorithms
Added allgather and allgatherv ring algorithms
Added support for collective operations based on one-sided semantics
Added support for alltoall with one-sided transfer semantics
Bug fixes

SHARP

Added support for switch-based hardware collectives (SHARP)

NCCL

Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce scatter, bcast, allgather and allgatherv

Tests

Updated tests to test the newly added algorithms and operations

v0.1.0

2 years ago

This is an early release of the UCC API and its implementation. Major features in this release are detailed below.

Features

API

UCC API to support library, contexts, teams, collective operations, execution engine, memory types, and triggered operations

Core

Added implementation for UCC abstractions - library, context, team, collective operations, execution engine, memory types, and triggered operations
Added support for memory types - CUDA, and CPU
Added support for configuring UCC library and contexts

CL

Added support for collectives, while the source and destination is either in CPU or device (GPU)
Added support for UCC_THREAD_MULTIPLE
Added support for CUDA stream-based collectives

TL

Added support for send/receive based collectives using UCX/UCP as a transport layer
Support for basic collectives types including barrier, alltoall, alltoallv, broadcast, allgather, allgatherv, allreduce was added in the UCP TL
Added support using NCCL as a transport layer
Support for collectives types including alltoall, alltoallv, allgather, allgatherv, allreduce, and broadcast

Tests

Added support for unit testing (gtest) infrastructure
Added support for MPI tests

v0.1.0-rc1

2 years ago

This is an early release of the UCC API and its implementation. Major features in this release are detailed below.

Features

API

UCC API to support library, contexts, teams, collective operations, execution engine, memory types, and triggered operations

Core

Added implementation for UCC abstractions - library, context, team, collective operations, execution engine, memory types, and triggered operations
Added support for memory types - CUDA, and CPU
Added support for configuring UCC library and contexts

CL

Added support for collectives, while the source and destination is either in CPU or device (GPU)
Added support for UCC_THREAD_MULTIPLE
Added support for CUDA stream-based collectives

TL

Added support for send/receive based collectives using UCX/UCP as a transport layer
Support for basic collectives types including barrier, alltoall, alltoallv, broadcast, allgather, allgatherv, allreduce was added in the UCP TL
Added support using NCCL as a transport layer
Support for collectives types including alltoall, alltoallv, allgather, allgatherv, allreduce, and broadcast

Tests

Added support for unit testing (gtest) infrastructure
Added support for MPI tests