[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
CUB_USE_COOPERATIVE_GROUPS
macro, as all supported CTK distributions provide CG. This macro will be removed in a future version of CUB.DeviceBatchMemcpy
algorithm.DeviceMergeSort::StableSortKeysCopy
API. Thanks to David Wendt (@davidwendt) for this contribution.DeviceRadixSort
. Thanks to Andy Adinets (@canonizer) for this contribution.CUB_DISABLE_CDP
.DeviceReduce
.BlockShuffle
resulting from an invalid thread offset. Thanks to @sjfeng1999 for this contribution.BlockRadixRank
when used with blocks that are not a multiple of 32 threads.BlockRadixRank
. Thanks to Andy Adinets (@canonizer) for this contribution.DeviceSegmentedSort
when used with bool
keys.DeviceReduce
.DeviceRunLengthEncode
when the first item is a NaN
.WarpScanExclusive
for vector types.BlockReduceRaking
docs for non-commutative operations. Thanks to Tobias Ribizel (@upsj) for this contribution.CUB 1.17.2 is a minor bugfix release.
The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous
debugging flags.
This release also includes several new features. DeviceHistogram
now supports __half
and better handles various edge cases. WarpReduce
now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync
accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort
learned to handle the case where begin_bit == end_bit
.
Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
CUB_PROVIDE_LEGACY_ARCH_MACROS
. These will be completely removed in a future release.
CUB_IS_HOST_CODE
: Replace with NV_IF_TARGET
.CUB_IS_DEVICE_CODE
: Replace with NV_IF_TARGET
.CUB_INCLUDE_HOST_CODE
: Replace with NV_IF_TARGET
.CUB_INCLUDE_DEVICE_CODE
: Replace with NV_IF_TARGET
.NV_IF_TARGET
. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.
CUB_RUNTIME_FUNCTION
: Execution space annotations for functions that invoke CUDA Runtime APIs.
__host__ __device__
__host__ __device__
__host__
__host__ __device__
__host__
CUB_RUNTIME_ENABLED
: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:
CUB_RDC_ENABLED
: New macro, may be combined with NV_IF_TARGET
to replace most usages of CUB_RUNTIME_ENABLED
. Behavior:
__device__
-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).
__device__
-only implementation of operator()
.__host__ __device__
lambda.cuda::proclaim_return_type
(Added in libcu++ 1.9.0)DeviceReduce
algorithm, following guidance from http://wg21.link/P2322R6.
Dispatch*Reduce
layer have changed:
DispatchReduce
:
init
as initial type instead of output iterator value type.DispatchSegmentedReduce
:
Equality
, Inequality
, InequalityWrapper
, Sum
, Difference
, Division
, Max
, ArgMax
, Min
, ArgMin
.ThreadReduce
now accepts accumulator type and uses a different type for prefix
.DeviceScan
, DeviceScanByKey
, and DeviceReduceByKey
algorithms, following guidance from http://wg21.link/P2322R6.
Dispatch
layer have changed:
DispatchScan
now accepts accumulator type as a template parameter.DispatchScanByKey
now accepts accumulator type as a template parameter.DispatchReduceByKey
now accepts accumulator type as the last template parameter.debug_synchronous
flags on device algorithms.
CUB_DEBUG_SYNC
during compilation to enable these checks.__half
in DeviceHistogram
.WarpReduce
.__reduce_add_sync
hardware acceleration for WarpReduce
on supported architectures.begin_bit == end_bit
.DeviceHistogram::Even
for a variety of edge cases:
SampleT
and LevelT
.LevelT
is an integral type and the number of levels does not evenly divide the level range.temp_storage_bytes
is properly set in the AdjacentDifferenceCopy
device algorithms.AdjacentDifferenceCopy
device algorithms.__CUDA_ARCH__
with <nv/target>
to handle host/device code divergence.CUB_DISABLE_BF16_SUPPORT
to avoid including the cuda_bf16.h
header or using the __nv_bfloat16
type.DeviceScan
algorithms.DeviceHistogram
algorithms.DevicePartition
algorithms.Device*Sort
algorithms.DeviceReduce
algorithms.DeviceRunLengthEncode
algorithms.DeviceSelect
algorithms.WarpMergeSort
documentation.The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous
debugging flags.
This release also includes several new features. DeviceHistogram
now supports __half
and better handles various edge cases. WarpReduce
now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync
accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort
learned to handle the case where begin_bit == end_bit
.
Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
CUB_PROVIDE_LEGACY_ARCH_MACROS
. These will be completely removed in a future release.
CUB_IS_HOST_CODE
: Replace with NV_IF_TARGET
.CUB_IS_DEVICE_CODE
: Replace with NV_IF_TARGET
.CUB_INCLUDE_HOST_CODE
: Replace with NV_IF_TARGET
.CUB_INCLUDE_DEVICE_CODE
: Replace with NV_IF_TARGET
.NV_IF_TARGET
. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.
CUB_RUNTIME_FUNCTION
: Execution space annotations for functions that invoke CUDA Runtime APIs.
__host__ __device__
__host__ __device__
__host__
__host__ __device__
__host__
CUB_RUNTIME_ENABLED
: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:
CUB_RDC_ENABLED
: New macro, may be combined with NV_IF_TARGET
to replace most usages of CUB_RUNTIME_ENABLED
. Behavior:
__device__
-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).
__device__
-only implementation of operator()
.__host__ __device__
lambda.cuda::proclaim_return_type
(Added in libcu++ 1.9.0)DeviceReduce
algorithm, following guidance from http://wg21.link/P2322R6.
Dispatch*Reduce
layer have changed:
DispatchReduce
:
init
as initial type instead of output iterator value type.DispatchSegmentedReduce
:
Equality
, Inequality
, InequalityWrapper
, Sum
, Difference
, Division
, Max
, ArgMax
, Min
, ArgMin
.ThreadReduce
now accepts accumulator type and uses a different type for prefix
.DeviceScan
, DeviceScanByKey
, and DeviceReduceByKey
algorithms, following guidance from http://wg21.link/P2322R6.
Dispatch
layer have changed:
DispatchScan
now accepts accumulator type as a template parameter.DispatchScanByKey
now accepts accumulator type as a template parameter.DispatchReduceByKey
now accepts accumulator type as the last template parameter.debug_synchronous
flags on device algorithms.
CUB_DEBUG_SYNC
during compilation to enable these checks.__half
in DeviceHistogram
.WarpReduce
.__reduce_add_sync
hardware acceleration for WarpReduce
on supported architectures.begin_bit == end_bit
.DeviceHistogram::Even
for a variety of edge cases:
SampleT
and LevelT
.LevelT
is an integral type and the number of levels does not evenly divide the level range.temp_storage_bytes
is properly set in the AdjacentDifferenceCopy
device algorithms.AdjacentDifferenceCopy
device algorithms.__CUDA_ARCH__
with <nv/target>
to handle host/device code divergence.CUB_DISABLE_BF16_SUPPORT
to avoid including the cuda_bf16.h
header or using the __nv_bfloat16
type.DeviceScan
algorithms.DeviceHistogram
algorithms.DevicePartition
algorithms.Device*Sort
algorithms.DeviceReduce
algorithms.DeviceRunLengthEncode
algorithms.DeviceSelect
algorithms.WarpMergeSort
documentation.The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous
debugging flags.
This release also includes several new features. DeviceHistogram
now supports __half
and better handles various edge cases. WarpReduce
now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync
accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort
learned to handle the case where begin_bit == end_bit
.
Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
CUB_PROVIDE_LEGACY_ARCH_MACROS
. These will be completely removed in a future release.
CUB_IS_HOST_CODE
: Replace with NV_IF_TARGET
.CUB_IS_DEVICE_CODE
: Replace with NV_IF_TARGET
.CUB_INCLUDE_HOST_CODE
: Replace with NV_IF_TARGET
.CUB_INCLUDE_DEVICE_CODE
: Replace with NV_IF_TARGET
.NV_IF_TARGET
. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.
CUB_RUNTIME_FUNCTION
: Execution space annotations for functions that invoke CUDA Runtime APIs.
__host__ __device__
__host__ __device__
__host__
__host__ __device__
__host__
CUB_RUNTIME_ENABLED
: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:
CUB_RDC_ENABLED
: New macro, may be combined with NV_IF_TARGET
to replace most usages of CUB_RUNTIME_ENABLED
. Behavior:
__device__
-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).
__device__
-only implementation of operator()
.__host__ __device__
lambda.cuda::proclaim_return_type
(Added in libcu++ 1.9.0)DeviceReduce
algorithm, following guidance from http://wg21.link/P2322R6.
Dispatch*Reduce
layer have changed:
DispatchReduce
:
init
as initial type instead of output iterator value type.DispatchSegmentedReduce
:
Equality
, Inequality
, InequalityWrapper
, Sum
, Difference
, Division
, Max
, ArgMax
, Min
, ArgMin
.ThreadReduce
now accepts accumulator type and uses a different type for prefix
.DeviceScan
, DeviceScanByKey
, and DeviceReduceByKey
algorithms, following guidance from http://wg21.link/P2322R6.
Dispatch
layer have changed:
DispatchScan
now accepts accumulator type as a template parameter.DispatchScanByKey
now accepts accumulator type as a template parameter.DispatchReduceByKey
now accepts accumulator type as the last template parameter.debug_synchronous
flags on device algorithms.
CUB_DEBUG_SYNC
during compilation to enable these checks.__half
in DeviceHistogram
.WarpReduce
.__reduce_add_sync
hardware acceleration for WarpReduce
on supported architectures.begin_bit == end_bit
.DeviceHistogram::Even
for a variety of edge cases:
SampleT
and LevelT
.LevelT
is an integral type and the number of levels does not evenly divide the level range.temp_storage_bytes
is properly set in the AdjacentDifferenceCopy
device algorithms.AdjacentDifferenceCopy
device algorithms.__CUDA_ARCH__
with <nv/target>
to handle host/device code divergence.CUB_DISABLE_BF16_SUPPORT
to avoid including the cuda_bf16.h
header or using the __nv_bfloat16
type.DeviceScan
algorithms.DeviceHistogram
algorithms.DevicePartition
algorithms.Device*Sort
algorithms.DeviceReduce
algorithms.DeviceRunLengthEncode
algorithms.DeviceSelect
algorithms.WarpMergeSort
documentation.CUB 1.17.1 is a minor bugfix release.
temp_storage_bytes
is properly set in
the AdjacentDifferenceCopy
device algorithms.AdjacentDifferenceCopy
device algorithms.DeviceSegmentedSort
.CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.
Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. The DeviceReduce::ReduceByKey
and DeviceScan
algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.
DeviceSelect
to work with discard iterators and mixed input/output types.CMAKE_INSTALL_LIBDIR
contained nested directories. Thanks to @robertmaynard for this contribution.DeviceSegmentedSort
on sm_61 and sm_70.DeviceSelect::Flagged
so that flags are normalized to 0 or 1.DeviceRadixSort
given num_items
close to 2^32. Thanks to @canonizer for this contribution.DeviceSegmentedSort
when launched via CDP.BlockDiscontinuity
: Replaced recursive-template loop unrolling with #pragma unroll
. Thanks to @kshitij12345 for this contribution.TexRefInputIterator
implementation with an alias to TexObjInputIterator
. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB.BlockAdjacentDifference
: Replaced recursive-template loop unrolling with #pragma unroll
. Thanks to @kshitij12345 for this contribution.cub::DeviceAdjacentDifference
API has been updated to use the new OffsetT
deduction approach described in NVIDIA/cub#212.CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.
Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. The DeviceReduce::ReduceByKey
and DeviceScan
algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.
DeviceSelect
to work with discard iterators and mixed input/output types.CMAKE_INSTALL_LIBDIR
contained nested directories. Thanks to @robertmaynard for this contribution.DeviceSegmentedSort
on sm_61 and sm_70.DeviceSelect::Flagged
so that flags are normalized to 0 or 1.DeviceRadixSort
given num_items
close to 2^32. Thanks to @canonizer for this contribution.BlockAdjacentDifference
. Thanks to @MKKnorr for this contribution.DeviceSegmentedSort
when launched via CDP.BlockDiscontinuity
: Replaced recursive-template loop unrolling with #pragma unroll
. Thanks to @kshitij12345 for this contribution.TexRefInputIterator
implementation with an alias to TexObjInputIterator
. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB.BlockAdjacentDifference
: Replaced recursive-template loop unrolling with #pragma unroll
. Thanks to @kshitij12345 for this contribution.cub::DeviceAdjacentDifference
API has been updated to use the new OffsetT
deduction approach described in NVIDIA/cub#212.CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.
Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. The DeviceReduce::ReduceByKey
and DeviceScan
algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.
DeviceSelect
to work with discard iterators and mixed input/output types.CMAKE_INSTALL_LIBDIR
contained nested directories. Thanks to @robertmaynard for this contribution.DeviceSegmentedSort
on sm_61 and sm_70.DeviceSelect::Flagged
so that flags are normalized to 0 or 1.DeviceRadixSort
given num_items
close to 2^32. Thanks to @canonizer for this contribution.DeviceSegmentedSort
when launched via CDP.BlockDiscontinuity
: Replaced recursive-template loop unrolling with #pragma unroll
. Thanks to @kshitij12345 for this contribution.TexRefInputIterator
implementation with an alias to TexObjInputIterator
. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB.BlockAdjacentDifference
: Replaced recursive-template loop unrolling with #pragma unroll
. Thanks to @kshitij12345 for this contribution.cub::DeviceAdjacentDifference
API has been updated to use the new OffsetT
deduction approach described in NVIDIA/cub#212.