Cuda Api Wrappers Versions Save

Thin, unified, C++-flavored wrappers for the CUDA APIs

v0.6.9-b3

2 months ago

Changes since v0.6.8:

  • #606 Can now copy directly to and from containers with contiguous storage - without going through pointers or specifying the size

Owning typed and untyped memory: unique_span and unique_region

  • #291 Added a unique_span<T> template class, combining the functionality of cuda::unique_ptr and cuda::span (and being somewhat similar to std::dynarray which almost made it into C++14). Many CUDA programs want to represent both the ownership of allocated memory, and the range of that memory for actual use, in the same variable - without the on-the-fly reallocation behavior of std::vector. This is now possible. Also implemented an untyped version of this, named unique_region.
  • #617 Replaced memory::external::mapped_region_t with memory::unique_region
  • #601 Added an empty() method to cuda::span (to match that of std::span - as it is no sometimes used)
  • #603 Use unique_span instead of our cuda::dynarray (which had been an std::vector under the hood), in various places in the API, especially RTC
  • #610 Return unique_span's from the cuda::rtc::program_output class methods which allocated their own buffers: The methods for getting the compilation log, the cubin data, the PTX and the LTO IR.

More robust memory regions (memory::region_t)

  • #594 Now using memory::region_t for mapped memory rather than a different, mapped-memory specific region class
  • #602 Make memory::region_t more constexpr-friendly
  • #604 memory::region_t's are now CUDA-independent, i.e. do not utilize any CUDA-specific definitions
  • #605 Can now construct const_region_t's from rvalue references to regions

Documentation & comments

  • #595 Correct the documentation for supports_memory_pools

Launch configuration & launch config builder changes

  • #596 Corrected a check against the associated device in the kernel-setting method of the launch config builder
  • #619, #618 Fixed launch configuration comparisons and now user defaulted comparison
  • #619 Fixed a bug in checking whether some CUDA-12-introduced launch config parameters are set

CUDA libraries and in-library, non-associated kernel support

  • #598 Corrected the API and implementation of get_attribute() and set_attribute() for library kernels

Internal refactoring

  • #607 Split off a detail/type_traits.hpp from types.hpp
  • #620 context::current::scoped_override_t now declared in current_context.hpp
  • #611 Reduced code repetition between context_t and primary_context_t.
  • #622 link::marshalled_options_t and link::option_t are now in the detail_ namespace - the user should typically never used
  • #624 Now collecting the log-related link options into a sub-structure of link::options_t
  • #625 Dropped specify_default_load_caching_mode from link::options_t, in favor of using an stdx::optional
  • #626 Now using optional's instead of bespoke constructs in pci_location_t
  • #628 Corrected the signature of context::current::peer_to_peer functions
  • #630 Moved program_base_t into the detail_ namespace
  • #632 Move rtc::marshalled_options_t and rtc::marshal() into the detail_ subnamespace - users should not need to use this themselves
  • #643 Moved memory::pool::ipc::ptr_handle_t out of ipc.hpp up into types.hpp (so that memory_pool.hpp doesn't depend on ipc.hpp)
  • #621 Renamed: link::fallback_strategy_t -> link::fallback_strategy_for_binary_code_t
  • #600 Now adhering to underscore suffix for proxy class field names

Other changes

  • #599 An invalid file, name_caching_program.hpp, had snuck into our code - removed it
  • #609 "Robustified" the buffers returned from cuda::rtc::program_output's various methods, so that they are all padded with an extra '\0' character past the end of the span's actual range. This is not necessarily and that data should hopefully not actually be reached, but - let's be
  • on the safe side.
  • #627 Dropped context-specification from host-memory allocator functions - it's not actually used
  • #629 Added a device ID field to the texture view object
  • #631 Dropped examples/rtc_common.hpp, which is no longer in which now
  • #638 Dropped the native_word_t type

v0.6.9-b2

2 months ago

Changes since v0.6.8:

  • #606 Can now copy directly to and from containers with contiguous storage - without going through pointers or specifying the size

Unique spans (unique_span)

  • #291 Added a unique_span<T> template class, combining the functionality of cuda::unique_ptr and cuda::span (and being somewhat similar to std::dynarray which almost made it into C++14). Many CUDA programs want to represent both the ownership of allocated memory, and the range of that memory for actual use, in the same variable - without the on-the-fly reallocation behavior of std::vector. This is now possible. Also implemented an untyped version of this, named unique_region.
  • #601 Added an empty() method to cuda::span (to match that of std::span - as it is no sometimes used)
  • #603 Use unique_span instead of our cuda::dynarray (which had been an std::vector under the hood), in various places in the API, especially RTC
  • #610 Return unique_span's from the cuda::rtc::program_output class methods which allocated their own buffers: The methods for getting the compilation log, the cubin data, the PTX and the LTO IR.

More robust memory regions (memory::region_t)

  • #594 Now using memory::region_t for mapped memory rather than a different, mapped-memory specific region class
  • #602 Make memory::region_t more constexpr-friendly
  • #604 memory::region_t's are now CUDA-independent, i.e. do not utilize any CUDA-specific definitions
  • #605 Can now construct const_region_t's from rvalue references to regions

Documentation & comments

  • #595 Correct the documentation for supports_memory_pools

Launch configuration & launch config builder changes

  • #596 Corrected a check against the associated device in the kernel-setting method of the launch config builder

CUDA libraries and in-library, non-associated kernel support

  • #598 Corrected the API and implementation of get_attribute() and set_attribute() for library kernels

Refactoring

  • #607 Split off a detail/type_traits.hpp from types.hpp

Other changes

  • #599 An invalid file, name_caching_program.hpp, had snuck into our code - removed it
  • #600 Adhere to underscore suffix for proxy class field names
  • #609 "Robustified" the buffers returned from cuda::rtc::program_output's various methods, so that they are all padded with an extra '\0' character past the end of the span's actual range. This is not necessarily and that data should hopefully not actually be reached, but - let's be on the safe side.

v0.6.8

2 months ago

Changes since v0.6.7:

Build process & build configuration changes

  • #583 Exported targets now make sure apps link against some system libraries CUDA depends on (to circumvent CMake bug 25665).
  • #567 The runtime-and-driver target now has driver-and-runtime as an alias
  • #590 Avoid compiled warnings about narrowing conversions and shadowing (when those warning flags are turned on)

Launch configuration & launch config builder changes

  • #564 Can now launch kernels with the full range of CUDA 12.x launch attributes, including remote memory sync domain, programmatic launch dependence and programmatic completion events (see descriptions in the CUDA Driver API documentation).
  • #484 Support for setting block cluster dimensions (as part of the support of CUDA 12.x launch attributes).
  • #577, #582 More extensive validation of launch configurations when building them with the launch config builder gadget.
  • #581 More robust comparison operators for dimension structures
  • #580 Launch config builder can now be told to the "use the maximum number of active blocks per multiprocessor".
  • #579 User can now set a target device on a launch configuration target device without setting a contextualized kernel, in a launch config builder
  • #578 Now using the launch config builder in more of the example programs
  • #569 Took care of unused validation function which was triggering a warning with newer compilers

CUDA libraries and in-library, non-associated kernel support

  • #565 Now supporting "CUDA libraries" - files or blocks of data in memory containing compiled kernels, which are not loaded immediately into modules within contexts; and contain device-and-context-independent compiled kernels. Both of these can now be represented and worked with.
  • #576 A module no longer holds the link options it was created with; those are not essential to its use, and at times are impossible to (re)create when obtaining a module from a library (which also doesn't hold its link options ).

Refactoring

  • #586 The poor man's span class now has its own file (detail/span.hpp)
  • #588 Some under-the-hood refactoring of host memory allocation functions
  • #589 Factored the cuda::memory::region_t class into its own file

Other changes

  • #593 Some work on the cuda::memory::copy_parameters_t structure
  • #592 Dropped the memory::managed::region_t and const_region_t and now just using memory::region_t and const_region_t everywhere
  • #591 Memory copy functions for spans and other work on memory copy functions
  • #587 Added a missing variant of memory-zero'ing
  • #585 You can now write cuda::memory::make_unique() - and it's assumed you mean device memory (you have to specify a device or device context though)
  • #575 Moved apriori_compiled_kernel_t into the kernel namespace, yielding kernel::apriori_compiled_it
  • #570, #573 Removed some redundant inclusions and definitions
  • #568 Fixed some breakage of kernel_t::set_attribute()
  • #566 Can now properly get and set properties on kernels (raw functions and handles)

Compatibility

  • #572 Fixed broken CUDA 9.x compatibility

v0.6.8b1

3 months ago

Changes since v0.6.7:

Build process & build configuration changes

  • #583 Exported targets now make sure apps link against some system libraries CUDA depends on (to circumvent CMake bug 25665).
  • #567 The runtime-and-driver target now has driver-and-runtime as an alias

Launch configuration & launch config builder changes

  • #564 Can now launch kernels with the full range of CUDA 12.x launch attributes, including remote memory sync domain, programmatic launch dependence and programmatic completion events (see descriptions in the CUDA Driver API documentation).
  • #577, #582 More extensive validation of launch configurations when building them with the launch config builder gadget.
  • #581 More robust comparison operators for dimension structures
  • #580 Launch config builder can now be told to the "use the maximum number of active blocks per multiprocessor".
  • #579 User can now set a target device on a launch configuration target device without setting a contextualized kernel, in a launch config builder
  • #578 Now using the launch config builder in more of the example programs
  • #569 Took care of unused validation function which was triggering a warning with newer compilers

CUDA libraries and in-library, non-associated kernel support

  • #565 Now supporting "CUDA libraries" - files or blocks of data in memory containing compiled kernels, which are not loaded immediately into modules within contexts; and contain device-and-context-independent compiled kernels. Both of these can now be represented and worked with.
  • #576 A module no longer holds the link options it was created with; those are not essential to its use, and at times are impossible to (re)create when obtaining a module from a library (which also doesn't hold its link options ).

Other changes

  • #575 Moved apriori_compiled_kernel_t into the kernel namespace, yielding kernel::apriori_compiled_it
  • #570, #573 Removed some redundant inclusions and definitions
  • #568 Fixed some breakage of kernel_t::set_attribute()
  • #566 Can now properly get and set properties on kernels (raw functions and handles)

Compatibility

  • #572 Fixed broken CUDA 9.x compatibility

  • Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)
  • Have you tried this version and are satisfied with the changes? Thumb-up just below this line to let others know as well.

v0.6.7

4 months ago

Changes since v0.6.6:

Build process & build configuration changes

  • #555 : No longer enabling any languages when configuring the project - it is not strictly necessary at that point.
  • #557 : Now setting EXPORT_ALL_SYMBOLS as a target property rather than globally (relevant for Windows builds mostly).
  • #558 : Now avoiding incompatibility with CMake 3.25-3.28, by avoiding adding a dependency to a CUDA:: target, and not trying to recreate it if it already exists.
  • #562 Slightly streamlined the package config file (and no longer preprocessing it).
  • #563 Now enabling the C++ language before declaring the dependence (with find_package()) on the Threads library.

Compatibility

  • #506 : Windows builds no longer fail due to missing cudaProfiler.h
  • #504 : Build on Windows should now also succeed when using cooperative groups.

Other changes

  • #556 Moved the CompileWithWarnings.cmake module into the examples/ subfolder, which is the only place it's used.
  • #561 Dropped an unused function and method in the launch config builder source code and class, respectively, and added some orphaned validation logic when obtaining the grid and block dimensions, so as not to ignore the overall dimensions.

  • Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)
  • Have you tried this version and are satisfied with the changes? Thumb-up just below this line to let others know as well.

v0.6.6

7 months ago

Changes since v0.6.4:

Functionality improvements

  • #545 Now checking, and throwing, errors due to cudaGetLastError() after kernel launches (mostly grid errors)
  • #547 When compiling in debug mode, now performing more launch configuration validity checks before launching a kernel
  • #549 Avoiding some excessive device property querying.

Bug fixes

  • #539, #544 NVRTC compilation logs now returned without a trailing nul ('\0') character.
  • #542 More robust use of namespace in the library's macros, so they don't trigger compilation errors regardless of the namespace of the code you use them in
  • #543 Now retrieving correct error strings again for Runtime-API-only errors
  • #550 Fixed a wrong side of comparison in some block configuration logic of the launch config builder
  • #553 Replaced inappropriate use of cbegin() and cend() in favor of begin() and end() in rtc::program::add_headers() code which may take inputs without these two methods.

Compatibility

  • #546 Resolved a build on Windows with rtc.hpp - fixed an overload resolution issue regarding compilation parameter marshalling.

Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)

v0.6.4

9 months ago

Changes since v0.6.3:

Features

  • #528 Can now register memory areas with the CUDA driver as read-only (in addition to other parameters)
  • #519 Quick & somewhat-dirty support for the use of "external memory resources" (mostly Direct3D or NVSCI buffers and semaphores; but not including semaphores etc. for now)

Bug fixes

  • #535 copy_parameters_t::set_endpoint_untyped() now properly calling an inner set_endpoint_untyped()
  • #527 copy_parameters_t::set_single_context no longer mistakenly taking an endpoint_t parameter
  • #528 When mapping a region pair, not insisting on the same address on both the host and the device (which had made it impossible for this to succeed with older GPUs).
  • #521 Avoid a compiler compiler warning when overriding the context for the current scope
  • #520 Removed unnecessary uses of a context-wrapper scoped-context-override
  • #517 cuda::memory::typed_set() no longer mistakenly accepts values of size 8 (which have no special CUDA API call).
  • #516 Corrected types and casting in stream_t::enqueue_t::write_single_value()
  • #515 Resolved a case of missing includes when including only certain headers
  • #514 Now providing a definition of cuda::memory::managed::allocate(device, num_bytes) (which was declared but not defined)

Other changes

  • #522 Renamed synch to sync in multiple identifiers
  • #529 program_t::add_registered_globals() can now take any container of any string-like type.
  • #523 Now passing device_t's by const-ref in more cases, avoiding copying and enabling re-use of a reference to the primary context.
  • #521 Reduce boilerplate + avoid warning when overriding context for the current scope resolved-on-development task

Build issues

  • #504 Fixed build failure with cooperative_groups on GitHub Actions Windows runners and CUDA >= 11.7

Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)

v0.6.4-rc1

10 months ago

Changes since v0.6.3:

Features

  • #528 Can now register memory areas with the CUDA driver as read-only (in addition to other parameters)
  • #519 Quick & somewhat-dirty support for the use of "external memory resources" (mostly Direct3D or NVSCI buffers and semaphores; but not including semaphores etc. for now)

Bug fixes

  • #527 copy_parameters_t::set_single_context no longer mistakenly taking an endpoint_t parameter
  • #528 When mapping a region pair, not insisting on the same address on both the host and the device (which had made it impossible for this to succeed with older GPUs).
  • #521 Avoid a compiler compiler warning when overriding the context for the current scope
  • #520 Removed unnecessary uses of a context-wrapper scoped-context-override
  • #517 cuda::memory::typed_set() no longer mistakenly accepts values of size 8 (which have no special CUDA API call).
  • #516 Corrected types and casting in stream_t::enqueue_t::write_single_value()
  • #515 Resolved a case of missing includes when including only certain headers
  • #514 Now providing a definition of cuda::memory::managed::allocate(device, num_bytes) (which was declared but not defined)

Other changes

  • #522 Renamed synch to sync in multiple identifiers
  • #529 program_t::add_registered_globals() can now take any container of any string-like type.
  • #523 Now passing device_t's by const-ref in more cases, avoiding copying and enabling re-use of a reference to the primary context.
  • #521 Reduce boilerplate + avoid warning when overriding context for the current scope resolved-on-development task

Build issues

  • #504 Fixed build failure with cooperative_groups on GitHub Actions Windows runners and CUDA >= 11.7

v0.6.3

1 year ago

Changes since v0.6.2:

  • Added a cuda::memory::pool_t proxy class
  • Memory pools are created using cuda::memory::pool::create() or via device methods
  • IPC: Can import (and export) pools and their allocations to/from other processes
  • #485 Moved the physical_allocation namespace up from memory::virtual_ into memory, as it is used also for memory pools

Bug fixes

  • #508 With CUDA >= 11.3, we no longer give up on creating a module from a compiled program just because no CUBIN is available - and try to use the PTX like with earlier CUDA versions.
  • #493 cuda::launch_config_builder_t::overall_size() now takes a cuda::grid::overall_dimension_t rather than a cuda::grid::dimension_t (not the same size).
  • #492 Avoiding inclusion of cooperative_groups.h from within the API headers

Other API changes

  • #511 Can now create CUDA runtime errors with fully-user-overriden what() message (probably not very interesting outside the API's internals).
  • #510 When NVRTC/PTX compiler complains about an invalid option, we now include the options passed in the thrown exceptions
  • #499 No longer exposing deprecated surface-related API functions with CUDA 12 and later.
  • #498 The launch config builder class now supports num_blocks() as an alias for the grid_size() method.
  • #488 cuda::memory::host::allocate() now returns a cuda::memory::region_t, for better consistency.
  • #486 Some changes to cuda::kernel::wrap.
  • Renamed cuda::memory::attribute_value_type_t -> cuda::memory::attribute_value_t
  • #483 It's now easier to convert memory::region_t's into typed spans.
  • #482 Improvements to the built-in cuda::span (which is used when std::span is unavailable) - making it somewhat more compatible with std::span
  • Make more comparison operators constexpr and noexcept

Compatibility

  • Wrappers now build correctly (again) with --std=c++20.
  • #501 Added a new NVRTC error code introduced in CUDA 12.1
  • #500 When using CUDA 12, use the term "LTO IR" rather than "NVVM" as appropriate
  • #494 Work around an MSVC issue with variadic template-templates
  • #491 Avoiding some warnings issued by MSVC
  • #480 Add example program built with each C++ version after 11 supported by the compiler

Build issues

  • Now requiring CMake version 3.25. You can download an up-to-date version from Kitware's website; it doesn't require any special installation.
  • #490 Switched from depending on CUDA::nvToolkitExt to depending on CUDA::nvtx, for CUDA versions 10.0 and above.

v0.6.2

1 year ago

Changes since v0.6.1:

The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.

So far, a callback was any object invokable with an std::stream_t parameter. From now on, we support two kinds of callback:

  • A plain function - not a closure, which may be invoked with a pointer to an arbitrary type: cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
  • An object invokable with no parameters - a closure, to which one cannot provide any additional information: cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)

This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.

Of course, the release also has s the "usual" long list of minor fixes.

Changes to existing API

  • #473 Redesign of host function / callback enqueue and launch mechanism, see above
  • #459 cuda::kernel::get() now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)
  • #477 When creating a new program, we default to assuming it's CUDA C++ and do not require an explicit specification of that fact.

API additions

  • #468 Added a non-CUDA memory type enum value, and - can now check the memory type of any pointer without throwing an error.
  • #472 Can now pass cuda::memory::region_t's when enqueueing copy operations on streams (and thus also cuda::span<T>'s)
  • #466 Can now perform copies using cuda::memory::copy_parameters_t<N> (for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.
  • #463 Can now obtain a raw pointer's context and device without first wrapping it in a cuda::pointer_t
  • #452 Support an enqueuing a memory barrier on a stream (one of the "batch stream memory operations)
  • A method of the launch configuration builder for indicating no dynamic shared memory is used

Bug fixes

  • #475 device::get() no longer incorrectly marked as noexcept
  • #467 Array-to-raw-memory copy function now determines context for the target area, and a new variant of the function takes the content as a parameter.
  • #455 Add missing definition of allocate_managed()~ in context.hpp`
  • #453 Now actually setting the flags when enqueueing a flush_remote_writes() operation on a stream (this is one of the "batch stream memory operations)
  • #450 Fixed an allocation-without-release in cuda::memory::virtual::set_access_mode
  • #449 apriori_compiled_kernel_t::get_attribute() was missing an inline decoration
  • #448 cuda::profiling::mark::range_start() and range_end() were calling create_attributions() the wrong way

Cleanup and warning avoidance

  • #443 Aligned member initialization order(s) in array_t with their declaration order.

Compatibility

  • #462 Can now obtain a pointer's device in CUDA 9.x (not just 10.0 and later)
  • #304 Some CUDA 9.x incompatibilities have been fixed

Other changes

  • #471 Made a few more comparison operators constexpr