Cuda Api Wrappers Versions Save

Thin, unified, C++-flavored wrappers for the CUDA APIs

v0.6.2

1 year ago

Changes since v0.6.1:

The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.

So far, a callback was any object invokable with an std::stream_t parameter. From now on, we support two kinds of callback:

  • A plain function - not a closure, which may be invoked with a pointer to an arbitrary type: cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
  • An object invokable with no parameters - a closure, to which one cannot provide any additional information: cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)

This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.

Of course, the release also has s the "usual" long list of minor fixes.

Changes to existing API

  • #473 Redesign of host function / callback enqueue and launch mechanism, see above
  • #459 cuda::kernel::get() now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)
  • #477 When creating a new program, we default to assuming it's CUDA C++ and do not require an explicit specification of that fact.

API additions

  • #468 Added a non-CUDA memory type enum value, and - can now check the memory type of any pointer without throwing an error.
  • #472 Can now pass cuda::memory::region_t's when enqueueing copy operations on streams (and thus also cuda::span<T>'s)
  • #466 Can now perform copies using cuda::memory::copy_parameters_t<N> (for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.
  • #463 Can now obtain a raw pointer's context and device without first wrapping it in a cuda::pointer_t
  • #452 Support an enqueuing a memory barrier on a stream (one of the "batch stream memory operations)
  • A method of the launch configuration builder for indicating no dynamic shared memory is used

Bug fixes

  • #475 device::get() no longer incorrectly marked as noexcept
  • #467 Array-to-raw-memory copy function now determines context for the target area, and a new variant of the function takes the content as a parameter.
  • #455 Add missing definition of allocate_managed()~ in context.hpp`
  • #453 Now actually setting the flags when enqueueing a flush_remote_writes() operation on a stream (this is one of the "batch stream memory operations)
  • #450 Fixed an allocation-without-release in cuda::memory::virtual::set_access_mode
  • #449 apriori_compiled_kernel_t::get_attribute() was missing an inline decoration
  • #448 cuda::profiling::mark::range_start() and range_end() were calling create_attributions() the wrong way

Cleanup and warning avoidance

  • #443 Aligned member initialization order(s) in array_t with their declaration order.

Compatibility

  • #462 Can now obtain a pointer's device in CUDA 9.x (not just 10.0 and later)
  • #304 Some CUDA 9.x incompatibilities have been fixed

Other changes

  • #471 Made a few more comparison operators constexpr

v0.6.2-rc2

1 year ago

The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.

So far, a callback was any object invokable with an std::stream_t parameter. From now on, we support two kinds of callback:

  • A plain function - not a closure, which may be invoked with a pointer to an arbitrary type: cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
  • An object invokable with no parameters - a closure, to which one cannot provide any additional information: cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)

This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.

Of course, the release also has s the "usual" long list of minor fixes.

Changes to existing API

  • #473 Redesign of host function / callback enqueue and launch mechanism, see above
  • #459 cuda::kernel::get() now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)
  • #477 When creating a new program, we default to assuming it's CUDA C++ and do not require an explicit specification of that fact.

API additions

  • #468 Added a non-CUDA memory type enum value, and - can now check the memory type of any pointer without throwing an error.
  • #472 Can now pass cuda::memory::region_t's when enqueueing copy operations on streams (and thus also cuda::span<T>'s)
  • #466 Can now perform copies using cuda::memory::copy_parameters_t<N> (for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.
  • #463 Can now obtain a raw pointer's context and device without first wrapping it in a cuda::pointer_t
  • #452 Support an enqueuing a memory barrier on a stream (one of the "batch stream memory operations)

Bug fixes

  • #475 device::get() no longer incorrectly marked as noexcept
  • #467 Array-to-raw-memory copy function now determines context for the target area, and a new variant of the function takes the content as a parameter.
  • #455 Add missing definition of allocate_managed()~ in context.hpp`
  • #453 Now actually setting the flags when enqueueing a flush_remote_writes() operation on a stream (this is one of the "batch stream memory operations)
  • #450 Fixed an allocation-without-release in cuda::memory::virtual::set_access_mode
  • #449 apriori_compiled_kernel_t::get_attribute() was missing an inline decoration
  • #448 cuda::profiling::mark::range_start() and range_end() were calling create_attributions() the wrong way

Cleanup and warning avoidance

  • #443 Aligned member initialization order(s) in array_t with their declaration order.

Compatibility

  • #462 Can now obtain a pointer's device in CUDA 9.x (not just 10.0 and later)
  • #304 Some CUDA 9.x incompatibilities have been fixed

Other changes

  • #471 Made a few more comparison operators constexpr

v0.6.1

1 year ago

Changes since v0.6:

Bug fixes

  • #442 Changed a no-longer-valid use of link::input_type_t in link.hpp which was triggering an error when building with C++17.
  • #438 Corrected the make_cuda_host_alloc_flags() function, which was bitwise-AND-ing instead of bitwise-OR-ing.

Other changes

  • #441 kernel_t::context() now uses wrap() and is noexcept
  • #436 , #437 Now respecting the CUDA_NO_HALF preprocessor define, and not defining nor including including and half-precision-related code with it defined.

v0.6

1 year ago

Changes since v0.5.6:

PTX Compilation library

This version introduces a single major change:

Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.

Value-vs-reference issues

  • #430 : Now passing kernel-like objects by reference rather than by value where relevant in the kernel launch wrapper functions.
  • #433 : Now passing program name by value rather than by reference.

Other changes

  • #431 : The NVTX wrappers no longer depend on a thread support library
  • #436 : The wrapper library now respects CUDA_NO_HALF, when you want to avoid CUDA defining the half
  • #432 : Removed some std:: rather than ::std:: namespace qualifications which had snuck into the codebase recently (which cause trouble with NVIDIA's cuda::std namespace).
  • #435 : Updated static data tables for the Ampere/Lovelace (8.x) and Hopper architectures.

v0.6-b2

1 year ago

This version only introduces a single change:

Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.

v0.6-rc1

1 year ago

This version only introduces a single major change:

Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.

There's also one minor tweak, not related to PTX compilation:

  • #430 : Now passing kernel-like objects by reference rather than by value where relevant in the kernel launch wrapper functions.

v0.5.6

1 year ago

Changes since v0.5.5:

New functionality

  • #423: Add an implementation of the surface and texture reference getters for modules (getting raw references, not corresponding wrapper classes for these objects, which this library does not currently offer)

C++14-and-later compatibility fixes

  • #415: Resolved incompatibility of std::optional/std::experimental::optional with the internal poor_mans_optional
  • #416: corrected placement of inclusion of std::experimental::optional

Other changes

  • #428, #429 : Minor fixes and tweaks to CUDA array code (via the cuda::array_t class template)
  • #427, #406 : Stream and Event wrapper class instances are now non-copyable (you need to either move them or pass references/pointers to them)
  • #425, #426: Error and exception handling improvements (with a slight performance benefit)
  • #424 : Link options now passed by const-reference, not by value
  • #411: Add :: prefix to occurrences of std:: (which snuck in again in recent versions; these potentially clashe with NVIDIA's standard library constructs)
  • #413: Added missing intra-library #include directives which were masked when including all APIs, but not when including individual headers. Also, removed inappropriate inline decorators from declaration-only lines
  • #420: Internal renaming
  • #417, #417: Internal placement of functionality in header files (files in cuda/api/ vs in cuda/api/multi_wrapper_impls).
  • #412: bandwidthtest now includes <iostream> on its own
  • #409: Moved pci_id_impl.hpp into the detail/ subfolder (and renamed it)

v0.5.5

1 year ago

Changes since v0.5.4:

Run-time compilation functionality

  • #397 : The NVRTC compilation options class now supports passing extra options to PTXAS, and also supports --dopt
  • #403 : The program builder class can now accept named header additions using std::string's for the name and/or header source (rather than only C-style const char* strings).

Bug fixes

  • #396 : scoped_existence_ensurer_t, the gadget for ensuring there is some current context (regardless of which) will now make sure the driver has been initialized.
  • #395 : Can now start profiling with our nvtx component even if the driver not yet being initialized.

Other changes

  • #400 : Added an alias for waiting/synchronizing on an event: You can now execute cuda::wait(my_event), not just cuda::synchronize(my_event).
  • #399 : time_elapsed_between() can now accept std::pair's of events.
  • #398 : Added another example program, the CUDA sample bandwidthtest
  • #401 : Made all stream enqueuing methods const (so you can now enqueue on a stream passed by const-reference).
  • #404 : Can now construct grid::overall_dimensions_t from a dim3 object, so that they're more interoperable with CUDA-related values you obtained elsewhere.

v0.5.5b2

1 year ago

Run-time compilation functionality

  • #397 : The NVRTC compilation options class now supports passing extra options to PTXAS, and also supports --dopt
  • #403 : The program builder class can now accept named header additions using std::string's for the name and/or header source (rather than only C-style const char* strings).

Bug fixes

  • #396 : A gadget for ensuring there is some current context (regardless of which) will no longer assume some other activity has already initialized the NVIDIA driver.
  • #395 : Can now start profiling with out nvtx component even if the driver not yet being initialized.

Other changes

  • #400 : Added an alias for waiting/synchronizing on an event: You can now execute cuda::wait(my_event), not just cuda::synchronize(my_event).
  • #399 : The time_elapsed_between() can now accept pairs of events
  • #398 : Added another example program, the CUDA sample bandwidthtest
  • #401 : Made all stream enqueuing methods const (so you can now enqueue on a stream passed by const&)
  • #404 : Can now construct grid::overall_dimensions_t from a dim3 object, so that they're more interoperable with CUDA-related values you obtained elsewhere.

v0.5.4

1 year ago

Changes since v0.5.3:

  • #392 Made the NVTX and NVRTC wrappers usable in multiple translation units within the same executable
  • #393 Made the NVTX dependency on libdl (on Linux) explicit

Other changes

  • #394 Avoiding redundant cuInit() call when getting a device's name