Thin, unified, C++-flavored wrappers for the CUDA APIs
Changes since v0.6.1:
The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.
So far, a callback was any object invokable with an std::stream_t
parameter. From now on, we support two kinds of callback:
cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)
This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.
Of course, the release also has s the "usual" long list of minor fixes.
cuda::kernel::get()
now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)cuda::memory::region_t
's when enqueueing copy operations on streams (and thus also cuda::span<T>
's)cuda::memory::copy_parameters_t<N>
(for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.cuda::pointer_t
device::get()
no longer incorrectly marked as noexcept
allocate_managed()~ in
context.hpp`flush_remote_writes()
operation on a stream (this is one of the "batch stream memory operations)apriori_compiled_kernel_t::get_attribute()
was missing an inline
decorationcuda::profiling::mark::range_start()
and range_end()
were calling create_attributions()
the wrong wayconstexpr
The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.
So far, a callback was any object invokable with an std::stream_t
parameter. From now on, we support two kinds of callback:
cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)
This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.
Of course, the release also has s the "usual" long list of minor fixes.
cuda::kernel::get()
now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)cuda::memory::region_t
's when enqueueing copy operations on streams (and thus also cuda::span<T>
's)cuda::memory::copy_parameters_t<N>
(for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.cuda::pointer_t
device::get()
no longer incorrectly marked as noexcept
allocate_managed()~ in
context.hpp`flush_remote_writes()
operation on a stream (this is one of the "batch stream memory operations)apriori_compiled_kernel_t::get_attribute()
was missing an inline
decorationcuda::profiling::mark::range_start()
and range_end()
were calling create_attributions()
the wrong wayconstexpr
Changes since v0.6:
link::input_type_t
in link.hpp
which was triggering an error when building with C++17.make_cuda_host_alloc_flags()
function, which was bitwise-AND-ing instead of bitwise-OR-ing.kernel_t::context()
now uses wrap()
and is noexcept
CUDA_NO_HALF
preprocessor define, and not defining nor including including and half-precision-related code with it defined.Changes since v0.5.6:
This version introduces a single major change:
Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.
CUDA_NO_HALF
, when you want to avoid CUDA defining the half
std::
rather than ::std::
namespace qualifications which had snuck into the codebase recently (which cause trouble with NVIDIA's cuda::std
namespace).This version only introduces a single change:
Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.
This version only introduces a single major change:
Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.
There's also one minor tweak, not related to PTX compilation:
Changes since v0.5.5:
std::optional
/std::experimental::optional
with the internal poor_mans_optional
std::experimental::optional
cuda::array_t
class template)::
prefix to occurrences of std::
(which snuck in again in recent versions; these potentially clashe with NVIDIA's standard library constructs)#include
directives which were masked when including all APIs, but not when including individual headers. Also, removed inappropriate inline
decorators from declaration-only linescuda/api/
vs in cuda/api/multi_wrapper_impls
).bandwidthtest
now includes <iostream>
on its ownpci_id_impl.hpp
into the detail/
subfolder (and renamed it)Changes since v0.5.4:
--dopt
std::string
's for the name and/or header source (rather than only C-style const char*
strings).scoped_existence_ensurer_t
, the gadget for ensuring there is some current context (regardless of which) will now make sure the driver has been initialized.cuda::wait(my_event)
, not just cuda::synchronize(my_event)
.time_elapsed_between()
can now accept std::pair
's of events.bandwidthtest
const
(so you can now enqueue on a stream passed by const-reference).grid::overall_dimensions_t
from a dim3
object, so that they're more interoperable with CUDA-related values you obtained elsewhere.--dopt
std::string
's for the name and/or header source (rather than only C-style const char*
strings).cuda::wait(my_event)
, not just cuda::synchronize(my_event)
.time_elapsed_between()
can now accept pairs of eventsbandwidthtest
grid::overall_dimensions_t
from a dim3
object, so that they're more interoperable with CUDA-related values you obtained elsewhere.