Thin, unified, C++-flavored wrappers for the CUDA APIs
Changes since v0.6.8:
unique_span
and unique_region
unique_span<T>
template class, combining the functionality of cuda::unique_ptr
and cuda::span
(and being somewhat similar to std::dynarray
which almost made it into C++14). Many CUDA programs want to represent both the ownership of allocated memory, and the range of that memory for actual use, in the same variable - without the on-the-fly reallocation behavior of std::vector
. This is now possible. Also implemented an untyped version of this, named unique_region
.memory::external::mapped_region_t
with memory::unique_region
empty()
method to cuda::span
(to match that of std::span
- as it is no sometimes used)unique_span
instead of our cuda::dynarray
(which had been an std::vector
under the hood), in various places in the API, especially RTCunique_span
's from the cuda::rtc::program_output
class methods which allocated their own buffers: The methods for getting the compilation log, the cubin data, the PTX and the LTO IR.memory::region_t
)memory::region_t
for mapped memory rather than a different, mapped-memory specific region classmemory::region_t
more constexpr-friendlymemory::region_t
's are now CUDA-independent, i.e. do not utilize any CUDA-specific definitionsconst_region_t
's from rvalue references to regionssupports_memory_pools
get_attribute()
and set_attribute()
for library kernelsdetail/type_traits.hpp
from types.hpp
context::current::scoped_override_t
now declared in current_context.hpp
context_t
and primary_context_t
.link::marshalled_options_t
and link::option_t
are now in the detail_
namespace - the user should typically never usedlink::options_t
specify_default_load_caching_mode
from link::options_t
, in favor of using an stdx::optional
pci_location_t
context::current::peer_to_peer
functionsprogram_base_t
into the detail_
namespacertc::marshalled_options_t
and rtc::marshal()
into the detail_
subnamespace - users should not need to use this themselvesmemory::pool::ipc::ptr_handle_t
out of ipc.hpp
up into types.hpp
(so that memory_pool.hpp
doesn't depend on ipc.hpp
)link::fallback_strategy_t
-> link::fallback_strategy_for_binary_code_t
name_caching_program.hpp
, had snuck into our code - removed itcuda::rtc::program_output
's various methods, so that they are all padded with an extra '\0' character past the end of the span's actual range. This is not necessarily and that data should hopefully not actually be reached, but - let's beexamples/rtc_common.hpp
, which is no longer in which nownative_word_t
typeChanges since v0.6.8:
unique_span
)unique_span<T>
template class, combining the functionality of cuda::unique_ptr
and cuda::span
(and being somewhat similar to std::dynarray
which almost made it into C++14). Many CUDA programs want to represent both the ownership of allocated memory, and the range of that memory for actual use, in the same variable - without the on-the-fly reallocation behavior of std::vector
. This is now possible. Also implemented an untyped version of this, named unique_region
.empty()
method to cuda::span
(to match that of std::span
- as it is no sometimes used)unique_span
instead of our cuda::dynarray
(which had been an std::vector
under the hood), in various places in the API, especially RTCunique_span
's from the cuda::rtc::program_output
class methods which allocated their own buffers: The methods for getting the compilation log, the cubin data, the PTX and the LTO IR.memory::region_t
)memory::region_t
for mapped memory rather than a different, mapped-memory specific region classmemory::region_t
more constexpr-friendlymemory::region_t
's are now CUDA-independent, i.e. do not utilize any CUDA-specific definitionsconst_region_t
's from rvalue references to regionssupports_memory_pools
get_attribute()
and set_attribute()
for library kernelsdetail/type_traits.hpp
from types.hpp
name_caching_program.hpp
, had snuck into our code - removed itcuda::rtc::program_output
's various methods, so that they are all padded with an extra '\0' character past the end of the span's actual range. This is not necessarily and that data should hopefully not actually be reached, but - let's be on the safe side.Changes since v0.6.7:
driver-and-runtime
as an aliasdetail/span.hpp
)cuda::memory::region_t
class into its own filecuda::memory::copy_parameters_t
structurememory::managed::region_t
and const_region_t
and now just using memory::region_t
and const_region_t
everywherecuda::memory::make_unique()
- and it's assumed you mean device memory (you have to specify a device or device context though)apriori_compiled_kernel_t
into the kernel
namespace, yielding kernel::apriori_compiled_it
kernel_t::set_attribute()
Changes since v0.6.7:
driver-and-runtime
as an aliasapriori_compiled_kernel_t
into the kernel
namespace, yielding kernel::apriori_compiled_it
kernel_t::set_attribute()
Changes since v0.6.6:
find_package()
) on the Threads library.cudaProfiler.h
CompileWithWarnings.cmake
module into the examples/
subfolder, which is the only place it's used.Changes since v0.6.4:
cudaGetLastError()
after kernel launches (mostly grid errors)'\0'
) character.cbegin()
and cend()
in favor of begin()
and end()
in rtc::program::add_headers()
code which may take inputs without these two methods.rtc.hpp
- fixed an overload resolution issue regarding compilation parameter marshalling.Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)
Changes since v0.6.3:
copy_parameters_t::set_endpoint_untyped()
now properly calling an inner set_endpoint_untyped()
copy_parameters_t::set_single_context
no longer mistakenly taking an endpoint_t
parametercuda::memory::typed_set()
no longer mistakenly accepts values of size 8 (which have no special CUDA API call).stream_t::enqueue_t::write_single_value()
cuda::memory::managed::allocate(device, num_bytes)
(which was declared but not defined)synch
to sync
in multiple identifiersprogram_t::add_registered_globals()
can now take any container of any string-like type.device_t
's by const-ref in more cases, avoiding copying and enabling re-use of a reference to the primary context.Want to help me with testing 0.7? Drop me a line... (it will have CUDA execution graph support)
Changes since v0.6.3:
copy_parameters_t::set_single_context
no longer mistakenly taking an endpoint_t
parametercuda::memory::typed_set()
no longer mistakenly accepts values of size 8 (which have no special CUDA API call).stream_t::enqueue_t::write_single_value()
cuda::memory::managed::allocate(device, num_bytes)
(which was declared but not defined)synch
to sync
in multiple identifiersprogram_t::add_registered_globals()
can now take any container of any string-like type.device_t
's by const-ref in more cases, avoiding copying and enabling re-use of a reference to the primary context.Changes since v0.6.2:
cuda::memory::pool_t
proxy classcuda::memory::pool::create()
or via device methodsphysical_allocation
namespace up from memory::virtual_
into memory
, as it is used also for memory poolscuda::launch_config_builder_t::overall_size()
now takes a cuda::grid::overall_dimension_t
rather than a cuda::grid::dimension_t
(not the same size).cooperative_groups.h
from within the API headerswhat()
message (probably not very interesting outside the API's internals).num_blocks()
as an alias for the grid_size()
method.cuda::memory::host::allocate()
now returns a cuda::memory::region_t
, for better consistency.cuda::kernel::wrap
.cuda::memory::attribute_value_type_t
-> cuda::memory::attribute_value_t
memory::region_t
's into typed spans.cuda::span
(which is used when std::span
is unavailable) - making it somewhat more compatible with std::span
constexpr
and noexcept
--std=c++20
.CUDA::nvToolkitExt
to depending on CUDA::nvtx
, for CUDA versions 10.0 and above.Changes since v0.6.1:
The most significant change in this version regards the way callbacks/host functions are supported. This change is motivated mostly as preparation for the upcoming introduction of CUDA graph support (not in this version), which will impose some stricter constraints on callbacks - precluding the hack we have been using so far.
So far, a callback was any object invokable with an std::stream_t
parameter. From now on, we support two kinds of callback:
cuda::stream_t::enqueue_t::host_function_call(Argument * user_data)
cuda::stream_t::enqueue_t::host_invokable(Invokable& invokable)
This lets us avoid the combination of heap allocation at enqueue and deallocation at launch - which works well enough for now, but will not be possible when the same callback needs to be invoked multiple times. Also, it was in contradiction of our presumption not to add layers of abstraction over what CUDA itself provides.
Of course, the release also has s the "usual" long list of minor fixes.
cuda::kernel::get()
now takes a device, not a kernel - since it can't really do anything useful for non-primary kernels (which is where apriori-compiled kernels are available)cuda::memory::region_t
's when enqueueing copy operations on streams (and thus also cuda::span<T>
's)cuda::memory::copy_parameters_t<N>
(for N=2 or 3), a wrapper of the CUDA driver's richest parameters structure with multiple convenience functions, for maximum configurability of a copy operation. But - this structure is not currently "fool-proof", so use with care and initialize all relevant fields.cuda::pointer_t
device::get()
no longer incorrectly marked as noexcept
allocate_managed()~ in
context.hpp`flush_remote_writes()
operation on a stream (this is one of the "batch stream memory operations)apriori_compiled_kernel_t::get_attribute()
was missing an inline
decorationcuda::profiling::mark::range_start()
and range_end()
were calling create_attributions()
the wrong wayconstexpr