Giddy - A lightweight GPU decompression library
(Originally presented in this mini-paper in the DaMoN 2017 workshop)
For questions, requests or bug reports - either use the Issues page or email me.
Discrete GPUs are powerful beasts, with numerous cores and high-bandwidth memory, often capable of 10x the throughput in crunching data relative to the maximum achievable on a CPU. Perhaps the main obstacle, however, to utilizing them, is that data usually resides in main system memory, close to the CPU - and to have the GPU process it, we must send it over a PCIe bus. Thus the CPU has a potential of processing in-memory data at (typically in 2017) 30-35 GB/sec, and a discrete GPU at no more than 12 GB/sec.
One way of counteracting this handicap is using compression. The GPU can afford expending more effort in decompressing data arriving over the bus than the CPU; thus if we the data is available apriori in system memory, and is amenable to compression, using it may increase the GPU's effective bandwidth more than it would the CPU's.
Compression schemes come in many shapes and sizes, but it is customary to distinguish "heavy-weight" schemes (such as those based on Lempel-Ziv) from "lightweight" schemes, involving small amounts of computation per element, few accesses to the compressed data for decompressing any single element.
Giddy enables the use of lightweight compressed data on the GPU by providing decompressor implementations for a plethora of compression schemes.
Giddy comprises:
If this sounds a bit confusing, scroll down to the examples section.
The following compression schemes are currently included:
(Note the Wiki pages for each of the schemes are just now being written.)
Additionally, two patching schemes are supported:
As these are "aposteriori" patching schemes, you apply them by simply decompressing using some base scheme, then using one of the two kernels data_layout::scatter
or data_layout::compressed_indices_scatter
on the initial decompression result. You will not find specific kernels, kernel wrappers or factory entries for the "combined" patched scheme, only for its components.
Note: The examples use the C++'ish CUDA API wrappers), making the host-side code somewhat clearer and shorter.
Suppose we're presented with compressed data with the following characteristics, which for simplicity is already in GPU memory:
Parameter | Value |
---|---|
Decompression scheme | Frame of Reference |
width of size/index type | 32 bits |
Uncompressed data type | int32_t |
type of offsets from FOR value | int16_t |
segment length | (runtime variable) |
total length of compressed data | (runtime variable) |
in other words, we want to implement the following function:
using size_type = uint32_t; // assuming less than 2^32 elements
using uncompressed_type = int32_t;
using compressed_type = int16_t;
void decompress_on_device(
uncompressed_type* __restrict__ decompressed,
const compressed_type* __restrict__ compressed,
const model_coefficients_type* __restrict__ segment_model_coefficients,
size_type length,
size_type segment_length);
We can do this with Giddy in one of three ways.
The example code for this mode of use is found in examples/src/direct_use_of_kernel.cu
.
In this mode, we
resolve_launch_configuration()
function with the object we instantiated, obtaining a launch_configuration_t
struct.launch_configuration_t
) or the plain vanilla way, extracting the fields of the launch_configuration_t
.The example code for this mode of use is found in examples/src/instantiation_of_wrapper.cu
.
Each decompression kernel has a corresponding thin wrapper class. An instance of the wrapper class has no state - no data members; we only use it for its vtable - its virtual methods, specific to the decompression scheme. Thus, in this mode of use, we:
cuda::kernels::decompression::frame_of_reference::kernel_t
resolve_launch_configuration()
method with the appropriate parameters, obtaining a launch_configuration_t
structure.cuda::kernel::enqueue_launch()
with our wrapper instance, the launch configuration, and the arguments we need to pass the kernelThe example code for this mode of use is found in examples/src/factory_provided_type_erased_wrapper.cu
.
The kernel wrappers are intended to allow a uniform interface for launching kernels. This uniformity is achieved by type-erasure: The wrappers' base class virtual methods wrappers' all take a map of boost::any
objects; and it is up to the caller to pass the appropriate parameters in that map. Thus, in this mode, we:
cuda::registered::kernel_t
class' static method produceSubclass()
- to instantiate specific the wrapper relevant to our scenario (named "decompression::frame_of_reference::kernel_t<4u, int, short, cuda::functors::unary::parametric_model::constant<4u, int> >"
). What we actually hold is an std::unique_ptr()
to such an instance.resolve_launch_configuration()
method of our isntance, obtaining a launch_configuration_t
structure.enqueue_launch()
method of our isntance, along with the launch configuration structure we've just obtained.No code is currently provided for compressing data - neither on the device nor on the host side. This is Issue #3.
Some of the decompressors are well-optimized, some need more work. The most recent (and only) performance analysis is in the mini-paper mentioned above. Step-by-step instructions for measuring performance (using well-known data sets) are forthcoming.
This endevor was made possible with the help of: