HipSYCL Versions Save

Implementation of SYCL and C++ standard parallelism for CPUs and GPUs from all vendors: The independent, community-driven compiler for C++-based heterogeneous programming models. Lets applications adapt themselves to all the hardware in the system - even at runtime!

v24.02.0

2 months ago

Maxing out SYCL performance

AdaptiveCpp 24.02 introduces multiple compiler improvements, making it one of the best SYCL compilers - and in many cases the best - in the world when it comes to extracting performance from the hardware.

If you are not using it already, try it now and perhaps save some compute time!

The following performance results have been obtained with AdaptiveCpp's generic single-pass compiler (--acpp-targets=generic).

Note: oneAPI by default compiles with -ffast-math, while AdaptiveCpp does not enable fast math by default. All benchmarks have been explicitly compiled with -fno-fast-math to align compiler behavior, except where noted otherwise.

perf_2402_nvidia

perf_2402_amd Note: oneAPI for AMD does not correctly round sqrt() calls even if -fno-fast-math is passed, using approximate builtins instead. This loss of precision can substantially skew benchmark results, resulting in misleading performance results. AdaptiveCpp 24.02 correctly rounds math functions by default. To align precision and allowed compiler optimizations, AdaptiveCpp was allowed to use approximate sqrt builtins as well for the AMD results.

perf_2402_intel

Note: AdaptiveCpp was running on the Intel GPU through OpenCL, while DPC++ was using its default backend Level Zero, which allows for more low-level control. Some of the differences may be explained by the different backend runtimes underneath the SYCL implementations.

World's fastest compiler for C++ standard parallelism offload

AdaptiveCpp 24.02 ships with the world's fastest compiler for offloading C++ standard parallelism constructs. This functionality was already part of 23.10, however AdaptiveCpp includes multiple important improvements. It can substantially outperform vendor compilers, and is the world's only compiler that can demonstrate C++ standard parallelism offloading performance across Intel, NVIDIA and AMD hardware. Consider the following performance results for the CloverLeaf, TeaLeaf and miniBUDE benchmarks:

apps_stdpar_normalized

  • The green bars show AdaptiveCpp 24.02 speedup over NVIDIA nvc++ on NVIDIA A100;
  • The red bars show AdaptiveCpp 24.02 speedup over AMD roc-stdpar on AMD Instinct MI100;
  • The blue bars show AdaptiveCpp 24.02 speedup over Intel icpx -fsycl-pstl-offload=gpu on Intel Data Center GPU Max 1550.
  • The dashed blue line indicates performance +/- 20%.

In particular, note that AdaptiveCpp does not depend on the XNACK hardware feature to obtain performance on AMD GPUs. XNACK is an elusive feature that is not available on most consumer hardware, and usually not enabled on most production HPC systems.

New features: Highlights

  • No targets specification needed anymore! AdaptiveCpp now by default compiles with --acpp-targets=generic. This means that a simple compiler invocation such as acpp -o test -O3 test.cpp will create a binary that can run on Intel, NVIDIA and AMD GPUs. AdaptiveCpp 24.02 is the world's only SYCL compiler that does not require specifying compilation targets to generate a binary that can run "everywhere".
  • New JIT backend: Host CPU. --acpp-targets=generic can now also target the host CPU through the generic JIT compiler. This can lead to performance improvements over the old omp compiler. E.g. on AMD Milan, babelstream's dot benchmark was observed to improve from 280GB/s to 380GB/s. This also means that it is no longer necessary to target omp to run on the CPU. generic is sufficient, and will likely perform better. Not having to compile for omp explicitly can also reduce compile times noticably (we observed e.g. ~15% for babelstream).
  • Persistent on-disk kernel cache: AdaptiveCpp 24.02 ships with an on-disk kernel cache for JIT compilations occuring when using --acpp-targets=generic. This can substantially reduce JIT overheads.
  • Automatic runtime specialization of kernels: When using --acpp-targets=generic, AdaptiveCpp can now automatically apply optimizations to kernels at JIT-time based on runtime knowledge. This can lead to noticable speedups in some cases, although the full potential of this is expected to only become apparent with future AdaptiveCpp versions.
    • This means that achieving best possible performance might require running the application multiple times, as AdaptiveCpp will try to JIT-compile increasingly specialized kernels with each application run. This can be controlled using the ACPP_ADAPTIVITY_LEVEL environment variable. Set it to 0 to recover the old behavior. The default is currently 1. If you are running benchmarks, you may have to update your benchmarking infrastructure to run applications multiple times.

What's Changed in Detail

Full Changelog: https://github.com/AdaptiveCpp/AdaptiveCpp/compare/v23.10.0...v24.02.0

New Contributors

v23.10.0

6 months ago

Highlights

This release contains several major features, and introduces a major shift in the project's capabilities:

  • New project name: AdaptiveCpp. This release is the first release with the new name, and contains renamed, user-facing components. This includes e.g. renamed compiler (acpp), compiler flags (e.g. --acpp-targets), cmake integration and more. The old name is still supported for backward compatibility during a transitional period. For details on why this renaming occured, see https://github.com/AdaptiveCpp/AdaptiveCpp/issues/1147
  • The world's first single-pass SYCL compiler (--acpp-targets=generic): This release is the first release to contain our new single-pass compiler. This is the world's only SYCL compiler which does not need to parse the code multiple times to generate a binary. Instead, during the regular host compilation, LLVM IR for kernels is extracted and embedded in the binary. At runtime, this IR is then JIT-compiled to whatever is needed (currently supported is PTX, amdgcn and SPIR-V)
    • As such, this new compiler design is also the first SYCL compiler to introduce a unified code representation across backends
    • "Compile once, run anywhere" - the new design guarantees that every binary generated by acpp --acpp-targets=generic can directly be executed on all supported GPUs from Intel, NVIDIA and AMD. The new approach can dramatically reduce compile times, especially when many devices need to be targeted since the code still is only parsed a single time.
    • See the paper for more details: https://dl.acm.org/doi/10.1145/3585341.3585351
  • The world's first SYCL implementation to support automatic offloading of C++ parallel STL algorithms (--acpp-stdpar). This heterogeneous programming model was until now primarily supported by NVIDIA's nvc++ for NVIDIA GPUs. AdaptiveCpp not only supports it for NVIDIA, AMD and Intel GPUs, but also conveniently allows to generate a binary that can dispatch to all supported devices using the new single-pass compiler. See here for details on this new experimental feature: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/stdpar.md
  • Production support for Intel GPUs through the new single-pass compiler
  • New OpenCL backend - this new backend supports targeting OpenCL SPIR-V devices, such as Intel's CPU and GPU OpenCL runtimes, bringing the total number of supported backends to five.
  • Many bug fixes and performance optimizations!

What's changed

The full list of changes it too long for release pages; please see here for a comprehensive list of all changes: Full Changelog: https://github.com/AdaptiveCpp/AdaptiveCpp/compare/v0.9.4...v23.10.0

New Contributors

v23.10.0-alpha

6 months ago

This is a prerelease for the upcoming 23.10.0 to provide a testing target.

What's Changed (incomplete, see full changelog below)

New Contributors

Full Changelog: https://github.com/AdaptiveCpp/AdaptiveCpp/compare/v0.9.4...v23.10.0-alpha

v0.9.4

1 year ago

This is a maintenance release, intended as a last stop before major additions. It therefore does not include major functionality already available on the develop branch such as the generic single-pass compiler.

Build instruction addendum

For build instructions and issues that were discovered post-release (e.g. about compatibility with LLVM versions that were not yet released at the time), please see here: https://github.com/AdaptiveCpp/AdaptiveCpp/wiki/Build-instructions-for-old-versions#hipsycl-094

What's Changed

New Contributors

Full Changelog: https://github.com/illuhad/hipSYCL/compare/v0.9.3...v0.9.4

v0.9.3

1 year ago

Highlights

  • Improved compatibility with new clang versions and ROCm clang
  • New extensions, e.g.
    • coarse grained events. These are zero-construction-cost events at the expense of lower synchronization performance, and hence a good match if the returned event of an operation is not expected to be used
    • queue priorities for in-order queues on certain backends
  • Added hip.explicit-multipass compilation flow
  • Multiple optimizations that can potentially reduce runtime overheads substantially
    • Use event pools in CUDA/HIP backends
    • Use asynchronous garbage collector thread to clean up old DAG nodes to remove garbage collection from the kernel submission path
    • Use std::weak_ptr instead of shared_ptr to express dependencies in the DAG; making old DAG nodes and their associated events eligible earlier for reuse by the event pool.
  • In-order queues map 1:1 to dedicated CUDA or HIP streams for more explicit scheduling control
  • Unified kernel cache and data format for all explicit multipass compilation flow (hipSYCL container format, HCF)
  • Manage hipSYCL runtime lifetime by refcounting all SYCL objects created by the user instead of just having a global object; this can resolve errors when terminating the program on some backends.
  • Simplify deployment when no std::filesystem is available
  • New tool: hipsycl-hcf-tool to inspect and edit HCF files
  • New tool: hipsycl-info to print information about detected devices.

What's Changed (details)

New Contributors

Thank you to our first-time contributors!

Full Changelog: https://github.com/illuhad/hipSYCL/compare/v0.9.2...v0.9.3

v0.9.2

2 years ago

Changes compared to the previous release 0.9.1 (selection)

The following is an incomplete list of changes and improvements:

Highlights

  • Initial support for operating as a pure CUDA library for NVIDIA's proprietary nvc++ compiler, without any additional hipSYCL compiler magic. In this flow, LLVM is not required and new NVIDIA hardware can be targeted as soon as NVIDIA adds support in nvc++.
  • Initial support for dedicated compiler support in the CPU backend. These new compilation passes can greatly improve performance of nd_range parallel for kernels on CPU. This allows executing SYCL code efficiently on any CPU supported by LLVM.
  • Scoped parallelism API v2 for a more performance portable programming model
  • Reimplement explicit multipass support for clang >= 13. This allows targeting multiple backends simultaneously, and was previously only supported on clang 11. Kernel names in the binary are now always demangleable as __hipsycl_kernel<KernelNameT> or __hipsycl_kernel<KernelBodyT>.

SYCL support

  • Support for new SYCL 2020 features such as atomic_ref, device selector API, device aspect API and others
  • Support for SYCL 2020 final group algorithm interface
  • Add support for the profiling API
  • ... more

Extensions

  • Add initial support for multi-device queue hipSYCL extension to automatically distribute work across multiple devices
  • Add initial support for queue::get_wait_list() hipSYCL extension to allow barrier-like semantics at the queue level
  • Add accessor_variant extension which allows accessors to automatically optimize the internal data layout of the accessor object depending on how they were constructed. This can save registers on device without any changes needed by the user.
  • Add handler::update_device() extension in analogy to already existing update_host(). This can be e.g. used to prefetch data.
  • Complete buffer-USM interoperability API
  • Add support for explicit buffer policy extension and asynchronous buffers

See the documentation on extensions for more details.

Optimizations

  • Automatic work distribution across multiple streams
  • Fix massive performance bug caused by a bug in the kernel cache in the Level Zero backend
  • Optimize CUDA backend to perform aggressive CUDA module caching in an explicit multipass scenario. This can greatly improve performance of the cuda.explicit-multipass compilation flow when multiple translation units are involved.
  • Several performance fixes and improvements in the hipSYCL runtime. Especially when spawning many tasks, performance can now be significantly better.
  • ... more

Bug fixes and other improvements

Yes, a lot of them :-)

v0.9.1

3 years ago

hipSYCL 0.9.1

-- This release is dedicated to the memory of Oliver M. Some things just end too soon.

New major features

  • Add new "explicit multipass" compilation model, allowing to simultaneously target all of hipSYCL's backends. This means hipSYCL can now compile to a binary that runs can run on devices from multiple vendors. Details on the compilation flow can be found here: https://github.com/illuhad/hipSYCL/blob/develop/doc/compilation.md
  • Introduce plugin architecture for backends of the hipSYCL runtime. This means hipSYCL now looks for backend plugins at runtime, allowing to extend an already existing hipSYCL installation with support for additional hardware without changing the already installed components.
  • Initial, experimental support for Intel GPUs using Level Zero and SPIR-V
  • Introducing initial support for large portions of oneDPL using our fork at https://github.com/hipSYCL/oneDPL
  • hipSYCL is now also tested on Windows in CI, although Windows support is still experimental.

New features and extensions

  • Command group properties that can influence how kernels or other operations are scheduled or executed:
    • hipSYCL_retarget command group property. Execute an operation submitted to a queue on an arbitrary device instead of the one the queue is bound to.
    • hipSYCL_prefer_group_size<Dim> command group property. Provides a recommendation to hipSYCL which group size to choose for basic parallel for kernels.
    • hipSYCL_prefer_execution_lane command group property. Provides a hint to the runtime on which backend queue (e.g. CUDA stream) an operation should be executed. This can be used to optimize kernel concurrency or overlap of data transfers and compute in case the hipSYCL scheduler does not already automatically submit an optimal configuration.
  • Comprehensive interoperability framework between buffers and USM pointers. This includes extracting USM pointers from existing buffer objects, turning any buffer into a collection of USM pointers, as well as constructing buffer objects on top of existing USM pointers.
  • The hipSYCL_page_size buffer property can be used to enable data state tracking inside a buffer at a granularity below the buffer size. This can be used to allow multiple kernels to concurrently write to the same buffer as long as they access different hipSYCL data management pages. Unlike subbuffers, this also works with multi-dimensional strided memory accesses.
  • Synchronous sycl::mem_advise() as free function
  • handler::prefetch_host() and queue::prefetch_host() for a simpler mechanism of prefetching USM allocations to host memory.
  • Explicit buffer policies to make programmer intent clearer as well as asynchronous buffer types that do not block in the destructor, which can improve performance. For example, auto v = sycl::make_async_view(ptr, range) constructs a buffer that operates directly on the input pointer and does not block in the destructor.
  • HIPSYCL_VISIBLITY_MASK environment variable can be used to select which backends should be loaded.

See https://github.com/illuhad/hipSYCL/blob/develop/doc/extensions.md for a list of all hipSYCL extensions with more details.

Optimizations and improvements

  • Hand-tuned optimizations for SYCL 2020 group algorithms
  • Automatic distribution of kernels across multiple CUDA/HIP streams
  • Improved support for newer ROCm versions
  • SYCL 2020 accessor deduction guides and host_accessor
  • Improve handling of Multi-GPU setups
  • Significant performance improvements for queue::wait()
  • Early DAG optimizations to improve handling of complex and large dependency graphs
  • Optimizations to elide unnecessary synchronization between DAG nodes

Bug fixes and other improvements

Yes, a lot of them!

v0.9.0

3 years ago

hipSYCL 0.9.0

hipSYCL 0.9 is packed with tons of new features compared to the older 0.8 series:

Support for key SYCL 2020 features

hipSYCL 0.9.0 introduces support for several key SYCL 2020 features, including:

  • Unified shared memory provides a pointer-based memory model as an alternative to the traditional buffer-accessor model
  • SYCL 2020 generalized backend model and backend interoperability provides generic mechanisms for interoperability between the underlying backend objects and SYCL
  • Queue shortcuts for kernel invocation and USM memory management functions
  • Inorder queues to submit kernels in order when a task graph is not required
  • Unnamed kernal lambdas (requires building hipSYCL against clang >= 10)
  • Subgroups
  • Group algorithms for parallel primitives at work group and subgroup level (Note that the interface may change slightly with the release of SYCL 2020 final, optimization is ongoing)
  • Reductions provide a simple way to carry out arbitrary amounts of reduction operations across all work items of a kernel using either predefined or user-provided reduction operators (Note that the interface may change slightly with the release of SYCL 2020 final, optimization is ongoing). Currently only scalar reductions are supported. Multiple simultaneous reductions are supported. In addition to the requirements of the SYCL specification, we also support reductions for the hierarchical and scoped parallelism models.
  • ... and more! See here for more information on the SYCL 2020 coverage of current hipSYCL: https://github.com/hipSYCL/featuresupport

Unique hipSYCL extensions

There are two new extensions in hipSYCL 0.9.0:

New runtime library

hipSYCL 0.9.0 is the first release containing the entirely rewritten, brand new runtime library, which includes features such as:

  • Single library for all backends (libhipSYCL-rt) instead of libraries for each backend (libhipSYCL_cpu, libhipSYCL_cuda etc)
  • Strict seperation between backend specific code and generic code, clear, simple interface to add new backends, making it easy to add additional backends in the future
  • Multiple runtime backends can now be active at the same time and interact
  • SYCL interface is now header-only; bootstrap mode in syclcc is no longer required and has been removed. When building hipSYCL, only the runtime needs to be compiled which can be done with any regular C++ compiler. This should simplify the build process greatly.
  • Architecture supports arbitrary execution models in different backends - queue/stream based, task graphs etc.
  • CUDA and CPU backends do not depend on HIP API anymore. The CUDA backend now goes directly to CUDA without going through HIP, and the CPU backend goes directly to OpenMP without going through hipCPU. hipCPU and HIP submodules are no longer required and have been removed.
  • Strict separation between SYCL interface and runtime, making it easy to expose new features (e.g. SYCL 2020) in the SYCL interface by leveraging the SYCL runtime interfaces underneath.
  • For each operation, SYCL interface can pass additional information to runtime/scheduler using hints framework. Device on which an operation is executed is just another hint for the runtime.
  • Support for lazy DAG execution (Note: Only partially activated by default)
  • Almost entirely callback-free execution model in CUDA/ROCm backends for potentially higher task throughput
  • New memory management system and improved multi-GPU support
    • manages arbitrary allocations on multiple devices
    • manages memory potentially below buffer granularity, using 3D page table to track invalid memory regions (not yet fully exposed)
  • Backend queues (e.g. CUDA streams) are maintained by the backend in a pool, the scheduler then distributes operations across the queues. No matter how many sycl::queues exist, compute/memory-overlap always works equally well. This means a sycl::queue is now nothing more than an interface to the runtime.
  • Vastly improved error handling. Proper implementation of async errors/error handlers. Task execution will be cancelled when an error is detected.
  • ROCm backend: Add support for 3D data transfers

syclcc and compilation improvements

  • new --hipsycl-targets flag that allows to compile for multiple targets and backends, e.g. syclcc --hipsycl-targets="omp;hip:gfx906,gfx900" compiles for the OpenMP backend as well as for Vega 10 and Vega 20. Note that simultaneous compilation for both NVIDIA and AMD GPUs is not supported due to clang limitations.
  • The compiler arguments and linker flags passed to backend compilers are now all exposed in cmake (and syclcc.json), giving the user more control to adapt the compilation flow to individual requirements. This can be helpful for uncommon setup scenarios where different flags may be required.

Performance improvements

  • New execution model for nd_range parallel for on CPU, bringing several orders of magnitudes of performance. Note that nd_range parallel for is inherently difficult to implement in library-only CPU backends, and basic parallel for or our scoped parallelism extension should be preferred if possible.

Fixes and other improvements

Yes, a lot of them :-)

v0.8.0

4 years ago

Note: hipSYCL 0.8.0 is deprecated, users are encouraged to use our package repositories instead

This is the release of hipSYCL 0.8.0. We provide the following packages:

  • hipSYCL-base provides the basic LLVM compiler stack that is needed in any case
  • hipSYCL-rocm provides a compatible ROCm stack that additionally allows hipSYCL to target AMD GPUs
  • hipSYCL provides the actual hipSYCL libraries, tools and headers

While we cannot provide matching CUDA packages for NVIDIA support due to legal reasons, scripts for installing a matching CUDA distribution as well as scripts to generate CUDA packages are provided. You will find further information in the readme here on github.

At the moment, Arch Linux, CentOS 7 and Ubuntu 18.04 packages are provided.

v0.8.0-rc1

4 years ago

This is a prerelease of hipSYCL 0.8.0. In particular, it serves to test new packages of the entire hipSYCL stack. We provide the following packages:

  • hipSYCL-base provides the basic LLVM compiler stack that is needed in any case
  • hipSYCL-rocm provides a compatible ROCm stack that additionally allows hipSYCL to target AMD GPUs
  • hipSYCL provides the actual hipSYCL libraries, tools and headers

While we cannot provide matching CUDA packages due to legal reasons, CUDA installation scripts will be provided for the actual hipSYCL 0.8.0 release.

At the moment, Arch Linux and Ubuntu 18.04 packages are provided.