ILGPU Versions Save

ILGPU JIT Compiler for high-performance .Net GPU programs

v1.2.0-beta1

2 years ago

This new beta release includes bug fixes and a significantly improved O2 optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).

Changes

  • Reviewed ILGPU documentation (#750, #776).
  • Added Cuda ISA 7.5, ISA 7.6 and SM 8.7 (#778).
  • Added support to fold Shuffle and Broadcast operations (#764).
  • Improved performance by using uniform branches for NVIDIA GPUs (#765).
  • Improved LoopUnrolling to cover more cases (#766).
  • Improved inline PTX to support multiple output and by-ref parameters (#760).
  • Fixed issues with LibDevice integration (#784).
  • Fixed issue with unsigned nested conversions (#772, #774).
  • Fixed sample project target frameworks (#771).

Internal changes

  • Bump FluentAssertions from 6.5.1 to 6.6.0 in /Src (#785).
  • Reset baseline for 1.2.0 (#777).

Special thanks

Special thanks to @hokb, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @MPSQUARK, @NullandKale and @Yey007) for providing feedback, submitting issues and feature requests.

v1.1.0

2 years ago

This new release includes bug fixes, a huge set of new features (e.g. LibDevice integration, CudaFFT and NVML bindings) and a significantly improved O2 optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).

Changes

  • Bumped System.Reflection.Metadata from 6.0.0 to 6.0.1 (#767).
  • Added NVML bindings (#518).
  • Added CuFFT and CuFFTW bindings (#706).
  • Added NvJpeg image-decoding bindings (#716, #721).
  • Added LibDevice bindings to include highly optimized math functions on NVIDIA GPUs (#707).
  • Added FP16 support to CuBlas bindings (#658).
  • Added new alignment methods to views to improve performance (#684).
  • Added new global code scheduling transformation to O2 pipeline (#704, #734).
  • Improved debug view implementations of all array views (#647).
  • Improved automatic vectorization (#668).
  • Improved performance of dead-code elimination (#702).
  • Improved loop-invariant code motion transformation (#703).
  • Improved on-the-fly optimization of SetField operations (#671).
  • Improved on-the-fly optimization of LoadElementAddress operations (#733).
  • Fixed missing binding of accelerator instances during Cuda memcopy operations (#705).
  • Fixed exception handling in the case of missing assembly binding redirects (#775).
  • Fixed code-placement phase and invalid removal of DebugAssert values (#749).
  • Fixed race condition in CPUMultiprocessor during lazy initialization (#747).
  • Fixed inheritance to avoid removal of IOValue instances (#745).
  • Fixed issue with the same phi value being reused in a loop (#756).
  • Fixed issue with unique algorithm when running multiple iterations per group (#758).
  • Prevented unintentional initialization of the current Accelerator instance (#714).

Internal changes

  • Require .NET6 for building and enable package validation (#729).
  • Bumped T4.Build from 0.2.3 to 0.2.4 (#767).
  • Bumped FluentAssertions from 6.5.0 to 6.5.1 (#748).
  • Bumped Microsoft.NET.Test.SDK from 17.0.0 to 17.1.0 (#752).
  • Fixed warnings in NET6 builds (#710).
  • Fixed missing struct constraint on TraversalSuccessorsProvider (#727).
  • Added ILGPU logos to logo folder (#717).

Special thanks

Special thanks to @Debiday, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @mikhail-khalizev, @MPSQUARK, @NullandKale, @RER009 and @Yey007) for providing feedback, submitting issues and feature requests.

v1.1.0-beta1

2 years ago

This new beta release includes bug fixes, a huge set of new features (e.g. LibDevice integration, CudaFFT and NVML bindings) and a significantly improved O2 optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).

Changes

  • Added NVML bindings (#518).
  • Added CuFFT and CuFFTW bindings (#706).
  • Added NvJpeg image-decoding bindings (#716, #721).
  • Added LibDevice bindings to include highly optimized math functions on NVIDIA GPUs (#707).
  • Added FP16 support to CuBlas bindings (#658).
  • Added new alignment methods to views to improve performance (#684).
  • Added new global code scheduling transformation to O2 pipeline (#704, #734).
  • Improved debug view implementations of all array views (#647).
  • Improved automatic vectorization (#668).
  • Improved performance of dead-code elimination (#702).
  • Improved loop-invariant code motion transformation (#703).
  • Improved on-the-fly optimization of SetField operations (#671).
  • Improved on-the-fly optimization of LoadElementAddress operations (#733).
  • Fixed missing binding of accelerator instances during Cuda memcopy operations (#705).
  • Fixed exception handling in the case of missing assembly binding redirects (#730).
  • Prevented unintentional initialization of the current Accelerator instance (#714).

Internal changes

  • Require .NET6 for building and enable package validation (#729).
  • Fixed warnings in NET6 builds (#710).
  • Fixed missing struct constraint on TraversalSuccessorsProvider (#727).
  • Added ILGPU logos to logo folder (#717).

Special thanks

Special thanks to @Debiday, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @mikhail-khalizev, @MPSQUARK, @NullandKale, @RER009 and @Yey007) for providing feedback, submitting issues and feature requests.

v1.0.0

2 years ago

This new stable release offers major performance improvements, new APIs to simplify programming, improve productivity and reduce programming errors. It also includes a lot of amazing new features (see below and get the Nuget package).

General notes

  • We converted ILGPU into a monorepo project including, ILGPU.Algorithms, ILGPU.Samples, Wiki and enhanced Documentation.
  • This version has some breaking changes compared to previous stable ILGPU versions (see below).

Breaking changes

  • The Memory API, involving ArrayView and MemoryBuffer types has been significantly improved to support explicit Stride information (see below).
  • All IndexX and LongIndexX types have been renamed to IndexXD and LongIndexXD to have a unified programming experience with respect to memory buffers and array views (see below).
  • The Device API has been redesigned to explicitly enable, filter and configure the available hardware accelerator devices (see below).
  • Parts of the Algorithms library have been refined to support the newly introduced stride types.

Changes

  • Added new Memory API to support explicit stride information (#421, #475, #483).
  • Added new Device API to enable, filter and configure the available hardware accelerator devices (#428).
  • Added support for OpenCL 3.0 API (#464).
  • Added support for inline PTX assembly instructions (#467).
  • Added support for multi-dimensional and static constant arrays (#479).
  • Added support for convenient profiling use ProfilingMarkers (#482).
  • Improved CPU runtime to support arbitrary Warp/Group/Multiprocessor configurations (#402, #484).
  • Improved error messages (#466)
  • Enabled folding of debug assertions in IRBuilder (#477).
  • Fixed Group helper methods for multi-dimensional kernels (#481).
  • Fixed invalid code generation of OpenCL kernels in the presence of constant switch conditions (#441).
  • Promote .NET 5 to a default target framework (#529, #536).
  • Added new Array processing pipeline to have full support for nD-arrays (#513).
  • Added convenience overloads for AsNDView (#571).
  • Added support for zero-length SubView operations (#550).
  • Added Backend optimizations for CPU backend to re-enable support for enhanced shared memory allocations (see #567) (#574).
  • Added support for Cuda ISA 7.3 and 7.4 to support all latest drivers (#566).
  • Added UCE transformation to the backend optimization passes (#569).
  • Added VS integration of check styles to all projects and fixed style checking (#517, #511).
  • Added CPU builder method to register custom CPU devices (#507).
  • Added support for chaining EnableAlgorithms on Context builder instances (#515).
  • Improved performance of all tests by enabling aggressive caching (#522).
  • Improve hash codes of IndexND and LongIndexND types (#510).
  • Changed InvalidEntryPointIndexParameterOfWrongType error message to be more descriptive (#535).
  • Changed T4 DllImportSearchPath to LegacyBehavior (#514).
  • Fixed constant folding when converting unsigned integers (#549).
  • Fixed critical issue when swapping registers/variables in backends (#541).
  • Fixed invalid copies from and to sub views (#523).
  • Fixed and enhanced Stride and ArrayView types (#509).
  • Fixed regression in single-pass scan when performing multiple iterations (#525).
  • Fixed RadixSortProvider and ScanProvider test cases (#516).
  • Removed obsolete properties and methods (#524).

Repository Changes

  • Merged ILGPU.Samples into ILGPU repository (#538, #561, #563, #564, #565, #568).
  • Merged ILGPU.Algorithms into ILGPU repository.
  • Merged ILGPU Wiki into ILGPU repository (#537).
  • Merged external ILGPU v0.10.1 documents (#546).
  • Added information about symbols and source link to ReadMe file (#594).

CI Changes

  • Add badges for versions and CI (#534).
  • Skip publishing nuget packages on forks (#533).
  • Selective builds on macOS, master and tags (#530).
  • Fix NuGet publishing bug in CI pipeline (#572).
  • Restricting the package CI job to run only once (#527).
  • Run clean tests on push to master or tag without using caches (#526).
  • Added support for releasing pre-view builds via feedz.io (#521, #520).
  • Adapted CI for new ILGPU monorepo (#512).

Major internal changes

  • Added build support for .Net5.0 (#446).
  • Added support for T4.Build to automatically transform T4 text templates during build (#431).
  • Restrict net47 unit tests to only run on CI builds (#465).
  • Avoid duplicate CI runs for pull requests made from the same repo (#485).
  • Updated InlineList implementation to reduce memory consumption (#478).
  • Fixed invalid assertion affecting successor blocks in frontend (#445).
  • Added missing struct type constraints (#532).
  • Applied general cleanup (#531).
  • Removed obsolete configurations from solutions (#599).
  • Prepared conditional compilation for future .NET frameworks (#592).
  • Updated .Net Framework version from v4.7 to v4.7.1 (#594).
  • Added 1.0.0 pre-release documentation (#602).
  • Added sample about inline PTX assembly instructions (#588).
  • Added sample about monitoring progress on Cuda accelerators (#593).
  • Added sample project for printf-like output in kernels (#600).
  • Added sample project for debug asserts in kernels (#600).
  • Added sample project for removing consecutive duplicate values (#600).
  • Added sample project for calculating histograms (#600).
  • Added sample project for fixed sized buffers (#600).
  • Added support for zero-length subviews of zero-length views (#585).
  • Guard against zero-length (CUDA and CL) allocations to enable allocations of zero bytes (#547, #610).
  • Simplified naming of GetAsPageLockedArray and AllocatePageLockedArray (#608).
  • Fixed transformation issues regarding many functions in kernel modules (without inlining) (#613).
  • Fixed invalid detection and processing of loops consisting of a single entry block (#607).
  • Fixed invalid conversion of LFA values in SSAStructureConstruction (affect array optimizations, #605).

Notes

  • We updated the versions of the .Net dependencies (#576, #577, #578, #579, #580, #581, #582, #583, #586, #591, #595 and #601).
  • We updated the required .Net Framework version (from v4.7 to v4.7.1) to benefit from the most recent dependency updates (#595).
  • We updated the ILGPU documentation and all samples to be compatible with this release (#584, #593, #600, #602).

Summary of the changes related to the new Memory API

The new API distinguishes between a coherent, strongly typed ArrayView<T> structure and its n-D versions ArrayViewXD<T, TStride>, which carry dimension-dependent stride information (The actual logic for computing element addresses is moved from the IndexXD types to the newly added StrideXD types). This allows developers to explicitly specify a particular stride of a view, reinterpret the data layout itself (by changing the stride), and perform compile-time optimizations based on explicitly typed stride information. Consequently, ILGPU's optimization pipeline is able to remove the overhead of these abstractions in most cases (except in rare use cases where strange-looking strides are used). It also makes all memory transfer-related operations explicit in terms of what memory layout the underlying data will have after an operation is performed.

In addition, it moves all copy related methods to the ArrayView instances instead of exposing them on the memory buffers. This realizes a "separation of concerns": One the one hand, a MemoryBuffer holds a reference to the native memory area and controls its lifetime. On the other hand, ArrayView structures manage the contents of these buffers and make them available to the actual GPU kernels.

Example:

// Simple 1D allocation of 1024 longs with TStride = Stride1D.Dense (all elements are accessed contiguously in memory)
var t = accl.Allocate1D<long>(1024);

// Advanced 1D allocation of 1024 longs with TStride = Stride1D.General(2) (each memory access will skip 2 elements)
// -> allocates 1024 * 2 longs to be able to access all of them
var t = accl.Allocate1D<long, Stride1D.General>(1024, new Stride1D.General(2));

// Simple 1D allocation of 1024 longs using the array provided
var data1 = new long[1024];
var t2 = accl.Allocate1D(data1);

// Simple 2D allocation of 1024 * 1024 longs using the array provided with TStride = Stride2D.DenseX
// (all elements in X dimension are accessed contiguously in memory)
// -> this will *not* transpose the input buffer as the memory layout will be identical on CPU and GPU
var data2 = new long[1024, 1024];
var t3 = accl.Allocate2DDenseX(data2);

// Simple 2D allocation of 1024 * 1024 longs using the array provided, with TStride = Stride2D.DenseY
// (all elements in Y dimension are accessed contiguously in memory)
// -> this *will* transpose the input buffer to match the desired data layout
var data3 = new long[1024, 1024];
var t4 = accl.Allocate2DDenseY(data3);

The major changes/features of the new Memory API are:

  • Index1|Index2|Index3 types have been renamed to Index1D|Index2D|Index3D to match the naming scheme of ArrayViewXD and MemoryBufferXD types.
  • LongIndex1|LongIndex2|LongIndex3 types have been renamed to LongIndex1D|LongIndex2D|LongIndex3D to match the naming scheme of the ArrayViewXD and MemoryBufferXD types.
  • Separation of concerns between MemoryBuffer and ArrayView instances:
    • ArrayView... structures represent and manage the contents of buffers (or chunks of buffers).
    • MemoryBuffer... classes manage the lifetime of allocated memory chunks on a device.
  • The ILGPU.ArrayView intrinsic structure implements the newly added IContiguousArrayView interface that marks contiguous memory sections.
  • The ILGPU.Runtime.MemoryBuffer... classes implement the newly added IContiguousArrayView interface that marks contiguous memory sections.
  • Types implementing the IContiguousArrayView interface provide extension methods for initializing, copying from and to the memory region (not supported on accelerators).
  • This PR adds the notion of Strides. ILGPU contains built-in common strides for 1D, 2D and 3D views.
    • Stride1D.Dense represents contiguous chunks of memory that pack elements side by side.
    • Stride1D.General represents strides that skip a certain number of elements.
    • Stride2D.DenseX represents 2D strides that pack elements side by side in dimension X (transfers from a to views with this stride involve transpose operations).
    • Stride2D.DenseY represents 2D strides that pack elements in the Y dimension side by side.
    • Stride2D.General represents strides that skip a certain number of elements in the X and Y dimensions.
    • Stride3D.DenseXY represents 3D strides that pack elements in the X,Y dimension side by side (transfers from a to views with this stride involve transposition operations).
    • Stride3D.DenseZY represents 3D strides that pack elements in the Z,Y dimension side by side.
    • Stride3D.General represents strides that omit a certain number of elements in the X, Y and Z dimensions.
  • All ArrayViewXD types have been moved to the ILGPU.Runtime namespace.
  • All ArrayViewXD types do not implement IContiguousArrayView, as they support arbitrary stride information. Note that the ArrayView1D<T, Stride1D.Dense> specialization has an implicit conversion to ArrayView<T> (and vice versa) for auxiliary purposes.
  • All CopyFromCPU and CopyToCPU methods are provided with additional hints as to whether they are transposing the input elements or keeping the original layout.
  • Note that GetAsXDArray(...) always returns elements in .Net standard layout for 1D, 2D and 3D arrays (this may result in transposing the input elements of the buffer on the CPU). Use view.AsContiguous().GetAsArray() to get the memory layout of the input buffer.

This also affects the implementation of all IndexND types. We moved the index reconstruction functions from the index types to the individual stride implementations:

Index2D index = <some_extent>.ReconstructIndex(index);

New way:

Index2D index = Stride2D.DenseX.ReconstructFromElementIndex(index, <some_extent>);
// .. or ..
Index2D index = Stride2D.DenseY.ReconstructFromElementIndex(index, <some_extent>);

Summary of the changes related to the new Device API

The new Device API removes the enumeration ContextFlags and implements the same functionality in an object oriented way using a Context.Builder class. It offers a fluent-API like configuration interface which makes it easy to set up:

// Enables all supported accelerators (default CPU accelerator only) and puts the context
// into auto-assertion mode via "AutoAssertions()". In other words, if a debugger is attached,
// the `Context` instance will turn on all assertion checks. This behavior is identical
// to the current implementation via new Context();
using var context = Context.CreateDefault();

// Turns on O2 and enables all compatible Cuda devices.
using var context = Context.Create(builder =>
{
    builder.Optimize(OptimizationLevel.O2).Cuda();
});

// Turns on all assertions, enables the IR verifier and enables all compatible OpenCL devices.
using var context = Context.Create(builder =>
{
    builder.Assertions().Verify().OpenCL();
});

// Turns on kernel source-line annotations, fast math using 32-bit float and enables
// *all* (even incompatible) OpenCL devices.
using var context = Context.Create(builder =>
{
    builder
        .DebugSymbols(DebugSymbolsMode.KernelSourceAnnotations)
        .Math(MathMode.Fast32BitOnly)
        .OpenCL(device => true);
});

// Selects an OpenCL device with a warp size of at least 32:
using var context = Context.Create(builder =>
{
    builder.OpenCL(device => device.WarpSize >= 32);
});

// Turns on all assertions in debug mode (same behavior like calling CreateDefault()):
using var context = Context.Create(builder =>
{
    builder.AutoAssertions();
});

// Turns on debug optimizations (level O0) and all assertions if a debugger is attached:
using var context = Context.Create(builder =>
{
    builder.AutoDebug();
});

// Turns on debug mode (optimization level P0, assertions and kernel debug information):
using var context = Context.Create(builder =>
{
    builder.Debug();
});

// Disable caching, enable conservative inlining and inline mutable static field values:
using var context = Context.Create(builder =>
{
    builder
        .Caching(CachingMode.Disabled)
        .Inlining(InliningMode.Conservative)
        .StaticFields(StaticFieldMode.MutableStaticFields);
});

// Turn on *all* CPU accelerators that simulate different hardware platforms:
using var context = Context.Create(builder => builder.CPU());

// Turn on an AMD-based CPU accelerator:
using var context = Context.Create(builder => builder.CPU(CPUDeviceKind.AMD));

Note that by default all debug symbols are automatically turned off when a debugger is attached. If you want to turn on the debug information in all cases, call .builder.DebugSymbols(DebugSymbolsMode.Basic). At the same time, this PR introduces the notion of a Device, which replaces the implementation of AcceleratorId. This allows us to query detailed device information without explicitly instantiating an accelerator:

// Print all device information without instantiating a single accelerator
// (device context) instance.
using var context = Context.Create(...);
foreach (var device in context)
{
    // Print detailed accelerator information
    device.PrintInformation();

    // ...
}

Note that we removed the ability to call the accelerator constructors (e.g. new CudaAccelerator(...)) directly. Either use the CreateAccelerator methods defined in the Device classes or use one of the extension methods like CreateCudaAccelerator(...) of the Context class itself:

using var context = Context.Create(...);
foreach (var device in context)
{
    // Instantiate an accelerator instance on this device
    using Accelerator accel = device.CreateAccelerator();
    // ...
}

// Instantiate the 2nd Cuda accelerator (NOTE that this is the *2nd* Cuda device
// and *not* the 2nd device of your machine).
using CudaAccelerator cudaDevice = context.CreateCudaAccelerator(1);

// Instantiate the 1st OpenCL accelerator (NOTE that this is the *1st* OpenCL device
// and *not* the 1st device of your machine).
using CLAccelerator clDevice = context.CreateOpenCLAccelerator(0);

Context properties that expose types from other (ILGPU internal) namespaces that cannot/should not (?) be covered by the API/ABI guarantees we want to give, has been made internal properties. To access these properties, use one of the available extensionmethods located in the corresponding namespaces:

using var context = ...

// OLD way
var internalIRContext = context.IRContext;

// NEW way:
// using namespace ILGPU.IR;
var internalIRContext = context.GetIRContext();

Using the Algorithms Library with the new Memory and Device APIs

To use the new version of the algorithms library with ILGPU v1.0.0, you need to initialize the library with the help of the new builder pattern:

// Enables all algorithm library features
using var context = Context.Create(builder =>
{
    builder.EnableAlgorithms();
});

Improved CPU runtime to support arbitrary Warp/Group/Multiprocessor configurations

The new CPU runtime significantly improves the existing CPUAccelerator runtime by adding support for user-defined warp, group and multiprocessor configurations. It changes the internal functionality to simulate a single warp of at least 2 threads (which ensures that all shuffle-based/reduction-like algorithms can also be run on the CPU by default). At the same time, each virtual multiprocessor can only execute a single thread group at a time. Increasing the number of virtual multiprocessors allows the user to simulate multiple concurrent groups. Most use cases will not require more than a single multiprocessor in practice.

Note that all device-wide static Grid/Group/Atomic/Warp classes are fully supported to debug/simulate all ILGPU kernels on the CPU.

Note that a custom warp size must be a multiple of 2.

This PR adds a new set of static creation methods:

  • CreateDefaultSimulator(...) which creates a CPUAccelerator instance with 4 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 16).
  • CreateNvidiaSimulator(...) which creates a CPUAccelerator instance with 32 threads per warp, 32 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 1024).
  • CreateAMDSimulator(...) which creates a CPUAccelerator instance with 32 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256).
  • CreateLegacyAMDSimulator(...) which creates a CPUAccelerator instance with 64 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256).
  • CreateIntelSimulator(...) which creates a CPUAccelerator instance with 16 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 128).

Furthermore, this PR adds support for advanced debugging features that enable a "sequential-like" execution mode. In this mode, each thread of a group will run sequentially one after another until it hits a synchronization barrier or exits the kernel function. This allows users to conveniently debug larger thread groups consisting of concurrent threads without switching to single-threaded execution. This behavior can be controlled via the newly added CPUAcceleratorMode enum:

    /// <summary>
    /// The accelerator mode to be used with the <see cref="CPUAccelerator"/>.
    /// </summary>
    public enum CPUAcceleratorMode
    {
        /// <summary>
        /// The automatic mode uses <see cref="Sequential"/> if a debugger is attached.
        /// It uses <see cref="Parallel"/> if no debugger is attached to the
        /// application.
        /// </summary>
        /// <remarks>
        /// This is the default mode.
        /// </remarks>
        Auto = 0,

        /// <summary>
        /// If the CPU accelerator uses a simulated sequential execution mechanism. This
        /// is particularly useful to simplify debugging. Note that different threads for
        /// distinct multiprocessors may still run in parallel.
        /// </summary>
        Sequential = 1,

        /// <summary>
        /// A parallel execution mode that runs all execution threads in parallel. This
        /// reduces processing time but makes it harder to use a debugger.
        /// </summary>
        Parallel = 2,
    }

By default, all CPUAccelerator instances use the automatic mode (CPUAcceleratorMode.Auto) that switches to a sequential execution model as soon as a debugger is attached to the application.

Note that threads in the scope of multiple multiprocessors may still run in parallel.

Special thanks

Special thanks to @76creates, @conghuiw, @deng0, @GPSnoopy, @jgiannuzzi, @Joey9801, @ljubon, @MoFtZ, @Nnelg, @nullandkale and @sucrose0413 for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @faruknane, @mikhail-khalizev, @MPSQUARK, @Ruberik, @Yey007, and @yuryGotham) for providing feedback, submitting issues and feature requests.

v1.0.0-rc3

2 years ago

This final release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes performance improvements and several bug fixes including critical patches for the internal loop optimization phases and cross-device peer accesses (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).

Breaking Changes

  • Refined the API for building custom Atomic implementations to overcome performance limitations (#667).

Changes

  • Added explicit conversion methods for ArrayView and ArrayView1D (#666).
  • Improved Atomics performance (#667).
  • Fixed issue with enabling IO operations (#694).
  • Fixed invalid peer-access functionality (#675).
  • Fixed invalid address-space inference in the presence of generic view-based casts (#670).
  • Fixed critical issues in LoopUnrolling phases (#653, #657, #661).
  • Fixed invalid thread configuration in CPUDevice and CPUMultiprocessor classes (#665).
  • Fixed missing NotInsideKernel attributes on MemSet functions (#651).
  • Fixed missing bindings current accelerator in the scope of profiling markers (#644).
  • Fixed radix sort on floating point data types (#643).

Repository Changes

  • Polished readme, build and license information. (#650, #655).
  • Updated samples to new Atomic function API (#667).

Major internal changes

  • Bumped several test dependency packages (#659, #662).
  • Bumped SourceLink dependencies to v1.1.1 (#689, #690).
  • Bumped T4.Build version to v0.2.3 (#685).
  • Added automatic skipping of specific CPU tests on MacOS runners (#669).

Special thanks

Special thanks to @MoFtZ, @jgiannuzzi , @deng0 and @conghuiw for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.

Full Changelog: https://github.com/m4rs-mt/ILGPU/compare/v1.0.0-rc2...v1.0.0-rc3

v1.0.0-rc2

2 years ago

This new release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes bug fixes, new features and a refined ILGPU Index/Stride, ScanExtensions,RadixSortExtensions and CuBlas APIs (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).

Breaking Changes

  • Refined Index1D|Index2D|Index3D|LongIndex1D|LongIndex2D|LongIndex3D type API surface: removed multidimensional index reconstruction methods.
  • Added new multidimensional index reconstruction methods to Stride1D|Stride2D|Stride3D types.
  • Moved bounds checking from ArrayView1D|ArrayView2D|ArrayView3D types to Stride1D|Stride2D|Stride3D types.
  • Refined CuBlas API to be compatible with stride information.
  • Refined Scan and RadixSort APIs to be compatible with stride information.

Changes

  • Updated Docs to include links to samples (#618).
  • Updated CuBlas interface to work on views with stride information (#631).
  • Updated Algorithms.Scan implementation to work on arbitrary stride types (#632)
  • Updated Algorithms.RadixSort implementation to work on arbitrary stride types (#637).
  • Fixed generated call to ValueType.GetHashCode (#617).
  • Fixed invalid alignment of dynamic shared memory allocations (#630).
  • Fixed OutOfRessources when emitting Code with debug assertions turned on using the Cuda backend (#628).
  • Fixed race condition in WarpReductions.Reduce for CPU accelerators (#627).
  • Refined index reconstruction methods and fixed element index assertions (#629).
  • Refined bounds checks of CUDA and OpenCL APIs (#619).
  • Improved hash code of index types to avoid copyright issues (#622).
  • Ensure Cuda accelerator is bound before calling CuBlas methods (#624).
  • Improved runtime performance of the CPU accelerator launcher (#626).

Repository Changes

  • Removed obsolete .gitignore entries (#634).
  • Adjusted copyright headers (based on #598) (#625).

Major internal changes

  • Allow building net471 target without Windows (#616).

Special thanks

Special thanks to @MoFtZ, @nullandkale, @jgiannuzzi, @Joey9801, @lostmsu and @kilngod for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.

Full Changelog: https://github.com/m4rs-mt/ILGPU/compare/v1.0.0-rc1...v1.0.0-rc2

v1.0.0-rc1

2 years ago

This new release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes bug fixes, a lot of amazing new features and improved samples and documentation (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).

Notes

  • We updated the versions of the .Net dependencies (#576, #577, #578, #579, #580, #581, #582, #583, #586, #591, #595 and #601).
  • We updated the required .Net Framework version (from v4.7 to v4.7.1) to benefit from the most recent dependency updates (#595).
  • We updated the ILGPU documentation and all samples to be compatible with the latest preview releases (#584, #593, #600, #602).

Changes

  • Updated .Net Framework version from v4.7 to v4.7.1 (#594).
  • Added 1.0.0 pre-release documentation (#602).
  • Added sample about inline PTX assembly instructions (#588).
  • Added sample about monitoring progress on Cuda accelerators (#593).
  • Added sample project for printf-like output in kernels (#600).
  • Added sample project for debug asserts in kernels (#600).
  • Added sample project for removing consecutive duplicate values (#600).
  • Added sample project for calculating histograms (#600).
  • Added sample project for fixed sized buffers (#600).
  • Added support for zero-length subviews of zero-length views (#585).
  • Guard against zero-length (CUDA and CL) allocations to enable allocations of zero bytes (#547, #610).
  • Simplified naming of GetAsPageLockedArray and AllocatePageLockedArray (#608).
  • Fixed transformation issues regarding many functions in kernel modules (without inlining) (#613).
  • Fixed invalid detection and processing of loops consisting of a single entry block (#607).
  • Fixed invalid conversion of LFA values in SSAStructureConstruction (affect array optimizations, #605).

Repository Changes

  • Added information about symbols and source link to ReadMe file (#594).

Major internal changes

  • Removed obsolete configurations from solutions (#599).
  • Prepared conditional compilation for future .NET frameworks (#592).

Special thanks

Special thanks to @MoFtZ, @nullandkale, @Joey9801, @jgiannuzzi and @sucrose0413 for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.

v1.0.0-beta3

2 years ago

This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).

Notes

  • We converted ILGPU into a monorepo project including, ILGPU.Algorithms, ILGPU.Samples, Wiki and enhanced Documentation.
  • This version has some breaking changes compared to previous stable ILGPU versions (see also Release v1.0.0-beta1).

Changes

  • Promote .NET 5 to a default target framework (#529, #536).
  • Added new Array processing pipeline to have full support for nD-arrays (#513).
  • Added convenience overloads for AsNDView (#571).
  • Added support for zero-length SubView operations (#550).
  • Added Backend optimizations for CPU backend to re-enable support for enhanced shared memory allocations (see #567) (#574).
  • Added support for Cuda ISA 7.3 and 7.4 to support all latest drivers (#566).
  • Added UCE transformation to the backend optimization passes (#569).
  • Added VS integration of check styles to all projects and fixed style checking (#517, #511).
  • Added CPU builder method to register custom CPU devices (#507).
  • Added support for chaining EnableAlgorithms on Context builder instances (#515).
  • Improved performance of all tests by enabling aggressive caching (#522).
  • Improve hash codes of IndexND and LongIndexND types (#510).
  • Changed InvalidEntryPointIndexParameterOfWrongType error message to be more descriptive (#535).
  • Changed T4 DllImportSearchPath to LegacyBehavior (#514).
  • Fixed constant folding when converting unsigned integers (#549).
  • Fixed critical issue when swapping registers/variables in backends (#541).
  • Fixed invalid copies from and to sub views (#523).
  • Fixed and enhanced Stride and ArrayView types (#509).
  • Fixed regression in single-pass scan when performing multiple iterations (#525).
  • Fixed RadixSortProvider and ScanProvider test cases (#516).
  • Removed obsolete properties and methods (#524).

Repository Changes

  • Merged ILGPU.Smples into ILGPU repository (#538, #561, #563, #564, #565, #568).
  • Merged ILGPU.Algorithms into ILGPU repository.
  • Merged ILGPU Wiki into ILGPU repository (#537).
  • Merged external ILGPU v0.10.1 documents (#546).

CI Changes

  • Add badges for versions and CI (#534).
  • Skip publishing nuget packages on forks (#533).
  • Selective builds on macOS, master and tags (#530).
  • Fix NuGet publishing bug in CI pipeline (#572).
  • Restricting the package CI job to run only once (#527).
  • Run clean tests on push to master or tag without using caches (#526).
  • Added support for releasing pre-view builds via feedz.io (#521, #520).

Major internal changes

  • Adapted CI for new ILGPU monorepo (#512).
  • Added missing struct type constraints (#532).
  • Applied general cleanup (#531).

Special thanks

Special thanks to @MoFtZ, @Joey9801, @jgiannuzzi ,@nullandkale, @76creates, @Nnelg, @ljubon for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.

v1.0.0-beta2

2 years ago

This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the Nuget package).

Please note that this version has some breaking changes compared to previous ILGPU versions. Refer to the v1.0-beta1 summary for more information.

v1.0-beta1

2 years ago

This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the Nuget package).

Please note that this version has some breaking changes compared to previous ILGPU versions.

Breaking changes

  • The Memory API, involving ArrayView and MemoryBuffer types has been significantly improved to support explicit Stride information (see below).
  • All IndexX and LongIndexX types have been renamed to IndexXD and LongIndexXD to have a unified programming experience with respect to memory buffers and array views (see below).
  • The Device API has been redesigned to explicitly enable, filter and configure the available hardware accelerator devices (see below).

Changes

  • Added new Memory API to support explicit stride information (#421, #475, #483).
  • Added new Device API to enable, filter and configure the available hardware accelerator devices (#428).
  • Added support for OpenCL 3.0 API (#464).
  • Added support for inline PTX assembly instructions (#467).
  • Added support for multi-dimensional and static constant arrays (#479).
  • Added support for convenient profiling use ProfilingMarkers (#482).
  • Improved CPU runtime to support arbitrary Warp/Group/Multiprocessor configurations (#402, #484).
  • Improved error messages (#466)
  • Enabled folding of debug assertions in IRBuilder (#477).
  • Fixed Group helper methods for multi-dimensional kernels (#481).
  • Fixed invalid code generation of OpenCL kernels in the presence of constant switch conditions (#441).

Summary of the changes related to the new Memory API

The new API distinguishes between a coherent, strongly typed ArrayView<T> structure and its n-D versions ArrayViewXD<T, TStride>, which carry dimension-dependent stride information (The actual logic for computing element addresses is moved from the IndexXD types to the newly added StrideXD types). This allows developers to explicitly specify a particular stride of a view, reinterpret the data layout itself (by changing the stride), and perform compile-time optimizations based on explicitly typed stride information. Consequently, ILGPU's optimization pipeline is able to remove the overhead of these abstractions in most cases (except in rare use cases where strange-looking strides are used). It also makes all memory transfer-related operations explicit in terms of what memory layout the underlying data will have after an operation is performed.

In addition, it moves all copy related methods to the ArrayView instances instead of exposing them on the memory buffers. This realizes a "separation of concerns": One the one hand, a MemoryBuffer holds a reference to the native memory area and controls its lifetime. On the other hand, ArrayView structures manage the contents of these buffers and make them available to the actual GPU kernels.

Example:

// Simple 1D allocation of 1024 longs with TStride = Stride1D.Dense (all elements are accessed contiguously in memory)
var t = accl.Allocate1D<long>(1024);

// Advanced 1D allocation of 1024 longs with TStride = Stride1D.General(2) (each memory access will skip 2 elements)
// -> allocates 1024 * 2 longs to be able to access all of them
var t = accl.Allocate1D<long, Stride1D.General>(1024, new Stride1D.General(2));

// Simple 1D allocation of 1024 longs using the array provided
var data1 = new long[1024];
var t2 = accl.Allocate1D(data1);

// Simple 2D allocation of 1024 * 1024 longs using the array provided with TStride = Stride2D.DenseX
// (all elements in X dimension are accessed contiguously in memory)
// -> this will *not* transpose the input buffer as the memory layout will be identical on CPU and GPU
var data2 = new long[1024, 1024];
var t3 = accl.Allocate2DDenseX(data2);

// Simple 2D allocation of 1024 * 1024 longs using the array provided, with TStride = Stride2D.DenseY
// (all elements in Y dimension are accessed contiguously in memory)
// -> this *will* transpose the input buffer to match the desired data layout
var data3 = new long[1024, 1024];
var t4 = accl.Allocate2DDenseY(data3);

The major changes/features of the new Memory API are:

  • Index1|Index2|Index3 types have been renamed to Index1D|Index2D|Index3D to match the naming scheme of ArrayViewXD and MemoryBufferXD types.
  • LongIndex1|LongIndex2|LongIndex3 types have been renamed to LongIndex1D|LongIndex2D|LongIndex3D to match the naming scheme of the ArrayViewXD and MemoryBufferXD types.
  • Separation of concerns between MemoryBuffer and ArrayView instances:
    • ArrayView... structures represent and manage the contents of buffers (or chunks of buffers).
    • MemoryBuffer... classes manage the lifetime of allocated memory chunks on a device.
  • The ILGPU.ArrayView intrinsic structure implements the newly added IContiguousArrayView interface that marks contiguous memory sections.
  • The ILGPU.Runtime.MemoryBuffer... classes implement the newly added IContiguousArrayView interface that marks contiguous memory sections.
  • Types implementing the IContiguousArrayView interface provide extension methods for initializing, copying from and to the memory region (not supported on accelerators).
  • This PR adds the notion of Strides. ILGPU contains built-in common strides for 1D, 2D and 3D views.
    • Stride1D.Dense represents contiguous chunks of memory that pack elements side by side.
    • Stride1D.General represents strides that skip a certain number of elements.
    • Stride2D.DenseX represents 2D strides that pack elements side by side in dimension X (transfers from a to views with this stride involve transpose operations).
    • Stride2D.DenseY represents 2D strides that pack elements in the Y dimension side by side.
    • Stride2D.General represents strides that skip a certain number of elements in the X and Y dimensions.
    • Stride3D.DenseXY represents 3D strides that pack elements in the X,Y dimension side by side (transfers from a to views with this stride involve transposition operations).
    • Stride3D.DenseZY represents 3D strides that pack elements in the Z,Y dimension side by side.
    • Stride3D.General represents strides that omit a certain number of elements in the X, Y and Z dimensions.
  • All ArrayViewXD types have been moved to the ILGPU.Runtime namespace.
  • All ArrayViewXD types do not implement IContiguousArrayView, as they support arbitrary stride information. Note that the ArrayView1D<T, Stride1D.Dense> specialization has an implicit conversion to ArrayView<T> (and vice versa) for auxiliary purposes.
  • All CopyFromCPU and CopyToCPU methods are provided with additional hints as to whether they are transposing the input elements or keeping the original layout.
  • Note that GetAsXDArray(...) always returns elements in .Net standard layout for 1D, 2D and 3D arrays (this may result in transposing the input elements of the buffer on the CPU). Use view.AsContiguous().GetAsArray() to get the memory layout of the input buffer.

Summary of the changes related to the new Device API

The new Device API removes the enumeration ContextFlags and implements the same functionality in an object oriented way using a Context.Builder class. It offers a fluent-API like configuration interface which makes it easy to set up:

// Enables all supported accelerators (default CPU accelerator only) and puts the context
// into auto-assertion mode via "AutoAssertions()". In other words, if a debugger is attached,
// the `Context` instance will turn on all assertion checks. This behavior is identical
// to the current implementation via new Context();
using var context = Context.CreateDefault();

// Turns on O2 and enables all compatible Cuda devices.
using var context = Context.Create(builder =>
{
    builder.Optimize(OptimizationLevel.O2).Cuda();
});

// Turns on all assertions, enables the IR verifier and enables all compatible OpenCL devices.
using var context = Context.Create(builder =>
{
    builder.Assertions().Verify().OpenCL();
});

// Turns on kernel source-line annotations, fast math using 32-bit float and enables
// *all* (even incompatible) OpenCL devices.
using var context = Context.Create(builder =>
{
    builder
        .DebugSymbols(DebugSymbolsMode.KernelSourceAnnotations)
        .Math(MathMode.Fast32BitOnly)
        .OpenCL(device => true);
});

// Selects an OpenCL device with a warp size of at least 32:
using var context = Context.Create(builder =>
{
    builder.OpenCL(device => device.WarpSize >= 32);
});

// Turns on all assertions in debug mode (same behavior like calling CreateDefault()):
using var context = Context.Create(builder =>
{
    builder.AutoAssertions();
});

// Turns on debug optimizations (level O0) and all assertions if a debugger is attached:
using var context = Context.Create(builder =>
{
    builder.AutoDebug();
});

// Turns on debug mode (optimization level P0, assertions and kernel debug information):
using var context = Context.Create(builder =>
{
    builder.Debug();
});

// Disable caching, enable conservative inlining and inline mutable static field values:
using var context = Context.Create(builder =>
{
    builder
        .Caching(CachingMode.Disabled)
        .Inlining(InliningMode.Conservative)
        .StaticFields(StaticFieldMode.MutableStaticFields);
});

// Turn on *all* CPU accelerators that simulate different hardware platforms:
using var context = Context.Create(builder => builder.CPU());

// Turn on an AMD-based CPU accelerator:
using var context = Context.Create(builder => builder.CPU(CPUDeviceKind.AMD));

Note that by default all debug symbols are automatically turned off when a debugger is attached. If you want to turn on the debug information in all cases, call .builder.DebugSymbols(DebugSymbolsMode.Basic). At the same time, this PR introduces the notion of a Device, which replaces the implementation of AcceleratorId. This allows us to query detailed device information without explicitly instantiating an accelerator:

// Print all device information without instantiating a single accelerator
// (device context) instance.
using var context = Context.Create(...);
foreach (var device in context)
{
    // Print detailed accelerator information
    device.PrintInformation();

    // ...
}

Note that we removed the ability to call the accelerator constructors (e.g. new CudaAccelerator(...)) directly. Either use the CreateAccelerator methods defined in the Device classes or use one of the extension methods like CreateCudaAccelerator(...) of the Context class itself:

using var context = Context.Create(...);
foreach (var device in context)
{
    // Instantiate an accelerator instance on this device
    using Accelerator accel = device.CreateAccelerator();
    // ...
}

// Instantiate the 2nd Cuda accelerator (NOTE that this is the *2nd* Cuda device
// and *not* the 2nd device of your machine).
using CudaAccelerator cudaDevice = context.CreateCudaAccelerator(1);

// Instantiate the 1st OpenCL accelerator (NOTE that this is the *1st* OpenCL device
// and *not* the 1st device of your machine).
using CLAccelerator clDevice = context.CreateOpenCLAccelerator(0);

Context properties that expose types from other (ILGPU internal) namespaces that cannot/should not (?) be covered by the API/ABI guarantees we want to give, has been made internal properties. To access these properties, use one of the available extension methods located in the corresponding namespaces:

using var context = ...

// OLD way
var internalIRContext = context.IRContext;

// NEW way:
// using namespace ILGPU.IR;
var internalIRContext = context.GetIRContext();

Improved CPU runtime to support arbitrary Warp/Group/Multiprocessor configurations

The new CPU runtime significantly improves the existing CPUAccelerator runtime by adding support for user-defined warp, group and multiprocessor configurations. It changes the internal functionality to simulate a single warp of at least 2 threads (which ensures that all shuffle-based/reduction-like algorithms can also be run on the CPU by default). At the same time, each virtual multiprocessor can only execute a single thread group at a time. Increasing the number of virtual multiprocessors allows the user to simulate multiple concurrent groups. Most use cases will not require more than a single multiprocessor in practice.

Note that all device-wide static Grid/Group/Atomic/Warp classes are fully supported to debug/simulate all ILGPU kernels on the CPU.

Note that a custom warp size must be a multiple of 2.

This PR adds a new set of static creation methods:

  • CreateDefaultSimulator(...) which creates a CPUAccelerator instance with 4 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 16).
  • CreateNvidiaSimulator(...) which creates a CPUAccelerator instance with 32 threads per warp, 32 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 1024).
  • CreateAMDSimulator(...) which creates a CPUAccelerator instance with 32 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256).
  • CreateLegacyAMDSimulator(...) which creates a CPUAccelerator instance with 64 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256).
  • CreateIntelSimulator(...) which creates a CPUAccelerator instance with 16 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 128).

Furthermore, this PR adds support for advanced debugging features that enable a "sequential-like" execution mode. In this mode, each thread of a group will run sequentially one after another until it hits a synchronization barrier or exits the kernel function. This allows users to conveniently debug larger thread groups consisting of concurrent threads without switching to single-threaded execution. This behavior can be controlled via the newly added CPUAcceleratorMode enum:

    /// <summary>
    /// The accelerator mode to be used with the <see cref="CPUAccelerator"/>.
    /// </summary>
    public enum CPUAcceleratorMode
    {
        /// <summary>
        /// The automatic mode uses <see cref="Sequential"/> if a debugger is attached.
        /// It uses <see cref="Parallel"/> if no debugger is attached to the
        /// application.
        /// </summary>
        /// <remarks>
        /// This is the default mode.
        /// </remarks>
        Auto = 0,

        /// <summary>
        /// If the CPU accelerator uses a simulated sequential execution mechanism. This
        /// is particularly useful to simplify debugging. Note that different threads for
        /// distinct multiprocessors may still run in parallel.
        /// </summary>
        Sequential = 1,

        /// <summary>
        /// A parallel execution mode that runs all execution threads in parallel. This
        /// reduces processing time but makes it harder to use a debugger.
        /// </summary>
        Parallel = 2,
    }

By default, all CPUAccelerator instances use the automatic mode (CPUAcceleratorMode.Auto) that switches to a sequential execution model as soon as a debugger is attached to the application.

Note that threads in the scope of multiple multiprocessors may still run in parallel.

Major internal changes:

  • Added build support for .Net5.0 (#446).
  • Added support for T4.Build to automatically transform T4 text templates during build (#431).
  • Restrict net47 unit tests to only run on CI builds (#465).
  • Avoid duplicate CI runs for pull requests made from the same repo (#485).
  • Updated InlineList implementation to reduce memory consumption (#478).
  • Fixed invalid assertion affecting successor blocks in frontend (#445).

Special thanks

Special thanks to @MoFtZ, @Joey9801, @jgiannuzzi and @GPSnoopy for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @MPSQUARK, @Nnelg, @Ruberik, @Yey007, @faruknane, @mikhail-khalizev, @nullandkale and @yuryGotham) for providing feedback, submitting issues and feature requests.