CLBlast Versions Save

Tuned OpenCL BLAS

3 months ago

CLBlast version 1.6.2. Changes since previous release (version 1.6.1):

Fix a bug in the pre-processor that would cause issues on Arm GPUs
Fix DLL install directory in mingw
Modifications to the Python bindings (pyclblast)
- Convert float scalar values to cl_half for fp16 routines
- Amax/amin, max/min routines accept unsigned integer buffers for index
- Switch to pyproject.toml file for installing Python bindings
- Build Python bindings using Cmake, adding Windows support
Generator script now always use LF endings, independent of the platform
Added tuned parameters for many devices (see doc/tuning.md)

10 months ago

CLBlast version 1.6.1. Changes since previous release (version 1.6.0):

11 months ago

CLBlast version 1.6.0. Changes since previous release (version 1.5.3):

Improved performance on Qualcomm Adreno GPUs:
- Unique database entries for specific Adreno devices
- Toggle OpenCL kernel compilation options for Adreno
- New preprocessor directive RELAX_WORKGROUP_SIZE
Fixed a bug in handling of #undef in CLBlast loop unrolling and array-to-register mapping functions
Fixed a bug in XAMAX/XAMIN routines related to inadvertently including the increment and offset in the result
Fixed a bug in XAMAX/XAMIN routines that would cause only the real part of a complex number to be taken into account
Fixed a bug that caused tests to not properly do integer-output testing (for XAMAX/XAMIN)
Fixes a minor issue with the expected input buffer size in the TRMV/TBMV/TPMV/TRSV routines
Fixes an issue with crashes on Android related to calling clReleaseProgram
Fixes two small issues in the plotting script
Fixed a documentation bug in the 'ld' requirements
Enabled Github Actions CI builds for testing and releasing
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)

1 year ago

CLBlast version 1.5.3. Changes since previous release (version 1.5.2):

Fix a correctness issue with DGEMM on SM 7.5 Turing GPUs
Update cl.hpp to the new opencl.hpp header in the samples
Changed the complex sum routine to return the complex sum instead of the absolute complex sum.
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)

3 years ago

CLBlast version 1.5.2. Changes since previous release (version 1.5.1):

Changed XAMAX/XAMIN to more likely return first rather than last min/max index, updated API docs
Added batched routines to pyclblast
Added CLBLAST_VERSION_MAJOR/MINOR/PATCH defines in headers to store version numbering
Several small improvements to the benchmark script (thanks to 'baryluk')
Fixed a bug in the caching when using a context with multiple devices
Fixed a bug in the tuners related to global workgroup size not being a multiple of the local
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)

4 years ago

CLBlast version 1.5.1. Changes since previous release (version 1.5.0):

5 years ago

CLBlast version 1.5.0. Changes since previous release (version 1.4.1):

Added support for shuffle instructions for NVIDIA GPUs (thanks to 'tyler-utah')
Added an option to compile the Netlib API with static OpenCL device and context (-DNETLIB_PERSISTENT_OPENCL=ON)
Added a FAQ page to the documentation
The tuners now check beforehand on invalid local thread sizes and skip those completely
Made the tuning API (OverrideParameters) more flexible, disregarding superfluous parameters
Fixed an issue with conjugate transpose not being executed in certain cases for a.o. XOMATCOPY
Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel
Fixed an issue with the preprocessor and the new GEMMK == 1 kernel
Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel
Fixed an issue for certain parameters for AXPY's 'XaxpyFaster' kernel
Various minor fixes and enhancements
Added non-BLAS routines:
- SCONVGEMM/DCONVGEMM/HCONVGEMM (convolution as im2col followed by batched GEMM)
- SCOL2IM/DCOL2IM/CCOL2IM/ZCOL2IM/HCOL2IM (col2im transform as used in machine learning)

5 years ago

CLBlast version 1.4.1 (bugfix release). Changes since previous release (version 1.4.0):

Fixed an access violation under Windows upon releasing the OpenCL program when the driver is already unloaded
Fixed an issue with double cl_program release in the CLBlast caching system
Added tuned parameters for various devices (see doc/tuning.md)

5 years ago

CLBlast version 1.4.0. Changes since previous release (version 1.3.0):

Added Python interface to CLBlast 'PyCLBlast'
Added CLBlast to Ubuntu PPA and macOS Homebrew package managers
Added an API to run the tuners programmatically without any I/O
Improved the performance potential by adding a second tunable GEMM kernel with 2D register tiling
Added support for Intel specific subgroup shuffling extensions for faster GEMM on Intel GPUs
Re-added a local memory size constraint to the tuners
The routine tuners now automatically pick up tuning results from disk from the kernel tuners
Updated and reorganised the CLBlast documentation
Added a 'canary' region to check for overflows in the tuner and tests (inspired by clARMOR)
Added an option to test against and compare performance with Intel's MKL
Fixed an access violation when compiled with Visual Studio upon releasing the OpenCL program
Fixed incorrect releasing of the OpenCL program resulting in segfaults / access violations
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)
Added non-BLAS level-1 routines:
- SHAD/DHAD/CHAD/ZHAD/HHAD (Hadamard element-wise vector-vector product)

6 years ago

CLBlast version 1.3.0. Changes since previous release (version 1.2.0):

Re-designed and integrated the auto-tuner, no more dependency on CLTune
Made it possible to override the tuning parameters in the clients straight from JSON tuning files
Added OpenCL pre-processor to unroll loops and perform array-to-register promotions for compilers which don't do this themselves (ARM Mali) - greatly improves performance on these platforms
Added first tuners for the TRSV (block size) and TRSM (invert kernel) routines
Added an optional argument to the GEMM routine to provide a pre-allocated temporary buffer
Fixed an issue with a crashing/hanging AMD APP compiler with the TRSM routine (invert kernel)
Improved compilation time by splitting the tuning database into multiple compilation units
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)
Added the RetrieveParameters function to the API to be able to inspect the tuning parameters
Added a strided-batched (not part of the BLAS standard) routine, faster but less generic compared to the existing xGEMMBATCHED routines:
- SGEMMSTRIDEDBATCHED/DGEMMSTRIDEDBATCHED/CGEMMSTRIDEDBATCHED/ZGEMMSTRIDEDBATCHED/HGEMMSTRIDEDBATCHED