Arrayfire Versions Save

ArrayFire: a general purpose GPU library.

v3.9.0

7 months ago

v3.9.0

Improvements

  • Add oneAPI backend #3296
  • Add support to directly access arrays on other devices #3447
  • Add asynchronous reduce all functions that return an af_array #3199
  • Add broadcast support #2871
  • Improve OpenCL CPU JIT performance #3257 #3392
  • Optimize thread/block calculations of several kernels #3144
  • Add support for fast math compiliation when building ArrayFire #3334 #3337
  • Optimize performance of fftconvolve when using floats #3338
  • Add support for CUDA 12.1 and 12.2
  • Better handling of empty arrays #3398
  • Better handling of memory in linear algebra functions in OpenCL #3423
  • Better logging with JIT kernels #3468
  • Optimize memory manager/JIT interactions for small number of buffers #3468
  • Documentation improvements #3485
  • Optimize reorder function #3488

Fixes

  • Improve Errors when creating OpenCL contexts from devices #3257
  • Improvements to vcpkg builds #3376 #3476
  • Fix reduce by key when nan's are present #3261
  • Fix error in convolve where the ndims parameter was forced to be equal to 2 #3277
  • Make constructors that accept dim_t to be explicit to avoid invalid conversions #3259
  • Fix error in randu when compiling against clang 14 #3333
  • Fix bug in OpenCL linear algebra functions #3398
  • Fix bug with thread local variables when device was changed #3420 #3421
  • Fix bug in qr related to uninitialized memory #3422
  • Fix bug in shift where the array had an empty middle dimension #3488

Contributions

Special thanks to our contributors: Willy Born Mike Mullen

v3.8.3

1 year ago

v3.8.3

Improvements

  • Add support for CUDA 12 #3352
  • Modernize documentation style and content #3351
  • memcpy performance improvements #3144
  • JIT performance improvements #3144
  • join performance improvements #3144
  • Improve support for Intel and newer Clang compilers #3334
  • CCache support on Windows #3257

Fixes

  • Fix issue with some locales with OpenCL kernel generation #3294
  • Internal improvements
  • Fix leak in clfft on exit.
  • Fix some cases where ndims was incorrectly used ot calculate shape #3277
  • Fix issue when setDevice was not called in new threads #3269
  • Restrict initializer list to just fundamental types #3264

Contributions

Special thanks to our contributors: Carlo Cabrera Guillaume Schmid Willy Born ktdq

v3.8.2

1 year ago

v3.8.2

Improvements

  • Optimize JIT by removing some consecutive cast operations #3031
  • Add driver checks checks for CUDA 11.5 and 11.6 #3203
  • Improve the timing algorithm used for timeit #3185
  • Dynamically link against CUDA numeric libraries by default #3205
  • Add support for pruning CUDA binaries to reduce static binary sizes #3234 #3237
  • Remove unused cuDNN libraries from installations #3235
  • Add support to staticly link NVRTC libraries after CUDA 11.5 #3236
  • Add support for compiling with ccache when building the CUDA backend #3241

Fixes

  • Fix issue with consecutive moddims operations in the CPU backend #3232
  • Better floating point comparisons for tests #3212
  • Fix several warnings and inconsistencies with doxygen and documentation #3226
  • Fix issue when passing empty arrays into join #3211
  • Fix default value for the AF_COMPUTE_LIBRARY when not set #3228
  • Fix missing symbol issue when MKL is staticly linked #3244
  • Remove linking of OpenCL's library to the unified backend #3244

Contributions

Special thanks to our contributors: Jacob Kahn Willy Born

v3.8.1

2 years ago

v3.8.1

Improvements

  • moddims now uses JIT approach for certain special cases - #3177
  • Embed Version Info in Windows DLLs - #3025
  • OpenCL device max parameter is now queries from device properties - #3032
  • JIT Performance Optimization: Unique funcName generation sped up - #3040
  • Improved readability of log traces - #3050
  • Use short function name in non-debug build error messages - #3060
  • SIFT/GLOH are now available as part of website binaries - #3071
  • Short-circuit zero elements case in detail::copyArray backend function - #3059
  • Speedup of kernel caching mechanism - #3043
  • Add short-circuit check for empty Arrays in JIT evalNodes - #3072
  • Performance optimization of indexing using dynamic thread block sizes - #3111
  • ArrayFire starting with this release will use Intel MKL single dynamic library which resolves lot of linking issues unified library had when user applications used MKL themselves - #3120
  • Add shortcut check for zero elements in af_write_array - #3130
  • Speedup join by eliminating temp buffers for cascading joins - #3145
  • Added batch support for solve - #1705
  • Use pinned memory to copy device pointers in CUDA solve - #1705
  • Added package manager instructions to docs - #3076
  • CMake Build Improvements - #3027 , #3089 , #3037 , #3072 , #3095 , #3096 , #3097 , #3102 , #3106 , #3105 , #3120 , #3136 , #3135 , #3137 , #3119 , #3150 , #3138 , #3156 , #3139 , #1705 , #3162
  • CPU backend improvements - #3010 , #3138 , #3161
  • CUDA backend improvements - #3066 , #3091 , #3093 , #3125 , #3143 , #3161
  • OpenCL backend improvements - #3091 , #3068 , #3127 , #3010 , #3039 , #3138 , #3161
  • General(including JIT) performance improvements across backends - #3167
  • Testing improvements - #3072 , #3131 , #3151 , #3141 , #3153 , #3152 , #3157 , #1705 , #3170 , #3167
  • Update CLBlast to latest version - #3135 , #3179
  • Improved Otsu threshold computation helper in canny algorithm - #3169
  • Modified default parameters for fftR2C and fftC2R C++ API from 0 to 1.0 - #3178
  • Use appropriate MKL getrs_batch_strided API based on MKL Versions - #3181

Fixes

  • Fixed a bug JIT kernel disk caching - #3182
  • Fixed stream used by thrust(CUDA backend) functions - #3029
  • Added workaround for new cuSparse API that was added by CUDA amid fix releases - #3057
  • Fixed const array indexing inside gfor - #3078
  • Handle zero elements in copyData to host - #3059
  • Fixed double free regression in OpenCL backend - #3091
  • Fixed an infinite recursion bug in NaryNode JIT Node - #3072
  • Added missing input validation check in sparse-dense arithmetic operations - #3129
  • Fixed bug in getMappedPtr in OpenCL due to invalid lambda capture - #3163
  • Fixed bug in getMappedPtr on Arrays that are not ready - #3163
  • Fixed edgeTraceKernel for CPU devices on OpenCL backend - #3164
  • Fixed windows build issue(s) with VS2019 - #3048
  • API documentation fixes - #3075 , #3076 , #3143 , #3161
  • CMake Build Fixes - #3088
  • Fixed the tutorial link in README - #3033
  • Fixed function name typo in timing tutorial - #3028
  • Fixed couple of bugs in CPU backend canny implementation - #3169
  • Fixed reference count of array(s) used in JIT operations. It is related to arrayfire's internal memory book keeping. The behavior/accuracy of arrayfire code wasn't broken earlier. It corrected the reference count to be of optimal value in the said scenarios. This may potentially reduce memory usage in some narrow cases - #3167
  • Added assert that checks if topk is called with a negative value for k - #3176
  • Fixed an Issue where countByKey would give incorrect results for any n > 128 - #3175

Contributions

Special thanks to our contributors: HO-COOH, Willy Born, Gilad Avidov, Pavan Yalamanchili

v3.8.0

3 years ago

v3.8.0

New Functions

  • Ragged max reduction - #2786
  • Initialization list constructor for array class - #2829 , #2987
  • New API for following statistics function: cov, var and stdev - #2986
  • Bit-wise operator support for array and C API (af_bitnot) - #2865
  • allocV2 and freeV2 which return cl_mem on OpenCL backend - #2911
  • Move constructor and move assignment operator for Dim4 class - #2946

Improvements

  • Add f16 support for histogram - #2984
  • Update confidence connected components example for better illustration - #2968
  • Enable disk caching of OpenCL kernel binaries - #2970
  • Refactor extension of kernel binaries stored to disk .bin - #2970
  • Add minimum driver versions for CUDA toolkit 11 in internal map - #2982
  • Improve warnings messages from run-time kernel compilation functions - #2996

Fixes

  • Fix bias factor of variance in var_all and cov functions - #2986
  • Fix a race condition in confidence connected components function for OpenCL backend - #2969
  • Safely ignore disk cache failures in CUDA backend for compiled kernel binaries - #2970
  • Fix randn by passing in correct values to Box-Muller - #2980
  • Fix rounding issues in Box-Muller function used for RNG - #2980
  • Fix problems in RNG for older compute architectures with fp16 - #2980 #2996
  • Fix performance regression of approx functions - #2977
  • Remove assert that check that signal/filter types have to be the same - #2993
  • Fix checkAndSetDevMaxCompute when the device cc is greater than max - #2996
  • Fix documentation errors and warnings - #2973 , #2987
  • Add missing opencl-arrayfire interoperability functions in unified back - #2981

Contributions

Special thanks to our contributors: P. J. Reed

v3.7.3

3 years ago

v3.7.3

Improvements

  • Add f16 support for histogram - #2984
  • Update confidence connected components example for better illustration - #2968
  • Enable disk caching of OpenCL kernel binaries - #2970
  • Refactor extension of kernel binaries stored to disk .bin - #2970
  • Add minimum driver versions for CUDA toolkit 11 in internal map - #2982
  • Improve warnings messages from run-time kernel compilation functions - #2996

Fixes

  • Fix bias factor of variance in var_all and cov functions - #2986
  • Fix a race condition in confidence connected components function for OpenCL backend - #2969
  • Safely ignore disk cache failures in CUDA backend for compiled kernel binaries - #2970
  • Fix randn by passing in correct values to Box-Muller - #2980
  • Fix rounding issues in Box-Muller function used for RNG - #2980
  • Fix problems in RNG for older compute architectures with fp16 - #2980 #2996
  • Fix performance regression of approx functions - #2977
  • Remove assert that check that signal/filter types have to be the same - #2993
  • Fix checkAndSetDevMaxCompute when the device cc is greater than max - #2996
  • Fix documentation errors and warnings - #2973 , #2987
  • Add missing opencl-arrayfire interoperability functions in unified back - #2981
  • Fix constexpr relates compilation error with VS2019 and Clang Compilers - #3049

Contributions

Special thanks to our contributors: P. J. Reed

v3.8.rc

3 years ago

v3.8.0 Release Candidate

New Functions

  • Ragged max reduction - #2786
  • Initialization list constructor for array class - #2829 , #2987
  • New API for following statistics function: cov, var and stdev - #2986
  • Bit-wise operator support for array and C API (af_bitnot) - #2865
  • allocV2 and freeV2 which return cl_mem on OpenCL backend - #2911
  • Move constructor and move assignment operator for Dim4 class - #2946

Improvements

  • Add f16 support for histogram - #2984
  • Update confidence connected components example for better illustration - #2968
  • Enable disk caching of OpenCL kernel binaries - #2970
  • Refactor extension of kernel binaries stored to disk .bin - #2970
  • Add minimum driver versions for CUDA toolkit 11 in internal map - #2982
  • Improve warnings messages from run-time kernel compilation functions - #2996

Fixes

  • Fix bias factor of variance in var_all and cov functions - #2986
  • Fix a race condition in confidence connected components function for OpenCL backend - #2969
  • Safely ignore disk cache failures in CUDA backend for compiled kernel binaries - #2970
  • Fix randn by passing in correct values to Box-Muller - #2980
  • Fix rounding issues in Box-Muller function used for RNG - #2980
  • Fix problems in RNG for older compute architectures with fp16 - #2980 #2996
  • Fix performance regression of approx functions - #2977
  • Remove assert that check that signal/filter types have to be the same - #2993
  • Fix checkAndSetDevMaxCompute when the device cc is greater than max - #2996
  • Fix documentation errors and warnings - #2973 , #2987
  • Add missing opencl-arrayfire interoperability functions in unified back - #2981

Contributions

Special thanks to our contributors: P. J. Reed

v3.7.2

3 years ago

v3.7.2

Improvements

  • Cache CUDA kernels to disk to improve load times(Thanks to @cschreib-ibex) #2848
  • Staticly link against cuda libraries #2785
  • Make cuDNN an optional build dependency #2836
  • Improve support for different compilers and OS #2876 #2945 #2925 #2942 #2943 #2945
  • Improve performance of join and transpose on CPU #2849
  • Improve documentation #2816 #2821 #2846 #2918 #2928 #2947
  • Reduce binary size using NVRTC and template reducing instantiations #2849 #2861 #2890
  • Improve reduceByKey performance on OpenCL by using builtin functions #2851
  • Improve support for Intel OpenCL GPUs #2855
  • Allow staticly linking against MKL #2877 (Sponsered by SDL)
  • Better support for older CUDA toolkits #2923
  • Add support for CUDA 11 #2939
  • Add support for ccache for faster builds #2931
  • Add support for the conan package manager on linux #2875
  • Propagate build errors up the stack in AFError exceptions #2948 #2957
  • Improve runtime dependency library loading #2954
  • Improved cuDNN runtime checks and warnings #2960
  • Document af_memory_manager_* native memory return values #2911
  • Add support for cuDNN 8 #2963

Fixes

  • Bug crash when allocating large arrays #2827
  • Fix various compiler warnings #2827 #2849 #2872 #2876
  • Fix minor leaks in OpenCL functions #2913
  • Various continuous integration related fixes #2819
  • Fix zero padding with convolv2NN #2820
  • Fix af_get_memory_pressure_threshold return value #2831
  • Increased the max filter length for morph
  • Handle empty array inputs for LU, QR, and Rank functions #2838
  • Fix FindMKL.cmake script for sequential threading library #2840
  • Various internal refactoring #2839 #2861 #2864 #2873 #2890 #2891 #2913
  • Fix OpenCL 2.0 builtin function name conflict #2851
  • Fix error caused when releasing memory with multiple devices #2867
  • Fix missing set stacktrace symbol from unified API #2915
  • Fix zero padding issue in convolve2NN #2820
  • Fixed bugs in ReduceByKey #2957
  • Add clblast patch to handle custom context with multiple devices #2967

Contributions

Special thanks to our contributors: Corentin Schreiber Jacob Kahn Paul Jurczak Christoph Junghans

v3.7.1

4 years ago

v3.7.1

Improvements

  • Improve mtx download for test data #2742
  • Improve Documentation #2754 #2792 #2797
  • Remove verbose messages in older CMake versions #2773
  • Reduce binary size with the use of NVRTC #2790
  • Use texture memory to load LUT in orb and fast #2791
  • Add missing print function for f16 #2784
  • Add checks for f16 support in the CUDA backend #2784
  • Create a thrust policy to intercept temporary buffer allocations #2806

Fixes

  • Fix segfault on exit when ArrayFire is not initialized in the main thread
  • Fix support for CMake 3.5.1 #2771 #2772 #2760
  • Fix evalMultiple if the input array sizes aren't the same #2766
  • Fix error when AF_BACKEND_DEFAULT is passed directly to backend #2769
  • Workaround name collision with AMD OpenCL implementation #2802
  • Fix on-exit errors with the unified backend #2769
  • Fix check for f16 compatibility in OpenCL #2773
  • Fix matmul on Intel OpenCL when passing same array as input #2774
  • Fix CPU OpenCL blas batching #2774
  • Fix memory pressure in the default memory manager #2801

Contributions

Special thanks to our contributors: padentomasello glavaux2

v3.7.0

4 years ago

v3.7.0

Major Updates

  • Added the ability to customize the memory manager(Thanks jacobkahn and flashlight) [#2461]
  • Added 16-bit floating point support for several functions [#2413] [#2587] [#2585] [#2587] [#2583]
  • Added sumByKey, productByKey, minByKey, maxByKey, allTrueByKey, anyTrueByKey, countByKey [#2254]
  • Added confidence connected components [#2748]
  • Added neural network based convolution and gradient functions [#2359]
  • Added a padding function [#2682]
  • Added pinverse for pseudo inverse [#2279]
  • Added support for uniform ranges in approx1 and approx2 functions. [#2297]
  • Added support to write to preallocated arrays for some functions [#2599] [#2481] [#2328] [#2327]
  • Added meanvar function [#2258]
  • Add support for sparse-sparse arithmetic support [#2312]
  • Added rsqrt function for reciprocal square root [#2500]
  • Added a lower level af_gemm function for general matrix multiplication [#2481]
  • Added a function to set the cuBLAS math mode for the CUDA backend [#2584]
  • Separate debug symbols into separate files [#2535]
  • Print stacktraces on errors [#2632]
  • Support move constructor for af::array [#2595]
  • Expose events in the public API [#2461]
  • Add setAxesLabelFormat to format labels on graphs [#2495]
  • Added deconvolution functions [#1881]

Improvements

  • Better error messages for systems with driver or device incompatibilities [#2678] [#2448][#2761]
  • Optimized unified backend function calls [#2695]
  • Optimized anisotropic smoothing [#2713]
  • Optimized canny filter for CUDA and OpenCL [#2727]
  • Better MKL search script [#2738][#2743][#2745]
  • Better logging of different submodules in ArrayFire [#2670] [#2669]
  • Improve documentation [#2665] [#2620] [#2615] [#2639] [#2628] [#2633] [#2622] [#2617] [#2558] [#2326][#2515]
  • Optimized af::array assignment [#2575]
  • Update the k-means example to display the result [#2521]

Fixes

  • Fix multi-config generators [#2736]
  • Fix access errors in canny [#2727]
  • Fix segfault in the unified backend if no backends are available [#2720]
  • Fix access errors in scan-by-key [#2693]
  • Fix sobel operator [#2600]
  • Fix an issue with the random number generator and s16 [#2587]
  • Fix issue with boolean product reduction [#2544]
  • Fix array_proxy move constructor [#2537]
  • Fix convolve3 launch configuration [#2519]
  • Fix an issue where the fft function modified the input array [#2520]
  • Added a work around for nvidia-opencl runtime if forge dependencies are missing [#2761]

Contributions

Special thanks to our contributors: @jacobkahn @WilliamTambellini @lehins @r-barnes @gaika @ShalokShalom