MaxMath Versions Save

A C# SIMD math library for use with Unity only, substantially extending Unity.Mathematics by new types and functions, using Unity.Burst.

2.3.5

1 year ago

Known Issues

half8 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)
(s)byte, (u)short vector and (U)Int128 multiplication, division and modulo operations by compile time constants are not optimal
optimized (U)Int128 comparison operators didn't make it into this release
bool vectors generated from operations on non-(s)byte vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties
most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
AVX2 (s)byte32 all_dif lookup tables are currently way too large (kiloBytes)

Fixes

(Issue #10) bool8/16/32 are now blittable when not used within an IJob

Additions

added comb(n, k) for scalar- and vector integer types. This is known as the binomial coefficient or "n choose k". An optional Promise parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows (which is not true for most solutions found online that claim it), uses a O(min(k, n - k)) algorithm with respect to time
added perm(n, k) for scalar- and vector integer types. This is known as "k-permutations of n". An optional Promise parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows, uses a O(k) algorithm with respect to time
added nextgreater(x) for all types. For integer types, it is a wrapper function for addsaturated(x, 1). For floating point types, it returns the next greater representable floating point value(s), unless x is NaN or infinite. An optional Promise parameter allows for numerous optimizations.
added nextsmaller(x) for all types. For integer types, it is a wrapper function for subsaturated(x, 1). For floating point types, it returns the next smaller representable floating point value(s), unless x is NaN or infinite. An optional Promise parameter allows for numerous optimizations
added nexttoward(from, to) for all types, returning the next representable integer/floating point value(s) in a given direction, unless from is equal to to. For floating point types, from is returned if from is NaN or infinite. If to is NaN, NaN is returned. An optional Promise parameter allows for numerous optimizations.

Improvements

improved performance of 64bit vectorized division thanks to a newly implemented and further optimized algorithm from a July 13th 2022 research paper, which replaces a vectorized loop (rather slow; up to 64 iterations; no instruction level parallelism outside the loop possible until the loop finished executing, following an almost certainly mispredicted branch) with straight line code. Due to "recent" improvements to divider circuits, this code path is inferior to hardware supported scalar division via element extraction for (u)long2, specifically, even when the quotient and/or remainder vector is in the middle of a dependency chain and even in tight loops, and is thus only implemented for (u)long3/4 types and only if compiling for AVX2
improved performance and reduced code size of up to (s)byte8 and every (u)short vector division if not compiling with FloatMode.Fast. Reduced constants possibly read from RAM in either case.
fixed performance regression of SIMD register <-> software abstraction conversions for types using up the entirety of a hardware register
lcm for (s)byte vectors with 8 elements or less: decreased code size by 20 or 28 bytes; removed 2 or 4 or 8 bytes of constant data read from RAM; reduced latency by 2 or 3 clock cycles
verified and increased the (u)long scalar- and vector intcbrt Promise.Unsafe0 range from [0, 1ul << 40] to [0, 1ul << 46], the code path of which is also possibly chosen at compile time
implemented optimized quarter{X} IEEE-754 comparison operators (without having to cast to float{X}). Vectorized halfX comparisons are implemented in MaxMath.Intrinsics.Xse as well and used where appropriate. compareto with quarter{X} and half{X} function overloads were implemented.
reduced latency of add/subsaturated for scalar Int128s, scalar and vector longs as well as vector ints by about a third
replaced (U)Int128.ToString(null, null)s call to BigInteger.ToString() and thus unnecessary heap allocations with an optimized implementation
(u)short8 / and % operators now correctly check for SSE2 support rather than AVX2
removed aliased fixed size buffers from all types, also improving indexer operator performance if the index is a compile time constant (in some cases)

Changes

Burst compiled code that uses a Promise argument which is not a compile time constant will throw an exception in DEBUG, as it represents significant overhead instead of an optimization. This will currently not inform users of the name of the function but rather the Burst compiled job/function that threw it.

Fixed Oversights

added explicit type conversion operators for scalar floats and doubles to half8 and all quarter vectors (as well as scalar halfs to quarter vectors)

2.3.0

1 year ago

Known Issues

half8 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)
(s)byte, (u)short vector and (U)Int128 multiplication, division and modulo operations by compile time constants are not optimal
optimized (U)Int128 comparison operators didn't make it into this release
using bool vectors generated from 256 bit input vectors like so: long4 x = select(a, b, >>> myLong4a < myLong4b <<<) (as an example) does not generate the most efficient machine code possible
unit tests for 64-bit bits_zerohigh functions fail 100% of the time because of a bug related to the managed debug implementation of intrinsics (reported)
unit tests for intrinsics code paths for all functions that use "(mm256_)shuffle_ps" or "(mm256_)blendv_ps" can fail semi-randomly due to a bug which changes the bit content of ints which would be NaN if dereferenced as a float and written back to memory (reported)
most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
(s)byte32 all_dif lookup tables are currently way too large (kiloBytes)

Fixes

fixed quarter rounding behavior when casting a wider floating point type to a quarter to round towards the nearest representable value instead of truncating the mantissa

Additions

added namespace `MaxMath.Intrinsics` for users who want to use the math library through "high level" X86 intrinsics. Because users need to guard their intrinsics code with e.g. `if (Burst.Intrinsics.X86.Sse2.IsSse2Supported)` blocks and supported architectures vary (slightly) from function to function, these are considered unsafe, undocumented and unrecommended and only serve as an exposed layer of abstraction which is used internally anyway.

added flags enum `Promise`, with values `Nothing`, `Everything` `NoOverflow`, `ZeroOrGreater`, `ZeroOrLess`, `NonZero` and `Unsafe` 0 through 3 aswell as the composites `Positive` and `Negative`. This flags enum is only ever used as an optional parameter and offers faster, yet more unsafe code. Specifics vary between functions and sometimes even overloads but are documented accordingly. Optimizations are only ever to be added, not removed (= a ...promise ... of never introducing breaking changes in this regard)

Other Additions

added factorial (for integer types) and gamma (floating point types) functions. factorial, when called without a Promise parameter, clamps the result to type.MaxValue in case of overflow
added erf(c), the (complementary) error function for floating point types
added (c)minmag and (c)maxmag functions, returning the (componentwise) minimum/maximum magnitude of two values or within a vector; equivalent to abs(x) > abs(y) ? x : y (maxmag) or abs(cmin(c)) > abs(cmax(c)) ? cmin(c) : cmax(c) (cmaxmag)
added (c)minmax and (c)minmaxmag functions which return both the (componentwise/columnwise) minimum and maximum (magnitude) as out parameters
added bitfield functions for scalar and vector integer types - small utility functions that pack several smaller integers into bigger ones
added copysign(x, y) functions for signed types, which is equivalent to return y < 0 ? nabs(x) : abs(x)
added (naive?) implementation for scalar- and vector float/double inverse hyberbolic functions asinh, acosh and atanh
added intlog10 functions (integer base ten logarithm)
added the bit test/bt family of functions for scalar and vector integer types. A testbit(POST_ACTION)((ref)x, i) function returns a boolean (vector), indicating whether the bit in x at index i is 1 and may (or may not) flip, set, or reset that bit afterwards
added a new category of type conversion functions with the suffix "unsafe". Added to(u)longunsafe and todoubleunsafe with a Promise parameter, allowing for up to two levels of optimization (vectorized 64bit int <-> 64 bit float is not hardware supported). Details in the XML documentation. Default double <-> (u)long conversion operators - apart from having their 4-element version improved - now check whether or not a safe range for unsafe conversions can be validated at compile time
added scalar/vectorized toquarterunsafe allowing for each type to be converted to a quarter type while specifying whether the input value will or will not overflow and/or is >= 0

Improvements

improved performance of several vector operators and function overloads for types that use up an entire hardware register while having to be up-cast to a wider type considerably - surrounding boilerplate code uses a new "in-house" faster-than-hardware algorithm with its dependency chain latency having been reduced from x [0 <= x <= 3] + (9 or 10) clock cycles down to x + (0 or 1 or 3) + (1 or 3) clock cycles

massive performance improvements for all vector types that are not a total of 128 or 256 bits wide, respectively, either through the `Avx.[...]undefined[...]` compiler intrinsics or through controlled undefined behaviour, by declaring an uninitialized variable and using pointer syntax to force the C# compiler into trusting that the variable has been fully initialized; this cannot lead to memory access violations, since the variable is declared and thus enough space is reserved on the stack, before it is optimized away by LLVM and assigned a hardware register instead, with undefined upper elements. This allows for upper elements of hardware registers to be ignored during compilation. Unnecessarily emitted instructions like `movq xmm0, xmm0` (move the low 8 bytes from a register to the same register, zeroing out the upper 8 bytes, even though only the lower 8 bytes will be written back to memory) or far worse instruction sequences, for example when using vectors with 3 elements, are now (MOSTLY; there's still work to be done) omitted instead. Although most zero-upper-elements instruction( sequence)s only took a single clock cycle, they were always part of each dependency chain and could happen between almost each function call, including operators of course. The same improvements apply to `Unity.Mathematics` types when passed to `maxmath` functions.

improved performance throughout the library by effectively adding hundreds of thousands of `Unity.Burst.CompilerServices.Constant.IsConstantExpression` condition checks more to many functions within the library. Most notably, algorithms, where the total latency is dependant on the byte size of arguments, may now perform much faster. Some but not yet all of these constant checks are exposed through a `Promise` parameter

Other Improvements

improved performance of scalar (u)short to (u)short2/3/4 conversion
reduced latency of all, any first, last, count and bitmask functions for bool8/16/32 when used with an expression as the argument, such as all(x != y) - a way to force the compiler to omit unnecessary intructions was found
reduced latency of addsaturated for scalar unsigned integer types
reduced latency of float/double to (U)Int128 conversion
reduced latency of shl, shrl and shra and thus all functions using those - especially for: shl for (s)byte vectors of all sizes if compiling for SSE4 and 32 byte sized vectors if compiling for AVX2; shl for (u)short vectors of 4 or more elements if compiling for at least SSE4; shra for (u)long vectors if compiling for AVX2 and the vector containing the shift amounts is a compile time constant.
reduced long2/3/4 shra code size and latency by another 2 clock cycles if compiling for AVX2
reduced latency of variable rol/r vector functions beyond shl/r improvements and added an optional Promise parameter, allowing the caller to promise the rotation values are in a specific range
reduced latency of long2/3/4 "is negative checks" - mylong4 < 0/0 > mylong4 by 33% by doubling its code size. This further improves performance/adds to code size of functions in the library
reduced latency of (u)long2/3/4 isinrange functions
reduced latency of unsigned byte and ushort vector to float vector conversion. This also affects performance of (s)byte (u)short vector intsqrt functions, aswell as the respective % and / operators (byte2/3/4/8, all ushort vectors)
reduced (u)long vector intcbrt latency by ~45% and reduced code size by ~20% (roughly 150 bytes). For other integer vector types, the latency has been reduced by ~8 to ~15 clock cycles
added hidden and retroactively improved exp2 scalar and vector integer argument function overloads. These return exp2((float/double)x) or (float/double)(1 << x) in 3 instead of 6 to 7 clock cycles at best; they of course also work for negative input values i.e. reciprocals of powers of 2. The (u)int overloads convert to floats, the (u)long overloads convert to doubles; explicit integer to integer casting should (and sometimes has to) be used for optimal results. Additionally, these overloads contain an optional 'Promise' parameter, allowing for omission of clamping which is needed to ensure correct underflow/overflow behavior, as dictated by Unity's exp2 implementation. If you ever used the standard exp2 function by implicitly converting an int type to a float type, performance was improved by a factor of about 30x. This overload only "breaks" code that casts (u)long types to float types implicitly if the result is expected to be a float type. It is recommended to explicitly cast the (u)long type to a (u)int type in such a case
added ==, !=, <, >, <= and >= operators for UInt128 and signed long/int comparisons, as the expensive float conversion and comparison was previosly used when, for instance, comparing a UInt128 to a constant int such as 1 or 0
implemented SIMD (u)int and (u)long division/modulo algorithms. (u)long performance gains are only noticable under certain conditions; the (u)int performance gain is substantial (and unfortunately not used for (u)int2/3/4by LLVM/Burst - these are now exposed as further div overloads and new mod functions). Other functions than operator overloads are positively affected
added SSE2 fallback code for all (s)byte2/3/4 shuffles, eg. myByte4.xzzw
added more SSE4 -> SSE2 fallback code instead of n * (vector element extraction code + scalar code + vector element insertion code), where viable (now - thanks to some specific performance improvements)
improved performance of double4 to (u)long4 conversions if compiling for AVX2
optimized each possible byte vector division/modulo operation by a scalar compile time constant. Many, if not most, were not even auto-vectorized, let alone optimized for SIMD instructions instead of general purpose register instructions, which were translated poorly if vectorized
replaced double precision (r)cbrt's math.pow(x, (-)1d/3d) call with an optimized implementation
reduced latency of float scalar- and vector (r)cbrt by ~1 + (1 or 2) * ~4 clock cycles, while also gaining a small amount of precision; Reduced code size, aswell as the number of required compile time constants
reduced latency of float scalar- and vector (r)cbrt which handle negative inputs accurately (i.e. the new standard) by one clock cycle... Making it just one clock cycle slower than the unsafe version, mostly just providing a somewhat consiberable advantage with regard to code size
reduced (s)byte16 and (u)short16 all_dif lookup table size by 896 bytes (traded for an increase of 8 bytes in code size so this doesn't save RAM; It potentially reduces memory latency aswell as register spilling onto the stack)
reduced (u)int8 t/lzcnt latency by ~10%, also positively affecting (u)int2/3/4/8 gcd and lcm performance, as it is part of a loop within gcd
reduced double and float to quarter conversion latency (15+ clock cycles down to 7, optimally (CPU dependant)), code size and the number of constants being used. This affects scalar and vector conversions; the scalar versions are now branch free.
added AVX2 -> SSSE3 -> SSE2 fallback code for (s)byte32 and (u)short16 all_dif functions

Changes

Complete `avg` Overhaul

renamed avg overloads which calculate the average value of a vector itself to cavg for consistency reasons (max vs cmax, for instance)
32- and 64bit integer (c)avg calculations can no longer result in overflow of intermediate calculations and thus incorrect results (lower performance by default)
added Promise parameters to most (c)avg overloads. These can bring back the previous performance of 32- and 64bit integer overloads
reduced latency of signed 8/16 bit (c)avg overloads

Other Changes

(U)Int128((u)long lo64, (u)long hi64) constructors are now public
theInt128 intsqrt overload now returns a ulong
replaced the optional float (r)cbrt bool paramater handleNegativeInput with a Promise parameter and removed it from the double overloads completely, with having its' NonNegative flag set being a requirement for the faster version. This is first due to the introduction of the Promise type and thus for consistency reasons. Also, the optimized double implementation handles negative numbers for free, which is now the standard behavior.
replaced the optional intcbrt bool paramater handleNegativeInput with a Promise parameter for reasons mentioned above, also handling negative input values correctly by default
Bumped C# Dev Tools to version 1.0.8

Fixed Oversights

(Issue #5) .meta files are now included to allow for adding the repository to Unity projects via its github URL
added floorpow2 function overloads for scalar (u)int and (u)long types
added legitemately faster-than-hardware double scalar- and vector fastrcp and fastrsqrt overloads (substantially less accurate than FloatMode.Fast, FloatPrecision.Low 1d / x or 1d / sqrt(x))
the seven Bit Manipulation Instructions (functions with a bits_ prefix) now have their vector equivalents implemented as overloads

2.2.0

2 years ago

Known Issues

half8 == and != operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
(s)byte, (u)short vector and (U)Int128 multiplication, division and modulo operations by compile time constants are not optimal. For (U)Int128, it requires a new Burst feature à la T Constant.ForceCompileTimeEvaluation<T, U>(Func<U, T> code)(proposed); Currently work is being done on (s)byte and (u)short vectors in this regard, which will beat any compiler. The current (tested) state of all optimizations possible is included in this version.
pow functions with compile time constant exponents currently do not handle many decimal numbers - math.rsqrt would often be used in those cases for optimal performance but it is actually slower when the Unity.Burst.FloatMode is set to anything but FloatMode.Fast. To guarantee optimal performance, compile time access to the current FloatMode would be needed (proposed)
double (r)cbrt functions are currently not optimized

Fixes

linked float8 rcp and rsqrt functions to Bursts' FloatMode and FloatPrecision
short.MinValue / -1 now correctly overflows to short.MinValue when dividing a short16 vector by another short16 vector when compiling for AVX or higher
fixed scalar quarter to double conversion for when the quarter value is negative
fixed scalar half to quarter conversion for when the half value is negative
fixed vector quarter to ulong conversion for when a quarter value is negative
fixed (u)short8 to quarter8 conversion

Additions

Added saturation arithmetic to the library for all scalar- and vector types. Saturation arithmetic clamps the result of an operation to `type.MinValue` and `type.MaxValue` if under- or overflow occurs, respectively and has single-instruction hardware support for `(s)bytes` and `(u)shorts`. The included functions are:

addsaturated
subsaturated
mulsaturated
divsaturated (only clamps division of floating point types and signed division of, for instance, sbyte.MinValue ( = -128) / -1 to sbyte.MaxValue ( =127), which would cause a hardware exception for ints and longs`)
castsaturated (all types to all other types with a smaller range),
csumsaturated
cprodsaturated

(U)Int128

added high performance (U)Int128 types with full library support, meaning: all operators and type conversions aswell as all functions support these types. Most operations of both types, in Burst code, compile down to optimal machine code. Exceptions: 1) signed 64x64 bit to 128 bit multiplication 2) *, /, % and divrem functions with a scalar compile time constant argument (See: Known Issues 2)
added Random128 XOR-Shift pseudo random number generator for generating (U)Int128s

Cube Root

added high performance & accuracy (r)cbrt - (reciprocal) cube root functions for scalar and vector float- and double types based on a research paper from 2021. An optional bool parameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case with math.pow(x, 1f/3f)), which is set to false by default
added high performance intcbrt - integer cube root functions for all scalar and vector integer types. For signed integer types, an optional bool parameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case with math.pow(x, 1f/3f)), which is set to false by default

Other Additions

added a log function to all scalar and vector float- and double types with a second parameter b, which is the logarithms' base
added reversebytes functions for all scalar- and vector types, which convert back and forth between big endian and little endian byte order, respectively. All of them (scalar, vector) compile down to single hardware instructions
added pow functions with scalar exponents for float and double scalars and vectors, with optimizations for selected constant exponents (not necessarily whole exponents)
added function overloads to all functions for scalar (s)bytes and (u)shorts in order to resolve function call resolution ambiguity which was already present in Unity.Mathematics, which may also improve performance in some cases
added a static readonly New property to RandomX XOR-Shift pseudo random generators. It calls Environment.TickCount internally (and is thus seeded somewhat randomly), makes sure it is non-zero and can be called from Burst native code
added fastrcp functions for float scalars and vectors, faster (and substantially less accurate) than FloatPrecision.Low, FloatMode.Fast Burst implementations
added fastrsqrt functions for float scalars and vectors, faster (and substantially less accurate) than FloatPrecision.Low, FloatMode.Fast Burst implementations

Improvements

added AVX and AVX2 code for float8 sin, cos, tan, sincos, asin, acos, atan, atan2, sinh, cosh, tanh, pow, exp, exp2, exp10, log, log2, log10 and fmod (and the % operator)
optimized many /, %, * and divrem operations with a scalar compile time constant argument for (s)byte vectors (see 'Known Issues 2'), which were previously not optimized (...optimally/at all) by Burst.
added SSE2 fallback code for converting AVX vector types to SSE vector types and vice versa(for example: short16(256 bit) to byte16(128 bit))
scalar (s)byte and (u)short rol and ror functions now compile down to single hardware instructions
improved performance and/or reduced code size of nearly all vector comparison operations (==, > etc.)
improved performance of - and added SSE2 fallback code for bitfield to boolean vector conversion (toboolX and thus also select(vector a, vector b, bitmask c));
improved performance of intpow functions in general and for when the exponent is a compile time constant
improved performance and reduced code size of compareto vector functions (especially for unsigned types)
added more optimizations to isdivisible
improved performance of intsqrt functions for (u)long and (s)byte scalar and vector types considerably
reduced code size of ispow2 vector functions
reduced code size of (s)byte vector-by-vector division
improved performance of Random64's (u)long4 generation if compiling for AVX2
improved performance of (s)byte matrix multiplication
reduced code size of (u)short- and up to (s)byte8 vector by vector division and divrem functions(and improved performance if compiling for SSE2 only)
reduced code size and improved performance of isinrange functions for (u)long vector types
reduced code size of ushort vector >= and <= operators for SSE2 fallback code by ~75%
improved performance and reduced code size of SSE2 down-casting fallback code

Changes

API BREAKING CHANGE: The various boolean to integer/floating point conversion functions (touint8/tof32 etc.) are now renamed to contain C# types in their names (tobyte/tofloat etc.)
API BREAKING CHANGE: If you use this library as intended, meaning you import it and Unity.Mathematics.math statically (using static MaxMath.maxmath;) and you use the pow functions with scalar bases and scalar exponents in those scripts, you will encounter the first ever function call resolution ambiguity. It is strongly recommended to always use the maxmath.pow function, because it optimizes any pow call enormously if the exponent is a compile time constant, which does NOT necessarily mean that such a call must declare the exponent as a literal value - the exponent may become a compile time constant due to constant propagation
quarter is now a readonly struct
quarter to sbyte, short, int and long coversions are now required to be declared explicitly
removed countbits(void* ptr, ulong bytes) from the library and added it to https://github.com/MrUnbelievable92/SIMD-Algorithms with more options

Fixed Oversights

(Issue #3) added constructor wrappers to the maxmath class analogous to Unity.Mathematics(byte4 myByte4 = (maxmath.)byte4(1, 2, 3, 4);)
added dsub - fused divide-subtract function for scalar and vector float types
added an optional bool fast = false parameter to dad, dsub, dadsub and dsubadd functions
added andnot function overloads for scalar and vector bool types
added implicit type conversions of scalar quarter values to half, float and double vectors
added all_eq and all_dif functions for vectors of size 2
added all_eq and all_dif functions for float and double vectors

2.1.2

3 years ago

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation

Fixes

fixed undefined behavior of "vshr" functions for vector types smaller than 128 bits
fixed SSE2 implementations of "vrol" and "vror" functions for the (u)short16 type

Additions

implemented Bmi1 and Bmi2 intrinsics as functions with a "bits_" prefix (except for "andn", which has already been implemented as "andnot")
added high performance and/or SIMD "isdivisible" functions for all integer vector types and scalar value types
added high performance and/or SIMD "intpow" - integer exponentiation - functions for (u)int, (u)long and all integer vector types
added high performance and/or SIMD "floorpow2" functions for all integer vector types
added "nabs" - negative absolute value functions for all non-boolean vector- and single value types
added "indexof(vector v, value x)" functions for all non-boolean vector types

Improvements

aggressivley optimized away global variables (shuffle masks) and thus memory access and usage where appropriate
improved performance of 256 bit vector subvector getters
added Sse2 fallback code for all (u)long2/3/4 operators
improved performance of mulitplication, division and modulo operations for all (s)byte- and (u)short vector- and matrix types when dividing by a single non-compile time constant value
added overloads for (s)byte- and (u)short vectors' "divrem" functions with a scalar value as the divisor parameter, improving performance when it is a compile time constant
improved performance of "intsqrt" functions for most types

Changes

bump com.unity.burst to version 1.5

Fixed Oversights

added bitmask8 and bitmask16 functions for (s)byte and (u)short vector types, respectively

2.1.1

3 years ago

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation

Fixes

fixed triggered burst compilation error by "Sse4_1.blend_epi16" when compiling for SSE2 due to fallback code not using a constant value for "imm8"
fixed incorrect CPU feature checks for quarter vector type-conversion code when compiling for SSE2
fixed "tzcnt" implementations (were completely broken)
fixed scalar (single value and C# fallback) "lzcnt" implementations for (s)byte and (u)short values and (u)long4 vectors

Additions

added "ulong countbits(void* ptr, ulong bytes)", which counts the number of 1-bits in a given block of memory, using Wojciech Mula's SIMD population count algorithm
added high performance and/or SIMD "gcd" a.k.a. greatest common divisor functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
added high performance and/or SIMD "lcm" a.k.a. least common multiple functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
added high performance and/or SIMD "intsqrt" - integer square root (floor(sqrt(x)) functions for all integer- and integer vector types, with the functions for signed integers and vectors throwing an ArgumentOutOfRangeException in case a value is negative

Improvements

performance improvements of "avg" functions for signed integer vectors
added SIMD implementations of the "transpose" functions for all matrix types
added SSE4 and SSE2 fallback code for variable bitshifts ("shl", "shrl" and "shra")
added SSE2 fallback code for (s)byte vector-by-vector division and modulo operations
added SSE2 fallback code for "all_dif" for (s)byte16, (u)short8 and (u)int8 vectors
added SSE2 fallback code for typecasting, propagating through the entire library
added SSE2 fallback code for "addsub" and "subadd" functions
bitmask32 and bitmask64 now allow for masks to be up to 32 and 64 bits wide, respectively

Changes

renamed "BurstCompilerException" to "CPUFeatureCheckException"
"shl", "shrl" and "shra" now have undefined behavior when bitshifting any value outside of the interval [0, 8 * sizeof(integer_type) - 1] for performance reasons and because of differences between SSE, AVX and managed C#

Fixed Oversights

added "shl", "shrl" and "shra" (varying per element) functions for (s)byte and (u)short vectors
added "ror" and "rol" (varying per element) functions for (s)byte and (u)short vectors
added "compareto" functions for all vector types except half- and quarter vectors
added "all_dif" functions for (s)byte32 vectors
added vshr/l and vror/l functions for (s)byte32 and (u)short16 vectors

2.1.1 Hotfix

Fixes

fixed SSE2 "shl", "shrl" and "shra" implementations
fixed SSE2 "intsqrt" implementations

Improvements

improved performance of (s)byte2, -3, -4, -8, -16 and (u)short2, -3, -4, -8 "gcd" functions (and thus "lcm") when compiling for Avx2
improved performance of "tzcnt" and "lzcnt" implementations for all vector types if compiling for SSE4 or higher, propagating through a lot of the library

Fixed Oversights

Added documentation for RandomX methods

2.1.0

3 years ago

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation

Fixes

fixed triggered burst compilation error by "Sse4_1.blend_epi16" when compiling for SSE2 due to fallback code not using a constant value for "imm8"
fixed incorrect CPU feature checks for quarter vector type-conversion code when compiling for SSE2
fixed "tzcnt" implementations (were completely broken)
fixed scalar (single value and C# fallback) "lzcnt" implementations for (s)byte and (u)short values and (u)long4 vectors

Additions

added "ulong countbits(void* ptr, ulong bytes)", which counts the number of 1-bits in a given block of memory, using Wojciech Mula's SIMD population count algorithm
added high performance and/or SIMD "gcd" a.k.a. greatest common divisor functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
added high performance and/or SIMD "lcm" a.k.a. least common multiple functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
added high performance and/or SIMD "intsqrt" - integer square root (floor(sqrt(x)) functions for all integer- and integer vector types, with the functions for signed integers and vectors throwing an ArgumentOutOfRangeException in case a value is negative

Improvements

performance improvements of "avg" functions for signed integer vectors
added SIMD implementations of the "transpose" functions for all matrix types
added SSE4 and SSE2 fallback code for variable bitshifts ("shl", "shrl" and "shra")
added SSE2 fallback code for (s)byte vector-by-vector division and modulo operations
added SSE2 fallback code for "all_dif" for (s)byte16, (u)short8 and (u)int8 vectors
added SSE2 fallback code for typecasting, propagating through the entire library
added SSE2 fallback code for "addsub" and "subadd" functions
bitmask32 and bitmask64 now allow for masks to be up to 32 and 64 bits wide, respectively

Changes

renamed "BurstCompilerException" to "CPUFeatureCheckException"
"shl", "shrl" and "shra" now have undefined behavior when bitshifting any value outside of the interval [0, 8 * sizeof(integer_type) - 1] for performance reasons and because of differences between SSE, AVX and managed C#

Fixed Oversights

added "shl", "shrl" and "shra" (varying per element) functions for (s)byte and (u)short vectors
added "ror" and "rol" (varying per element) functions for (s)byte and (u)short vectors
added "compareto" functions for all vector types except half- and quarter vectors
added "all_dif" functions for (s)byte32 vectors
added vshr/l and vror/l functions for (s)byte32 and (u)short16 vectors

2.0.0

3 years ago

Re-Release Notes

Version 2.0.0 adds - for the first time - fallback procedures from Avx2 to Sse4, Sse2 and platform independent instruction sets, respectively, with some major optimizations for all of them
ARM and other instruction sets do NOT have optimized fallback procedures written for them, and there are no plans for it at this time. Burst/LLVM are good at recognizing the patterns in the code, though, and some of the code will be vectorized for other platforms (confirmed)

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation

Fixes

fixed incorrect bool4 subvector getters of the bool8 type

Improvements

removed "fixed" vector element access to improve performance in managed C#

Additions

added "shuffle(vector, vector, ShuffleComponent(, ShuffleComponent)(, ShuffleComponent)(, ShuffleComponent)) functions for (s)byte, (u)short, (u)long, quarter and half vectors

Changes

Bump com.unity.burst to version 1.4.4

Fixed Oversights

Added "addsub" function for floating point types, complementary to "subadd"
Added "addsub" and "subadd" functions for integer types

1.2.0

3 years ago

Known Issues

half8 "==" and "!=" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation.

Fixes

Added preliminary safety cast to a float of the half value in toboolsafe() until Unity fixes their half '==' and '!=' operators according to IEEE 754

Additions

"quarter" precision floats and vectors

"quarter" is an 8-bit IEEE 754 1.3.4.-3 floating point value, often called a "minifloat"
It has a very limited range of [-15.5, 15.5] with an epsilon of 0.015625. All integers, aswell as i + 0.5, within that range can be represented as a quarter
Type conversion from - and to quarters also conforms to the IEEE 754 standard. In detail, casting to a quarter performs rounding according to a) its' precision and b) whether or not the more precise value is closer to 0 or to quarter.Epsilon. NaN and +/- zero preservation, aswell as preservation of/clamping to +/- infintiy was also implemented
"==" and "!=" operators for vectors conforming to the IEEE 754 standard were implemented (unlike, currently, Unity's "half" type). All the other boolean- and arithmetic operators were implemented for the base type only, which will return single precision results (for arithmetic operations). For vectors, quarter vectors are to be (implicitly) cast to single precision vectors first, until/if Unity changes their "half" implementation.
Type conversions from - and to all other single value and vector types were implemented
Full function implementation within the library was added, including: abs(), isnan(), isinf(), isfinite(), select(), as[s]byte/asquarter(), vrol/r(), vshl/r(), toboolsafe and toquartersafe

Fixed Oversights

Added missing type conversions from - and to half8 for (s)byte8, (u)short8 and (u)int8 vectors
Added missing type conversions from - and to half8 for booleans and boolean vectors
Added half "select" functions
Improved the performance of unsafe boolean-to-half/float/double functions
added (preliminary?) "abs", "isnan", "isinf" and "isfinite" for half and half vectors, eliminating unnecessary casting

1.1.0

3 years ago

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation.

Fixes

Fixed a bug where vshl-/vshr-ing a(n) (s)byte16 vector by 11 would return the vector itself

Changes

Changed the return type of count(boolx) to a uint instead of an int

Additions

RNG

Added/Modified 8, 16, 32 and 64 bit XOR-Shift pseudo random number generators:
They use the most efficient (Avx2) SIMD instructions to generate vectors with elements of the corresponding size in bytes. When compared to Unity.Mathematics, the performance is better since a) scalar multiplication of each generated value has been replaced by a single SIMD instruction and b) doubles are generated by Random64 instead of two 32-bit RNG iterations
Removed NextT(T max) from SIGNED integer and floating point types, since those will never generate negative numbers. One can either generate an unsigned integer and cast it to a signed value for free, or use the functions with min and max parameters, as both of these would be more clear in regards to what range the result will be in
Unity.Mathematics.Random is implicitly convertible to Random32 and vice versa. Safe and fast explicit type conversions between Random8, 16, 32 and 64 were added

Shuffle

Added bool8, bool16 and bool32 subvector getters
Added setters for half8 subvectors
Added (s)byte32 subvector getters
Added setters for (s)byte8 and s)byte16 subvectors
Added setters for (u)short8 subvectors
!!! Setters for (s)byte32, (u)short16, (u)int8 and float8 subvectors are implemented, but due to Unity.Burst related bugs, they are deactivated in the code. The issue has been forwarded a month ago and should be fixed with Burst 1.5
Slightly improved the performance of a select few (s)byte2/3/4 vector shuffle getters and setters

1.0.1

3 years ago

Removed all Burst.AssumeRange compiler promises from function parameters, which would prevent some debug-checks from being executed in burst compiled code
... but added all missing "AssumeRange" attributes to return values of all functions
added missing dot product for byte and sbyte vectors
added the mathematical constants tau (2 * pi), phi (golden ratio), square root of 3, cube root of 2 and cube root of 3
Discovered and forwarded serious bugs with Unity.Mathematics' "half" type. Will not be changing my implementation in regards to half8 until they fixed it (for reasons of API consitency) https://github.com/Unity-Technologies/Unity.Mathematics/issues/185

MaxMath Versions Save

2.3.5

Known Issues

Fixes

Additions

Improvements

Changes

Fixed Oversights

2.3.0

Known Issues

Fixes

Additions

Other Additions

Improvements

Other Improvements

Changes

Complete avg Overhaul

Other Changes

Fixed Oversights

2.2.0

Known Issues

Fixes

Additions

(U)Int128

Cube Root

Other Additions

Improvements

Changes

Fixed Oversights

2.1.2

Known Issues

Fixes

Additions

Improvements

Changes

Fixed Oversights

2.1.1

Known Issues

Fixes

Additions

Improvements

Changes

Fixed Oversights

2.1.1 Hotfix

Fixes

Improvements

Fixed Oversights

2.1.0

Known Issues

Fixes

Additions

Improvements

Changes

Fixed Oversights

2.0.0

Re-Release Notes

Known Issues

Fixes

Improvements

Additions

Changes

Fixed Oversights

1.2.0

Known Issues

Fixes

Additions

"quarter" precision floats and vectors

Fixed Oversights

1.1.0

Known Issues

Fixes

Changes

Additions

RNG

Shuffle

1.0.1

Complete `avg` Overhaul