A C# SIMD math library for use with Unity only, substantially extending Unity.Mathematics by new types and functions, using Unity.Burst.
half8
==
and !=
operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)(s)byte
, (u)short
vector and (U)Int128
multiplication, division and modulo operations by compile time constants are not optimal(U)Int128
comparison operators didn't make it into this releasebool
vectors generated from operations on non-(s)byte
vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties(s)byte32
all_dif
lookup tables are currently way too large (kiloBytes)bool8/16/32
are now blittable when not used within an IJob
comb(n, k)
for scalar- and vector integer types. This is known as the binomial coefficient or "n choose k". An optional Promise
parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows (which is not true for most solutions found online that claim it), uses a O(min(k, n - k)) algorithm with respect to timeperm(n, k)
for scalar- and vector integer types. This is known as "k-permutations of n". An optional Promise
parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows, uses a O(k) algorithm with respect to timenextgreater(x)
for all types. For integer types, it is a wrapper function for addsaturated(x, 1)
. For floating point types, it returns the next greater representable floating point value(s), unless x is NaN or infinite. An optional Promise
parameter allows for numerous optimizations.nextsmaller(x)
for all types. For integer types, it is a wrapper function for subsaturated(x, 1)
. For floating point types, it returns the next smaller representable floating point value(s), unless x is NaN or infinite. An optional Promise
parameter allows for numerous optimizationsnexttoward(from, to)
for all types, returning the next representable integer/floating point value(s) in a given direction, unless from
is equal to to
. For floating point types, from
is returned if from
is NaN or infinite. If to
is NaN, NaN is returned. An optional Promise
parameter allows for numerous optimizations.(u)long2
, specifically, even when the quotient and/or remainder vector is in the middle of a dependency chain and even in tight loops, and is thus only implemented for (u)long3/4
types and only if compiling for AVX2(s)byte8
and every (u)short
vector division if not compiling with FloatMode.Fast
. Reduced constants possibly read from RAM in either case.lcm
for (s)byte
vectors with 8 elements or less: decreased code size by 20 or 28 bytes; removed 2 or 4 or 8 bytes of constant data read from RAM; reduced latency by 2 or 3 clock cycles(u)long
scalar- and vector intcbrt
Promise.Unsafe0
range from [0, 1ul << 40] to [0, 1ul << 46], the code path of which is also possibly chosen at compile timequarter{X}
IEEE-754 comparison operators (without having to cast to float{X}
). Vectorized halfX
comparisons are implemented in MaxMath.Intrinsics.Xse
as well and used where appropriate. compareto
with quarter{X}
and half{X}
function overloads were implemented.add/subsaturated
for scalar Int128
s, scalar and vector long
s as well as vector int
s by about a third(U)Int128.ToString(null, null)
s call to BigInteger.ToString()
and thus unnecessary heap allocations with an optimized implementation(u)short8
/
and %
operators now correctly check for SSE2 support rather than AVX2Promise
argument which is not a compile time constant will throw an exception in DEBUG
, as it represents significant overhead instead of an optimization. This will currently not inform users of the name of the function but rather the Burst compiled job/function that threw it.explicit
type conversion operators for scalar float
s and double
s to half8
and all quarter
vectors (as well as scalar half
s to quarter
vectors)half8
==
and !=
operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)(s)byte
, (u)short
vector and (U)Int128
multiplication, division and modulo operations by compile time constants are not optimal(U)Int128
comparison operators didn't make it into this releasebool
vectors generated from 256 bit input vectors like so: long4 x = select(a, b, >>> myLong4a < myLong4b <<<)
(as an example) does not generate the most efficient machine code possiblebits_zerohigh
functions fail 100% of the time because of a bug related to the managed debug implementation of intrinsics (reported)int
s which would be NaN if dereferenced as a float
and written back to memory (reported)(s)byte32
all_dif
lookup tables are currently way too large (kiloBytes)quarter
rounding behavior when casting a wider floating point type to a quarter
to round towards the nearest representable value instead of truncating the mantissaMaxMath.Intrinsics
for users who want to use the math library through "high level" X86 intrinsics. Because users need to guard their intrinsics code with e.g. if (Burst.Intrinsics.X86.Sse2.IsSse2Supported)
blocks and supported architectures vary (slightly) from function to function, these are considered unsafe, undocumented and unrecommended and only serve as an exposed layer of abstraction which is used internally anyway.Promise
, with values Nothing
, Everything
NoOverflow
, ZeroOrGreater
, ZeroOrLess
, NonZero
and Unsafe
0 through 3 aswell as the composites Positive
and Negative
. This flags enum is only ever used as an optional parameter and offers faster, yet more unsafe code. Specifics vary between functions and sometimes even overloads but are documented accordingly. Optimizations are only ever to be added, not removed (= a ...promise ... of never introducing breaking changes in this regard)factorial
(for integer types) and gamma
(floating point types) functions. factorial
, when called without a Promise
parameter, clamps the result to type.MaxValue
in case of overflowerf(c)
, the (complementary) error function for floating point types(c)minmag
and (c)maxmag
functions, returning the (componentwise) minimum/maximum magnitude of two values or within a vector; equivalent to abs(x) > abs(y) ? x : y
(maxmag
) or abs(cmin(c)) > abs(cmax(c)) ? cmin(c) : cmax(c)
(cmaxmag
)(c)minmax
and (c)minmaxmag
functions which return both the (componentwise/columnwise) minimum and maximum (magnitude) as out
parametersbitfield
functions for scalar and vector integer types - small utility functions that pack several smaller integers into bigger onescopysign(x, y)
functions for signed types, which is equivalent to return y < 0 ? nabs(x) : abs(x)
float
/double
inverse hyberbolic functions asinh
, acosh
and atanh
intlog10
functions (integer base ten logarithm)bit test
/bt
family of functions for scalar and vector integer types. A testbit(POST_ACTION)((ref)x, i)
function returns a boolean (vector), indicating whether the bit in x
at index i
is 1 and may (or may not) flip, set, or reset that bit afterwardsto(u)longunsafe
and todoubleunsafe
with a Promise
parameter, allowing for up to two levels of optimization (vectorized 64bit int <-> 64 bit float is not hardware supported). Details in the XML documentation. Default double
<-> (u)long
conversion operators - apart from having their 4-element version improved - now check whether or not a safe range for unsafe conversions can be validated at compile timetoquarterunsafe
allowing for each type to be converted to a quarter type while specifying whether the input value will or will not overflow and/or is >= 0Avx.[...]undefined[...]
compiler intrinsics or through controlled undefined behaviour, by declaring an uninitialized variable and using pointer syntax to force the C# compiler into trusting that the variable has been fully initialized; this cannot lead to memory access violations, since the variable is declared and thus enough space is reserved on the stack, before it is optimized away by LLVM and assigned a hardware register instead, with undefined upper elements. This allows for upper elements of hardware registers to be ignored during compilation. Unnecessarily emitted instructions like movq xmm0, xmm0
(move the low 8 bytes from a register to the same register, zeroing out the upper 8 bytes, even though only the lower 8 bytes will be written back to memory) or far worse instruction sequences, for example when using vectors with 3 elements, are now (MOSTLY; there's still work to be done) omitted instead. Although most zero-upper-elements instruction( sequence)s only took a single clock cycle, they were always part of each dependency chain and could happen between almost each function call, including operators of course. The same improvements apply to Unity.Mathematics
types when passed to maxmath
functions.Unity.Burst.CompilerServices.Constant.IsConstantExpression
condition checks more to many functions within the library. Most notably, algorithms, where the total latency is dependant on the byte size of arguments, may now perform much faster. Some but not yet all of these constant checks are exposed through a Promise
parameter(u)short
to (u)short2/3/4
conversionall
, any
first
, last
, count
and bitmask
functions for bool8/16/32
when used with an expression as the argument, such as all(x != y)
- a way to force the compiler to omit unnecessary intructions was foundaddsaturated
for scalar unsigned integer typesfloat
/double
to (U)Int128
conversionshl
, shrl
and shra
and thus all functions using those - especially for: shl
for (s)byte
vectors of all sizes if compiling for SSE4 and 32 byte sized vectors if compiling for AVX2; shl
for (u)short
vectors of 4 or more elements if compiling for at least SSE4; shra
for (u)long
vectors if compiling for AVX2 and the vector containing the shift amounts is a compile time constant.long2/3/4
shra
code size and latency by another 2 clock cycles if compiling for AVX2rol/r
vector functions beyond shl/r
improvements and added an optional Promise
parameter, allowing the caller to promise the rotation values are in a specific rangelong2/3/4
"is negative checks" - mylong4 < 0
/0 > mylong4
by 33% by doubling its code size. This further improves performance/adds to code size of functions in the library(u)long2/3/4
isinrange
functionsbyte
and ushort
vector to float vector conversion. This also affects performance of (s)byte
(u)short
vector intsqrt
functions, aswell as the respective %
and /
operators (byte2/3/4/8
, all ushort
vectors)(u)long
vector intcbrt
latency by ~45% and reduced code size by ~20% (roughly 150 bytes). For other integer vector types, the latency has been reduced by ~8 to ~15 clock cyclesexp2
scalar and vector integer argument function overloads. These return exp2((float/double)x)
or (float/double)(1 << x)
in 3 instead of 6 to 7 clock cycles at best; they of course also work for negative input values i.e. reciprocals of powers of 2. The (u)int
overloads convert to float
s, the (u)long
overloads convert to double
s; explicit integer to integer casting should (and sometimes has to) be used for optimal results. Additionally, these overloads contain an optional 'Promise' parameter, allowing for omission of clamping which is needed to ensure correct underflow/overflow behavior, as dictated by Unity's exp2
implementation. If you ever used the standard exp2
function by implicitly converting an int
type to a float
type, performance was improved by a factor of about 30x. This overload only "breaks" code that casts (u)long
types to float
types implicitly if the result is expected to be a float
type. It is recommended to explicitly cast the (u)long
type to a (u)int
type in such a case==
, !=
, <
, >
, <=
and >=
operators for UInt128
and signed long
/int
comparisons, as the expensive float conversion and comparison was previosly used when, for instance, comparing a UInt128 to a constant int such as 1 or 0(u)int
and (u)long
division/modulo algorithms. (u)long
performance gains are only noticable under certain conditions; the (u)int
performance gain is substantial (and unfortunately not used for (u)int2/3/4
by LLVM/Burst - these are now exposed as further div
overloads and new mod
functions). Other functions than operator overloads are positively affected(s)byte2/3/4
shuffles, eg. myByte4.xzzw
double4
to (u)long4
conversions if compiling for AVX2byte
vector division/modulo operation by a scalar compile time constant. Many, if not most, were not even auto-vectorized, let alone optimized for SIMD instructions instead of general purpose register instructions, which were translated poorly if vectorizeddouble
precision (r)cbrt
's math.pow(x, (-)1d/3d)
call with an optimized implementationfloat
scalar- and vector (r)cbrt
by ~1 + (1 or 2) * ~4 clock cycles, while also gaining a small amount of precision; Reduced code size, aswell as the number of required compile time constantsfloat
scalar- and vector (r)cbrt
which handle negative inputs accurately (i.e. the new standard) by one clock cycle... Making it just one clock cycle slower than the unsafe version, mostly just providing a somewhat consiberable advantage with regard to code size(s)byte16
and (u)short16
all_dif
lookup table size by 896 bytes (traded for an increase of 8 bytes in code size so this doesn't save RAM; It potentially reduces memory latency aswell as register spilling onto the stack)(u)int8
t/lzcnt
latency by ~10%, also positively affecting (u)int2/3/4/8
gcd
and lcm
performance, as it is part of a loop within gcd
double
and float
to quarter
conversion latency (15+ clock cycles down to 7, optimally (CPU dependant)), code size and the number of constants being used. This affects scalar and vector conversions; the scalar versions are now branch free.(s)byte32
and (u)short16
all_dif
functionsavg
Overhaulavg
overloads which calculate the average value of a vector itself to cavg
for consistency reasons (max
vs cmax
, for instance)(c)avg
calculations can no longer result in overflow of intermediate calculations and thus incorrect results (lower performance by default)Promise
parameters to most (c)avg
overloads. These can bring back the previous performance of 32- and 64bit integer overloads(c)avg
overloads(U)Int128((u)long lo64, (u)long hi64)
constructors are now public
Int128
intsqrt
overload now returns a ulong
float
(r)cbrt
bool
paramater handleNegativeInput
with a Promise
parameter and removed it from the double
overloads completely, with having its' NonNegative
flag set being a requirement for the faster version. This is first due to the introduction of the Promise
type and thus for consistency reasons. Also, the optimized double
implementation handles negative numbers for free, which is now the standard behavior.intcbrt
bool
paramater handleNegativeInput
with a Promise
parameter for reasons mentioned above, also handling negative input values correctly by defaultC# Dev Tools
to version 1.0.8.meta
files are now included to allow for adding the repository to Unity projects via its github URLfloorpow2
function overloads for scalar (u)int
and (u)long
typesdouble
scalar- and vector fastrcp
and fastrsqrt
overloads (substantially less accurate than FloatMode.Fast
, FloatPrecision.Low
1d / x
or 1d / sqrt(x)
)bits_
prefix) now have their vector equivalents implemented as overloadshalf8
==
and !=
operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation(s)byte
, (u)short
vector and (U)Int128
multiplication, division and modulo operations by compile time constants are not optimal. For (U)Int128, it requires a new Burst feature à la T Constant.ForceCompileTimeEvaluation<T, U>(Func<U, T> code)
(proposed); Currently work is being done on (s)byte
and (u)short
vectors in this regard, which will beat any compiler. The current (tested) state of all optimizations possible is included in this version.pow
functions with compile time constant exponents currently do not handle many decimal numbers - math.rsqrt
would often be used in those cases for optimal performance but it is actually slower when the Unity.Burst.FloatMode
is set to anything but FloatMode.Fast
. To guarantee optimal performance, compile time access to the current FloatMode
would be needed (proposed)double
(r)cbrt
functions are currently not optimizedfloat8
rcp
and rsqrt
functions to Bursts' FloatMode
and FloatPrecision
short.MinValue / -1
now correctly overflows to short.MinValue
when dividing a short16
vector by another short16
vector when compiling for AVX or higherquarter
to double
conversion for when the quarter
value is negativehalf
to quarter
conversion for when the half
value is negativequarter
to ulong
conversion for when a quarter
value is negative(u)short8
to quarter8
conversiontype.MinValue
and type.MaxValue
if under- or overflow occurs, respectively and has single-instruction hardware support for (s)bytes
and (u)shorts
. The included functions are:addsaturated
subsaturated
mulsaturated
divsaturated
(only clamps division of floating point types and signed division of, for instance, sbyte.MinValue
( = -128) / -1
to sbyte.MaxValue
( =127), which would cause a hardware exception for int
s and long
s`)castsaturated
(all types to all other types with a smaller range),csumsaturated
cprodsaturated
(U)Int128
types with full library support, meaning: all operators and type conversions aswell as all functions support these types. Most operations of both types, in Burst code, compile down to optimal machine code. Exceptions: 1) signed 64x64 bit to 128 bit multiplication 2) *
, /
, %
and divrem
functions with a scalar compile time constant argument (See: Known Issues 2)Random128
XOR-Shift pseudo random number generator for generating (U)Int128
s(r)cbrt
- (reciprocal) cube root functions for scalar and vector float
- and double
types based on a research paper from 2021. An optional bool
parameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case with math.pow(x, 1f/3f)
), which is set to false
by defaultintcbrt
- integer cube root functions for all scalar and vector integer types. For signed integer types, an optional bool
parameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case with math.pow(x, 1f/3f)
), which is set to false
by defaultlog
function to all scalar and vector float
- and double
types with a second parameter b
, which is the logarithms' basereversebytes
functions for all scalar- and vector types, which convert back and forth between big endian and little endian byte order, respectively. All of them (scalar, vector) compile down to single hardware instructionspow
functions with scalar exponents for float
and double
scalars and vectors, with optimizations for selected constant exponents (not necessarily whole exponents)(s)byte
s and (u)short
s in order to resolve function call resolution ambiguity which was already present in Unity.Mathematics
, which may also improve performance in some casesNew
property to RandomX
XOR-Shift pseudo random generators. It calls Environment.TickCount
internally (and is thus seeded somewhat randomly), makes sure it is non-zero and can be called from Burst native codefastrcp
functions for float
scalars and vectors, faster (and substantially less accurate) than FloatPrecision.Low
, FloatMode.Fast
Burst implementationsfastrsqrt
functions for float
scalars and vectors, faster (and substantially less accurate) than FloatPrecision.Low
, FloatMode.Fast
Burst implementationsfloat8
sin
, cos
, tan
, sincos
, asin
, acos
, atan
, atan2
, sinh
, cosh
, tanh
, pow
, exp
, exp2
, exp10
, log
, log2
, log10
and fmod
(and the %
operator)/
, %
, *
and divrem
operations with a scalar compile time constant argument for (s)byte
vectors (see 'Known Issues 2'), which were previously not optimized (...optimally/at all) by Burst.short16
(256 bit) to byte16
(128 bit))(s)byte
and (u)short
rol
and ror
functions now compile down to single hardware instructions==
, >
etc.)toboolX
and thus also select(vector a, vector b, bitmask c)
);intpow
functions in general and for when the exponent is a compile time constantcompareto
vector functions (especially for unsigned types)isdivisible
intsqrt
functions for (u)long
and (s)byte
scalar and vector types considerablyispow2
vector functions(s)byte
vector-by-vector divisionRandom64
's (u)long4
generation if compiling for AVX2(s)byte
matrix multiplication(u)short
- and up to (s)byte8
vector by vector division and divrem
functions(and improved performance if compiling for SSE2 only)isinrange
functions for (u)long
vector typesushort
vector >=
and <=
operators for SSE2 fallback code by ~75%touint8
/tof32
etc.) are now renamed to contain C# types in their names (tobyte
/tofloat
etc.)Unity.Mathematics.math
statically (using static MaxMath.maxmath;
) and you use the pow
functions with scalar bases and scalar exponents in those scripts, you will encounter the first ever function call resolution ambiguity. It is strongly recommended to always use the maxmath.pow
function, because it optimizes any pow
call enormously if the exponent is a compile time constant, which does NOT necessarily mean that such a call must declare the exponent as a literal value - the exponent may become a compile time constant due to constant propagationquarter
is now a readonly struct
quarter
to sbyte
, short
, int
and long
coversions are now required to be declared explicitlycountbits(void* ptr, ulong bytes)
from the library and added it to https://github.com/MrUnbelievable92/SIMD-Algorithms with more optionsUnity.Mathematics
(byte4 myByte4 = (maxmath.)byte4(1, 2, 3, 4);
)dsub
- fused divide-subtract function for scalar and vector float
typesbool fast = false
parameter to dad
, dsub
, dadsub
and dsubadd
functionsandnot
function overloads for scalar and vector bool
typesquarter
values to half
, float
and double
vectorsall_eq
and all_dif
functions for vectors of size 2all_eq
and all_dif
functions for float
and double
vectorsAdded documentation for RandomX methods
Added "addsub" function for floating point types, complementary to "subadd"
Added "addsub" and "subadd" functions for integer types
Added missing type conversions from - and to half8 for (s)byte8, (u)short8 and (u)int8 vectors
Added missing type conversions from - and to half8 for booleans and boolean vectors
Added half "select" functions
Improved the performance of unsafe boolean-to-half/float/double functions
added (preliminary?) "abs", "isnan", "isinf" and "isfinite" for half and half vectors, eliminating unnecessary casting
Added bool8, bool16 and bool32 subvector getters
Added setters for half8 subvectors
Added (s)byte32 subvector getters
Added setters for (s)byte8 and s)byte16 subvectors
Added setters for (u)short8 subvectors
!!! Setters for (s)byte32, (u)short16, (u)int8 and float8 subvectors are implemented, but due to Unity.Burst related bugs, they are deactivated in the code. The issue has been forwarded a month ago and should be fixed with Burst 1.5
Slightly improved the performance of a select few (s)byte2/3/4 vector shuffle getters and setters
Removed all Burst.AssumeRange compiler promises from function parameters, which would prevent some debug-checks from being executed in burst compiled code
... but added all missing "AssumeRange" attributes to return values of all functions
added missing dot product for byte and sbyte vectors
added the mathematical constants tau (2 * pi), phi (golden ratio), square root of 3, cube root of 2 and cube root of 3
Discovered and forwarded serious bugs with Unity.Mathematics' "half" type. Will not be changing my implementation in regards to half8 until they fixed it (for reasons of API consitency) https://github.com/Unity-Technologies/Unity.Mathematics/issues/185