Ml Explore Mlx Versions Save

MLX: An array framework for Apple silicon

v0.7.0

2 months ago

Highlights

Perf improvements for attention ops:
- No copy broadcast matmul (benchmarks)
- Fewer copies in reshape

Core

Faster broadcast + gemm
- benchmarks
mx.linalg.svd (CPU only)
Fewer copies in reshape
Faster small reductions
- benchmarks

NN

nn.RNN, nn.LSTM, nn.GRU

Bugfixes

Fix bug in depth traversal ordering
Fix two edge case bugs in compilation
Fix bug with modules with dictionaries of weights
Fix bug with scatter which broke MOE training
Fix bug with compilation kernel collision

v0.6.0

2 months ago

Highlights:

Faster quantized matrix-vector multiplies
- Benchmarks
mx.fast.scaled_dot_product_attention fused op

Core

Memory allocation API improvements
Faster GPU reductions for smaller sizes (between 2 and 7x)
- Benchmarks
mx.fast.scaled_dot_product_attention fused op
Faster quantized matrix-vector multiplications
Pickle support for mx.array

NN

Dilation on convolution layers

Bugfixes

Fix mx.topk
Fix reshape for zero sizes

v0.5.0

3 months ago

Highlights:

Faster convolutions.
- Up to 14x faster for some common sizes.
- See benchmarks

Core

mx.where properly handles inf
Faster and more general convolutions
- Input and kernel dilation
- Asymmetric padding
- Support for cross-correlation and convolution
atleast_{1,2,3}d accept any number of arrays

NN

nn.Upsample layer
- Supports nearest neighbor and linear interpolation
- Any number of dimensions

Optimizers

Linear schedule and schedule joiner:
- Use for e.g. linear warmup + cosine decay

Bugfixes

arange throws on inf inputs
Fix Cmake build with MLX
Fix logsumexp inf edge case
Fix grad of power w.r.t. to exponent edge case
Fix compile with inf constants
Bug temporary bug in convolution

v0.4.0

3 months ago

Highlights:

Partial shapeless compilation
- Default shapeless compilation for all activations
- Can be more than 5x faster than uncompiled versions
CPU kernel fusion
- Some functions can be up to 10x faster

Core

CPU compilation
Shapeless compilation for some cases
- mx.compile(function, shapeless=True)
Up to 10x faster scatter: benchmarks
mx.atleast_1d, mx.atleast_2d, mx.atleast_3d

Bugfixes

Bug with tolist with bfloat16 and float16
Bug with argmax on M3

v0.3.0

3 months ago

Highlights:

mx.fast subpackage
Custom mx.fast.rope up to 20x faster

Core

Support metadata with safetensors
Up to 5x faster scatter and 30% faster gather
40% faster bfloat16 quantizated matrix-vector multiplies
mx.fast subpackage with a fast RoPE
Context manager mx.stream to set the default device

NN

Average and Max pooling layers for 1D and 2D inputs

Optimizers

Support schedulers for e.g. learning rates
A few basic schedulers:
- optimizers.step_decay
- optimizers.cosine_decay
- opimtizers.exponential_decay

Bugfixes

Fix bug in remainder with negative numerators and integers
Fix bug with slicing into softmax
Fix quantized matmuls with non 32 multiples

v0.2.0

3 months ago

Highlights:

mx.compile makes stuff go fast
- Some functions are up to 10x faster (benchmarks)
- Training models anywhere from 10% to twice as fast (benchmarks)
- Simple syntax for compiling full training steps

Core

mx.compile function transformation
Find devices properly for iOS
Up to 10x faster GPU gather
__abs__ overload for abs on arrays
loc and scale in parameter for mx.random.normal

NN

Margin ranking loss
BCE loss with weights

Bugfixes

Fix for broken eval during function transformations
Fix mx.var to give inf with doff >= nelem
Fix loading empty modules in nn.Sequential

v0.1.0

4 months ago

Highlights

Memory use improvements:
- Gradient checkpointing for training with mx.checkpoint
- Better graph execution order
- Buffer donation

Core

Gradient checkpointing with mx.checkpoint
CPU only QR factorization mx.linalg.qr
Release Python GIL during mx.eval
Depth-based graph execution order
Lazy loading arrays from files
Buffer donation for reduced memory use
mx.diag, mx.diagonal
Breaking: array.shape is a Python tuple
GPU support for int64 and uint64 reductions
vmap over reductions and arg reduction:
- sum, prod, max, min, all, any
- argmax, argmin

NN

Softshrink activation

Bugfixes

Comparisons with inf work, and fix mx.isinf
Bug fix with RoPE cache
Handle empty Matmul on the CPU
Negative shape checking for mx.full
Correctly propagate NaN in some binary ops
- mx.logaddexp, mx.maximum, mx.minimum
Fix > 4D non-contiguous binary ops
Fix mx.log1p with inf input
Fix SGD to apply weight decay even with 0 momentum

v0.0.11

4 months ago

Highlights:

GGUF improvements:
- Native quantizations Q4_0, Q4_1, and Q8_0
- Metadata

Core

Support for reading and writing GGUF metadata
Native GGUF quantization (Q4_0, Q4_1, and Q8_0)
Quantize with group size of 32 (2x32, 4x32, and 8x32)

NN

Module.save_weights supports safetensors
nn.init package with several commonly used neural network initializers
Binary cross entropy and cross entropy losses can take probabilities as targets
Adafactor in nn.optimizers

Bugfixes

Fix isinf and friends for integer types
Fix array creation from list Python ints to int64, uint, and float32
Fix power VJP for 0 inputs
Fix out of bounds inf reads in gemv
mx.arange crashes on NaN inputs

v0.0.10

4 months ago

Highlights:

Faster matmul: up to 2.5x faster for certain sizes, benchmarks
Fused matmul + addition (for faster linear layers)

Core

Quantization supports sizes other than multiples of 32
Faster GEMM (matmul)
ADMM primitive (fused addition and matmul)
mx.isnan, mx.isinf, isposinf, isneginf
mx.tile
VJPs for scatter_min and scatter_max
Multi output split primitive

NN

Losses: Gaussian negative log-likelihood

Misc

Performance enhancements for graph evaluation with lots of outputs
Default PRNG seed is based on current time instead of 0
Primitive VJP takes output as input. Reduces redundant work without need for simplification
PRNGs default seed based on system time rather than fixed to 0
Format boolean printing in Python style when in Python

Bugfixes

Scatter < 32 bit precision and integer overflow fix
Overflow with mx.eye
Report Metal out of memory issues instead of silent failure
Change mx.round to follow NumPy which rounds to even

v0.0.9

4 months ago

Highlights:

Initial (and experimental) GGUF support
Support Python buffer protocol (easy interoperability with NumPy, Jax, Tensorflow, PyTorch, etc)
at[] syntax for scatter style operations: x.at[idx].add(y), (min, max, prod, etc)

Core

Array creation from other mx.array’s (mx.array([x, y]))
Complete support for Python buffer protocol
mx.inner, mx.outer
mx.logical_and, mx.logical_or, and operator overloads
Array at syntax for scatter ops
Better support for in-place operations (+=, *=, -=, ...)
VJP for scatter and scatter add
Constants (mx.pi, mx.inf, mx.newaxis, …)

NN

GLU activation
cosine_similarity loss
Cache for RoPE and ALiBi

Bugfixes / Misc

Fix data type with tri
Fix saving non-contiguous arrays
Fix graph retention for inlace state, and remove retain_graph
Multi-output primitives
Better support for loading devices