Ml Explore Mlx Versions Save

MLX: An array framework for Apple silicon

v0.12.0

2 weeks ago

Highlights

Faster quantized matmul
- Up to 40% faster QLoRA or prompt processing, some numbers

Core

mx.synchronize to wait for computation dispatched with mx.async_eval
mx.radians and mx.degrees
mx.metal.clear_cache to return to the OS the memory held by MLX as a cache for future allocations
Change quantization to always represent 0 exactly (relevant issue)

Bugfixes

Fixed quantization of a block with all 0s that produced NaNs
Fixed the len field in the buffer protocol implementation

v0.11.0

3 weeks ago

Core

mx.block_masked_mm for block-level sparse matrix multiplication
Shared events for synchronization and asynchronous evaluation

NN

nn.QuantizedEmbedding layer
nn.quantize for quantizing modules
gelu_approx uses tanh for consistency with PyTorch

v0.10.0

1 month ago

Highlights

Improvements for LLM generation
- Reshapeless quant matmul/matvec
- mx.async_eval
- Async command encoding

Core

Slightly faster reshapeless quantized gemms
Option for precise softmax
mx.metal.start_capture and mx.metal.stop_capture for GPU debug/profile
mx.expm1
mx.std
mx.meshgrid
CPU only mx.random.multivariate_normal
mx.cumsum (and other scans) for bfloat
Async command encoder with explicit barriers / dependency management

NN

nn.upsample support bicubic interpolation

Misc

Updated MLX Extension to work with nanobind

Bugfixes

Fix buffer donation in softmax and fast ops
Bug in layer norm vjp
Bug initializing from lists with scalar
Bug in indexing
CPU compilation bug
Multi-output compilation bug
Fix stack overflow issues in eval and array destruction

v0.9.0

1 month ago

Highlights:

Fast partial RoPE (used by Phi-2)
Fast gradients for RoPE, RMSNorm, and LayerNorm
- Up to 7x faster, benchmarks

Core

More overhead reductions
Partial fast RoPE (fast Phi-2)
Better buffer donation for copy
Type hierarchy and issubdtype
Fast VJPs for RoPE, RMSNorm, and LayerNorm

NN

Module.set_dtype
Chaining in nn.Module (model.freeze().update(…))

Bugfixes

Fix set item bugs
Fix scatter vjp
Check shape integer overlow on array construction
Fix bug with module attributes
Fix two bugs for odd shaped QMV
Fix GPU sort for large sizes
Fix bug in negative padding for convolutions
Fix bug in multi-stream race condition for graph evaluation
Fix random normal generation for half precision

v0.8.0

1 month ago

Highlights

More perf!
mx.fast.rms_norm and mx.fast.layer_norm
Switch to Nanobind substantially reduces overhead
Up to 4x faster __setitem__ (e.g. a[...] = b)

Core

mx.inverse, CPU only
vmap over mx.matmul and mx.addmm
Switch to nanobind from pybind11
Faster setitem indexing
- Benchmarks
mx.fast.rms_norm, token generation benchmark
mx.fast.layer_norm, token generation benchmark
vmap for inverse and svd
Faster non-overlapping pooling

Optimizers

Set minimum value in cosine decay scheduler

Bugfixes

Fix bug in multi-dimensional reduction

v0.7.0

1 month ago

Highlights

Perf improvements for attention ops:
- No copy broadcast matmul (benchmarks)
- Fewer copies in reshape

Core

Faster broadcast + gemm
- benchmarks
mx.linalg.svd (CPU only)
Fewer copies in reshape
Faster small reductions
- benchmarks

NN

nn.RNN, nn.LSTM, nn.GRU

Bugfixes

Fix bug in depth traversal ordering
Fix two edge case bugs in compilation
Fix bug with modules with dictionaries of weights
Fix bug with scatter which broke MOE training
Fix bug with compilation kernel collision

v0.6.0

2 months ago

Highlights:

Faster quantized matrix-vector multiplies
- Benchmarks
mx.fast.scaled_dot_product_attention fused op

Core

Memory allocation API improvements
Faster GPU reductions for smaller sizes (between 2 and 7x)
- Benchmarks
mx.fast.scaled_dot_product_attention fused op
Faster quantized matrix-vector multiplications
Pickle support for mx.array

NN

Dilation on convolution layers

Bugfixes

Fix mx.topk
Fix reshape for zero sizes

v0.5.0

2 months ago

Highlights:

Faster convolutions.
- Up to 14x faster for some common sizes.
- See benchmarks

Core

mx.where properly handles inf
Faster and more general convolutions
- Input and kernel dilation
- Asymmetric padding
- Support for cross-correlation and convolution
atleast_{1,2,3}d accept any number of arrays

NN

nn.Upsample layer
- Supports nearest neighbor and linear interpolation
- Any number of dimensions

Optimizers

Linear schedule and schedule joiner:
- Use for e.g. linear warmup + cosine decay

Bugfixes

arange throws on inf inputs
Fix Cmake build with MLX
Fix logsumexp inf edge case
Fix grad of power w.r.t. to exponent edge case
Fix compile with inf constants
Bug temporary bug in convolution

v0.4.0

2 months ago

Highlights:

Partial shapeless compilation
- Default shapeless compilation for all activations
- Can be more than 5x faster than uncompiled versions
CPU kernel fusion
- Some functions can be up to 10x faster

Core

CPU compilation
Shapeless compilation for some cases
- mx.compile(function, shapeless=True)
Up to 10x faster scatter: benchmarks
mx.atleast_1d, mx.atleast_2d, mx.atleast_3d

Bugfixes

Bug with tolist with bfloat16 and float16
Bug with argmax on M3

v0.3.0

2 months ago

Highlights:

mx.fast subpackage
Custom mx.fast.rope up to 20x faster

Core

Support metadata with safetensors
Up to 5x faster scatter and 30% faster gather
40% faster bfloat16 quantizated matrix-vector multiplies
mx.fast subpackage with a fast RoPE
Context manager mx.stream to set the default device

NN

Average and Max pooling layers for 1D and 2D inputs

Optimizers

Support schedulers for e.g. learning rates
A few basic schedulers:
- optimizers.step_decay
- optimizers.cosine_decay
- opimtizers.exponential_decay

Bugfixes

Fix bug in remainder with negative numerators and integers
Fix bug with slicing into softmax
Fix quantized matmuls with non 32 multiples