Ml Explore Mlx Versions Save

MLX: An array framework for Apple silicon

v0.7.0

2 months ago

Highlights

  • Perf improvements for attention ops:
    • No copy broadcast matmul (benchmarks)
    • Fewer copies in reshape

Core

  • Faster broadcast + gemm
  • mx.linalg.svd (CPU only)
  • Fewer copies in reshape
  • Faster small reductions

NN

  • nn.RNN, nn.LSTM, nn.GRU

Bugfixes

  • Fix bug in depth traversal ordering
  • Fix two edge case bugs in compilation
  • Fix bug with modules with dictionaries of weights
  • Fix bug with scatter which broke MOE training
  • Fix bug with compilation kernel collision

v0.6.0

2 months ago

Highlights:

  • Faster quantized matrix-vector multiplies
  • mx.fast.scaled_dot_product_attention fused op

Core

  • Memory allocation API improvements
  • Faster GPU reductions for smaller sizes (between 2 and 7x)
  • mx.fast.scaled_dot_product_attention fused op
  • Faster quantized matrix-vector multiplications
  • Pickle support for mx.array

NN

  • Dilation on convolution layers

Bugfixes

  • Fix mx.topk
  • Fix reshape for zero sizes

v0.5.0

3 months ago

Highlights:

  • Faster convolutions.
    • Up to 14x faster for some common sizes.
    • See benchmarks

Core

  • mx.where properly handles inf
  • Faster and more general convolutions
    • Input and kernel dilation
    • Asymmetric padding
    • Support for cross-correlation and convolution
  • atleast_{1,2,3}d accept any number of arrays

NN

  • nn.Upsample layer
    • Supports nearest neighbor and linear interpolation
    • Any number of dimensions

Optimizers

  • Linear schedule and schedule joiner:
    • Use for e.g. linear warmup + cosine decay

Bugfixes

  • arange throws on inf inputs
  • Fix Cmake build with MLX
  • Fix logsumexp inf edge case
  • Fix grad of power w.r.t. to exponent edge case
  • Fix compile with inf constants
  • Bug temporary bug in convolution

v0.4.0

3 months ago

Highlights:

  • Partial shapeless compilation
    • Default shapeless compilation for all activations
    • Can be more than 5x faster than uncompiled versions
  • CPU kernel fusion

Core

  • CPU compilation
  • Shapeless compilation for some cases
    • mx.compile(function, shapeless=True)
  • Up to 10x faster scatter: benchmarks
  • mx.atleast_1d, mx.atleast_2d, mx.atleast_3d

Bugfixes

  • Bug with tolist with bfloat16 and float16
  • Bug with argmax on M3

v0.3.0

3 months ago

Highlights:

  • mx.fast subpackage
  • Custom mx.fast.rope up to 20x faster

Core

  • Support metadata with safetensors
  • Up to 5x faster scatter and 30% faster gather
  • 40% faster bfloat16 quantizated matrix-vector multiplies
  • mx.fast subpackage with a fast RoPE
  • Context manager mx.stream to set the default device

NN

  • Average and Max pooling layers for 1D and 2D inputs

Optimizers

  • Support schedulers for e.g. learning rates
  • A few basic schedulers:
    • optimizers.step_decay
    • optimizers.cosine_decay
    • opimtizers.exponential_decay

Bugfixes

  • Fix bug in remainder with negative numerators and integers
  • Fix bug with slicing into softmax
  • Fix quantized matmuls with non 32 multiples

v0.2.0

3 months ago

Highlights:

  • mx.compile makes stuff go fast
    • Some functions are up to 10x faster (benchmarks)
    • Training models anywhere from 10% to twice as fast (benchmarks)
    • Simple syntax for compiling full training steps

Core

  • mx.compile function transformation
  • Find devices properly for iOS
  • Up to 10x faster GPU gather
  • __abs__ overload for abs on arrays
  • loc and scale in parameter for mx.random.normal

NN

  • Margin ranking loss
  • BCE loss with weights

Bugfixes

  • Fix for broken eval during function transformations
  • Fix mx.var to give inf with doff >= nelem
  • Fix loading empty modules in nn.Sequential

v0.1.0

4 months ago

Highlights

  • Memory use improvements:
    • Gradient checkpointing for training with mx.checkpoint
    • Better graph execution order
    • Buffer donation

Core

  • Gradient checkpointing with mx.checkpoint
  • CPU only QR factorization mx.linalg.qr
  • Release Python GIL during mx.eval
  • Depth-based graph execution order
  • Lazy loading arrays from files
  • Buffer donation for reduced memory use
  • mx.diag, mx.diagonal
  • Breaking: array.shape is a Python tuple
  • GPU support for int64 and uint64 reductions
  • vmap over reductions and arg reduction:
    • sum, prod, max, min, all, any
    • argmax, argmin

NN

  • Softshrink activation

Bugfixes

  • Comparisons with inf work, and fix mx.isinf
  • Bug fix with RoPE cache
  • Handle empty Matmul on the CPU
  • Negative shape checking for mx.full
  • Correctly propagate NaN in some binary ops
    • mx.logaddexp, mx.maximum, mx.minimum
  • Fix > 4D non-contiguous binary ops
  • Fix mx.log1p with inf input
  • Fix SGD to apply weight decay even with 0 momentum

v0.0.11

4 months ago

Highlights:

  • GGUF improvements:
    • Native quantizations Q4_0, Q4_1, and Q8_0
    • Metadata

Core

  • Support for reading and writing GGUF metadata
  • Native GGUF quantization (Q4_0, Q4_1, and Q8_0)
  • Quantize with group size of 32 (2x32, 4x32, and 8x32)

NN

  • Module.save_weights supports safetensors
  • nn.init package with several commonly used neural network initializers
  • Binary cross entropy and cross entropy losses can take probabilities as targets
  • Adafactor in nn.optimizers

Bugfixes

  • Fix isinf and friends for integer types
  • Fix array creation from list Python ints to int64, uint, and float32
  • Fix power VJP for 0 inputs
  • Fix out of bounds inf reads in gemv
  • mx.arange crashes on NaN inputs

v0.0.10

4 months ago

Highlights:

  • Faster matmul: up to 2.5x faster for certain sizes, benchmarks
  • Fused matmul + addition (for faster linear layers)

Core

  • Quantization supports sizes other than multiples of 32
  • Faster GEMM (matmul)
  • ADMM primitive (fused addition and matmul)
  • mx.isnan, mx.isinf, isposinf, isneginf
  • mx.tile
  • VJPs for scatter_min and scatter_max
  • Multi output split primitive

NN

  • Losses: Gaussian negative log-likelihood

Misc

  • Performance enhancements for graph evaluation with lots of outputs
  • Default PRNG seed is based on current time instead of 0
  • Primitive VJP takes output as input. Reduces redundant work without need for simplification
  • PRNGs default seed based on system time rather than fixed to 0
  • Format boolean printing in Python style when in Python

Bugfixes

  • Scatter < 32 bit precision and integer overflow fix
  • Overflow with mx.eye
  • Report Metal out of memory issues instead of silent failure
  • Change mx.round to follow NumPy which rounds to even

v0.0.9

4 months ago

Highlights:

  • Initial (and experimental) GGUF support
  • Support Python buffer protocol (easy interoperability with NumPy, Jax, Tensorflow, PyTorch, etc)
  • at[] syntax for scatter style operations: x.at[idx].add(y), (min, max, prod, etc)

Core

  • Array creation from other mx.array’s (mx.array([x, y]))
  • Complete support for Python buffer protocol
  • mx.inner, mx.outer
  • mx.logical_and, mx.logical_or, and operator overloads
  • Array at syntax for scatter ops
  • Better support for in-place operations (+=, *=, -=, ...)
  • VJP for scatter and scatter add
  • Constants (mx.pi, mx.inf, mx.newaxis, …)

NN

  • GLU activation
  • cosine_similarity loss
  • Cache for RoPE and ALiBi

Bugfixes / Misc

  • Fix data type with tri
  • Fix saving non-contiguous arrays
  • Fix graph retention for inlace state, and remove retain_graph
  • Multi-output primitives
  • Better support for loading devices