Faster alternative to Metal Performance Shaders
A faster alternative to Metal Performance Shaders, a reference implementation of modern GPU algorithms, and a step toward defragmenting the AI ecosystem.
Algorithms:
Progamming Language | MFA Supports | MPSGraph Supports | PyTorch Supports |
---|---|---|---|
CPU C++ (metal-cpp) | ✅ | ❌ | ✅ |
GPU C++ (Indirect Command Buffers) | ✅ | ❌ | ❌ |
Swift (iPadOS, Playgrounds) | ✅ | ✅ | ❌ |
Swift (macOS, Xcode) | ✅ | ✅ | ✅ |
Predecessor to Swift | not tested | ✅ | ✅ |
Usage:
/Applications/Xcode 14.2.app
, side by side with the existing Xcode installation /Applications/Xcode.app
libMetalFlashAttention.metallib
swift build.swift
Alternatively:
SGEMM, every square matrix from 1–1536:
HGEMM, every square matrix from 1–2048:
Scaling by square size:
Function Constant | Value |
---|---|
M_splits |
2 |
N_splits |
2 |
M_simd |
Block M / M_splits |
N_simd |
Block N / N_splits |
K_simd |
Block K |
Precision | Block M | Block N | Block K |
---|---|---|---|
Float32 | 32 | 32 | 32 |
Float32 | 48 | 48 | 24 |
Float16 | 32 | 32 | 32 |
Float16 | 48 | 48 | 32 |
Size Start | Size End | Duplicate Commands/Encoder | Trials |
---|---|---|---|
1 | 190 | 256 | 16 |
192 | 254 | 128 | 16 |
256 | 382 | 64 | 16 |
384 | 510 | 32 | 16 |
512 | 766 | 16 | 16 |
768 | 1022 | 8 | 16 |
1024 | 1534 | 4 | 16 |
1536 | 2048 | 2 | 16 |
Setup:
Scaling by sequence length:
Scaling by head size:
64: every
roundUpToPowerOf2(D/64)
integers
Function Constant | Value |
---|---|
Q_trans |
❌ |
K_trans |
✅ |
V_trans |
❌ |
O_trans |
❌ |
R_splits |
TBD |
R_simd |
Block R / R_splits |
C_simd |
Block C |
D_simd |
$$8 \times \left \lceil{ \frac{D}{8} }\right \rceil $$ |
Dense: Stable Diffusion XL outermost attention layer @ 512x512 (sequence length = 1024)
Dense: Stable Diffusion 2 outermost attention layer @ 512x512 (sequence length = 4096)
Dense: Stable Diffusion 1 outermost attention layer @ 512x512 (head size = 40)
Releases:
Prospective Future Goals: