CUDA Templates for Linear Algebra Subroutines
cute::Tensor<>
, MMA atoms, and an overhauled CuTe GEMM tutorial series.Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.
Params
structs.master
to main
in this release.device::GemmUniversalAdapter
and kernel::GemmUniversal
types, allowing users to include both APIs in the same translation units. More information can be found in the 3.x backwards compatibility section.CollectiveBuilder
API, enabling CUTLASS profiler.Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.
Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
BLAS3 functions with Hoppers new double precision matrix multiplication instructions.
ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.
Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
Optimized DepthWise Conv. Two new modes are added
Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.
Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.
Deprecation announcement: CUTLASS plans to deprecate the following: