CUDA Templates for Linear Algebra Subroutines
CUTLASS 2.10.0
CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors. Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable. Optimizations for GEMM+Softmax. Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM. GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after. GEMM Epilogue Permutation Fusion can permute the GEMM output before storing. Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized. Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now. Standalone Layernorm and Pooling kernels. Back-to-back GEMM enhancements. Updates and bugfixes from the community (thanks!)
Bug fixes, performance tuning, and enhancements to documentation.
CUTLASS 2.9.0
TF32x3: emulated single-precision using Tensor Cores
Mainloop fusion for Convolution: convolution with fused per-channel scale-bias-relu
Grouped GEMM: similar to batched GEMM with distinct problem size per group
Implicit GEMM Convolution fusion supports staging 1st convolution's output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue.
Optimal performance using CUDA 11.5
Updates from the community (thanks!)
Deprecation announcement: CUTLASS plans to deprecate the following:
cutlass::half_t
cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
CUTLASS 2.6.0
Dy
and w
analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution
interfaceCUTLASS 2.5 is a minor release contributing:
cutlass::logical_and
MmaMultistage
for implicit GEMM convolution for NVIDIA Ampere architectureMmaPipeline
for implicit GEMM convolution for NVIDIA Volta and Turing architecturesCUTLASS 2.3
mma.sp.sync