Cutlass Versions Save

CUDA Templates for Linear Algebra Subroutines

1 year ago

CUTLASS 2.10.0

CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors. Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable. Optimizations for GEMM+Softmax. Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM. GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after. GEMM Epilogue Permutation Fusion can permute the GEMM output before storing. Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized. Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now. Standalone Layernorm and Pooling kernels. Back-to-back GEMM enhancements. Updates and bugfixes from the community (thanks!)

1 year ago

Bug fixes, performance tuning, and enhancements to documentation.

2 years ago

CUTLASS 2.9.0

First layer Convolution kernels specialized for small channel counts and reduced alignment
- Few channels specialization for reduced alignment capabilities
- Fixed channels further specialized when channel count perfectly matches the access vector size
- Unit tests
- Python-based instance emitter in the CUTLASS Library and support in the Profiler
BLAS3 operators accelerated by Tensor Cores
- Supported types: f32, cf32, f64, cf64
- HERK with emitter
- SYRK with emitter
- SYMM with emitter
- TRMM with emitter
- Unit tests
CUTLASS Python demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using CUDA Python
- Python-based runtime interoperable with existing emitters
GEMM + Softmax example
Optimal performance using CUDA 11.6u2
Updates and bugfixes from the community (thanks!)

2 years ago

TF32x3: emulated single-precision using Tensor Cores
- 45+ TFLOPs on NVIDIA A100
- GEMM SDK example (real)
- COMPLEX GEMM SDK example (complex)
- Implicit GEMM Convolution SDK example
Mainloop fusion for Convolution: convolution with fused per-channel scale-bias-relu
Grouped GEMM: similar to batched GEMM with distinct problem size per group
- SDK example with performance comparison with Batched Strided GEMM
- cutlass::gemm::device::GemmGrouped
Implicit GEMM Convolution fusion supports staging 1st convolution's output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue.
Optimal performance using CUDA 11.5
Updates from the community (thanks!)
Deprecation announcement: CUTLASS plans to deprecate the following:
- Maxwell and Pascal GPU architectures
- Ubuntu 16.04
- CUDA 10.2

2 years ago

2.7.0

Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
- Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
- Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
- Thank you for filing these issues!

2 years ago

Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
Tuning for GEMMs fused with partial reductions
Corrections and bug fixes reported by the CUTLASS community
- Thank you for filing these issues!

2 years ago

CUTLASS 2.6.0

Optimal performance when compiled with the CUDA 11.4 Toolkit
- Adopt the new L2 prefetch feature in cp.async and global load
Fused operators with GEMM and Convolution
- Fused broadcast in epilogue
- Fused partial reduction in epilogue
64b tensor strides and leading dimensions support for GEMMs
Affine rank=2 matrix layouts
- Row stride and column stride for matrices using cutlass::layout::AffineRank2
- Support FP64 tensor core and SIMT GEMM.
Batched GEMV preview implementation
New strided Dgrad implementation
- Accelerates over previous implementation by cutting down redundant math by 4x
- Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
- Updates to quaternion.h and functional.h
- SDK Example for GEMM and Convolution
- Unit tests for GEMM and Convolution
Many improvements to the epilogue.
- Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
- Performance improvement for FP16 tensor core kernels
- Bug fixes
Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
Updated minimum CUDA Toolkit requirement to 10.2
- CUDA 11.4 Toolkit recommended
Corrections and bug fixes reported by the CUTLASS community
- Thank you for filing these issues!

3 years ago

CUTLASS 2.5 is a minor release contributing:

Tensor reductions
- m-to-n reductions of tensors with affine layout
- Specializations for reductions including contiguous dimension
- Specializations for reductions excluding contiguous dimension
- Custom reduction functors such as cutlass::logical_and
- Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31)
Optimizations for 3-D convolution
- Optimized tile iterators using precomputed delta table for 3-D convolution
- Full coverage of forward and backwards passes for 3D convolution
Fused Convolution+Convolution example
Corrections and bug fixes reported by the CUTLASS community
- Thank you for filing these issues!

3 years ago

Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs
- Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution
- Data type: FP32, complex<FP32>, Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32
- Spatial dimensions: 1-D, 2-D, and 3-D
- Layout: NHWC, NCxHWx
Implicit GEMM convolution components:
- Global memory iterators supporting Fprop, Dgrad, and Wgrad
- MmaMultistage for implicit GEMM convolution for NVIDIA Ampere architecture
- MmaPipeline for implicit GEMM convolution for NVIDIA Volta and Turing architectures
- Documentation describing Implicit GEMM Convolution algorithm and implementation

3 years ago

CUTLASS 2.3

NVIDIA Ampere Architecture features
- Sparse Tensor Core GEMM kernels:
  - Direct access to Sparse Tensor Cores and maximum performance via mma.sp.sync
- Fast SGEMM targeting GeForce RTX 30-series CUDA Cores
Minor Features:
- Activation functions such as GeLU and Sigmoid
- Small matrix and quaternion template classes in device code
- Floating-point constants
NVIDIA Ampere GPU Architecture examples and documentation:
- Tensor Float 32 and
- Sparse Tensor Cores
- Documentation added on CUTLASS efficient row-major epilogue