Fast inference engine for Transformer models
Tuned OpenCL BLAS
BLISlab: A Sandbox for Optimizing GEMM
The HPC toolbox: fused matrix multiplication, convolution, data-parallel...
Stretching GPU performance for GEMMs and tensor contractions.
DBCSR: Distributed Block Compressed Sparse Row matrix library
Single file libraries for C/C++
Code for testing the native float16 matrix multiplication performance on...
Serial and parallel implementations of matrix multiplication