Serial and parallel implementations of matrix multiplication
This repository contains a number of serial and parallel benchmarks for matrix multiplication in C++. Matrix multiplication is a wonderful first operation to try your hand at optimizing for the following reasons:
And many more!
The benchmarks in this repository were written using Google Benchmark. For simplicity, all benchmarks assume square matrix of dimension N x N, where N is 384, 768, and 1152.
The following section breaks down the benchmarks contained in each subdirectory.
serial_mmul_bench
parallel_mmul_bench
blocked_mmul_bench
blocked_aligned_mmul_bench
blocked_mmul_bench
but using 64-byte aligned allocations to prevent blocks from spanning cache linesparallel_blocked_mmul_bench
blocked_column_aligned_mmul_bench
parallel_blocked_column_mmul_bench
blocked_column_multi_output_aligned_mmul_bench
parallel_blocked_column_multi_output_mmul_bench
baseline_cuda_mmul
shmem_cuda_mmul