Algorithms implemented in CUDA + resources about GPGPU
This is a curated list of of examples of using GPU in general-purpose computings, libraries and papers.
Vector addition - Simplest fast one-dimensional vectors addition [CUDA]
Sum of elements in an array - Parallel sum of elements in an array [CUDA]
cuBlas SAXPY - Implementation of SAXPY with cuBlas [CUDA]
2D convolution - Naïve implementation of 2D convolution [CUDA]
Median filter - Median filter with arbitrary size kernel [CUDA]
Sobel edge-detection filter - Parallel implementation of Sobel Operator which is used in image processing [CUDA]
K Means clustering - Fast Floyd K Means on GPU. Shared memory and two-step reduction (partial and global) are used to implement finding cluster centers [CUDA]
Fuzzy C Means clustering - Fuzzy C Means. Shared memory and two-step reduction (partial and global) are used to implement finding cluster centers [CUDA]
CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).
Thrust is a powerful library of parallel algorithms and data structures. Thrust provides a flexible, high-level interface for GPU programming that greatly enhances developer productivity. Using Thrust, C++ developers can write just a few lines of code to perform GPU-accelerated sort, scan, transform, and reduction operations orders of magnitude faster than the latest multi-core CPUs. For example, the thrust::sort algorithm delivers 5x to 100x faster sorting performance than STL and TBB.
OpenCL is the open, royalty-free standard for cross-platform, parallel programming of diverse processors found in personal computers, servers, mobile devices and embedded platforms. OpenCL greatly improves the speed and responsiveness of a wide spectrum of applications in numerous market categories including gaming and entertainment titles, scientific and medical software, professional creative tools, vision processing, and neural network training and inferencing.
Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API and provides access to compute devices, contexts, command queues and memory buffers. On top of the core library is a generic, STL-like interface providing common algorithms (e.g. transform(), accumulate(), sort()) along with common containers (e.g. vector<T>, flat_set<T>). It also features a number of extensions including parallel-computing algorithms (e.g. exclusive_scan(), scatter(), reduce()) and a number of fancy iterators (e.g. transform_iterator<>, permutation_iterator<>, zip_iterator<>).
PyCUDA lets you access Nvidia‘s CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist–so what’s so special about PyCUDA?
PyOpenCL gives you easy, Pythonic access to the OpenCL parallel computation API.
OpenACC is a user-driven directive-based performance-portable parallel programming model designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model.
Hemi simplifies writing portable CUDA C/C++ code. With Hemi, you can write parallel kernels like you write for loops in line in your CPU code and run them on your GPUю
CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel-prefix-sum ("scan"), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.
Awesome CUDA by Erkaman is a list of useful libraries and resources for CUDA development
CUDA Awesome by gmarciani is a collection of awesome algorithms, implemented in CUDA