HLS-based Graph Processing Framework on FPGAs
ThunderGP for HBM-Enabled FPGA platforms has been released. Please see the branch.
ThunderGP has been published in ACM Transactions on Reconfigurable Technology and Systems (TRETS), as one of "Best Papers in FPGA 2021".
ThunderGP won the third place in 2020 Xilinx Adaptive Computing Developer Contest, top 9 out of 79 teams.
ThunderGP is accepted to be FPGA 2021. Read the paper.
ThunderGP is featured at Xilinx Apps and Libraries.
ThunderGP was presented at XACC@NUS Workshop Series 2020: Reconfigurable Computing Systems. see Slides, Video/Youtube, Video/bilibili.
ThunderGP enables data scientists to enjoy the performance of FPGA-based graph processing without compromising programmability. To our best knowledge and experiments, this is the fastest graph processing framework on HLS-based FPGAs.
Two aspacts make the ThunderGP deliver superior performance. On the one hand, ThunderGP embraces an improved execution flow to better exploit the pipeline parallelism of FPGA and alleviate the data access amount to the global memory. On the other hand, the memory accesses are highly optimized to fully utilize the memory bandwidth capacity of the hardware platforms.
On Xilinx multi-SLR based FPGAs, it is running at 250Mhz, and the performance can be up to 6400 MTEPS (million traversed edges per second), or a 2.9 times speedup over the state-of-the-art.
ThunderGP currently has seven build-in graph algorithms: PageRank (PR), Sparse Matrix-Vector Multiplication (SpMV), Breadth-First Search (BFS), Single Source Shortest Path (SSSP), Closeness Centrality (CC), ArticleRank (AR), and Weakly Connected Component (WCC).
The desired application can be implemented by passing argument app=[the algorithm]
to make
command. The below table is for quick reference.
Argument | Accelerated algorithm |
---|---|
app=pr |
PageRank (PR) |
app=spmv |
Sparse Matrix-vector Multiplication (SpMV) |
app=bfs |
Breadth First Search (BFS) |
app=sssp |
Single Source Shortest Path (SSSP) |
app=cc |
Closeness Centrality (CC) |
app=ar |
ArticleRank (AR) |
app=wcc |
Weakly Connected Component (WCC) |
$ git clone https://github.com/Xtra-Computing/ThunderGP.git
$ cd ./
$ vim ThunderGP.mk
$ # configure the DEVICE as DEVICES := xilinx_u250_xdma_201830_2; configure TARGETS := hw
$ make app=pr all # make the host execution program and the FPGA bitstream. It takes time :)
# For execution on real hardware. The path of graph dataset needs to be provided by the user.
$ ./host_graph_fpga_pr xclbin_pr/*.xclbin ./dataset/rmat-14-32.txt
Throughput (MTEPS) of different graph processing algorithms over datasets on VCU1525 platform.
App. | PR | SPMV | BFS | SSSP | CC | AR | WCC |
---|---|---|---|---|---|---|---|
R21 | 5,015 | 4,190 | 5,417 | 3,901 | 4,623 | 4,848 | 4,584 |
R24 | 4,599 | 3,781 | 3,437 | 3,430 | 4,339 | 4,486 | 4,328 |
G24 | 5,039 | 4,037 | 5,216 | 3,725 | 4,752 | 4,927 | 4,704 |
G25 | 4,464 | 3,615 | 4,660 | 3,343 | 4,344 | 4,389 | 4,356 |
WT | 2,884 | 2,874 | 2,717 | 2,427 | 2,776 | 2,833 | 2,776 |
MG | 4,454 | 3,883 | 4,939 | 3,699 | 4,077 | 4,285 | 4,088 |
PK | 4,001 | 3,729 | 4,251 | 3,169 | 3,833 | 3,909 | 3,716 |
WP | 3,030 | 2,994 | 3,112 | 2,491 | 2,993 | 2,931 | 2,894 |
LJ | 3,186 | 3,003 | 3,408 | 2,623 | 3,113 | 3,081 | 3,099 |
TW | 2,938 | 2,801 | 2,120 | 2,425 | 2,962 | 2,853 | 2,894 |
Throughput (MTEPS) of different graph processing algorithms over datasets on U250 platform.
App. | PR | SPMV | BFS | SSSP | CC | AR | WCC |
---|---|---|---|---|---|---|---|
R21 | 4,669 | 5,056 | 6,028 | 4,879 | 4,783 | 4,667 | 4,901 |
R24 | 4,732 | 4,946 | 5,897 | 4,285 | 4,939 | 4,732 | 4,988 |
G24 | 5,040 | 5,305 | 5,772 | 4,428 | 3,705 | 5,040 | 5,303 |
G25 | 4,978 | 4,072 | 4,974 | 3,864 | 3,661 | 4,984 | 5,254 |
WT | 2,251 | 2,938 | 2,630 | 2,583 | 2,369 | 2,253 | 2,405 |
MG | 3,756 | 4,195 | 4,949 | 4,378 | 3,914 | 3,737 | 3,891 |
PK | 3,630 | 4,372 | 4,629 | 3,927 | 3,865 | 3,662 | 3,841 |
WP | 3,255 | 3,652 | 4,058 | 3,417 | 3,341 | 3,259 | 3,432 |
LJ | 3,342 | 3,693 | 4,329 | 3,614 | 3,557 | 3,328 | 3,708 |
TW | 3,538 | 3,959 | 3,671 | 3,585 | 3,759 | 3,533 | 3,806 |
Benefiting from the high level abstraction of HLS, our APIs natively support C/C++ languages.
ThunderGraph covers three levels of API for implementation or further exploration.
APIs in L1 and L2 are for building the accelerators, and APIs of L3 are for host program. Details are as below:
L1 is used to construct the basic modules to build the compute kernels and the dataflow.
L2 provides hooks for mapping graph processing algorithms.
L3 provides the high-level APIs on host side to deploy or control graph processing accelerator. Since recent FPGAs usually consist of multiple (SLRs), L3 also wraps the partition scheduling and memory management interface for multiple SLRs.
More details: ThunderGP APIs
The Gather-Apply-Scatter (GAS) model is widely used for FPGA-based graph processing frameworks as computation model due to its extensibility to various graph processing algorithms. ThunderGP adopts a simplified version of GAS model by following work On-the-fly-data-shuffling-for-OpenCL-based-FPGAs. This model updates the vertex property by propagating from source vertex to destination vertex. The input for the model is an unordered set of directed edges of the graph. Undirected edges in a graph can be represented by a pair of directed edges.
The process per iteration mainly contains three stages: Scatter, Gather, and Apply.
<src, dst, weight>
, an update tuple is generated for the destination vertex of the edge. The update tuple is of the format <dst, value>
, where the dst is the destination vertex of the edge and value is generated by processing the vertex properties and edge weights.As shown in the above diagram, The edges in one partition are streamed into Scatter stage, For each edges, the property of source vertices will be fetched from the global memory by the per-fetching and the cache module, at the same time, the property of corresponding edge, or the weight of edge is loaded from global memory in stream, then these two value go through an algorithm-specific processing which return an update of the property of the destination vertex, finally, at the end of scatter stage, this update value and the destination of this edge is combined to create a update tuple. The update tuples are streamed into the shuffle stage which dispatches the tuples to corresponding gather processing engines(PEs). The Gather PEs accumulates the update value in local on-chip memory which is caching the property of destination vertices. After all the edges in this partition are processed, the cached data in gather PEs will be aggregated to the global memory. and the Apply stage which calls algorithm-specific function updates all the vertices for the next iteration.
If you use ThunderGP in your paper, please cite our work (full version).
@article{10.1145/3517141,
author = {Chen, Xinyu and Cheng, Feng and Tan, Hongshi and Chen, Yao and He, Bingsheng and Wong, Weng-Fai and Chen, Deming},
title = {ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLS},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1936-7406},
url = {https://doi.org/10.1145/3517141},
doi = {10.1145/3517141},
note = {Just Accepted},
journal = {ACM Trans. Reconfigurable Technol. Syst.},
month = {feb},
keywords = {FPGA; HBM; HLS; Graph Processing; Framework}
}
@inbook{10.1145/3431920.3439290,
author = {Chen, Xinyu and Tan, Hongshi and Chen, Yao and He, Bingsheng and Wong, Weng-Fai and Chen, Deming},
title = {ThunderGP: HLS-Based Graph Processing Framework on FPGAs},
year = {2021},
url = {https://doi.org/10.1145/3431920.3439290},
booktitle = {The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
pages = {69–80},
numpages = {12}
}