CHARM: Composing Heterogeneous Accelerators on Versal ACAP Architecture
Principal Investigator: Dr. Peipei Zhou, https://peipeizhou-eecs.github.io/
Ph.D. Students: Jinming Zhuang (Lead) and Zhuoping Yang
Faculty Collaborators: Drs. Jingtong Hu, Alex Jones, Deming Chen, and Jason Cong
Student Collaborators: Jason Lau and Hanchen Ye
AMD Collaborators: Stephen Neuendorffer, Jack Lo, and Kristof Denolf
ACM PDF: https://doi.org/10.1145/3543622.3573210 Author Version PDF: https://peipeizhou-eecs.github.io/publication/fpga23/
Author Version PDF: https://arxiv.org/pdf/2305.18698.pdf
ACM PDF: https://doi.org/10.1145/3626202.3637569
python project_setup.py
from charm import*
#Define the left-hand-side(A) and right-hide-side(B) operands
A=np.random.rand(4096, 4096).astype(np.float32)
B=np.random.rand(4096, 4096).astype(np.float32)
#Create the object of the class charm
automm=charm(prj_dir)
#Launch charm dse to find optimized hardware configuration
Versal_config=automm.cdse(A,B)
#Launch charm automatic code generator to emit the code for AIE, PL and Host CPU
device='vck190' # Supported devices are vck190 and vck5000
automm.cacg(Versal_config,device)
#Run Vitis Compilation Flow
automm.build()
In this repo, we use general-purpose Matrix-Matrix Multiplication (GEMM) applications as an example and provide a detailed description of how to build a system-level design on AMD Versal VCK190 Platform. By going through this repo, users can get knowledge on:
We provide an automatic code generation and compilation flow that users can build the system on Versal step by step by changing the configuration files.
To play with the Charming Accelerators, the following software and hardware dependencies are required:
unzip xilinx_vck190_base_202110_1.zip
tar -xf xilinx-versal-common-v2021.1.tar.gz
cd xilinx-versal-common-v2021.1
sh sdk.sh
PLATFORM=${PATH}/xilinx_vck190_base_202110_1/xilinx_vck190_base_202110_1.xpfm
SYSROOT = ${PATH}/sysroots/cortexa72-cortexa53-xilinx-linux
EDGE_COMMON_SW=${PATH}/xilinx-versal-common-v2021.1
source /opt/tools/xilinx/Vitis/2021.1/settings64.sh
source /opt/xilinx/xrt/setup.sh
unset LD_LIBRARY_PATH (If needed)
source ${PATH}/environment-setup-cortexa72-cortexa53-xilinx-linux
Users can generate the customized project by setting up the configuration file and directly running the following command:
./project_setup.sh ./config_files/input.cfg ${Project_DIR}
cd ${Project_DIR}
make all PLATFORM=${PATH} EDGE_COMMON_SW_PATH=${PATH} SYSROOT_PATH={PATH}
After copy the sd card image to micro sd card and boot up the system run the following commands to get the execution results. {M}, {K}, {N} refers to the size of MM. In order to reduce the effect of overhead of calling API when runnning the kernel, users can specify the number of {iteration} of running the MM then it provides the average throughput. To verify the correctness of the MM kernel, {verify} should be assigned to 1, otherwise 0. One example of running MM with 1024*1024*\1024 for 100 iterations without verify the result can be: ./hostexe mm_hw.xclbin 1024 1024 1024 100 0
cd /mnt/sd-mmcblk0p1
./hostexe mm_hw.xclbin {M} {K} {N} {iteration} {verify}
In this part, we first introduce the overall MM tiling strategy including four levels of tilings. Then in the later parts, we illustrate the methodology of how we handle each of these level of tilings.
Given a large Matrix Multiplication(MM) with size (M*K) * (K*N) refer as M*K*N, the listing bellow shows four level of tilings to handle this MM (from innermost to outermost):
We visualize the on-chip buffer level tiling in the right figure. We refer the MM calculated in single AIE as "Tile" level and refer the MM unrolled on AIE array level as "Batch" level. The strtegy of mapping the tiled MM on AIE array will be illustrated later.
In this part, we demonstrate the coding style of calculating MM with size TI*TK*TJ in a single AIE which corresponds to the first level of tiling.
AIE is a very-long instruction word (VLIW) processor which can issue upto seven operations in parallel using one VLIW word:
The key challenge of programming single AIE is how to make back-to-back issued instructions by utilizing the 32KB local memory and 2KB local registers of a single AIE (for integer data type there are additional 3KB accumulator registers).
We provide our source code that achieves 95% efficiency when calculating 32*32*32 MM in src/aie/mm_kernel0.cc. The visualization of the algorithm is shown below:
The insights of programming AIE are:
The tools for automatically generating the source code are under ""src_gen"" folder
ACG takes platform information and user-specified design point as input and automatically generated the system-level design by launching the following 4 template based components sequentially:
Kerne_lGen: Kernel_Gen is launched to generate both the single AI Engine(AIE) C code and adaptive data flow (ADF) graph code in C++ for verifying the correctness of a single kernel design. MM kernels with fp32 data type in different shapes that can be fit in a single kernel are supported in current version.
AIE_ArrGen: AIE_ArrGen is launched to generate new ADF graph code that defines how packet-switch streams are connected to AIE array which contains 400 AIEs. Single kernel calculating 32x32x32 MM with fp32 data type is supported to scale out to the AIE array.
PL_Gen: Based on the AIE array created by AIE_ArrGen, PL_Gen is launched to generate PL streams, scheduling controller C/C++ HLS modules to communicate with AIE array and PL on-chip buffers, off-chip AXI data transfer modules to communicate with DDR. Differnet system level designs varying in on-chip buffer size and its implementation option (BRAM or URAM) fp32 data type are supported.
Host_Gen: Host_Gen is launched to generate the system control logic running on the ARM CPU by using AMD XRT APIs.
Compilation
After code generation, the vendor tools AIE compiler and V++ compiler take ADF gragh and HLS C/C++ as input respectively. Their output object file libadf.a and kernel.xo will be linked into xclbin file which includes the hardware information of the design for the target platform. C++ compiler compiles XRT API-based host code to executable file runs on CPU.
We provide a configuration file template under "./config_files/input.cfg", users can specify the platform, data type, kernel type, and mapping strategy of each level in this file. The feasible option of each parameter are illustrated in ( ) The rules of using this configuration file are listed below:
Platform:VCK190;
DATA_TYPE:fp32;
KernelGen:1;
KRL_TYPE:0;
I:32;
K:32;
J:32;
AIEArrGen:1;
NUM_PACK:4;
A:6;
B:4;
C:16;
A_BRO:4;
C_BRO:3;
SysGen:1;
X:8;
Y:1;
Z:2;
LHS_BUFF:0;
RHS_BUFF:0;
OUT_BUFF:1;
We provide four applications under the example folder including BERT for natural language processing, NCF for recommendations, ViT for vision classification, MLP for multi-layer perceptron classification or regression. The expected throughput should be the same as the results shown in the following figure:
To quickly reproduce the results, we provide the pre-built object files of AIE, PL, and ARM CPU in the pre_built folder. Users can go to the corresponding folder and run the following command to create the sd card image for onboard execution.
make package EDGE_COMMON_SW_PATH=${PATH} SYSROOT_PATH={PATH}
We acknowledge the support from the University of Pittsburgh New Faculty Start-up Grant, NSF awards #2213701, #2217003, and the support from CRISP, one of six SRC JUMP centers. We thank AMD/Xilinx for FPGA and software donation, and support from the AMD/Xilinx Center of Excellence at UIUC, the AMD/Xilinx Heterogeneous Accelerated Compute Cluster at UCLA, and the Center for Research Computing (CRC) at the University of Pittsburgh.
References:
[1] AIE Architecture(AM009 2021.1)
[2] AIE Instructions and APIs(UG1078 UG1529)
[3] AIE Coding Example(UG1079 2021.1)
[4] Versal Programming Environment(UG1076 2021.1)
[5] Introduction to FP32 programming of AIE