FPGA-based GZIP (deflate) data compressor. Input raw data and output standard GZIP format (as known as .gz file format). 基于 FPGA 的流式 GZIP 压缩器,用于通用数据压缩。输入原始数据,输出标准的 GZIP 格式,即常见的 .gz / .tar.gz 文件的格式。
An FPGA-based streaming GZIP (deflate) compressor supports both LZ77 and dynamic huffman. For universal lossless data compression. Input raw data and output the standard GZIP format (as known as .gz / .tar.gz file format).
AXI-stream input interface:
AXI-stream output interface:
o_tready=1
always, then the input interface will also have no back pressure, i.e. o_tready=1
always (even in the worst case).
Complies with deflate algorithm specification (RFC1951 [1]) and GZIP format specification (RFC1952 [2]).
deflate block :
LZ77 Compression:
Dynamic huffman coding:
The compression ratio of this design:
According to GZIP spec, calculates CRC32 of raw data and places it at GZIP footer.
The design source code is in RTL directory. In which gzip_compressor_top.v is the top module.
The in/out signals of gzip_compressor_top is as follows:
module gzip_compressor_top # (
parameter SIMULATION = 0 // 0:disable simulation assert (for normal use) 1: enable simulation assert (for simulation)
) (
input wire rstn, // asynchronous reset. 0:reset 1:normally use
input wire clk,
// input stream : AXI-stream slave, 1 byte width (thus do not need tkeep and tstrb)
output wire i_tready,
input wire i_tvalid,
input wire [ 7:0] i_tdata,
input wire i_tlast,
// output stream : AXI-stream master, 4 byte width
input wire o_tready,
output reg o_tvalid,
output reg [31:0] o_tdata,
output reg o_tlast,
output reg [ 3:0] o_tkeep // At the end of packet (tlast=1), tkeep may be 4'b0001, 4'b0011, 4'b0111, or 4'b1111. In other cases, tkeep can only be 4'b1111
);
rstn=0
initial
register, it is necessary to reset before use.
The input interface is a standard 8-bit wide AXI-stream slave.
i_tvalid
handshakes with i_tready
. Successfully input 1 data only when i_tvalid=1
and i_tready=1
simultaneously (see following figure).i_tdata
is one-byte input data.i_tlast
is the boundary mark of the package, i_tlast=1
means that the current byte is the last byte of a packet, and the next transmitted byte is the first byte of the next packet. Each packet is compressed into an independent GZIP data stream. _ __ __ __ __ __ __ __ __ __ __ __
clk \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \
_____________________________ _____
tvalid ___________/ \___________/ \________
_________________ ________________________________
tready \_________________/
_____ _______________________ _____
tdata XXXXXXXXXXXX__D1_X___________D2__________XXXXXXXXXXXXX__D3_XXXXXXXXX
The output interface is a standard 32 bit (4byte) wide AXI-stream master.
o_tvalid
handshakes with o_tready
. Successfully input one data only when i_tvalid=1
and i_tready=1
simultaneously (as same as input AXI-stream).o_tdata
is 4-byte output data. According to AXI-stream spec, o_tdata
is small endian, i.e. o_tdata[7:0]
is the earliest byte, o_tdata[31:24]
is the latest byte.o_tlast
is the boundary mark of the package. Each packet an GZIP data stream.o_tkeep
is byte-valid signal :
o_tkeep[0]=1
means o_tdata[7:0]
exist,o_tkeep[1]=1
means o_tdata[15:8]
exist,o_tkeep[2]=1
means o_tdata[23:16]
exist,o_tkeep[3]=1
means o_tdata[31:24]
exist.o_tlast=1
), o_tkeep
be 4'b0001
, 4'b0011
, or 4'b0111
o_tkeep=4’b1111
always.
The output data of AXI-stream satisfies GZIP format standard. After storing each AXI-stream packet in an .gz file, this file can be decompressed by compression softwares (7ZIP, WinRAR, etc.).
Tip: .gz is GZIP compressed files. More well-known files are .tar.gz. In fact, TAR packages multiple files into a single .tar file, and then compresses this .tar file to obtain a .tar.gz file. If a single file is compressed, it can be directly compressed into a .gz without TAR packaging. For example, compress data.txt to data.txt.gz
For example, a total of 987 handshakes were performed on the output AXI-stream interface, and the last handshake was o_tlast=1
, indicating that the 987 cycles of data are an independent GZIP stream. Assuming the last handshake o_tkeep=4'b0001
, then the last cycle only carries 1 byte of data, thus the GZIP stream contains a total of 986×4+1=3949 bytes. If you save these bytes into a. gz file, it should be:
.gz file 1st byte = o_tdata[7:0] of 1st cycle
.gz file 2nd byte = o_tdata[15:8] of 1st cycle
.gz file 3rd byte = o_tdata[23:16] of 1st cycle
.gz file 4th byte = o_tdata[31:24] of 1st cycle
.gz file 5th byte = o_tdata[7:0] of 2nd cycle
.gz file 6th byte = o_tdata[15:8] of 2nd cycle
.gz file 7th byte = o_tdata[23:16] of 2nd cycle
.gz file 8th byte = o_tdata[31:24] of 2nd cycle
......
.gz file 3945th byte = o_tdata[7:0] of 986th cycle
.gz file 3946th byte = o_tdata[15:8] of 986th cycle
.gz file 3947th byte = o_tdata[23:16] of 986th cycle
.gz file 3948th byte = o_tdata[31:24] of 986th cycle
.gz file 3949th byte = o_tdata[7:0] of 987th cycle
o_tready=1
always, then the input interface will also have no back pressure, i.e. o_tready=1
always (even in the worst case).
o_tready=1
, then we can ignore i_tready
. The module can runs in the stable maximum performance.
To evaluate the comprehensive performance of this design, I compare it with the following three deflate compression schemes:
LOWLUT=False
, COMPRESS=True
, DECOMPRESS=False
, MATCH10=True
,FAST=True
The data to be compressed for comparison is raw.hex located in Arty-example/python directory, with a size of 512kB.
The following table shows the comparison results. Note that ↑ represents the larger the better, ↓ represents the smaller the better.
indicators | Software | My design | HDL-deflate | HT-Deflate-FPGA |
---|---|---|---|---|
LZ77 ? | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
hash-based LZ77 search ? | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: |
dynamic literal code tree ? | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
dynamic distance code tree ? | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
dynamic length code tree ? | :heavy_check_mark: | :x: | :x: | :x: |
FPGA logic resource (↓) | - | 8200×LUT | 2854×LUT | ? (huge) |
FPGA memory resource (↓) | - | 25×BRAM36K | 8.5×BRAM36K | ? (huge) |
Clock frequency (↑) | - | 128 MHz | 95 MHz | ? |
Clock cycles consumed (↓) | - | 524288 | 1449901 | ? |
Input Bytes/cycle (↑) | - | 1.0 | 0.36 | ? |
Overall Performance (↑) | 44 MB/s | 128 MB/s | 35 MB/s | 10 GB/s level |
Power (↓) | High | Low | Low | High |
Compressed size (↓) | 287 kB | 299 kB | 395 kB | ? |
Compression ratio (↑) | 44% | 41% | 23% | between 23% & 41% :warning: |
:warning: The compression ratio of HT-Deflate-FPGA may larger than HDL-deflate , but smaller than My design. This conclusion is analyzed based on the algorithm characteristics it supports.
The implement result of gzip_compressor_top on multiple FPGAs:
FPGA series | FPGA part | logic | logic (%) | on-chip-mem | on-chip-mem (%) | frequency |
---|---|---|---|---|---|---|
Xilinx Artix-7 | xc7a35ticsg324-1L | 8218*LUT | 40% | 25*BRAM36K | 50% | 128 MHz |
Xilinx Zynq-7 | xc7z020clg484-1 | 8218*LUT | 16% | 25*BRAM36K | 18% | 128 MHz |
Xilinx Virtex-7 | xc7vx485tffg1761-1 | 8201*LUT | 3% | 25*BRAM36K | 3% | 160 MHz |
Xilinx ZU+ | xczu3eg-sbva484-2-e | 8180*LUT | 12% | 24*BRAM36K | 11% | 280 MHz |
Altera Cyclone4 | EP4CE55F23C8 | 16807*LE | 30% | 807398 bits | 34% | 74 MHz |
The simulation source code is in SIM directory. In which tb_gzip_compressor.v is the top module.
The diagram of this testbench is as follows:
Among them, the random data packet generator (tb_random_data_source.v) will generate four types of data packets with different characteristics (random bytes with uniform distribution, random bytes with non-uniform distribution, randomly continuously changing data, and sparse data), which will be sent to the design under test (gzip_compressor_top) for compression, and then tb_save_result_to_file.v module will store the compressed results into files. Each packet is stored into a independent .gz file.
tb_print_crc32 is responsible for calculating the CRC32 of the raw data and printing it. Note that the CRC32 is also calculated internally in gzip_compressor_top and filled into the footer. These two CRC32 calculators are independent (the former is only used for simulation to verify whether the CRC32 generated by the module under test is correct). You can compare the simulated printed CRC32 with the CRC32 in the GZIP file on your own.
You can follow the following steps to simulate using iverilog:
FILE_COUNT
in tb_random_data_source.v.OUT_FILE_PATH
in tb_save_result_to_file.v)
python check_gz_file.py .
The command means: check all .gz files in current directory (.
), i.e. sim_data.
In addition to iverilog, you can use other simulators. Just add all the .v files in RTL and SIM directories to the simulation project, and set tb_gzip_compressor.v to be the top-level file for simulation.
I provided an demo which runs on Arty development board (this demo is also a pure RTL design, which can be ported to other FPGAs).
The FPGA project receives UART (serial port) data, feeds it into the GZIP compressor, and sends the resulting GZIP stream through the serial port (UART: baud rate 115200, no parity).
On the computer, we run a Python program. The execution steps of the program are:
The serial port speed is much lower than the maximum performance that gzip_compressor_top can achieve, so this project is only used for demonstration. To achieve higher performance of the module, another high-speed communication interfaces need to be used.
The following figure is the block diagram of this demo.
Files related to the project:
fpga_uart_gz_file.py
)After programming the FPGA, open a command line in Arty_example/python directory and run the following command:
python fpga_uart_gz_file.py <file_name_to_compress>
For example, running the following command to compress raw.hex :
python fpga_uart_gz_file.py raw.hex
If the compression is successful, we can get raw.hex
without any errors.
基于 FPGA 的流式的 GZIP (deflate 算法) 压缩器。用于通用无损数据压缩:输入原始数据,输出标准的 GZIP 格式,即常见的 .gz / .tar.gz 文件的格式。
AXI-stream 输入接口
AXI-stream 输出接口
依照 deflate 算法规范 (RFC1951 [1]) 和 GZIP 格式规范 (RFC1952 [2]) 编写
deflate block :
LZ77 游程压缩:
动态 huffman 编码:
由于支持了以上功能,本设计的压缩率:
依照 GZIP 的规定,生成原始数据的 CRC32 放在 GZIP 的末尾,用于校验。
RTL 目录包含了 GZIP 压缩器的设计源码,其中的 gzip_compressor_top.v 是 IP 的顶层模块。
gzip_compressor_top 的输入输出信号如下
module gzip_compressor_top # (
parameter SIMULATION = 0 // 0:disable simulation assert (for normal use) 1: enable simulation assert (for simulation)
) (
input wire rstn, // asynchronous reset. 0:reset 1:normally use
input wire clk,
// input stream : AXI-stream slave, 1 byte width (thus do not need tkeep and tstrb)
output wire i_tready,
input wire i_tvalid,
input wire [ 7:0] i_tdata,
input wire i_tlast,
// output stream : AXI-stream master, 4 byte width
input wire o_tready,
output reg o_tvalid,
output reg [31:0] o_tdata,
output reg o_tlast,
output reg [ 3:0] o_tkeep // At the end of packet (tlast=1), tkeep may be 4'b0001, 4'b0011, 4'b0111, or 4'b1111. In other cases, tkeep can only be 4'b1111
);
initial
寄存器初始化的 FPGA 上,使用前必须复位。
输入接口是标准的 8-bit 位宽的 AXI-stream slave
i_tvalid
和 i_tready
构成握手信号,只有同时=1时才成功输入了1个数据 (如下图)。i_tdata
是1字节的输入数据。i_tlast
是包 (packet) 的分界标志,i_tlast=1
意味着当前传输的是一个包的末尾字节,而下一个传输的字节就是下一包的首字节。每个包会被压缩为一个独立的 GZIP 数据流。 _ __ __ __ __ __ __ __ __ __ __ __
clk \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \
_____________________________ _____
tvalid ___________/ \___________/ \________
_________________ ________________________________
tready \_________________/
_____ _______________________ _____
tdata XXXXXXXXXXXX__D1_X___________D2__________XXXXXXXXXXXXX__D3_XXXXXXXXX
输出接口是标准的 32-bit (4字节) 位宽的 AXI-stream master
o_tvalid
和 o_tready
构成握手信号,只有同时=1时才成功输出了1个数据 (类似输入接口) 。o_tdata
是4字节的输出数据。按照 AXI-stream 的规定,o_tdata
是小端序,o_tdata[7:0]
是最靠前的字节,o_data[31:24]
是最靠后的字节。o_tlast
是包的分界标志。每个包是一个独立的 GZIP 数据流。o_tkeep
是字节有效信号:
o_tkeep[0]=1
意为 o_tdata[7:0]
有效,否则无效o_tkeep[1]=1
意为 o_tdata[15:8]
有效,否则无效o_tkeep[2]=1
意为 o_tdata[23:16]
有效,否则无效o_tkeep[3]=1
意为 o_tdata[31:24]
有效,否则无效o_tlast=1
时) ,o_tkeep
才可能为 4'b0001, 4'b0011, 4'b0111
o_tkeep=4’b1111
AXI-stream 接口输出的是满足 GZIP 格式标准的数据,将每个 AXI-stream 包的数据独立存入一个 .gz 文件后,这个文件就可以被众多压缩软件 (7ZIP, WinRAR 等) 解压。
提示: .gz 是 GZIP 压缩文件的概念。更为人熟知的是 .tar.gz 。实际上 TAR 是把多个文件打包成一个 .tar 文件,然后再对这一个 .tar 文件进行 GZIP 压缩得到 .tar.gz 文件。如果对单个文件进行压缩,则可以不用 TAR 打包,直接压缩为一个 .gz 。例如 data.txt 压缩为 data.txt.gz
例如,AXI-stream 接口上一共成功握手了 987 次,最后一次握手时 o_tlast=1
,说明这 987 拍数据是一个独立的 GZIP 流。假设最后一次握手时 o_tkeep=4'b0001
,则最后一拍只携带1字节的数据,则该 GZIP 流一共包含 986×4+1=3949 字节。如果将这些字节存入 .gz 文件,则应该:
.gz 文件的第1字节 对应 第1拍的 o_tdata[7:0]
.gz 文件的第2字节 对应 第1拍的 o_tdata[15:8]
.gz 文件的第3字节 对应 第1拍的 o_tdata[23:16]
.gz 文件的第4字节 对应 第1拍的 o_tdata[31:24]
.gz 文件的第5字节 对应 第2拍的 o_tdata[7:0]
.gz 文件的第6字节 对应 第2拍的 o_tdata[15:8]
.gz 文件的第7字节 对应 第2拍的 o_tdata[23:16]
.gz 文件的第8字节 对应 第2拍的 o_tdata[31:24]
......
.gz 文件的第3945字节 对应 第986拍的 o_tdata[7:0]
.gz 文件的第3946字节 对应 第986拍的 o_tdata[15:8]
.gz 文件的第3947字节 对应 第986拍的 o_tdata[23:16]
.gz 文件的第3948字节 对应 第986拍的 o_tdata[31:24]
.gz 文件的第3949字节 对应 第987拍的 o_tdata[7:0]
o_tready
始终=1,则输入接口也一定无反压,也即 o_tready
始终=1 (即使在最坏情况下) 。
o_tready
始终=1 ,则可忽略 i_tready
信号,任何时候都可以让 i_tvalid=1
来输入一个字节。
为了评估本设计的综合表现,我将本设计与以下 3 种 deflate 压缩方案进行对比:
LOWLUT=False
, COMPRESS=True
, DECOMPRESS=False
, MATCH10=True
,FAST=True
用于对比的待压缩数据是 Arty-example/python 目录下的 raw.hex ,它的大小是 512kB 。
下表展示对比结果。在指标中,↑代表越大越好, ↓代表越小越好
评价指标 | 软件压缩 | 本设计 | HDL-deflate | HT-Deflate-FPGA |
---|---|---|---|---|
支持 LZ77 ? | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
支持哈希表搜索 LZ77 ? | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: |
支持动态 literal code tree ? | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
支持动态 distance code tree ? | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: |
支持动态 length code tree ? | :heavy_check_mark: | :x: | :x: | :x: |
FPGA逻辑资源 (↓) | - | 8200×LUT | 2854×LUT | ? (较大) |
FPGA存储资源 (↓) | - | 25×BRAM36K | 8.5×BRAM36K | ? (较大) |
时钟频率 (↑) | - | 128 MHz | 95 MHz | ? |
消耗时钟周期数 (↓) | - | 524288 | 1449901 | ? |
每周期输入字节数 (↑) | - | 1.0 | 0.36 | ? |
实际性能 (↑) | 44 MB/s | 128 MB/s | 35 MB/s | 10 GB/s 级别 |
功耗 (↓) | 高 | 低 | 低 | 高 |
压缩后大小 (↓) | 287 kB | 299 kB | 395 kB | ? |
压缩率 (↑) | 44% | 41% | 23% | 介于 23% 和 41% :warning: |
:warning: HT-Deflate-FPGA 的压缩率应该高于 HDL-deflate ,但低于本设计。这是根据它所支持的算法特性分析出来的。
gzip_compressor_top 在各种 FPGA 上实现的结果:
FPGA 系列 | FPGA 型号 | 逻辑资源 | 逻辑资源(%) | 片上存储 | 片上存储(%) | 最高频率 |
---|---|---|---|---|---|---|
Xilinx Artix-7 | xc7a35ticsg324-1L | 8218*LUT | 40% | 25*BRAM36K | 50% | 128 MHz |
Xilinx Zynq-7 | xc7z020clg484-1 | 8218*LUT | 16% | 25*BRAM36K | 18% | 128 MHz |
Xilinx Virtex-7 | xc7vx485tffg1761-1 | 8201*LUT | 3% | 25*BRAM36K | 3% | 160 MHz |
Xilinx ZU+ | xczu3eg-sbva484-2-e | 8180*LUT | 12% | 24*BRAM36K | 11% | 280 MHz |
Altera Cyclone4 | EP4CE55F23C8 | 16807*LE | 30% | 807398 bits | 34% | 74 MHz |
SIM 目录包含了 GZIP 压缩器的 testbench 源码。该 testbench 的框图如下:
其中随机数据包生成器 (tb_random_data_source.v) 会4种生成特性不同的数据包 (字节概率均匀分布的随机数据、字节概率非均匀分布的随机数据、随机连续变化的数据、稀疏数据) ,送入待测模块 (gzip_compressor_top) 进行压缩,然后 tb_save_result_to_file 模块会把压缩结果存入文件,每个独立的数据包存入一个独立的 .gz 文件。
tb_print_crc32 负责计算原始数据的 CRC32 并打印,注意:待测模块内部也会计算 CRC32 并封装入 GZIP 数据流,这两个 CRC32 计算器是独立的 (前者仅用于仿真,用来验证待测模块生成的 CRC32 是否正确)。你可以自行将仿真打印的 CRC32 与 生成的 GZIP 文件中的 CRC32 进行对比。
你可以按照以下步骤进行 iverilog 仿真:
FILE_COUNT
来修改数量。在10个文件的情况下,仿真一般要运行十几分钟才能结束。OUT_FILE_PATH
来修改存放的目录)
python check_gz_file.py .
以上命令意为:对当前目录 .
,也即 sim_data 目录下的所有 .gz 文件进行批量检查。
除了 iverilog ,你也可以用其它仿真器来仿真。只需要把 RTL 和 SIM 目录里的所有 .v 文件加入仿真工程,并以 tb_gzip_compressor.v 为仿真顶层进行仿真即可。
我提供了一个基于串口的 FPGA 部署运行示例,该工程跑在 Arty开发板 上 (该工程也全都是纯 RTL 设计,可以直接移植到其它 FPGA 型号上)。
该 FPGA 工程接收串口数据,将数据送入 GZIP 压缩器,并将得到的 GZIP 压缩数据流用串口发出去 (串口格式: 波特率115200, 无校验位)。
在电脑 (上位机) 上,编写了一个 python 程序,该程序的执行步骤是:
由于串口速度远小于 gzip_compressor_top 能达到的最高性能,因此该工程仅仅用于展示。要想让 gzip_compressor_top 发挥更高性能,需要用其它高速通信接口。
下图是该工程的框图。
有关该工程的文件:
fpga_uart_gz_file.py
)FPGA 工程烧录后,在 Arty-example/python 目录下打开命令行,运行以下命令:
python fpga_uart_gz_file.py <需要压缩的文件名>
例如,运行以下命令来压缩 raw.hex :
python fpga_uart_gz_file.py raw.hex
如果压缩成功,会得到 raw.hex.gz
文件,且不会打印报错信息。
[1] RFC 1951 : DEFLATE Compressed Data Format Specification version 1.3. https://www.rfc-editor.org/rfc/rfc1951
[2] RFC 1952 : GZIP file format specification version 4.3. https://www.rfc-editor.org/rfc/rfc1952
[3] HDL-deflate : https://github.com/tomtor/HDL-deflate
[4] HT-Deflate-FPGA : https://github.com/UCLA-VAST/HT-Deflate-FPGA