Cuckoo Index: A Lightweight Secondary Index Structure
NOTE This is not an officially supported Google product.
Cuckoo Index (CI) is a lightweight secondary index structure that represents the many-to-many relationship between keys and partitions of columns in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions.
The problem of finding all partitions that possibly contain a given lookup key is traditionally solved by maintaining one filter (e.g., a Bloom filter) per partition that indexes all unique key values contained in this partition:
Partition 0:
A, B => Bloom filter 0
Partition 1:
B, C => Bloom filter 1
...
To identify all partitions containing a key, we need to probe all per-partition filters (which could be many). Since a Bloom filter may return false positives, there is a chance (of e.g. 1%) that we accidentally identify a negative partition as positive. In the above example, a lookup for key A may return Partition 0 (true positive) and 1 (false positive). Depending on the storage medium, a false positive partition can be very expensive (e.g., many milliseconds on disk).
Furthermore, secondary columns typically contain many duplicates (also across partitions). With the per-partition filter design, these duplicates may be indexed in multiple filters (in the worst case, in all filters). In the above example, the key B is redundantly indexed in Bloom filter 0 and 1.
Cuckoo Index addresses both of these drawbacks of per-partition filters.
Prepare a dataset in a CSV format that you are going to use. One of the datasets we used was DMV Vehicle, Snowmobile, and Boat Registrations.
wget -c https://data.ny.gov/api/views/w4pv-hbkt/rows.csv -O Vehicle__Snowmobile__and_Boat_Registrations.csv
Add the file to the data
dependencies in the BUILD.bazel
file.
data = [
# Put your csv files here
"Vehicle__Snowmobile__and_Boat_Registrations.csv"
],
For footprint experiments, run the following command, specifying the path to the data file, columns to test, and the tests to run.
bazel run -c opt --cxxopt="-std=c++17" :evaluate -- \
--input_csv_path="Vehicle__Snowmobile__and_Boat_Registrations.csv" \
--columns_to_test="City,Zip,Color" \
--test_cases="positive_uniform,positive_distinct,positive_zipf,negative,mixed" \
--output_csv_path="results.csv"
For lookup performance experiments, run the following command, specifying the path to the data file, and columns to test.
NOTE You might want to use fewer rows for lookup experiments as the benchmarks are quite time-consuming.
bazel run -c opt --cxxopt='-std=c++17' --dynamic_mode=off :lookup_benchmark -- \
--input_csv_path="Vehicle__Snowmobile__and_Boat_Registrations.csv" \
--columns_to_test="City,Zip,Color"
NOTE CMake support is community-based. The maintainers do not use CMake internally.
For further information have a look at the cmake README.
Please cite our VLDB 2020 paper if you use this code in your own work:
@article{cuckoo-index,
author = {Kipf, Andreas and Chromejko, Damian and Hall, Alexander and Boncz, Peter and Andersen, David},
title = {Cuckoo Index: A Lightweight Secondary Index Structure},
year = {2020},
issue_date = {September 2020},
publisher = {VLDB Endowment},
volume = {13},
number = {13},
issn = {2150-8097},
url = {https://doi.org/10.14778/3424573.3424577},
doi = {10.14778/3424573.3424577},
journal = {Proc. VLDB Endow.},
month = sep,
pages = {3559-3572},
numpages = {14}
}