A lightweight benchmark for approximate nearest neighbor search
annbench
is a simple benchmark for approximate nearest neighbor search algorithms in Python. This repository design is strongly influenced by a great project, ann-benchmarks, that provides comprehensive and thorough benchmarks for various algorithms. In contrast, we aim to deliver more lightweight and straightforward scripts with the following features.
git clone https://github.com/matsui528/annbench.git
cd annbench
pip install -r requirements.txt
# conda install faiss-cpu -y -c pytorch # If you'd like to try faiss, run this on anaconda
# conda install faiss-gpu -y -c pytorch # or, if you have GPUs, install faiss-gpu
python download.py dataset=siftsmall # Downloaded on ./dataset
python run.py dataset=siftsmall algo=annoy # Indices are on ./interim. Results are on ./output
python plot.py # Plots are on ./result_img
# Downloading deep1m takes some hours
python download.py --multirun dataset=siftsmall,sift1m,deep1m
# Will take some hours
python run.py --multirun dataset=siftsmall,sift1m,deep1m algo=linear,annoy,ivfpq,hnsw,ivfpq4bit,scann,pq,pq4bit,hnsw_faiss,nsg
# Or, if you have GPUs,
# python run.py --multirun dataset=siftsmall,sift1m,deep1m algo=linear,annoy,ivfpq,hnsw,linear_gpu,ivfpq_gpu,ivfpq4bit,scann,pq,pq4bit,hnsw_faiss,nsg
python plot.py
./conf
. You can edit the config files to change parameters.python download.py dataset=DATASET
.
Several datasets can be downloaded at once by python download.py --multirun dataset=DATASET1,DATASET2,DATASET3
. See hydra for more detailed APIs for multirun../dataset
.ALGO
) with a target dataset (DATASET
) by python run.py dataset=DATASET algo=ALGO
. You can run multiple algorithms on multiple datasets by python run.py --multirun dataset=DATASET1,DATASET2 algo=ALGO1,ALGO2
../interim
. They are reused for each run with different query parameters../output
.num_trial=10
times and return the average runtime. You can change this by: python run.py num_trial=5
python plot.py
. This script checks ./output
and generate figures for each dataset on ./result_img
.python plot.py with_query_param=false
.python plot.py width=15 height=10
run.py
or plot.py
, the output files will be on ./log
as well. For example with python run.py algo=annoy dataset=siftsmall
, the result file will be saved on (1) ./output/siftsmall/annoy/result.yaml
and (2) ./log/2020-03-11/22-30-59/0/result.yaml
.dataset | dim | #base | #query | #train | misc |
---|---|---|---|---|---|
siftsmall | 128 | 10,000 | 100 | 25,000 | A toy dataset for hello-world |
sift1m | 128 | 1,000,000 | 10,000 | 100,000 | |
deep1m | 96 | 1,000,000 | 10,000 | 100,000 | The first 1M vectors of Deep1B. Hepler scripts |
Note that hnsw (hnswlib)
is an original implementation by the original authors, and hnsw (faiss)
is a re-implemented version by the faiss team.
./annbench/algo
.
The class must inherit BaseANN
class. See annoy.py for examples../conf/algo
.BaseDataset
on ./annbench/dataset
.
An simple example is siftsmall.py../conf/dataset
.param_index={"M": 8, "nlist": 100}
With these parameters, one index (let us denote ivfpq(M=8, nlist=100)
) is created.
This index is stored in the disk as M8_nlist100.bin
, where the way of naming is defined in the function stringify_index_param.
Here, a query parameter is defined as:
param_query={"nprobe": [1, 2, 4, 8, 16]}
In the search step, the index is read from the disk onto the memory first. Then we run the search five times, with for nprobe in [1, 2, 4, 8, 16]
. This creates five results (five pairs of (recall, runtime)). By connecting these results, one polyline is drawn on the final plot.[1, 4, 2, 8, 16]
), the final graph would become weird../conf/algo/
. These parameters are used for all datasets by default. If you'd like to specialize parameters for a specific dataset, you can define the specialized version in ./conf/specialized_param/
.ivfpq
is defined here, where nlist=100
. You can set nlist=1000
for the sift1m dataset by adding a config file here
SOMEWHEE/LARGE_HDD/
:
python run.py interim=SOMEWHERE/LARGE_HDD/interim
ivfpq
algorithm with different query parameters:
python run.py algo=ivfpq dataset=siftsmall param_query.nprobe=[1,5,25]
Feel free to open a pull request.