Fast and precise comparison of genomes and metagenomes (in the order of terabytes) on a typical personal laptop
BinDash is a command-line software for comparing genomes (including metagenomes and pangenomes) on a typical personal laptop. BinDash is based on Binwise Densified minhash for estimation of mutation rate between genomes. We implemented b-bit one-permutation rolling MinHash with optimal/faster/re-randomized densification. It is extremely fast and memory efficient. It can handle sequences consisting of terabytes of input data (gzipped or not, in fasta or fastq format).
The basic idea is: the Jaccard index as an accurate proxy of Average Nucleotide Identity(ANI) or mutation rate (1-ANI) according to equation:
$$ANI=1+\frac{1}{k}log\frac{2*J}{1+J}$$
if assuming a Poisson model, and
$$ANI=(\frac{2*J}{1+J})^{\frac{1}{k}}$$
if assuming a Binomial model. You can specify which model to use via --model option, see below.
Precompiled binaries on modern Linux can be found here. On MacOS, GNU GCC has to be installed first, we recommend the homebrew install (see below).
wget https://github.com/jianshu93/bindash/releases/download/v2.1/BinDash_Linux_x86-64_v2.0.tar.gz
tar -xzvf BinDash_Linux_x86-64_v2.0.tar.gz
./bindash --help
conda install bindash -c bioconda
brew tap jianshu93/bindash
brew update
brew install --cc=gcc-13 bindash
cd ${PROJECT_ROOT_DIRECTORY}
mkdir release && cd release
cmake -DCMAKE_BUILD_TYPE=Release ..
make # For Windows with MSYS Makefiles, the command might be "cd ../ && make" because out-of-source build may or may not be supported on this platform.
./bindash --help # to see a general help message
For MacOS, the native clang compiler cannot compile without compiling error. It is recommended to install gcc first as follows.
# install homebrew if not already done
brew update
brew install gcc@13
cd ${PROJECT_ROOT_DIRECTORY}
mkdir release && cd release
# Then run cmake as above (a GUI for cmake may also be available for MacOS)
CC="$(brew --prefix)/bin/gcc-13" CXX="$(brew --prefix)/bin/g++-13" cmake -DCMAKE_INSTALL_PREFIX=. ..
make
The folowing three commands show how to run BinDash:
bindash sketch --outfname=genomeA.sketch genomeA.fasta
bindash sketch --outfname=genomeB.sketch genomeB.fasta
bindash dist genomeA.sketch genomeB.sketch # print to stdout
### or if you have a list of genomes, one genome per line in a list file. All versus all
ls *.fasta > name.txt
bindash sketch --listfname=name.txt --outfname=genome_sketch
bindash dist --outfname=dist.txt genome_sketch
### query against database genomes
ls *_query.fasta > query.txt
ls *_db.fasta > db.txt
bindash sketch --listfname=query.txt --outfname=genome_query_sketch
bindash sketch --listfname=db.txt --outfname=genome_db_sketch
bindash dist --outfname=dist.txt genome_query_sketch genome_db_sketch
The output of "bindash dist" consists of several lines. Each line has these five tab-separated fields:
According to our experiments and theoretical analysis, sketch size (the skethchsize64 option) shoule be larger than 188 (that is actual sketch size is larger than ~12,000) to be accurate for genome pairs with ANI above 99.5%
By default, BinDash uses the optimally densified MinHash proposed by Shrivastava. This MinHash technique allows for efficient compression and fast comparison.
Basically, compression of a genome is done as follows.
All suggestions, comments, and feature requests are welcome.
Author: XiaoFei Zhao (cndfeifei AT hotmail DOT com)
License: Apache 2.0
Owen Kaser and Daniel Lemire, Strongly universal string hashing is fast, Computer Journal (2014) 57 (11): 1624-1638. http://arxiv.org/abs/1202.4961
Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language, Volume 24, Issue 4, October 2010, Pages 698-710 http://arxiv.org/abs/0705.4676
Daniel Lemire, The universality of iterated hashing over variable-length strings, Discrete Applied Mathematics 160 (4-5), 2012. http://arxiv.org/abs/1008.1715
Muła, Wojciech, Nathan Kurz, and Daniel Lemire. "Faster population counts using AVX2 instructions." The Computer Journal 61, no. 1 (2018): 111-120. https://academic.oup.com/comjnl/article-abstract/61/1/111/3852071
Anshumali Shrivastava, Optimal Densification for Fast and Accurate Minwise Hashing. Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3154-3163, 2017. http://proceedings.mlr.press/v70/shrivastava17a.html
Tung Mai et.al., On Densification for Minwise Hashing, Uncertainty in Artificial Intelligence, 2020. http://proceedings.mlr.press/v115/mai20a.html
Ping Li et.al., Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling, 33rd Conference on Neural Information Processing Systems (NIPS), 2019. https://proceedings.neurips.cc/paper/2019/hash/9f067d8d6df2d4b8c64fb4c084d6c208-Abstract.html
XiaoFei Zhao; BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics, 2018. bty651, https://doi.org/10.1093/bioinformatics/bty651