A compressed, associative, exact, and weighted dictionary for k-mers.
This release of the library features a restructured public API for the dictionary and its supported queries.
kmer_id
: contig information (contig_id
and contig_size
of the contig where the k-mer lies in), the relative (within the contig) identifier of the k-mer (named kmer_id_in_contig
), and the orientation of the k-mer in the contig. For any positive query, 0 <= kmer_id_in_contig < contig_size
holds true.With this release the dictionary construction uses external memory to save RAM usage.
No major changes compared to previous version (rather than renaming of variables for consistency with papers), but we removed a (useless) serialised 4-byte integer from skew_index
and so previous index binary files are not compatible with this library release.
This release adds a new tool called permute
that re-orders (and possibly reverse-complement) the strings in an input (weighted) collection to minimize the number of runs in the abundances and, hence, optimize the encoding of the abundances.
The abundances are encoded in O(r)
space on top of the space for a SSHash dictionary, where r
is the number of runs (i.e., maximal substrings formed by a single abundance value) in the abundances.
The i
-th abundance in the sequence, corresponding to the k-mer of identifier i
, is retrieved in O(log r)
time.
This release adds a new feature: compressed abundances. The SSHash dictionary now can also store the abundances in highly compressed space.
First release.