Metacache Versions Save

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping

v2.4.2

2 months ago

Improved sequence id extraction from filenames and sequence headers.

The default setting works a bit smarter now, it first tries to find NCBI-style accession or accession.version identifiers, then genbank identifiers and finally uses the filename (without path and extension).

The new command line option -sequence-id-format <type> allows the user to select a preferred method for sequence id extraction. Available values for <type> are:

smart: (default), works as described above
ncbi: only use NCBI-style accession or accession.version identifiers
genbank: only use genbank identifiers
filename: only use filename (without path and extension)
leadingword: only use first contiguous stretch of non-whitespace characters

v2.4.1

2 months ago

fixed abundance table formatting

prevent scientific notation from beeing used for read counts
row showing unclassified reads had the taxon column missing, now shown with taxon "--"

v2.4.0

2 months ago

Changed handling of non-unique sequence IDs during database build

If a reference sequence is inserted, whose ID (e.g. NCBI accession) is already present in the database, the newer sequence will now be inserted with a modified ID (an exclamation mark + duplication counter will be appended) and a warning will be printed to stderr.

Added min/max length filter

A minimum and maximum length for reads can now be set with -min-readlen <#> and -max-readlen <#>. Reads with lengths outside of this range will not be processed, i.e., treated as if they were not present in the input file. How many reads were discarded and how many were processed is printed to stderr. The default behavior, that all reads will be processed, remains unchanged.

Other changes

cleaned up some includes
updated dates
changed some aspects of default code formatting

v2.3.2

3 months ago

improved parsing of assembly_summary files with inconsistent headers

v2.3.1

1 year ago

fixed type mismatch bug that could prevented compilation with uint64_t for MC_TARGET_ID_TYPE / MC_WINDOW_ID_TYPE / DMC_KMER_TYPE
allow up to 10 alphanumeric characters in NCBI-style accession ids
GPU version: removed outdated CUDA 10.2 and CUB from documentation

v2.3.0

1 year ago

Removed compaction step from GPU version and speed up GPU queries. This also removes the dependency on CUB.
Set CUDA arch=native per default to automatically detect GPU architecture.
Fixed make with multiple MACROS (#34 ).

v2.2.3

1 year ago

Improved merge mode:

Added -out option
Recover from malformed input files (#33)
Show more output on verbose info level

v2.2.2

1 year ago

Fixed kmers on GPU for k != 16 (default was working correctly)
Fixed shown query parameters when running abundance estimation

v2.2.1

2 years ago

Fixed canonical kmer on GPU for k != 16 (default was working correctly)
Fixed merge mode

v2.2.0

2 years ago

Fixed the NCBI genome download script (the ftp path can be empty for some genomes).
Changed the default data type for storing reference sequence ids from 16 to 32 bits in order to fit all complete bacterial, viral and archaea genomes of the latest NCBI RefSeq releases.
The error message during the build process that should have reported that the number of sequences exceeds the supported number is fixed now.