COMBINE Lab Salmon Versions Save

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

v1.2.1

4 years ago

This is a minor release, but it nonetheless adds a few important features and fixes an outstanding bug.

This release incorporates all of the improvements and additions of 1.2.0, which are significant and which are covered in detail here.

New features:

  • salmon learned a new command line option --mismatchSeedSkip. This option can be used to tune seeding sensitivity for selective-alignment . The default value is 5, and should work well in most cases, but this can be tuned if the user wants. After a k-mer hit is extended to a uni-MEM, the uni-MEM extension can terminate for one of 3 reasons; the end of the read, the end of the unitig, or a mismatch. If the extension ends because of a mismatch, this is likely the result of a sequencing error. To avoid looking up many k-mers that will likely fail to be located in the index, the search procedure skips by a factor of mismatchSeedSkip until it either (1) finds another match or (2) is k-bases past the mismatch position. This value controls that skip length. A smaller value can increase sensitivity, while a larger value can speed up seeding.

  • salmon learned about the environment variable SALMON_NO_VERSION_CHECK. If this environment variable is set (to either 1 or TRUE) then salmon will skip checking for an updated version, regardless of whether or not it is passed the --no-version-check flag on the command line. This makes it easy to e.g. set the environment variable to control this behavior for instances running on a cluster. This addresses issue 486, and we thank @cihanerkut for the suggestion.

Improvements:

  • This is a change in default behavior: As raised in issue 505, salmon would not index sequence with duplicate decoy entries, unless the --keepDuplicates flag was passed. Instead, salmon would refuse to index these sequences until the duplicate decoys were removed. Since indexing duplicate sequences does not make any sense, we have decided that duplicate decoy sequences will always be discarded (regardless of the status of the --keepDuplicates flag). This lifts the burden on the user of having to ensure that the decoy sequences are free of duplicates. The behavior can now be described as: "If a decoy sequence is a duplicate of any previously-observed sequence, it is discarded, regardless of the status of the --keepDuplicates flag." This applies equally well if the decoy is a duplicate of a previously-observed decoy or if it is a duplicate of a non-decoy target sequence. Essentially, any decoy sequence that is a duplicate of a previously-observed sequence (decoy or not) will be discarded. The number of observed duplicate decoys (if > 0) will be reported to the log. Thanks to @tamuanand for raising the issue that led to this improvement.

  • During the build process, salmon (and pufferfish) now check directly if std::numeric_limits<_int128> is defined or not, and set the pre-processor flags accordingly. This should address an issue that was reported building under clang on OSX 10.15 (seemingly, earlier versions of the compiler turned on vendor-specific extensions under the -std=c++14 flag, while the newer version does not).

Bug fixes:

  • Addressed / fixed a possibly un-initialized variable (sopt.noSA) in argument parsing.

v1.2.0

4 years ago

Improvements and changes

Improvements


  • Extreme reduction in the required intermediate disk space used when building the salmon index. This improvement is due to the changes implemented by @iminkin in TwoPaCo (which pufferfish, and hence salmon, uses for constructing the colored, compacted dBG) addressing the issue here. This means that for larger references or references with many "contigs" (transcripts), the intermediate disk space requrements are reduced by up to 2 orders of magnitude!

  • Reduction in the memory required for indexing, especially when indexing with a small value of k. This improvement comes from (1) fixing a bug that was resulting in an unnecessarily-large allocation when "pufferizing" the output of TwoPaCo and (2) improving the storage of some intermediate data structures used during index construction. These improvements should help reduce the burden of constructing a decoy-aware index with small values of k. The issue of reducing the number of intermediate files created (which can hurt performance on NFS-mounted drives) is being worked on upstream, but is not yet resolved.

alevin

  • This release introduces support for the quantification of CITE-seq / feature barcoding based single-cell protocols! A full, end-to-end tutorial is soon-to-follow on the alevin-tutorial website.

New flags and options:


  • Salmon learned a new option (currently Beta) --softclip : This flag allows soft-clipping at the beginning and end of reads when they are scored with selective-alignment. If used in conjunction with the --writeMappings flag, then the CIGAR strings in the resulting SAM output will designate any soft-clipping that occurs at the beginning or end of the read. Note: To pass the selective-alignment filter, the read must still obtain a score of at least maximum achievable score * minScoreFraction, but softclipping allows omitting a poor quality sub-alignment at the beginning or end of the read with no change to the resulting score for the rest of the alignment (rather than forcing a negative score for these sub-alignments).

  • Salmon learned a new option --decoyThreshold <thresh>: For an alignemnt to an annotated transcript to be considered invalid, it must have an alignment score s such that s < (decoyThreshold * bestDecoyScore). A value of 1.0 means that any alignment strictly worse than the best decoy alignment will be discarded. A smaller value will allow reads to be allocated to transcripts even if they strictly align better to the decoy sequence. The previous behavior of salmon was to discard any mappings to annotated transcripts that were strictly worse than the best decoy alignment. This is equivalent to setting --decoyThreshold 1.0, which is the default behavior.

  • Salmon learned a new option --minAlnProb <prob> (default ): When selective alignment is carried out on a read, each alignment A is assigned a probability given by $e^{-(scoreExp * (bestScore - score(A)))}$, where the default scoreExp is just 1.0. Depending on how much worse a given alignment is compared to the best alignment for a read, this can result in an exceedingly small alignment probability. The --minAlnProb option lets one set the alignment probability below which an alignment's probability will be truncated to 0. This allows skipping the alignments for a fragment that are unlikely to be true (and which could increase the difficulty of inference in some cases). The default value is 1e-5.

  • Salmon learned a new flag --disableChainingHeuristic: Passing this flag will turn off the heuristic of Li 2018 that is used to speed up the MEM chaining step, where the inner loop of the chaining algorithm is terminated after a small number of previous pointers for a given MEM have been found. Passing this flag can improve the sensitivity of alignment to sequences that are highly repetitive (especially those with overlapping repetition), but it can make the chaining step somewhat slower.

  • Salmon learned a new flag --auxTargetFile <file>. The file passed to this option should be a list of targets (i.e. sequences indexed during indexing, or aligned against in the provided BAM file) for which auxliliary models (sequence-specific, fragment-GC, and position-specific bias correction) should not be applied. The format of this file is to provide one target name per-line, in a newline separated file. Unlike decoy sequences, this list of sequences is provided to the quant command, and can be different between different runs if so-desired. Also, unlike decoy sequences, the auxiliary targets will be quantified (e.g. they will have entries in quant.sf and can have reads assigned to them). To aid in metadata tracking of targets marked as auxiliary, the aux_info directory contains a new file aux_target_ids.json, which contains a json file listing the indices of targets that were treated as "auxiliary" targets in the current run.

  • The equivalence class output is now gzipped when written (and written to aux_info/eq_classes.txt.gz rather than aux_info/eq_classes.txt). To detect this behavior, an extra property gzipped is written to the eq_class_properties entry of aux_info/meta_info.json. Apart from being gzipped to save space, however, the format is unchanged. So, you can simply read the file using a gzip stream, or, alternatively, simply unzip the file before reading it.

  • Added special handling for reading SAM files that were, themselves, produced by salmon. Specifically, when reading SAM files produced by salmon, the AS tag will be used to assign appropriate conditional probabilities to different mappings for a fragment (rather than looking for a CIGAR string, which is not computed).

  • The versionInfo.json file generated during indexing now remember the specific version of salmon that was used to build the index. The indexVersion field is already a version identifier that is incremented when the index changes in a binary-incompatible way. However, the new field will allow one to know the exact salmon version that was used to build the index.

alevin

  • A couple of new flags has been added to support the feature barcoding based quantification in the alevin framework.

    • index command
      • --features: Performs indexing on a tsv file instead of a regular FASTA reference file. The tsv file should contain the name of the features, tab separated by the nucleotide sequence identifiers.
    • alevin command
      • --featureStart: The start index (0 based) of the feature barcode in the R2 read file. (Typically 0 for CITE-seq and 10 for 10x feature barcoding).
      • --featureLength: The length of the feature barcode in the R2 read file. (Typically 15 for both CITE-seq and 10x feature barcoding).
      • --citeseq: The command is used for quantifying the feature barcoded data where the alignment of the barcodes (with 1-edit distance) is done instead of the mRNA reads.
      • No --tgMap is needed when using --citeseq single-cell protocol.
  • --end 3 has been enabled in this release. It is useful for protocols where UMI comes before the Cellular Barcode (CB). Note: --end 3 does not start subsequencing from the 3' end of the R1 file. Instead, alevin still starts counting the subsequence from the 5' end. However, we first sample the UMI instead of the CB. The idea here is, --end 5 represents CB+UMI while --end 3 represents UMI+CB and all the sequences beyond the |CB| + |UMI| length are ignored, no matter what value is set for the flag --end.

Bug fixes

  • Fixed an issue (upstream in pufferfish), that is actually arising from bbhash. Specifically, the issue was unexpected behavior of bbhash during minimum perfect hash construction. It may create temporary files during MPHF construction, and it was using the current working directory to do this, with no option to override this behavior. We have fixed this in our copy of the bbhash code, and the salmon index command will now use the provided output directory as temporary working space for bbhash. This issue has been reported upstream in bbhash as issue 19.

  • Fixed an issue with long target names (raised in issue 451) not being allowed in the index. Previously, in the pufferfish-based index, target names of length > 255 were clipped to 255 characters. While this is not normally a problem, pipelines that attempt to encode significant metadata in the target name may be affected by this limit. With this release, target names of up to 65,536 characters are supported. Thanks to @chilampoon for raising this issue.

  • Fixed an issue where the computed alignment score could be wrong (too high) when there were MEMs in the highest-scoring chain that overlapped in the query and the reference by different amounts. This was relatively infrequent, but has now been fixed. Thanks to @cdarby for reporting the issue and providing a test case to fix it!

  • Fixed an issue where, in rare situations, usage of the alignment cache could cause non-determinism the the score for certain alignments, which could result in small fluctuations in the number of assigned fragments. The fix involves both addressing a bug in ksw2 where an incorrect alignment score for global alignment could be returned in certain rare situations depending on how the bandwidth parameter is set, and also by being more stringent in what alignments are inserted into the alignment cache and which mappings are searched for in the alignment cache. Many thanks to @csoneson for raising this issue and finding a dataset containing enough of the corner cases to track down and fix the issue. Thanks to @mohsenzakeri for isolating the underlying cases and figuring out how to fix them.

alevin

  • The big feature hash generated when --dumpBfh is set, creates a reverse UMI sequences than those present originally. This was a legacy bug, introduced when shifting from jellyfish based 2-bit encoding to the AlevinKmer class based 2-bit encoding. This has been fixed in the this release.

  • Fixed an issue where the --writeUnmappedNames did not work properly with alevin. This addresses issue 501.

Other notes

  • As raised in issue 500, the salmon executable, since v1.0.0, assumes the SSE4 instruction set. While this feature has been standard on processors for a long time, some older hardware may not have this feature set. This compile flag was removed from the pufferfish build in this release, as we noticed no speed regressions in its absence. However, please let us know if you notice any non-trivial speed regressions (which we do not expect).

v1.1.0

4 years ago

salmon 1.1.0 release notes

Note : This version contains some important fixes, please see below for detailed information.

Note : On our testing machines, this version of salmon was index-compatible with version 1.0.0. That is, it is likely that you need not re-build your index from what you built with 1.0.0. However, it is not clear that this compatibility is guaranteed by the cereal library. If you encounter difficulty loading a previously-built index, please consider re-building with the latest version before filing a bug report.

Note : If you want to build from source and use a version of the (header-only) cereal library already installed on your system, please make sure it is cereal v1.3.0. The current findCereal.cmake file does not support version restrictions, and we are working to improve this for proper automatic detection and enforcement of this constraint in future releases.

As always, a pre-compiled linux executable is included below and the latest release is available via Bioconda.

Improvements

  • SHA512 sums are now properly propagated forward to meta_info.json.

  • Bumped the included version of the cereal serialization library. The components used by salmon should be backward compatible in terms of reading output from the previous version (i.e. should not require index re-building).

  • The flag --keepFixedFasta was added to the index command. If this flag is passed, then a "fixed" version of the fasta file will be retained in the index directory. This file is created during indexing, but is normally deleted when indexing is complete. It contains the input fasta without duplicate sequences (unless --keepDuplicates was used), with the headers as understood by salmon, with N nucleotides replaced, etc.

  • Introduced a few small optimizations upstream (in pufferfish) to speed up selective-alignment; more are on the way (thanks to @mohsenzakeri).

Bug fixes

  • The bug described directly below led to the discovery of a different but related bug that could cause the extracted sequence used for bias correction to be incorrect. The code was assuming zero-initalization of memory which was not necessarily occuring. Note: This bug affects runs performed under mapping-based mode (i.e. when the input was not coming from a BAM file) and when --seqBias or --gcBias flags (or both) were used. Depending upon the initialization of the underlying memory, the bug may lead to unexpected results and diminished accuracy. The bug was present in versions 0.99.0 beta 1 through 1.0.0 (inclusive), and if you processed data using these versions in mapping-based mode with the flags mentioned above, we encourage you to reprocess this data with the newest version, just in case. We apologize for any inconvenience.

  • Fixed a bug that would occur when the input fasta file contained short sequences (<= length k) near the end of the file and bias correction (sequence-specific or fragment-GC) was enabled. This is particularly acute when the short sequence was immediately preceded by a very long target and would cause inordinate warning message printing to the log (hugely slowing down index loading). This printing to the log could slow the index loading considerably. Furthermore, this would provide sequence copies to short transcripts and decoy sequences even though they are not needed, which would result in unnecessary memory waste. The bug was due to a missing parenthesization to enforce the desired operator precedence. This fix should speed up index loading and reduce memory usage when using the --seqBias or --gcBias flags. Huge thanks to @mdshw5 for finding an input that would trigger this behavior (which didn't show up in testing), and for helping to track down the cause.

  • Fixed a bug that could occur in computing the Beta function component of the chaining score with very long queries. This should not have shown up at all with Illumina-length reads, but nonetheless the adjustment conceptually corrects the scoring for all cases. Thanks @mohsenzakeri.

v1.0.0

4 years ago

This is a major stable release of salmon and brings a lot of exciting new features with extensive benchmarking in the latest preprint.

This new version of salmon is based on a fundamentally different indexing data structure (pufferfish) than the previous version. It also adopts a different mapping algorithm; a variant of selective-alignment. The new indexing data structure makes it possible to index the transcriptome as well as a large amount of "decoy" sequence in small memory. It also makes it possible to index the entire transcriptome and the entire genome in "reasonable" memory (currently ~18G in dense mode and ~14G in sparse mode, though these sizes may improve in the future), which provides a much more comprehensive set of potential decoy sequences. In the new index, the transcriptome and genome are on "equal footing", which helps to avoid bias toward either the transcriptome or the genome during mapping.

Note : To construct the ccDBG from the reference sequence, which is subsequently indexed with pufferfish, salmon makes use of (a very slightly modified version of) the TwoPaCo software. TwoPaCo implements a very efficient algorithm for building a ccDBG from a collection of reference sequences. One of the key parameters of TwoPaCo is the size of the Bloom filter used to record and filter possible junction k-mers. To ease the indexing procedure, salmon will attempt to automatically set a reasonable estimate for the Bloom filter size, based on an estimate of the number of distinct k-mers in the reference and using a default FPR of 0.1% over TwoPaCo's default 5 filters. To quickly obtain an estimate of the number of distinct k-mers, salmon makes use of (a very slightly modified version of) the ntCard software; specifically the nthll implementation.

Changes since v0.99.0 beta2

A bug related to alevin index parsing is fixed. Specifically, if the length of any one decoy target is less than the kmer length then alevin was dumping gene counts for decoy targets. Thanks @csoneson for reporting this and it has been fixed in the latest stable release.

Changes since v0.99.0 beta1

Allow passing of explicit filter size to the indexing command via the -f parameter (default is to estimate required filter size using nthll).

Fix bug that prevented dumping SAM output, if requested, in alevin mode.

Correctly enabled strictFilter mode in alevin, improving single-cell mapping quality.

Changes since v0.14.1

The indexing methodology of salmon is now based on pufferfish. Thus, any previous indices need to be re-built. However, the new indexing methodology is considerably faster and more parallelizable than the previous approach, so providing multiple threads to the index command shoule make relatively short work of this task.

The new version of salmon adopts a new and modified selective-alignment algorithm that is, nonetheless, very similar to the selective-alignment algorithm described in Alignment and mapping methodology influence transcript abundance estimation. In this release of salmon, selective-alignment is enabled by default (and, in fact, mapping without selective-alignemnt is disabled). We may explore, in the future, ways to allow disabling selecive-alignment under the new mapping approach, but at this point, it is always enabled.

As a consequence of the above, range factorization is enabled by default.

There is a new command-line flag --softclipOverhangs which allows reads that overhang the end of transcripts to be softclipped. The softclipped region will neither add to nor detract from the match score. This is more permissive than the default strategy which would require the overhaning bases of the read to be scored as a deletion under the alignment.

There is a new command-line flag --hitFilterPolicy which determines the policy by which hits or chains of hits are filtered in selective alignment, prior to alignment scoring. Filtering hits after chaining (the default) is more sensitive, but more computationally intensive, because it performs the chaining dynamic program for all hits. Filtering before chaining is faster, but some true hits may be missed. The NONE option is not recommended, but is the most sensitive. It does not filter any chains based on score, though all methods only retain the highest-scoring chains per transcript for subsequent alignment score. The options are BEFORE, AFTER, BOTH and NONE.

There is a new command-line flag --fullLengthAlignment, which performs selective-alignment over the full length of the read, beginning from the (approximate) initial mapping location and using extension alignment. This is in contrast with the default behavior which is to only perform alignment between the MEMs in the optimal chain (and before the first and after the last MEM if applicable). The default strategy forces the MEMs to belong to the alignment, but has the benefit that it can discover indels prior to the first hit shared between the read and reference.

The -d/--dumpEqWeights flag now dumps the information associated with whichever type of factorization is being used for quantification (the default now is range-factorized equivalence classes). The --dumpEq flag now always dumps simple equivalence classes. This means that no associated conditional probabilities are written to the file, and if range-factorization is being used, then all of the range-factorized equivalence classes that correspond to the same transcript set are collapsed into a simple equivalence class label and the corresponding counts are summed.

There has been a change to the default behavior of the VB prior. The default VB prior is now evaluated on a per-transcript rather than per-nucleotide basis. The previous behavior is enabled enabled by passing --perNucleotidePrior option to the quant command.

Considerable improvments have been made to fragment length modeling in the case of single-end samples.

Alevin now contains a flag --quartzseq2 to support the Quartz-Seq2 protocol (thanks @dritoshi).

bug fix: Alevin when provided with --dumpFeatures flag dumps featureDump.txt. The column header of the file was inconsistent with the values and has been fixed i.e. ArborescenceCount field should occur as the last column now.

bug fix: The mtx format overflows the total number of genes boundary when the total number of genes are exactly a multiple of 8. It has been fixed to be consistent in the latest release

The following command-line flags have been removed (since, given the new index, they no longer serve a useful function): --allowOrphansFMD, --consistentHits, --quasiCoverage.

v0.14.2

4 years ago

This release is just the replica of v0.14.1, plus a hot-fix for the bug in alevin's output mtx format. Thanks @pinin4fjords for reporting this and it should fix https://github.com/COMBINE-lab/salmon/issues/431 issue. NOTE This release doesn't support the features in the beta release of 0.99. If you are interested in the new indexing and mapping scheme described in the v0.99 release notes, please wait for v1.0.0 or try the v0.99 beta2.

v0.99.0-beta2

4 years ago

This is the second beta version of the next major release of salmon.

This new version of salmon is based on a fundamentally different indexing data structure (pufferfish) than the previous version. It also adopts a different mapping algorithm; a variant of selective-alignment. The new indexing data structure makes it possible to index the transcriptome as well as a large amount of "decoy" sequence in small memory. It also makes it possible to index the entire transcriptome and the entire genome in "reasonable" memory (currently ~18G in dense mode and ~14G in sparse mode, though these sizes may improve in the future), which provides a much more comprehensive set of potential decoy sequences. In the new index, the transcriptome and genome are on "equal footing", which helps to avoid bias toward either the transcriptome or the genome during mapping.

Since it constitutes such a major change (and advancement) in the indexing and alignment methodology, we are releasing beta versions of this new realease of salmon to give users the ability to try it out and to provide feedback before it becomes the "default" version you get via e.g. Bioconda. Since it is not currently possible to have both releases and "betas" in Bioconda, you can get the pre-compiled executables below, or build this version directly from the develop branch of the salmon repository.

Note : To construct the ccDBG from the reference sequence, which is subsequently indexed with pufferfish, salmon makes use of (a very slightly modified version of) the TwoPaCo software. TwoPaCo implements a very efficient algorithm for building a ccDBG from a collection of reference sequences. One of the key parameters of TwoPaCo is the size of the Bloom filter used to record and filter possible junction k-mers. To ease the indexing procedure, salmon will attempt to automatically set a reasonable estimate for the Bloom filter size, based on an estimate of the number of distinct k-mers in the reference and using a default FPR of 0.1% over TwoPaCo's default 5 filters. To quickly obtain an estimate of the number of distinct k-mers, salmon makes use of (a very slightly modified version of) the ntCard software; specifically the nthill implementation.

Changes since v0.99.0 beta1

  • Allow passing of explicit filter size to the indexing command via the -f parameter (default is to estimate required filter size using nthll).

  • Fix bug that prevented dumping SAM output, if requested, in alevin mode.

  • Correctly enabled strictFilter mode in alevin, improving single-cell mapping quality.

Changes since v0.14.1

  • The indexing methodology of salmon is now based on pufferfish. Thus, any previous indices need to be re-built. However, the new indexing methodology is considerably faster and more parallelizable than the previous approach, so providing multiple threads to the index command shoule make relatively short work of this task.

  • The new version of salmon adopts a new and modified selective-alignment algorithm that is, nonetheless, very similar to the selective-alignment algorithm described in Alignment and mapping methodology influence transcript abundance estimation. In this release of salmon, selective-alignment is enabled by default (and, in fact, mapping without selective-alignemnt is disabled). We may explore, in the future, ways to allow disabling selecive-alignment under the new mapping approach, but at this point, it is always enabled.

  • As a consequence of the above, range factorization is enabled by default.

  • There is a new command-line flag --softclipOverhangs which allows reads that overhang the end of transcripts to be softclipped. The softclipped region will neither add to nor detract from the match score. This is more permissive than the default strategy which would require the overhaning bases of the read to be scored as a deletion under the alignment.

  • There is a new command-line flag --hitFilterPolicy which determines the policy by which hits or chains of hits are filtered in selective alignment, prior to alignment scoring. Filtering hits after chaining (the default) is more sensitive, but more computationally intensive, because it performs the chaining dynamic program for all hits. Filtering before chaining is faster, but some true hits may be missed. The NONE option is not recommended, but is the most sensitive. It does not filter any chains based on score, though all methods only retain the highest-scoring chains per transcript for subsequent alignment score. The options are BEFORE, AFTER, BOTH and NONE.

  • There is a new command-line flag --fullLengthAlignment, which performs selective-alignment over the full length of the read, beginning from the (approximate) initial mapping location and using extension alignment. This is in contrast with the default behavior which is to only perform alignment between the MEMs in the optimal chain (and before the first and after the last MEM if applicable). The default strategy forces the MEMs to belong to the alignment, but has the benefit that it can discover indels prior to the first hit shared between the read and reference.

  • The -d/--dumpEqWeights flag now dumps the information associated with whichever type of factorization is being used for quantification (the default now is range-factorized equivalence classes). The --dumpEq flag now always dumps simple equivalence classes. This means that no associated conditional probabilities are written to the file, and if range-factorization is being used, then all of the range-factorized equivalence classes that correspond to the same transcript set are collapsed into a simple equivalence class label and the corresponding counts are summed.

  • There has been a change to the default behavior of the VB prior. The default VB prior is now evaluated on a per-transcript rather than per-nucleotide basis. The previous behavior is enabled enabled by passing --perNucleotidePrior option to the quant command.

  • Considerable improvments have been made to fragment length modeling in the case of single-end samples.

  • Alevin now contains a flag --quartzseq2 to support the Quartz-Seq2 protocol (thanks @dritoshi).

  • bug fix: Alevin when provided with --dumpFeatures flag dumps featureDump.txt. The column header of the file was inconsistent with the values and has been fixed i.e. ArborescenceCount field should occur as the last column now.

  • bug fix: The mtx format overflows the total number of genes boundary when the total number of genes are exactly a multiple of 8. It has been fixed to be consistent in the latest release

  • The following command-line flags have been removed (since, given the new index, they no longer serve a useful function): --allowOrphansFMD, --consistentHits, --quasiCoverage.

v0.99.0-beta1

4 years ago

This is the first beta version of the next major release of salmon.

This new version of salmon is based on a fundamentally different indexing data structure (pufferfish) than the previous version. It also adopts a different mapping algorithm; a variant of selective-alignment. The new indexing data structure makes it possible to index the transcriptome as well as a large amount of "decoy" sequence in small memory. It also makes it possible to index the entire transcriptome and the entire genome in "reasonable" memory (currently ~18G in dense mode and ~14G in sparse mode, though these sizes may improve in the future), which provides a much more comprehensive set of potential decoy sequences. In the new index, the transcriptome and genome are on "equal footing", which helps to avoid bias toward either the transcriptome or the genome during mapping.

Since it constitutes such a major change (and advancement) in the indexing and alignment methodology, we are releasing beta versions of this new realease of salmon to give users the ability to try it out and to provide feedback before it becomes the "default" version you get via e.g. Bioconda. Since it is not currently possible to have both releases and "betas" in Bioconda, you can get the pre-compiled executables below, or build this version directly from the develop branch of the salmon repository.

Note : To construct the ccDBG from the reference sequence, which is subsequently indexed with pufferfish, salmon makes use of (a very slightly modified version of) the TwoPaCo software. TwoPaCo implements a very efficient algorithm for building a ccDBG from a collection of reference sequences. One of the key parameters of TwoPaCo is the size of the Bloom filter used to record and filter possible junction k-mers. To ease the indexing procedure, salmon will attempt to automatically set a reasonable estimate for the Bloom filter size, based on an estimate of the number of distinct k-mers in the reference and using a default FPR of 0.1% over TwoPaCo's default 5 filters. To quickly obtain an estimate of the number of distinct k-mers, salmon makes use of (a very slightly modified version of) the ntCard software; specifically the nthill implementation.

Changes since v0.14.1

  • The indexing methodology of salmon is now based on pufferfish. Thus, any previous indices need to be re-built. However, the new indexing methodology is considerably faster and more parallelizable than the previous approach, so providing multiple threads to the index command shoule make relatively short work of this task.

  • The new version of salmon adopts a new and modified selective-alignment algorithm that is, nonetheless, very similar to the selective-alignment algorithm described in Alignment and mapping methodology influence transcript abundance estimation. In this release of salmon, selective-alignment is enabled by default (and, in fact, mapping without selective-alignemnt is disabled). We may explore, in the future, ways to allow disabling selecive-alignment under the new mapping approach, but at this point, it is always enabled.

  • As a consequence of the above, range factorization is enabled by default.

  • There is a new command-line flag --softclipOverhangs which allows reads that overhang the end of transcripts to be softclipped. The softclipped region will neither add to nor detract from the match score. This is more permissive than the default strategy which would require the overhaning bases of the read to be scored as a deletion under the alignment.

  • There is a new command-line flag --hitFilterPolicy which determines the policy by which hits or chains of hits are filtered in selective alignment, prior to alignment scoring. Filtering hits after chaining (the default) is more sensitive, but more computationally intensive, because it performs the chaining dynamic program for all hits. Filtering before chaining is faster, but some true hits may be missed. The NONE option is not recommended, but is the most sensitive. It does not filter any chains based on score, though all methods only retain the highest-scoring chains per transcript for subsequent alignment score. The options are BEFORE, AFTER, BOTH and NONE.

  • There is a new command-line flag --fullLengthAlignment, which performs selective-alignment over the full length of the read, beginning from the (approximate) initial mapping location and using extension alignment. This is in contrast with the default behavior which is to only perform alignment between the MEMs in the optimal chain (and before the first and after the last MEM if applicable). The default strategy forces the MEMs to belong to the alignment, but has the benefit that it can discover indels prior to the first hit shared between the read and reference.

  • The -d/--dumpEqWeights flag now dumps the information associated with whichever type of factorization is being used for quantification (the default now is range-factorized equivalence classes). The --dumpEq flag now always dumps simple equivalence classes. This means that no associated conditional probabilities are written to the file, and if range-factorization is being used, then all of the range-factorized equivalence classes that correspond to the same transcript set are collapsed into a simple equivalence class label and the corresponding counts are summed.

  • There has been a change to the default behavior of the VB prior. The default VB prior is now evaluated on a per-transcript rather than per-nucleotide basis. The previous behavior is enabled enabled by passing --perNucleotidePrior option to the quant command.

  • Considerable improvments have been made to fragment length modeling in the case of single-end samples.

  • Alevin now contains a flag --quartzseq2 to support the Quartz-Seq2 protocol (thanks @dritoshi).

  • bug fix: Alevin when provided with --dumpFeatures flag dumps featureDump.txt. The column header of the file was inconsistent with the values and has been fixed i.e. ArborescenceCount field should occur as the last column now.

  • The following command-line flags have been removed (since, given the new index, they no longer serve a useful function): --allowOrphansFMD, --consistentHits, --quasiCoverage.

v0.14.1

4 years ago

This is primarily a bugfix release. For the recently-added features and capabilities, please refer to the 0.14.0 release notes.

The following bugs have been fixed in v0.14.1 :

  • If the number of skipped CBs are too high, then the reported whitelist can sometimes be from the skipped CB id. Thanks to @Ryan-Zhu for bringing this up and it has been fixed.
  • Multiple bugs in --dumpMtx format, thanks to @Ryan-Zhu and @alexvpickering for pointing these out.
    • Expression values can sometimes be reported in scientific notation, which can break certain downstream parsers. This has been modified to fixed-precision decimal (C++ default of a precision of 6).
    • The column ids were 0-indexed while mtx assumes 1-indexing. This has been fixed to report the indices starting from 1.
    • The reported expressions in the mtx file were in column-major format which doesn't align with the binary format counts and the quant_mat_cols.txt. It has been fixed to report row-major order, and now aligns with the order of the genes as reported in quants_mat_cols.txt
  • Alevin fails without reporting error when the number of low confidence cellular barcodes are < 1. We have fixed alevin to not perform intelligent whitleisting if the number of low confidence cellular barcodes are <200.

Note : At least one issue has reported a depressed mapping rate with v0.14.1. It is recommended to upgrade to the latest version of salmon.

v0.14.0

4 years ago

Salmon 0.14.0 release notes

In addition to the changes and enhancements listed below, this release of salmon implements the decoy-aware selective-alignment strategy described in the manuscript Alignment and mapping methodology influence transcript abundance estimation. For reasons explored in depth in the manuscript, we recommend making use of this decoy-aware selective alignment strategy when not providing pre-aligned reads to salmon. Because of the changes required to implement this indexing strategy, salmon v0.14.0 is not compatible with the indices of previous versions, and so you must re-build the index for this version of salmon (which must be done anyway, if one is adding decoy sequence).

Adding decoy sequence to the salmon index.

Adding decoy sequence to the salmon index is simple, but salmon is specific about the manner in which the sequence is added. To ease this process, we have created a script that allows the automated creation of a decoy-enhanced transcriptome from a genome FASTA, transcriptome FASTA, and annotation GTF file. The script, as well as detailed instructions on how to run it an use its output, is provided in the SalmonTools repository.

Note: Because making effective use of the decoy sequence requires having accurate mapping scores, the decoys are only used when salmon is run with selective alignment (i.e. with the flags --validateMappings, --mimicBT2 or --mimicStrictBT2).

Detailed description of decoy requirements

It is not necessary to use the script we provide to extract decoy sequences, and if you'd like to add your own decoys to the file you wish to index, the process is fairly straightforward. All records for decoy sequence must come at the end of the FASTA file being indexed, and you must provide a file with all of the names (one name per line) of the records that should be treated as decoys (they need not be in the same order as in the FASTA file). Consider that you have the files txome.fa and decoys.fa, where decoys.fa are the decoy sequences you want to add to your index. Also, assume that decoys.txt is the file containing the names of the decoy records. You can create a valid input files as:

$ grep "^>" decoys.fa | cut -d ">" -f2 > decoys.txt $ cat txome.fa decoys.fa > txome_combined.fa

Now, you can build the decoy-aware salmon index using the command:

$ salmon index -t txome_combined.fa -d decoys.txt -i combined_index

Changes to default behavior and new behavior

  • Dovetailing mappings and alignments are considered discordant and discarded by default --- this is the same behavior that is adopted by default in Bowtie2. This is a change from the older behavior of salmon where dovetailing mappings were considered concordant and counted by default. If you wish to consider dovetailing mappings as concordant (the previous behavior), you can do so by passing the --allowDovetail flag to salmon quant. Exotic library types (e.g. MU, MSF, MSR) are no longer supported. If you need support for such a library type, please submit a feature request describing the use-case.

  • The version check information is now written to stderr rather than stdout. This enables directly redirecting the SAM output, when using the -z/--writeMappings flag with the implicit argument that writes that output to stdout. NOTE: If you are having difficulty using the -z/--writeMappings flag to write output to a file (e.g using -z <file.sam> or --writeMappings <file.sam>), try using -z=<file.sam> or --writeMappings=<file.sam> instead --- this appears to be an issue with Boost's argument parsing library for flags that have implicit as well as default values.

  • Salmon now automatically detects, during indexing, if it believes that the transcriptome being indexed is in GENCODE format and the --gencode flag has not been passed. In this case, it issues a warning, since we generally recommend to use this flag when indexing GENCODE transcriptomes (to avoid the very long transcript names in the output). This implements feature request 366; thanks @alexvpickering.

  • The default setting for --numPreAuxModelSamples has been lowered from 1,000,000 to 5,000. This simply means that the basic models (and cruically the read alignment error model) will start being applied much earlier on in the online algorithm. This has very little effect on samples with a decent number of fragments, but can considerably improve estimates (especially in alignment-based mode) for samples with only a small number of fragments.

  • The definition of --consensusSlack has changed. Instead of being an absolute number, it is now a fractional value (between 0 and 1) the describes the number of "hits" (i.e. suffix array intervals) that a mapping may miss and still be consdered valid for chaining.

Improvements and new flags for bulk mode

When writing out mappings in conjunction with

The flags below are either new, or only present since v0.13.0 and are therefore highlighted again below for completeness:

  • --mimicBT2 : This flag is a "meta-flag" that sets the parameters related to mapping and selective alignment to mimic alignment using Bowtie2 (with the flags --no-discordant and --no-mixed), but using the default scoring scheme and allowing both mismatches and indels in alignments.

  • --mimicStrictBT2 : This flag is a "meta-flag" that sets the parameters related to mapping and selective alignment to mimic alignment using Bowtie2 (with the flags suggested by RSEM), but using the default scoring scheme and allowing both mismatches and indels in alignments. These setting essentially disallow indels in the resulting alignments.

In addition to these "meta-flags", a few other flags have been introduced that can alter the behavior of mapping:

  • --recoverOrphans : This flag (which should only be used in conjunction with selective alignment), performs orphan "rescue" for reads. That is, if mappings are discovered for only one end of a fragment, or if the mappings for the ends of the fragment don't fall on the same transcript, then this flag will cause salmon to look upstream or downstream of the discovered mapping (anchor) for a match for the opposite end of the given fragment. This is done by performing "infix" alignment within the maximum fragment length upstream of downstream of the anchor mapping using edlib.

  • --hardFilter : This flag (which should only be used with selective alignment) turns off soft filtering and range-factorized equivalence classes, and removes all but the equally highest scoring mappings from the equivalence class label for each fragment. While we recommend using soft filtering (the default) for quantification, this flag can produce easier-to-understand equivalence classes if that is the primary object of study.

  • --skipQuant : Related to the above, this flag will stop execution before the actual quantification algorithm is run.

  • --bandwidth : This flag (which is only meaningful in conjunction with selective alignment), sets the bandwidth parameter of the relevant calls to ksw2's alignment function. This determines how wide an area around the diagonal in the DP matrix should be calculated.

  • --maxMMPExtension : This flag (which should only be used with selective alignment) limits the length that a mappable prefix of a fragment may be extended before another search along the fragment is started. Smaller values for this flag can improve the sensitivity of mapping, but could increase run time.

Through broad benchmarking across many samples, we have worked to considerably improve the selective-alignment algorithm and its sensitivity. We note that it is likely selective alignment will turned on by default in future releases, and we strongly encourage all users to make use of this feature and report their experiences with it. Along with the default selective alignment (enabled via --validateMappings), there are two "meta" flags that enable selective alignment parameters meant to mimic configurations in which users might be interested.

New information available in meta_info.json

  • The following fields have been added to meta_info.json:
    • num_valid_targets: The number of non-decoy targets in the index used for mapping.
    • num_decoy_targets: The number of decoy targets in the index used for mapping (only meaningful in mapping-based mode).
    • num_decoy_fragments: The number of fragments that were discarded from quantification because they best-aligned to a decoy target rather than a valid transcript.
    • num_dovetail_fragments : which denotes the number of fragments that have only dovetailing mappings. If the --allowDovetail flag was passed, these are counted toward quantification, otherwise they are discarded (but this number is still reported). This field only has a meaningful value in quasi-mapping mode (with or without selective alignment).
    • num_fragments_filtered_vm : which denotes the number of fragments that had a mapping to the transcriptome, but which were discarded because none of the mappings for the fragments exceeded the minimum selective alignment score. This field only has a meaningful value in conjunction with selective alignment (otherwise it is 0).
    • num_alignments_below_threshold_for_mapped_fragments_vm : which denotes the number of mappings discarded because they failed to reach the minimum selective alignment score, but for which the corresponding fragment had at least a single valid mapping. This field only has a meaningful value in conjunction with selective alignment (otherwise it is 0).

Improvements in single cell mode

  • Alevin supports decoy genomic alignments. NOTE: If you have a previous version of salmon index, with the release of 0.14, you will have to update to the latest salmon index.

  • The data of the file filtered_cb_frequency.txt, along with other features, will be dumped in the file featureDump.txt by default i.e. you don't need --dumpFeature flag to get CB level features except the raw_cb_frequency.txt. The list of the features in the features file is as follows:

    • Cellular Barcode (CB) Sequence
    • Number of sequence corrected reads assigned to the CB
    • Number of mapped reads assigned to the CB
    • Number of deduplicated reads assigned to the CB
    • Mapping rate i.e. #mapped reads / #sequence corrected reads
    • Deduplication rate i.e. 1 - (#deduplicated reads / #mapped reads)
    • Mean / Max of the expressed gene quantification estimates.
    • Number of expressed genes.
    • Number of genes with the count estimates more than the mean.
    • Average Number of Reads deduplicated in each Arboresnce
  • Command line flag --dumpUmiGraph, along with the per cell level UMI graphs also dumps the frequency of the number of reads used for deduplicating an arborescence. It is added to the last column of the featureDump.txt as #Reads:#arborescence pairs separated by tab.

  • The binary output format of alevin, quants_mat.gz, has been changed into a sparse single precision format. In pratice we saw the file size reduced to as big as half the size of the original file.

  • New command line flag --dumpMtx is added to dump the quants in matrix-market-exchange(mtx) sparse format.

  • In case of encountered errors in different stages of the alevin pipeline, instead of default error-code of 1, following four categories of error-codes will be reported by alevin for automated debugging:

    • 1: Error while mapping reads and/or generic errors.
    • 64: Error in knee estimation / Cellular Barcode sequence correction.
    • 74: Error while deduplicating UMI and/or EM optimization.
    • 84: Error while intelligent whitelisting.

Bug fixes, deprecations and removals

  • A bug in the quantmerge command (issue 356) that could cause the output of quantmerge to be truncated was fixed (the bug was first introduced in v0.13.0).

  • Added missing explicit initialization for variable that could affect the initialization condition of the optimization; thank @come-raczy.

  • Following developer(hidden) flags have been deprecated:

    • --dumpUmitoolsMap (permanently disabled)
    • --noSoftMap (Always assumed True)
    • --dumpBarcodeMap (permanently disabled)
    • --noBarcode (Always assumed False)
  • Following user flags have been deprecated:

    • --debug (Always assumed True)
    • --useCorrelation (permanently disabled)
    • --dumpCsvCounts (swapped in favor of mtx with the flag --dumpMtx)

v0.13.1

5 years ago

Salmon 0.13.1 release notes

Version 0.13.1 is a patch to 0.13.0. We describe the contents of the patch here, and repeat the v0.13.0 release notes again below for simplicity.

  • This version fixes a non-determinism bug introduced in v0.13.0 that could cause the mapping rate of orphaned mappings to fluctuate slightly between runs.

  • This version adds the --allowDovetail flag which overrides the newly-default behavior of discarding dovetail mappings of paired-end reads. If passed this flag, salmon will not consider dovetailing mappings as discordant, and will consider them.

  • The following fields have been added to meta_info.json:

    • num_dovetail_fragments : which denotes the number of fragments that have only dovetailing mappings. If the --allowDovetail flag was passed, these are counted toward quantification, otherwise they are discarded (but this number is still reported). This field only has a meaningful value in quasi-mapping mode (with or without mapping validation).
    • num_fragments_filtered_vm : which denotes the number of fragments that had a mapping to the transcriptome, but which were discarded because none of the mappings for the fragments exceeded the minimum mapping validation score. This field only has a meaningful value in conjunction with mapping validation (otherwise it is 0).
    • num_alignments_below_threshold_for_mapped_fragments_vm : which denotes the number of mappings discarded because they failed to reach the minimum mapping validation score, but for which the corresponding fragment had at least a single valid mapping. This field only has a meaningful value in conjunction with mapping validation (otherwise it is 0).

Previous Salmon 0.13.0 release notes

Change to default behavior

Starting from this version of salmon, dovetailed mappings (see the Bowtie2 manual for a description) are not accepted by default using the built-in mapping (with or without --validateMappings). Moreover v0.13.0 has no flag to allow dovetail mappings. The --allowDovetail option has been added to v0.13.1 to enable this behavior, if desired.

Exotic library types (e.g. MU, MSF, MSR) are no longer supported. If you need support for such a library type, please submit a feature request describing the use-case.

Improvements and new flags

Again, there have been significant improvements to mapping validation. Through broad benchmarking across many samples, we have worked to considerably improve the algorithm and its sensitivity. We note that it is likely that mapping validation will turned on by default in future releases, and we strongly encourage all users to make use of this feature and report their experiences with it.

Along with the default mapping validation (enabled via --validateMappings), there are two "meta" flags that enable mapping validation parameters meant to mimic configurations in which users might be interested.

  • --mimicBT2 : This flag is a "meta-flag" that sets the parameters related to mapping and mapping validation to mimic alignment using Bowtie2 (with the flags --no-discordant and --no-mixed), but using the default scoring scheme and allowing both mismatches and indels in alignments.

  • --mimicStrictBT2 : This flag is a "meta-flag" that sets the parameters related to mapping and mapping validation to mimic alignment using Bowtie2 (with the flags suggested by RSEM), but using the default scoring scheme and allowing both mismatches and indels in alignments. These setting essentially disallow indels in the resulting alignments.

In addition to these "meta-flags", a few other flags have been introduced that can alter the behavior of mapping:

  • --recoverOrphans : This flag (which should only be used in conjunction with mapping validation), performs orphan "rescue" for reads. That is, if mappings are discovered for only one end of a fragment, or if the mappings for the ends of the fragment don't fall on the same transcript, then this flag will cause salmon to look upstream or downstream of the discovered mapping (anchor) for a match for the opposite end of the given fragment. This is done by performing "infix" alignment within the maximum fragment length upstream of downstream of the anchor mapping using edlib.

  • --hardFilter : This flag (which should only be used with mapping validation) turns off soft filtering and range-factorized equivalence classes, and removes all but the equally highest scoring mappings from the equivalence class label for each fragment. While we recommend using soft filtering (the default) for quantification, this flag can produce easier-to-understand equivalence classes if that is the primary object of study.

  • --skipQuant : Related to the above, this flag will stop execution before the actual quantification algorithm is run.

  • --bandwidth : This flag (which is only meaningful in conjunction with mapping validation), sets the bandwidth parameter of the relevant calls to ksw2's alignment function. This determines how wide an area around the diagonal in the DP matrix should be calculated.

  • --maxMMPExtension : This flag (which should only be used with mapping validation) limits the length that a mappable prefix of a fragment may be extended before another search along the fragment is started. Smaller values for this flag can improve the sensitivity of mapping, but could increase run time.

The default setting for --numPreAuxModelSamples has been lowered from 1,000,000 to 5,000. This simply means that the basic models (and cruically the read alignment error model) will start being applied much earlier on in the online algorithm. This has very little effect on samples with a decent number of fragments, but can considerably improve estimates (especially in alignment-based mode) for samples with only a small number of fragments.

The definition of --consensusSlack has changed. Instead of being an absolute number, it is now a fractional value (between 0 and 1) the describes the number of "hits" (i.e. suffix array intervals) that a mapping may miss and still be consdered valid for chaining.

Improvements and changes to alevin

  • With this release alevin will dump a summary statistics of a single cell experiment into the file alevin_meta_info.json inside the aux folder of the output directory.

  • EquivalenceClassBuilder object will now have a single cell SCRGValue templaization, which will marginally reduce the memory used by the object.

  • Salmon's --initUniform flag has been linked with alevin, if enabled through command line (default false) it initialized the EM step with a uniform prior instead of with a unique equivalence class evidence.

  • Alevin can directly consume bfh file format generated using --dumpBfh. It provides an independant entry point into alevin's UMI deduplication step instead of the raw FASTQ files.

  • A bug in UMI deduplication step has been fixed. Previously the vertices in the maximum connected components of an arborescence were not being removed.

  • The custom mode of the single cell protocol for alevin, does not need explicit protocol specific command line flag. Although the full triplet --umiLength --barcodeLength --end command line options has to be specified to enable the custom mode.

  • Maximum allowable length of a barcode and/or the UMI has been set to 20 for the custom mode of a single cell experiment.

  • A new command line option --keepCBFraction has been added, which expects a value in the range (0, 1]. This parameter forces alevin to use the specified fraction of all the observed Cellular barcode in the input reads after sequence correction.

Bug fixes, deprecations and removals

  • Fixed a rare bug that could cause salmon and alevin to "hang" when many read files were provided as input at the number of records in the read file were a divisor of the mini-batch size. Thanks to @rbenel for finding a dataset that triggers this bug and reporting it in #329.

  • The --strictIntersect flag led to unnecessary complexity in the codebase, and it seems, was not really used by anyone, so it was removed to simplify and streamline the code.

  • The --useFSPD flag has been deprecated for many releases and was removed.