COMBINE Lab Salmon Versions Save

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

v1.10.1

1 year ago

This release is a very minor update, intended entirely to address #835 (a problem raised by deb med maintainers running into build problems upstream). This release bumps the included version of the cereal headers in the corresponding pufferfish tag to v1.3.2 and also updates the required version for salmon to match this (i.e. cereal v1.3.2). Since the prior version included in pufferfish in past releases, the cereal library had made 2 patch releases which, nonetheless, were not backwards compatible. This lead to problems when mixing cereal v1.3.2 with v1.3.0. This release bumps everything to v1.3.2 to match the latest package on debian testing. If salmon 1.10.0 is working fine for you, there's no need to update to this release (but obviously no harm in doing so). It adds no new features or bug fixes within salmon itself.

v1.10.0

1 year ago

Fixes

  • This releases addresses a bug in deserializing the compact_vector (discovered by @jamshed) that could lead to undefined behavior. In fact, this bug was underlying a relatively rare but longstanding issue with the previous biconda build of salmon where a segmentation fault could occur during indexing.

  • This release addresses #806, where several output counters used 32-bit values and could produce incorrect values if they exceeded the maximum representable 32-bit integer. These counters have been changed to be 64-bits wide. It is worth noting that this was an issue with the reported values with the output report, but not with the internal representations (i.e. the actual quantifications were not affected).

  • This release incorporates PR #817 by @Gaura that addresses an issue in the processing of some sci-sea 3 data where having a read1 length of 33 or 34 would result in error while being valid lengths. This resulted in salmon refusing to process this data; this has now been fixed (addresses #813).

Improvements

  • Substantial refactoring has been made to parts of the mapping code to clean up redundant code and to make future additions easier.

  • Substantial improvements have been made to the CMake files to reduce the need for redundant copies of files and to propagate target properties more faithfully.

  • Several dependencies have been updated, including libstadenio and itlib.

Full Changelog: https://github.com/COMBINE-lab/salmon/compare/v1.9.0...v1.10.0

v1.9.0

1 year ago

New features

  • Salmon learned the ability to optionally write quality values in output SAM files. If the --writeQualities flag is passed to salmon when mappings are also being written (i.e. with --writeMappings=), then the SAM records for reads will contain the corresponding quality values. Note: You should not pass this flag to salmon if you are providing FASTA rather than FASTQ files as input; those files have no quality values, and so this flag is not compatible with FASTA input. Note: The default behavior remains to not write quality values, as they are not necessary for many downstream applications and they consume considerable extra space in the output. This addresses the feature request in #756; thanks to @A-N-Other for the suggestion.

Fixes

  • Addressing #748, raised by @taylorreiter - In single-end mode, all unmapped reads were being reported with the code u, including those mapped to decoys. This release fixes the output so the proper code d, is reported for those fragments best mapping to decoys.

Improvements

  • When salmon alevin was being run upstream of alevin-fry for generating a RAD file, it was possible for the file to be truncated if there was insufficient disk space for the output. This release of salmon adds a final check of the ofstream after the call to close to determine if the stream is in a bad state. This should lead to better error reporting and proper exit codes if the RAD output of salmon alevin is unexpectedly truncated. Thanks to @allyhawkins for helping to uncover this issue.

  • The use of multi-stage builds has greatly reduced the size of the Docker image to ~101MB (from ~1.38G); thanks to @kaczmarj for contributing this improvement.

  • Improvements to the documentation have been made and some typos fidex thanks to @molecules.

Full Changelog: https://github.com/COMBINE-lab/salmon/compare/v1.8.0...v1.9.0

v1.8.0

2 years ago

New features & improvements

Note (June 7, 2022) : Updated release tarball to remove problematic libm that was causing illegal instruction on some architectures.

  • The index command now optionally accepts a flag -n/--no-clip that will disable homopolymer clipping during reference indexing.

  • Addressed an offset miscalculation; this results in further improved specificity in alevin's --sketch mode.

Fixes

  • No other particular bug fixes are noted for this release.

Notes

  • Legacy and deprecated Intel TBB functionality has now been removed, and salmon (and pufferfish upon which it depends) have been updated to oneAPI TBB. The current release requires a recent version of oneAPI TBB (>= 2021.4.0) library.

Full Changelog: https://github.com/COMBINE-lab/salmon/compare/v1.7.0...v1.8.0

v1.7.0

2 years ago

New features & improvements

  • This release includes a refactoring and optimization of the mapping code in --sketch mode, further increasing speed; output should remain identical.

  • This release adds the --splitSeqV1 and --splitSeqV2 flags, that have been the development release for a bit, as simple alternatives to custom geometry when processing SPLiT-seq data for alevin-fry or alevin processing.

Fixes

  • No particular bug fixes are noted for this release.

Other changes / enhancements

  • Explicitly check for valid value of k before calling out to the indexer. This leads to a more informative error message and exit if the user passes an unacceptable value of k.

Notes

  • The Intel TBB library used internally by salmon (and used as well in TwoPaCo that is relied upon for compacted reference de Bruijn graph construction) has evolved into the oneAPI TBB. Recent releases of this library (2021.1 and forward) make certain backward incompatible changes and therefore cannot be used to build salmon. We anticipate working toward replacing the deprecated and removed functions with the corresponding oneAPI replacements and idioms, hopefully in the next release of salmon. Therefore, we anticipate that this will be the last — or close to the last —salmon release to use (and be compatible with) the legacy Intel TBB library. Future releases will likely require a newer version of the oneAPI TBB library instead.

Full Changelog: https://github.com/COMBINE-lab/salmon/compare/v1.6.0...v1.7.0

v1.6.0

2 years ago

New features

  • This release introduces specific flags for two new single-cell protocols (which can be processed using either alevin or that can be used to produce a RAD file for alevin-fry). Specifically, these new protocols are special because they mark the initial support within this framework for variable-length barcodes. In the next release, we hope to have an update to our generic barcode, umi, read geometry specification mini-language to expose this feature more generally there, but for the time being, these are implemented as new single-cell protocol flags. The new protocols supported are sci-RNA-seq3 and inDrop v2. These are exposed, respectively with the --sciseq3 and --indropV2 flags. In addition to the custom geometry specification, the list of geometries / protocols with pre-specified flags has now been added to the documentation.

Fixes

  • This release fixes #691, where an extra : was present in the cmd_info.json file in rad and sketch mode where the salmon_version was recored. Thanks to @allyhawkins for reporting this issue.

  • This releases fixes a rare corner case in cell barcode rescue (recovering cell barcodes with an N) where, if a barcode could not be properly extracted, a rescue attempt would be made for the previous barcode, which could result in the wrong barcode / umi pairing for that read. Thanks to @Gaura for finding this bug and the PR to fix it.

Other changes / enhancements

Full Changelog: https://github.com/COMBINE-lab/salmon/compare/v1.5.2...v1.6.0

v1.5.2

2 years ago

This is a minor release and introduced no new features. However, this release addresses the issue raised in #688. Specifically, when run in RAD mode (i.e. with --rad or --sketch), salmon alevin did not output a cmd_info.json or meta_info.json file. While not strictly required for subsequent processing with alevin-fry, having this information can be useful for provenance tracking and bookkeeping. Now, both of these files are properly generated when running salmon alevin in RAD mode.

v1.5.1

2 years ago

Note: If you downloaded the pre-compiled linux binary from this release page for v1.5.1 before 19:47 UTC on June 14, please check your version with salmon -v. For a short period of time, the executable posted here was actually v1.5.0. Other distribution mechanism (e.g. bioconda, docker hub, etc.) were not affected by this.

New features (in 1.5.0)

This release introduces an --ont flag, that is designed to improve quantification from Oxford Nanopore Technologies (ONT) long-reads (both cDNA and direct RNA). The main effect of this flag is twofold:

  • First, it enables an alignment error model designed to work with long-read alignments. Until this point, the recommendation when using salmon to quantify aligned long reads had been to disable the error model, since salmon's default error model is designed for short reads and did not work well with long read alignments. However, the error model enabled with the --ont flag is designed specifically for the alignment characteristics of long reads and should improve the quantification estimates produced for this data by providing a better estimate of the conditional probability of a read arising from a particular transcript given its alignment to that transcript (the testing for this feature has been done mostly using minimap2).

  • Second, it disables the length effect in the generative model when computing the conditional probability of observing a fragment given that it arises from a specific transcript. This is because in long-read sequencing, we do not expect to observe (i.e. sequence) multiple fragments from the same molecule, and thus we do not expect the transcript length to directly affect the observed fragment count directly. A consequence of this change is that the "EffectiveLength" of transcripts is not currently computed and used in the model in this mode, and this field in the output will be populated with a sentinel value of 100.

Other improvements (in 1.5.0)

  • When running alevin to generate a RAD file for alevin-fry (specifically when using --sketch mode), the sensitivity of mapping has been improved by allowing for reads that have only highly-repetitive seeds and map to a large number of loci.

  • It is no longer necessary to provide a transcript-to-gene --tgMap to the alevin command if alevin is being run with the --rad and/or --sketch flags.

  • Automatically detect and exit if alevin is run with an index including decoy sequences when using the --rad and/or --sketch flags. This functionality is not currently supported, and mapping against such an index can cause (cryptic) errors in downstream processing. Now, if such an index is passed when using these flags, an informative error message is printed and the program will exit with a return code of 1.

  • Support for the custom single-cell features (end, barcodeLength umiLength) simultaneously with the --citeseq command-line flags has been dropped, although they can still be used independently. A user has to either use the --citeseq flag with predefined sets of features (CB: 16, UMI: 10) or use the umi-geometry, bc-geometry, read-geometry flags for a customized extraction of the barcode sequences. Note, in the geometry mode, the user has to explicitly provide keepCBFraction 1.0 and a tgMap file, while it's not necessary to provide either in citeseq based mode.

Bug fixes

  • Fix an issue where the size of the representation used for the barcode length and UMI length when writing output to a RAD file was mistakenly linked. As most current protocols use a 32-bit integer for both, most runs are not affected.

  • Fix an issue where the barcode and UMI length may not be properly set when using the custom geometry format (addresses #670).

v1.4.0

3 years ago

salmon 1.4.0 : Thanksgiving release 🦃

Bug fixes

  • Fixed a very rare bug whereby, on certain operating systems, under certain types of system load, and with specific versions of the C++ standard library, the default standard device would fail to produce a pseudorandom seed and would raise an exception. On these systems, "/dev/urandom" is explicitly substituted for the default random device. Unfortunately, it is not possible / easy to make the appropriate source changes at runtime. So, if you are experiencing this issue (which, again, looks to be exceedingly rare), it may be best to compile from source on the machine causing the issue.
  • salmon should now compile and run on ARM machines. It has been tested on an AWS aarch64 node (running Ubuntu 20.10), but presumably should work on many ARM machines. It is assumed that NEON intrinsics are available. This support for ARM was made immensely easier by SIMDe. Thanks to @mr-c and @BenLangmead for pointing out SIMDe project and to @mr-c, @lh3 and lead developer of SIMDe @nemequ who all gave useful advice on the initial expansion to ARM support.

Support for RAD file creation and the alevin-fry pipeline

  • --rad/--justAlign flag : Salmon/alevin 1.4.0 coincides with the initial release of alevin-fry, a flexible and efficient framework for single-cell quantification. Alevin-fry handles barcode-detection and quantification, providing the methods developed as part of alevin, as well as a number of other possibilities. Alevin-fry is computationally efficient, flexible, and very memory efficient, processing single-cell experiments in 2-3GB of memory (see more details in the poster introducing alevin-fry). Moving forward, we plan for alevin-fry to be the primary development platform for new single-cell quantification methods. Nonetheless, alevin-fry currently, and for the forseeable future, will rely on alevin to perform the actual barcode / umi extraction, and mapping of sequencing reads. alevin communicates with alevin-fry via an intermediate binary file called a RAD (Reduced Alignment Data) file. To process data with alevin-fry (documentation available here), you must first map the reads to the reference transcriptome to generate a RAD file. This is done by running alevin as you would normally do, and by additionally passing the flag --rad or --justAlign. This flag will tell alevin to just align the reads and to write the appropriate information to a RAD file in the output directory (with a pre-determined name).

  • --sketch/--sketchMode flag : Alevin learned the --sketch/--sketchMode flag. This flag is currently relevant only in RAD mode. In fact, this flag currently implies RAD mode (that is --sketch is currently the same as --rad --sketch). The --sketch flag is meant to prioritize mapping speed at the potential cost of reduced specificity. It turns off selective-alignment and instead maps the reads using a custom implementation of psuedoalignment [1] with structural constraints (PASC). This consists of executing the k-mer collecting part of a pseudoalignment [1] algorithm to collect potentially compatible targets for a fragment, represented by a series of "hits". The targets are then filtered to ensure that the collected hits are consistent in their orientation, and co-linear in their placement on the fragment and reference (these are the enforced structural constraints). This algorithm is distinct from the seeding step of selective alignment or the quasi-mapping algorithm, and prioritizes speed. For an overview of how --sketch mode affects downstream results, please check out our poster Accurate, efficient, and uncertainty-aware expression quantification of single-cell RNA-seq data.

  • --noWhitelist flag : Alevin learned the --noWhitelist flag. Passing this flag to alevin (in classic mode; this flag has no effect in RAD mode) stops the pipeline after UMI deduplication and quantification. The second-round intelligent whitelisting operation will not be performed.

  • generic barcode / umi / read geometry syntax : Alevin learned to support a generic syntax to specify the read sequence that should be used for barcodes, UMIs and the read sequence. The syntax allows one to specify how the pattern corresponding to the barcode, UMI, and read sequence should be pieced together, and the syntax is meant to be intuitive and general. For example, one can specify the 10Xv2 geometry in the following manner using the generic syntax:

    • --read-geometry 2[1-end] --bc-geometry 1[1-16] --umi-geometry 1[17-26]

    This specifies that the "sequence" read (the biological sequence to be aligned) comes from read 2, and it spans from the first index 1 (this syntax used 1-based indexing) until the end of the read. Likewise, the barcode derives from read 1 and occupies positions 1-16, and the UMI comes from read 1 and occupies positions 17-26. The syntax can specify multiple ranges, and they will simply be concatenated together to produce the string. For example, one could specify --bc-geometry 1[1-8,16-23] to designate that the barcode should be taken from the substring in positions 1-8 of read 1 followed by the substring in positions 16-23 of read 1. It is even possible to have the string pieced together across both reads, but that functionality is only available if you are running with --rad or --sketch and preparing a RAD file for alevin-fry. If you are running classic alevin, the barcode must reside on a single read. The robust parsing of the flexible geometry syntax is made possible by the cpp-peglib project.

  • Alevin learned the ability to annotate output SAM files with the CB and UR tags. If you write a SAM file by running alevin with --writeMappings, then the resulting SAM file will have CB and UR tags in the alignment records to record the cell barcode and UMI for the fragment.

  • A new command-line flag --noWhitelist is added to explicitly disable the 'intelligent-whitelist' by alevin. It helps with a still-unresolved issue on HPC running on old centOS, where alevin fails to gain access to virtual memory.

References

[1] Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525-527.

v1.3.0

3 years ago

salmon 1.3.0 Release notes

  • Happy 4th of July ( :us: :fireworks: )

Bug fixes & improvements

:gift: Improvements

  • Fragments that best-map to decoys are now written in the output SAM file if the --writeMappings option is provided. In order to make filtering of decoy and non-decoy alignments easier, all alignments now include a tag in their SAM record. Alignments to a valid (non-decoy) target are tagged with XT:A:T, and those to decoys are tagged with XT:A:D. This allows easy filtering of decoy mappings. The conditions for a decoy mapping to be written to the file are as follows:

    1. There is no valid mapping to a non-decoy target. That is, all mappings to valid (non-decoy) targets must have alignment score < decoyThreshold * bestDecoyScore.
    2. Only best-scoring decoy alignments are written to file. Thus, if there are sub-optimal decoy alignments that are still better than alignments to valid targets, they will not appear in the output SAM file.
    3. If decoy alignments are written (condition 1 is satisfied), then all equally-best decoy alignments are written to file (i.e. a decoy fragment can still multi-map).
  • In the SAM file produced with the --writeMappings option, the header lines now include tags to designate each reference sequence as being a decoy or not. Sequence lines (@SQ lines) that correspond to valid targets contain the tag DS:T, while those corresponding to decoys contain the tag DS:D. Note: In alignment-based mode, salmon will not process SAM/BAM files with decoy entries (to avoid usage errors, since decoy alignment is not intended for quantification). So, if, for some reason you are using a salmon-generated SAM file containing decoy sequences and alignment records, you must remove them before quantifying using alignment-based mode (i.e. removing all headers with DS:D and all alignment records withXT:A:D). Details about how to perform that transformation can be found here.

  • This release enables some considerable improvements to speed in the case of aligning poor quality reads. Specifically, this is enabled due to upstream changes in pufferfish implemented by @mohsenzakeri. Now, the aligner can exit early if it becomes clear at any point during alignment that a valid score cannot be obtained. This reduces the computation used to evaluate poor alignments that will not pass subsequent filtering (addresses #527 adn #537).

  • Homopolymer seeds are now skipped during mapping and alignment. In pathological datasets, this could cause unnecessarily slow mapping without any improvements to the actual mapping rate (i.e. it could generate many poor mappings that would fail alignment). This change can speed up mapping in such datasets (addresses #527 adn #537).

  • Three new filtering flags have been added to both improve sensitivity and speed. They determine how mappings are filtered at different stages. The previous behavior (that of salmon v1.0.0 — 1.2.1) can be obtained by setting --preMergeChainSubThresh 1.0, --postMergeChainSubThresh x, --orphanChainSubThresh x where x is (1.0 - --consensusSlack) — by default this corresponds to x = 0.65.

    • --perMergeChainSubThresh : The threshold of sub-optimal chains, compared to the best chain on a given target, that will be retained and passed to the next phase of mapping. Specifically, if the best chain for a read (or read-end in paired-end mode) to target t has score X_t, then all chains for this read with score >= X_t * preMergeChainSubThresh will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. It's default value is 0.75 for paired-end data and 1.0 for single-end data.
    • --postMergeChainSubThresh : The threshold of sub-optimal chain pairs, compared to the best chain pair on a given target, that will be retained and passed to the next phase of mapping. This is different than preMergeChainSubThresh, because this is applied to pairs of chains (from the ends of paired-end reads) after merging (i.e. after checking concordancy constraints etc.). Specifically, if the best chain pair to target t has score X_t, then all chain pairs for this read pair with score >= X_t * postMergeChainSubThresh will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. The default value for this parameter is 0.9. Note: This option is only meaningful for paired-end libraries, and is ignored for single-end libraries.
    • --orphanChainSubThresh : This threshold sets a global sub-optimality threshold for chains corresponding to orphan mappings. That is, if the merging procedure results in no concordant mappings then only orphan mappings with a chain score >= orphanChainSubThresh * bestChainScore will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. Unlike the --preMergeChainSubThresh and --postMergeChainSubThresh options, this threshold is global with respect to all orphan chains (not simply per-target). From that perspective, you can view it as overriding the value of --consensusSlack in the case of orphan mappings. Note: This option is only meaningful for paired-end libraries, and is ignored for single-end libraries.
  • The default --mismatchSeedSkip was changed from 5 to 3.

  • Updated the required LibGFF dependency to v2.0.0. If you already have this installed on your system, you can pass the hint to the location to cmake using -DLIB_GFF_PATH or -DGFF_ROOT.

  • Add the "CellRanger" standard tags, CB:Z and UR:Z tags to the alignment records reported by alevin if the user passes the --writeMappings flag when running alevin.

  • Moved from (deprecated) tbb::atomic<double> to std::atomic<double> throughout the codebase, including accounting for the lack of a compare_and_swap method on the latter.

  • Changed the default gap-open penalty to 6 (from 4). This makes any gap less preferred compared to a mismatch. Note: How to properly set the default scoring scheme, as well as how to set an ideal alignment quality threshold (i.e. what is the lowest quality alignment one should allow) is not a straightforward question. This change in default accords with our belief that gaps should be penalized more in typical data. However, the ideal settings for such parameters is certainly worthy of more in-depth study, and we are looking into both empirical and theoretical mechanisms for determining how these parameters can be best determined. To obtain the old (pre 1.3.0) scoring scheme, simply pass --go 4 on the command line. You can also experiment with even more stringent gap penalties by increasing --go for gap open (current default 6) and --ge for gap extend (current default 2).

  • Changed warning message color from yellow to magenta to make it readable on both light and dark background (address #541).

  • Emojis in release notes :smiley:.

:bug: Bug fixes

  • Improved selective-alignment speed in pathological case involving isolated homopolymer MEM chains. Thanks to @red-plant for raising the issue (with reproducible data) in 527.

  • Custom barcode lengths for the --citeseq mode was disabled. It has been fixed in https://github.com/COMBINE-lab/salmon/issues/531 and --citeseq single-cell protocol can be used along with --end --barcodeLength --umiLength triplets. Thanks @rfarouni for reporting this.

  • The variance estimates reported by --numCellBootstraps command in alevin were not corrected for bias. It has been corrected to reported unbiased estimates by multiplying the variance matrix by (n/n-1).

  • Fixed linking order issue that could, on rare custom compiles of salmon, cause memory to be allocated by TBB and freed by jemalloc (resulting in a segfault). Thanks to @mathog and davidtgoldblatt for helping to track down and resolve this one!

  • Fixed an error (regression) that could cause an overhanging read in a read pair to be improperly not marked as a dovetail (when it is). This could result in assignment preference for transcripts where the dovetailing read overhangs the transcript start.

  • Fixed a bug that could occur in certain cases of between-mem alignment where too high of an alignment score could be attributed to a mapping. This could occur when there were overlapping MEMs in the chain on the reference (a bit uncommon), and when the size of the overlap was different on the read and reference. This bug has been fixed by properly adjusting the score in all cases.

  • The dynamic and asynchronous update of the fragment length distribution could cause the fluctuations in fragment-level conditional probabilities within the set of alignments for a given fragment. For duplicate transcripts this could lead to an unexpected result where sequence-duplicate transcripts could be inferred to have unequal abundance. The current release addresses this behavior by employing a fragment length distribution cache to ensure there are no fluctuation in conditional fragment length probabilities among the set of alignments for a given fragment. Note: This behavior is expected only to have affected atypical salmon usage, as duplicate transcripts are collapsed / discarded by default during indexing.