Luntergroup Octopus Versions Save

Bayesian haplotype-based mutation calling

v0.5.2-beta

5 years ago

This is a minor release with some bug fixes, and improvements to installation scripts.

Improvements

  • Makes RFQUAL a FORMAT field rather than an INFO field, so each sample gets an RFQUAL. [81b75ea3af852e809c789a8a924b3aa0f9791264]
  • Installation can now be to any location. Resolves #36. [18b36eaaf789d294ddfe3514fbb0d4e03c4eeccd]
  • Installation script can now be given htslib root location. Resolves #38. [93ba0000b56cad3c490d3e8ce5b400180a6a0e46]
  • Installation script now tries both cmake3 and cmake. Resolves #37. [30a3ffb11ad216e108efefcd70549130939655bb]
  • Installation script now properly downloads provided random forests. [3599999dbedce18520389f656a352edf866fd1c7]

Bug fixes

  • Fixes bug in htslib float field extraction that could corrupt FORMAT and INFO values. [0078c5a1e2fbd1abd43a65315f4de216fbf4fa9b]
  • Fixes bug in the de novo mutation model that could lead to segmentation faults. [7d945f80a7ae4db3a5b107619801e76ec93e0033, ad996a28631fc3ce58912e20db38e7b424fa61b1]
  • Fixes bug that causes conflicting call exception due to calling variants in skipped regions. [01477d9a89111b01bb81f741d8f8f334e66f0a0b]
  • Fixes bug in de novo contamination measure that could cause segmentation faults. [f2e6610f76d4d6b5be68c9c278be0a16a7bbb162]

v0.5.1-beta

5 years ago

This is a minor release that resolves a few issues in the first Beta release - v0.5.0-beta.

Improvements

  • Adds support for allosomes in the trio calling model. [41b72b22663d91e08b63121ba2a7624485a99003]
  • Moves the RFQUAL random forest score to the FORMAT field, so there is now one score for each sample. [81b75ea3af852e809c789a8a924b3aa0f9791264]
  • Adds new measures: RTB, REB, BMC, BMF. [f1000d4410fe62e8b0b8bd0080d4720b81024710, 1a937f2f48aad986ea76b799f666155da4fccc08]
  • Improves temp directory cleanup on failed runs. [95016f138b27f8b683c511025d83ec539dc2cc0f]
  • Makes random forest training a little easier by adding default measure lists to training scripts and by allowing the argument forest to the --training-annotations option (renamed from --csr-train). [519be06afbc7ddc3c70b4a5da899a22d18391b5c, 106e3443c4ed5319d84f621b5b0eaf50c46db179]
  • Changes some of the UMI config settings to reduce runtimes (at minor expensive of accuracy). [36e47295df27ab19f51e2207eaee4842ab88ca32]

Interface changes

  • Renames --csr-train option to --training-annotations. [a1f8c45878ac1ca05c496f4b6b6c344c21a1ab10]
  • Adds version numbers to provided random forests. [3599999dbedce18520389f656a352edf866fd1c7]
  • Renames the RPB measure to RSB. [c52c6e8cb220e5db1171aa857141617b8aedf7c4]

Bug fixes

  • Resolves a libc++ bug where subnormal doubles are not parsed properly, causing errors when using random forest filtering. [dc137542403b7c9af73257151472936ccd5a0844]
  • Fixes a possible segmentation fault when using the MQD measure. [45b9b742d09cb037ffa605c719695ae22d94a066]
  • Fixes a VCF reading bug that could mangle INFO and FORMAT fields with multiple values. [0078c5a1e2fbd1abd43a65315f4de216fbf4fa9b]

v0.5.0-beta

5 years ago

This is the first beta release as most of the core features are reasonably mature. There have been various stability and runtime improvements, in addition to improvements to the core algorithm - including a completely new indel mutation model. Once again, the cancer calling model has received most attention, particularly for high depth ultra-low VAF tumour-only calling (e.g. UMI).

General

  • Overhaul of the indel mutation model which controls priors on germline, somatic, and de novo mutations. Gap open and extensions conditional on local repeat context and current gap length are modelled. [bd0eb24bfd09efacbadb306af3a0af15827b7015, 20f5d9ff1facdfee90b2c88b8b603986b0e01fce]
  • A brand new candidate variant generator! Named RepeatScanner, this generator looks for likely misaligned SNV runs in microsatellites and proposes indels. This can result in more biologically realistic calls in these regions. This generator is controlled with the --repeat-candidate-generator command line option. [2856c2e6b8a5683f07c19d3f40e1c2f3b467bacd , 2856c2e6b8a5683f07c19d3f40e1c2f3b467bacd]
  • Evidence BAMs for multi-sample input, including 'split' evidence BAMs. [face5fb7d7627154b1628f11a4aed64cd25a51ad, e56641c75c92cd104463fd3435a0fea0d3807793]
  • The way QUAL is calculated in the cancer and trio models has been improved. Previously QUAL was the posterior probability the called alt allele segregated and is classified correctly. This could lead to low QUAL scores if the classification was uncertain (e.g. in tumour-only samples). QUAL is now simply the posterior probability the allele segregates. There is also a new annotation for all cancer caller calls, and DENOVO trio calls, PP, that is equivalent to the old QUAL. [905c96b7362ba2513c920e33d896751490cc32f0, 3b28e9fe85af4aef4408cb3b31c959408a0ba129, 0d1537b9012326d4e8e3d98d718e0f81ff73219e]
  • Candidate variant generators are now more sensitive to very low frequency variation (<1% VAF). [d3e36316c47c7d736fde48611baaef408f7078c9]
  • SOMATIC have a new annotation: MAP_VAF which reports theMaximum a posteriori VAF estimate.
  • New measures to use for threshold and random forest filtering. [11ff14faaa141ddb290dd31f6a2686adf5f51269]
  • Complete refactor of the core cancer caller genotype models results in some runtime improvements. [d3e5a5a0fc11e3462b63de8e7cc6c3c36080c006]
  • Better Variational Bayes seed generation for cancer genotypes, especially good news for lower frequency mutations. [2fadf78a51844d30ff464c72931075167cfa15d1]
  • Improved somatic model fitting for high ploidy somatic genotypes in cancer caller. [2d7573c334f880aecbb064b0e5994149eff10815]
  • Improved use of indexing in the individual caller results in ~5% speedup. [b6bba8a947d1f7c91044874479b8174b94540fdd, 16a3cc5a22f9405585fce7e6cca2b47ff609d977, 9c951d2c17dd92fca8ab1778b883fa45f825ffeb]
  • Better identification of messy regions that slow down calling. [5326835a901c0a542b792a51781bae4730bfaba3 , 8208a204fd08c21f980d1fdbf1197168cf00fcfb]
  • The assembler now considers observed read strands and reduce the score of bubbles with high strand bias. [50da8040f04af3c3c9eea5180d2d635b6cf76125]
  • Filtering measures can now be parameterised by user input. [e1ab33090304ff6acd1109dcf2d92248f8354d47]
  • The way some measures consider ambiguous reads has been improved which can prevent some biases previously observed. [7e2f635a2a9745396b8f46f70aaed959d068fe06]
  • Adds support for calling chromosome Y in trios. [41b72b22663d91e08b63121ba2a7624485a99003]
  • Adds a "data profiler" that can be used to build a profile of polymorphisms and errors present in the data. Currently this only profiles indels. This feature is currently experimental and is primarily intended to be used to improve indel error models. [99ad1e94f1e506051dc8b61d6f623bdbef92184a]

Bug fixes

  • Fixes a bug that could lead to segmentation faults during haplotype generation. [1ecd74e7a45f2337426728e90bf5a3c90f52592a]
  • Fixes a problem reading lists of floats from VCF files that could result in garbage output (e.g. for VAF_CR) [e361f5065da83a9d1febabf4dcac9c7578dc3e8e].
  • Fix GCC 8 warning which caused compile error. [58b51fd14b73bf5dbcd8f50a4d9704f39acf985f, 3733b09e643de92226010dd866006786fd609375]
  • Fixes some instances of compiler based non-determinism that could result in different results between compilers. [d01819396161e76a14cc1605d63da2abf35901aa, e66169e5724ce3251fd3071a01a5d5e8e1db1599]

Interface changes

  • Adds command line option --max-vb-seeds which controls the maximum number of seeds the Variational Bayes based genotype model algorithms can use. [95c66a2ec89fe37adb8a4707d15b69bf17f25563]
  • Adds --split-bamout for split realigned BAMs. Split BAMs are no longer requested by specifying a prefix to --bamout. [34d8a89748cd363e967cea89774531efa73a9dbb]
  • The measure SC has been renamed to NC (Normal Contamination). [23497c3aaf0c93c9ca633f96778f8f74c4a5a4b3] -- Adds --mask-tails for unconditionally masking bases of all read tails. [acfddaf1b5e910496b737f3dd6cab2667dadae4b]
  • Adds --tumour-germline-concentration which may be used to control shape of prior distribution on haplotype mixture frequency of tumour samples. Only really relevant to high depth tumour-only calling. [9f83ca6fce24ced6ea901845f3c474ecfc6a1867]
  • Renames --snv-denovo-mutation-rate to --denovo-snv-mutation-rate and --indel-denovo-mutation-rate to --denovo-indel-mutation-rate. [4b9d95f448ef1f8d2375947a58d664850a868c18]
  • Adds --repeat-candidate-generator to control new repeat candidate generator. [2856c2e6b8a5683f07c19d3f40e1c2f3b467bacd]

Miscellaneous

  • There is now a configs directory in the main project directory that contains pre-written configs for calling certain types of data. [9da036416ff2bd7a36f5f734aebbd391df7c48f4]

v0.4.1-alpha

5 years ago

This is a bug fix release that fixes a minor bug that crept into v0.4.0-alpha.

Bug fixes

  • Fixes a bug in v0.4.0-alpha where germline calls may be hard filtered when using threshold filtering.

v0.4.0-alpha

5 years ago

This is a major release with important new features, enhancements, and performance improvements.

New features

  • New polyclone calling model for bacterial and viral data.
  • New population calling model with Hardy-Weinberg priors.
  • Random forest filtering for germline and somatic variants using ranger.
  • Generate an 'evidence' BAM for single sample calling with the --bamout option. See the wiki page for details.

Calling improvements

  • The cancer caller can now model more than one somatic haplotype which improves calling sensitivity, and also allows somatic phasing. See cancer calling model wiki for more details.
  • Optimisation of the cancer model improves sensitivity for low frequency mutations.
  • New unified indel mutation model used for germline, de-novo, and somatic indel calling.
  • New filter Measures. See wiki for full list.
  • Tumour-only calling now much faster and more accurate.
  • Uses variant prior model to deduplicate haplotypes for all models, resulting in more biologically realistic calls.
  • DENOVO and SOMATIC calls now get different filtering treatment to regular germline variants using threshold filters.

Interface changes

  • Added --forest-file and --somatic-forest-file for random forest filtering.
  • Added --somatics-only to report only SOMATIC variants.
  • Added --denovos-only to report only DENOVO variants.
  • Added --max-somatic-haplotypes which limits the number of somatic haplotypes that may be used by the cancer calling model.
  • --consider-reads-with-unmapped-segments --> --no-reads-with-unmapped-segments and --consider-reads-with-distant-segments --> --no-reads-with-distant-segments. These filters are now off my default.
  • --max-cancer-genotypes removed and replaced with --max-genotypes, which is also used by the polyclone calling model.
  • Added --max-clones option for specifying the maximum number of clones for the polyclone calling model.
  • Added --somatic-filter-expression, --denovo-filter-expression, and --refcall-filter-expression which may be used for hard filtering 'DENOVO' and SOMATICcalls.

v0.3.3-alpha

6 years ago

This version brings new features, in addition to significant calling and runtime improvements.

New features

  • CSR filtering can be run on a user supplied octopus VCF file, without running calling (--filter-vcf command line option).
  • Micro-inversions and complex rearrangements are callable.

Calling improvements

  • Better handling of variants in tandem repeat regions, in particular, many cases that would previously have been called as a series of SNV's, are now called as an insertion-deletion pair, which is more biologically plausible.
  • Improved the SNV error model to stop some true heterozygous SNV's being called as homozygous.

Runtime improvements

  • CSR filtering is fully parallelised. Like for calling, this is activated with the --threads command. This resolves #13.

Bug fixes

  • Various fixes to the way haplotypes are reconstructed from VCF, which lead to some edge cases being misclassified.

Interface changes

  • The helper Python install script install.py is now supplied with both a C++ and C compiler with the cxx_compiler and c_compiler commands respectively.
  • Supplementary alignments are now filtered by default (--no-supplementary-alignments changes to --allow-supplementary-alignments).
  • Secondary alignments are now filtered by default (--no-secondary-alignments changes to --allow-secondary-alignments).

Other changes

  • htslib is now linked dynamically by default, which means its requirements do not need to be explicitly linked also. This resolves #16. Be sure to clean any CMake caches before rebuilding (--clean with Python install script).
  • .vcf.gz index files are now in the .tbi format, rather than .csi.

v0.3.2-alpha

6 years ago

This version brings bug fixes and some minor performance improvements.

Bug fixes

  • Fixes issue #11 where octopus hangs after calling variants.
  • Fixes issue #17 where contig names containing a colon could not be parsed.

Performance improvements

  • Gap open penalties are now more consistent tandem repeats which can improve calling performance in some cases.
  • Decreased the minimum probability cap for de novo mutation model which seems to result in more sensitive de novo and somatic mutation calls.

Interface changes

  • Somatic SNV and INDEL mutation rates are now specified separately via the command line.

v0.3.1-alpha

6 years ago

This release contains some runtime performance improvements, particularly for the tumour calling model. It also updates the requirements for GCC, CMake, and Boost.

Requirement changes

  • Updates CMake requirement to 3.9 so can use IPO checks.
  • Updates Boost requirement to 1.65 for bug fixes and better program option formatting.
  • Updates GCC requirement to 6.3 to avoid bug in 6.2.

Performance Improvements

  • Significantly improves runtime performance of tumour calling model.
  • Improves masking of noisy regions which can slow down calling.
  • Slightly improves CSR runtime performance.

Other changes

  • Fixes various warnings from new Clang and GCC compilers.
  • Can now build with compiler sanitizer flags.
  • Adds a Dockerfile.

v0.3-alpha

6 years ago

This is a major release that contains significant new features and improvements.

New features

  • Variant filtering: Octopus now has simple threshold based filtering which is turned on by default. This can dramatically reduce the false positive rate in some datasets (e.g. Platinum genomes).
  • The population model now uses an independence-based genotype model. Although this doesn't offer true joint calling, it at-least offers consistent output until such time as a proper model is implemented.
  • Somatic mutation calling is now significantly faster and more accurate due to model optimisation.

Bug fixes

  • Fixed a bug with haplotype filtering that could cause haplotypes not to be filtered, and also result in inconsistent results between runs.

Other changes

  • VCF records now include AC and AN INFO fields.
  • Added an official logo!
  • Protect called haplotypes from filtering when using holdouts.
  • Octopus will now always emit a call if the variant posterior is above the given threshold, even if the homozygous reference genotype is MAP.
  • The max QUAL is now 10000.

v0.2.1-alpha

7 years ago

This release includes a new de novo mutation model that improves trio calling.

New features

  • A new de novo mutation model that includes context dependent indel gap open and extension penalties, calculates using an exponential model. There are now two options that parametrise the model; snv-denovo-mutation-rate and indel-denovo-mutation-rate. Gap open and extension penalties are weighted based on context.

Bug fixes

  • Fixes a bug that could prevent a legacy VCF being made.
  • Corrects a region difference method that sometimes resulted in incorrect 'skip region' deduction, which could lead to an exception being thrown.
  • Fixes a bug that resulted in an incorrect trio model posterior probability.
  • Fixes some numerical overflow/underflow bugs that resulted in undefined behaviour.

Other changes

  • Increases max-joint-genotypes to 1,000,00.