Luntergroup Octopus Versions Save

Bayesian haplotype-based mutation calling

v0.7.4

3 years ago

Bug fixes

Restores bug fixed from 0.7.2 that were accidentally reverted.

Improvements and modifications

QUAL scores less than 1 are now reported to 2 significant figures.

v0.7.3

3 years ago

This release contains several minor improvements:

Replaces the old random forest training procedure with a Snakemake version. [e0c492251d85bc9926c2dba2f28c87159da909ad]
All annotations can now requested even if they are not active. [9bb584cd7f92a3dcdeb93b009f7a82770dc64eb8]
Annotations will be default no longer be aggregated when --disable-call-filtering is used. To aggregate annotations for forest training, the --aggregate-annotations option is added. [42f424870307861e6eb9cb8f53245f03357973f6]
Big runtime improvement to the cell calling model. [bc062d24c1e344a09ba42240e206f604ad8ab771]
Change the default --max-genotype-combinations to 100,000 for trio and population calling. This improves runtime considerably but has little impact on accuracy. [18fcbcbc354479ebf0abe6a2d08e048041eb0918]
Trains a new germline random forest using much less training data overall but more trio data.

v0.7.2

3 years ago

This is a minor bug fix release:

Fixes a segmentation fault in the cancer caller caused by adjacent phase blocks with different ploidies. [9bc21a34f980075a9746014a42eddd177c32f325]
Default to UTC time when no tz database found. [3dbd8cc33616129ad356e99a4dae82e4f6702250]
Prints annotations when --annotations specified with --help. [366fe044327504487e50f51224582e82d3fdda1e]
Fixes a bug causing some reads to be dropped when filtering long. haplotype regions. [8f2cc87a731c48c38d71eab25a5a20f6fe82b84b]
Update the link in the README to the Nature Biotechnology paper! [b8baf136ba9c579c22fc8db8b69b66d350379ae1]

v0.7.1

3 years ago

This is a minor bug fix release:

Fixes underreporting of de novo mutations in trio mode [7d195dee033d178680b5d12b89c54ec5b4b4978e].
Improves QUAL precision for trio calls [b68c7cf73bbc9500e740d12c4e91783fc0420773].
Resolves some issues with read counting (e.g. AD) on * alleles [e0023adf0f1bf0f97af474a0a5016ea213f1cc62].
Fixes underreporting of more than one somatic haplotype in cancer mode [0c7d06bca5ea072e6b53eb72238624fc5c7bc103].
Improves coalescent model to allow 2 indel heterozygosity parameters along a haplotype [58db9346ab06216bc4478fa011ccf372f557565e].
Adds timestamp to VCF output [36ca0b908726486476292308a867dc6bc4e67edb].
Adds --architecture option to install.py that sets compiler march option [87faea93b08e465c74dd28c7452a66a74a644ed2].
Default minimum mapping quality filter set to 5.

v0.7.0

3 years ago

This is a major release since v0.6.3-beta and is the first non-beta release. Highlights include:

The pair HMM used for the core haplotype likelihood model has been completely re-written to support AVX2 and AVX-512 instruction sets. This can result in some nice performance improvements on machines supporting these instructions. Also, the HMM now supports variable band-widths and 32-bit integer scores, which is necessary to evaluate long reads.
Evidence BAMs are now annotated with supporting haplotype(s) and other information. Automatic 'splitting' by haplotype is gone but there is a [script] provided to do this.
Octopus is now paired and linked read aware! Reads are assumed paired by default, but can be assumed unpaired or linked with the --read-linkage option. This improves accuracy and phasing for most analysis.
Random forests now store the annotations used for training as meta information in the forest file, allowing different annotations to be used for different forests. Note that this change makes previous forest versions incompatible with this version, it also means that a modified ranger must be used for training (the main ranger package does not store variable names in the meta info).
Allele-level annotations (e.g. AD) are now supported; they can be requested with the --annotations option.
The phasing algorithm has been completely re-written to improve accuracy and to allow discontiguous phase sets, which can frequently occur in some analysis (e.g. linked reads, or somatic phasing).
Calling from PacBio CCS reads is now supported - although improvements are still needed, especially regarding runtime. See the PacBio CCS config.
The haplotype generator now supports 'backtracking' - where a block of partially resolved haplotypes is buffered, and then restored when downstream haplotypes have also been partially resolved. This can lead to long haplotypes much faster than keeping all haplotypes in the tree simultaneously. Backtracking is turned off by default, but can be. enabled by using --backtrack-level option.
Mixing of distinct sample ploidies is now supported by the population calling model.
Overflows on QUAL and GQ have been reduced allowing for much greater ranges on these statistics.
The use of * ALT allele has been brought inline with the updated VCF v4.3 specification. The --legacy option has therefore been removed.
New RFGQ_ALL INFO measure for random forest filtered runs - the empirical probability (Phred) of all genotypes being correct (derived from each FORMAT RFGQ). Use this for filtering tumour-normal calls etc.
Handling of ALT supplementary alignments (for GRCh38 etc) has been improved, resulting in better accuracy.
Polyploid calling much faster, especially when the --max-genotypes option is used (recommended for anything over triploid).
The local re-assembler now automatically considers the average region depth when evaluating bubbles, resulting in fewer spurious candidate variants.
The local re-assembler no longer allows cyclic graphs by default, resulting in far fewer spurious candidates with very little loss in sensitivity. Cyclic graphs can be re-enabled with the --allow-cycles option.
Haplotypes (i.e. phased GT entires) are now reported in a consistent manner - always lexicographical (w.r.t the implied haplotype). This breaks the previous rule that somatic haplotypes always appeared after germline ones - somatic haplotypes are now identified with the HSS FORMAT annotation.
The way genotypes are represented has been completely re-written, resulting in some nice runtime performance improvements for all calling models.
The way filtering measures are calculated has been re-written, resulting in a nice runtime performance improvement for filtering.
The way Octopus identifies 'uncallable' regions that tend to slow down analysis has been much improved, resulting in much better runtimes.
Automatic dependency installation in the installation script has been much improved, and is now the recommend way to install Octopus on all operating systems.
Many bug fixes.

v0.6.3-beta

5 years ago

This release reduces runtime in the cancer and polyclone calling models by 20-25%; fixes a bug in the read deduplication algorithm, resulting in fewer false positive calls in PCR data (particularly for somatic calling), adds new read pre-processing options designed to mitigate systematic artefacts in 10X Genomics sequencing, and adds a new way to metric (RFQUAL_ALL) to filter somatic variant calls.

New features / interface changes

Adds command line options --mask-inverted-soft-clipping and --mask-3prime-shifted-soft-clipped-heads for masking 10X Genomics sequencing artefacts. [0b8fb935d93154b624f644940e0375f8c92b62c0, 6566fb2432cdac01fa43e0b217ae985c105993e9]

Improvements

Reduces runtime in the VariationalBayesMixtureMixtureModel used in the cancer and polyclone calling models by ~20-25% [d9cbcec3c24a9460d709b30ff0ea006d08e55491]
Switches multi-precision floating point arithmetic in the cancer calling model to use GMP library, resulting in a small speedup. This change adds a dependency to GMP. [e59be9dbf5cd768846b3de1170a185d67d3a06d3]

Bug fixes

Fixes the read deduplication algorithm so that reads with multiple duplicates are recognised. [01eb88d0acd3e07b33cea4210a1126cef0e0e407]
Fixes a bug in the NC and SMQ measures that could cause an exception to be thrown. [a1262e2a14d6c2efa7c7045a470cfe4b3dd7a209, 226b40a272cbd57721c29645b882637ac15474ef]

v0.6.2-beta

5 years ago

This is a minor bug fix release.

Bug fixes

Fixes a bug that results in the cancer calling model throwing an exception when provided with a single candidate haplotype. [59bceefe7acdcfb610a12ea818e1354a2bb1cc42]
Fixes a bug that could cause a segmentation fault due to the haplotype leaf list becoming corrupted when removing regions from the tree. [ec7ee85224513968dd4ecf53ad5929fd3365a346]

v0.6.1-beta

5 years ago

This is a minor release that fixes some bugs, compilation issues, and adds better binary version logging.

Changes

The git branch and commit, are some system information are now logged during compilation. This information is available with the --version command. [242dd00549cc27f6629619d5d3bf7c0866ff0c29, 36a6a82c28d99cd17f0e169fa28e876f28b0c82f, c6c397d1ca5d3c50a372c83d6a75ca491ef4e678]
Adds measure ADP for assigned sequence depth (i.e. reads assigned to a unique called allele). [702109ee85f362c6cce739aa0ff3a1be19e436a8]
Adds measures ADP and VL to default random forest measures. [a0359530077794cadb3723253f5fb783d9fca975]
Adds support for gzipped region files (for options --regions-file and --skip-regions-file) [ec41af47350067808c3c27a434563a948bf652dc]
Reads that cannot be assigned to a unique haplotype are assigned randomly to any of the supporting haplotypes for bam realignment (rather than always assigning to one of them). [cb3faf950a02f6577e21ae8f42b7b72cf6b4694b]

Bug fixes

Corrects measures AD and AF calculations. [2a6a1065319a4cb0416f2162def18b362d9e09d2 , 66e4466350edb5c7c17c2eab218e46f28699293b]
Adds check for overflow in SIMD pHMM method that could result in segfaults. [08c231049c40f26b6667e3027b37fd7a610cbfe5]

v0.6.0-beta

5 years ago

This release improves calling accuracy, includes more flexible error modelling, and adds annotations to filtered VCF and realigned BAM files.

Interface changes

The --training-annotations option is replaced with --annotations, with has slightly different behaviour (see below).
The --split-bamout option is removed as --bamout realignments now include tags.
Adds the option --full-bamout. [1147e8f72f0fe3613ace63580ee592677f2f8466]
Adds the option --refcall-block-merge-threshold for controlling recall blocks.
Renames --extract-filtered-source-candidates to --use-filtered-source-candidates. [6972ffaf0f5ca88fb56e2b5e5f7a066462f64a37]

Improvements

Indel error models now include variable gap extension penalties and account for tetra-nucleotide tandem repeats. [8f40fc3d3e8feef5c078d45ec8e17b3ec1955946]
More built-in sequence error models to choose from, and custom error models (see wiki). [8f40fc3d3e8feef5c078d45ec8e17b3ec1955946]
Annotations can now be requested for filtered VCF files using the new --annotations option. [c75cbac60cdc27864b29097f1d608d89d31cbb68]
Reference calling now outputs calls in adaptive blocks using the new --refcall-block-merge-threshold option. [9127cf3f6f5a11f0268e14b9e51f177b9d1825d3]
Better handling of temporary BCF files in multithreaded mode helps prevent system errors due to too many open files (addresses issue #52). [42fa364aa3ca64ea25613458c8f6a45dbab5f34f]
Adds annotations to realigned evidence BAMs (see wiki). [c047e96978d40234314cf82bcf35f12d835c3b6f]

v0.5.3-beta

5 years ago

This is a minor release containing some bug fixes and an improved installation method.

Interface changes

Renames option --download to --download-forests in Python installation script. [166b6bea998242a821a23074f04253266976d297]
Adds command line option --temp-directory-prefix for setting name of temporary directory. [69ea9e73c41b73ffd6f93d9c1df9ba13588d21de]
Adds cell caller prototype (undocumented). [9f496d30e01e12146e5face4a010f42ee0407c23]

Improvements

Assembler now adjusted support threshold depending on depth, which should prevents too many false candidates from high depth samples. [d582cbc4b3880822b87dac74a81bd5ae5f75851c, 91930cb7231981a33524b75dcecd9aff582b8755]
Adds `--install-dependencies' to the Python installation script that results in all dependencies being installed locally. [81c6535e47a4ba3d772e05b5afdfea83f884cd0d, 5ac69585f0da395baee1a73d9c7bd621bfd149a3]
Allows candidates only seen on one strand if there are only reads seen in that direction overlapping the region. Addresses #45. [44c5c2268d4802b6dda5669752e6f91292b6c659]

Bug fixes

Resolves exception when merging temporary files for contigs containing : (e.g. HLA-A*01:01:01:01). Resolves #44. [https://github.com/luntergroup/octopus/commit/37a3329239518a07b2b9ca62e28a7570d9773667]
Stops output of IUPAC ambiguity symbols, which are not permitted by VCF specification. Resolves #46. [da055138540dc06be2d66960d3cc0f812788ff70]
Should prevent exception being thrown during filtering caused by short haplotype for realignment (see issue raised in #41). [0942ddeb10e2a271d528da5e9867ce43637bace1]