Deepvariant Versions Save

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

v1.6.1

1 month ago

In this release:

We fixed a bug in call_variants that caused the step to freeze in cases where there were no examples. This bug was observed and reported in https://github.com/google/deepvariant/issues/764, https://github.com/google/deepvariant/issues/769, https://github.com/google/deepsomatic/issues/8.
Updated libssw library from 1.2.4 to 1.2.5.
The same model files are used for v1.6.0 and v1.6.1 for all technologies.

v1.6.0

6 months ago

Improved support for haploid regions, chrX and chY. Users can specify haploid regions with a flag. Updated case studies show usage and metrics.
Added pangenome workflow (FASTQ-to-VCF mapping with VG and DeepVariant calling). Case study demonstrates improved accuracy
Substantial improvements to DeepTrio de novo accuracy by specifically training DeepTrio for this use case (for chr20 at 30x HG002-HG003-HG004, false negatives reduced from 8 to 0 with DeepTrio v1.4, false positives reduced from 5 to 0).
We have added multi-processing ability in postprocess_variants which reduces 48 minutes to 30 minutes for Illumina WGS and 56 minutes to 33 minutes with PacBio.
We have added new models trained with Complete genomics data, and added case studies.
We have added NovaSeqX to the training data for the WGS model.
We have migrated our training and inference platform from Slim to Keras.
Force calling with approximate phasing is now available.

We are sincerely grateful to

@wkwan and @paulinesho for the contribution to helping in Keras move.
@lucasbrambrink for enabling multiprocessing in postprocess_variants.
@msamman, @akiraly1 for their contributions.
PacBio: William Rowell (@williamrowell), Nathaniel Echols for their feedback and testing.
UCSC: Benedict Paten(@benedictpaten), Shloka Negi (@shlokanegi), Jimin Park (@jimin001), Mobin Asri (@mobinasri) for the feedback.

v1.5.0

1 year ago

New model datatype: --model_type ONT_R104 is a new option. Starting from v1.5, DeepVariant natively supports ONT R10.4 simplex and duplex data.
- For older ONT chemistry, please continue to use PEPPER-Margin-DeepVariant.
Incorporated PacBio Revio training data in DeepVariant PacBio model. In our evaluations this single model performs well on both Sequel II and Revio datatypes. Please use DeepVariant v1.5 and later for Revio data.
Incorporated Element Biosciences data in WGS models. We found that we could jointly train a short-read WGS model with both Illumina and Element data. Inclusion of Element data improves accuracy on Element without negative effect on Illumina. Please use the WGS model for best results on either Illumina or Element data.
Added vg/Giraffe-mapped BAMs to DeepVariant WGS training data (alongside existing BWA). We observed that a single model can be trained for strong results with both BWA and vg/Giraffe.
Improved DeepVariant WES model for 100bps exome sequencing thanks to user-reported issues (including https://github.com/google/deepvariant/issues/586 and https://github.com/google/deepvariant/issues/592).
Thanks to Tong Zhu from Nvidia for his suggestion to improve the logic for shuffling reads.
Thanks to Doron Shem-Tov (@doron-st) and Ilya Soifer (@ilyasoifer) from Ultima Genomics for adding new functionalities enabled by flags --enable_joint_realignment and --p_error.
Thanks to Dennis Yelizarov for improving Google-internal infrastructure for running make_examples.
Updated TensorFlow version to 2.11.0. Updated htslib version to 1.13.

v1.4.0

1 year ago

Simplified DeepVariant PacBio by introducing direct phasing. This means PacBio users who run DeepVariant no longer need to run DeepVariant+WhatsHap+DeepVariant. See PacBio case study for more information.
For Illumina WGS and WES, we add an additional feature of read insert size (insert_size) . This reduces errors by 4-10% for Illumina WGS and WES model. Thanks @lucasbrambrink for implementing this feature.
Reduced the runtime of the postprocess_variants step by 10-30%. Thanks @moshewagner for optimizing the code.
Included experimental code which explores use of Keras for model architecture. This is not used in production methods, but may be informative to developers seeking examples of Keras applied to similar problems. Thanks @wkwan and @paulinesho for their contributions.
We did not include OpenVINO by default in the Docker images we released. Users can still build their own Docker images with the option turned on as needed.
Updated 2022-10-17: We have released an Illumina RNA-seq model and added an RNA-seq case study.

v1.3.0

2 years ago

Improved the DeepTrio PacBio models on PacBio Sequel II Chemistry v2.2 by including this data in the training dataset.
Improved call_variants speed for PacBio models (both DeepVariant and DeepTrio) by reducing the default window width from 221 to 199, without tradeoff on accuracy. Thanks to @lucasbrambrink for conducting the experiments to find a better window width for PacBio.
Introduced a new flag --normalize_reads in make_examples, which normalizes Indel candidates at the reads level.This flag is useful to reduce rare cases where an indel variant is not left-normalized. This feature is mainly relevant to joint calling of large cohorts for joint calling, or cases where read mappings have been surjected from one reference to another. It is currently set to False by default. To enable it, add --normalize_reads=true directly to the make_examples binary. If you’re using the run_deepvariant one-step approach, add --make_examples_extra_args="normalize_reads=true". Currently we don’t recommend turning this flag on for long reads due to potential runtime increase.
Added an --aux_fields_to_keep flag to the make_examples step, and set the default to only the auxiliary fields that DeepVariant currently uses. This reduces memory use for input BAM files that have large auxiliary fields that aren’t used in variant calling. Thanks to @williamrowell and @rhallPB for reporting this issue.
Reduced the frequency of logging in make_examples as well as call_variants to address the issue reported in https://github.com/google/deepvariant/issues/491.

v1.2.0

2 years ago

The DeepVariant v1.2 release contains the following major improvements:

A major code refactor for make_examples better modularizes common components between DeepVariant, DeepTrio, and potential future applications. This enables DeepTrio to inherit improvements such as --add_hp_channel (introduced to the DeepVariant PacBio model in v1.1; see blog), improving DeepTrio’s PacBio accuracy.
The DeepVariant PacBio model has substantially improved accuracy for PacBio Sequel II Chemistry v2.2, achieved by including this data in the training dataset.
We updated several dependencies: Python version to 3.8, TensorFlow version to 2.5.0, and GPU support version to CUDA 11.3 and cuDNN 8.2. The greater computational efficiency of these dependencies results in improvements to speed.
In the "training" model for make_examples, we committed (https://github.com/google/deepvariant/commit/4a11046de0ad86e36d2514af9f035c9cb34414bf) that fixed an issue introduced in an earlier commit (https://github.com/google/deepvariant/commit/a4a654769f1454ea487ebf0a32d45a9f8779617b) where make_examples might generate fewer REF (class0) examples than expected.
Improvements to accuracy for Illumina WGS models for various, shorter read lengths. Thanks to the following contributors and their teams for the idea:
- Dr. Masaru Koido (The University of Tokyo and RIKEN)
- Dr. Yoichiro Kamatani (The University of Tokyo and RIKEN)
- Mr. Kohei Tomizuka (RIKEN)
- Dr. Chikashi Terao (RIKEN)

Additional detail for improvements in DeepVariant v1.2:

Improvements for training:

We augmented the training data for Illumina WGS model by adding BAMs with trimmed reads (125bps and 100bps) to improve our model’s robustness on different read lengths.

Improvements for make_examples: For more details on flags, run /opt/deepvariant/bin/make_examples --help for more details.

Major refactoring to ensure useful features (such as --add_hp_channel) can be shared between DeepVariant and DeepTrio make_examples.
Add MED_DP (median of DP) in the gVCF output. See this section for more details.
New --split_skip_reads flag: if True, make_examples will split reads with large SKIP cigar operations into individual reads. Resulting read parts that are less than 15 bp are filtered out.
We now sort the realigned BAM output mentioned in this section when you use --emit_realigned_reads=true --realigner_diagnostics=/output/realigned_reads for make_examples. You will still need to run samtools index to get the index file, but no longer need to sort the BAM.
Added an experimental prototype for multi-sample make_examples.
- This is an experimental prototype for working with multiple samples in DeepVariant, a proof of concept enabled by the refactoring to join together DeepVariant and DeepTrio, generalizing the functionality of make_examples to work with multiple samples. Usage information is in multisample_make_examples.py, but note that this is experimental.
Improved logic for read allele counts calculation for sites with low base quality indels, which resulted in Indel accuracy improvement for PacBio models.
Improvements to the realigner code to fix certain uncommon edge cases.

Improvements for the one-step run_deepvariant: For more details on flags, run /opt/deepvariant/bin/run_deepvariant --help for more details.

New --runtime_report which enables runtime report output to --logging_dir. This makes it easier for users to get the runtime by region report for make_examples.
New --dry_run flag is now added for printing out all commands to be executed, without running them. This is mentioned in the Quick Start section.

v1.1.0

3 years ago

The v1.1 release introduces DeepTrio, which uses a model specifically trained to call a mother-father-child trio or parent-child duo. DeepTrio has superior accuracy compared to DeepVariant. Pre-trained models are available for Illumina WGS, Illumina exome, and PacBio HiFi.

In addition, DeepVariant v1.1 contains the following improvements:

Accuracy improvements on PacBio, reducing Indel errors by ~21% on the case study. This is achieved by adding an input channel which specifically encodes haplotype information, as opposed to only sorting by haplotype in v1.0. The flag is --add_hp_channel which is enabled by default for PacBio.
Speed improvements for long read data by more efficient handling of long CIGAR strings.
New functionality to add detailed logs for runtime of make_examples by genomic region, viewable in an interactive visualization.
We now fully withhold HG003 from all training, and report all accuracy evaluations on HG003. We continue to withhold chromosome20 from training in all samples.

New optional flags to increase speed:

A team at Intel has adapted DeepVariant to use the OpenVINO toolkit, which further accelerates TensorFlow applications. This further speeds up the call_variants stage by ~25% for any model when run in CPU mode on an Intel machine. DeepVariant runs of OpenVINO have the same accuracy and are nearly identical to runs without. Runs with OpenVINO are fully reproducible on OpenVINO.

To use OpenVINO, add the following flag too the DeepVariant command:

--call_variants_extra_args "use_openvino=true"

We thank Intel for their contribution, and acknowledge the extensive work their team put in, captured in (https://github.com/google/deepvariant/pull/363)

v1.0.0

3 years ago

DeepVariant v1.0 releases new features and accuracy improvements sufficiently substantial to indicate a major version of v1.0. Compared to DeepVariant v0.10, these changes reduce Illumina WGS errors by 24%, exome errors by 19%, and PacBio errors by 52%.

Added ALT-aligned pileups, which creates additional input channels where reads are also aligned to the candidate ALT alleles. This is controlled by the flag --alt_aligned_pileup. --alt_aligned_pileup=diff_channels is now default for DeepVariant PacBio model. This substantially improves INDEL accuracy for PacBio data.
Added new flag --sort_by_haplotypes to optionally allow creating pileup images with reads sorted by haplotype. Haplotype sorting is based on the HP tag that must be present in input BAM, and --parse_sam_aux_fields needs to be set as well. This substantially improves INDEL accuracy for PacBio data.
The PacBio case study now includes instructions for two-pass calling, which allows users to take advantage of the --sort_by_haplotypes by phasing variants and the input reads. Accuracy metrics for both single pass calling and two-pass calling are shown. Users may choose whether to run a second time for higher accuracy.
Default of --min_mapping_quality in make_examples.py changed from 10 to 5. This improves accuracy of all models (WGS, WES, and PACBIO).
Included a new hybrid illumina+pacbio model and documentation.
Added show_examples, a tool for showing examples as pileup image files, with documentation.
Cleaned up unused experimental flags: --sequencing_type_image and --custom_pileup_image
Added --only_keep_pass flag to postprocess_variants.py to optionally only keep PASS calls in output VCF.
Addressed GitHub issues:
- Fixed the binarize function in modelling.py. (https://github.com/google/deepvariant/issues/286 fixed in https://github.com/google/deepvariant/commit/db87d77)
- Fixed quoting issues for --regions when using run_deepvariant.py. (https://github.com/google/deepvariant/issues/305 fixed in https://github.com/google/deepvariant/commit/fbacd35)
- Added --version to run_deepvariant.py. (https://github.com/google/deepvariant/issues/332 fixed in https://github.com/google/deepvariant/commit/f101492)
- Added --sample_name flag to postprocess_variant.py and applied it in run_deepvariant.py as well. (https://github.com/google/deepvariant/issues/334 fixed in https://github.com/google/deepvariant/commit/a81d629)

v0.10.0

4 years ago

Update to Python3 and TensorFlow2: We use Python3.6, and pin to TensorFlow 2.0.0.
Improved PacBio model for amplified libraries: the PacBio HiFi training data now includes amplified libraries at both standard and high coverages. This provides a substantial accuracy boost to variant detection from amplified HiFi data.
Turned off ws_use_window_selector_model by default: This flag was turned on by default in v0.7.0. After the discussion in issue #272, we decided to turn this off to improve consistency and accuracy, at the trade-off of a 7% increase in runtime of the make_examples step. Users may add --make_examples_extra_args "ws_use_window_selector_model=true" to save some runtime at the expense of accuracy.

v0.9.0

4 years ago

In the v0.9.0 release, we introduce best practices for merging DeepVariant samples.
Added visualizations of variant output for visual QC and inspection.
Improved Indel accuracy for WGS and WES (error reduction of 36% on the WGS case study) by reducing Indel candidate generation threshold to 0.06.
Improved WES model accuracy by expanding training regions with a 100bp buffer around capture regions and additional training at lower exome coverages.
Improved performance for new PacBio Sequel II chemistry and CCS v4 algorithm by training on additional data.

Full release notes:

New documentation:

Added a tutorial for merging WES trio.
- Added recommended GLnexus parameters for merging WGS and WES data (also available as built-in presets in GLnexus v1.2.2+).
Visualization functionality and documentation: VCF stats report.

Changes to Docker images, code, and models:

Docker images now live in Docker Hub google/deepvariant in addition to gcr.io/deepvariant-docker/deepvariant.
For WES, added 100bps buffer to the capture regions when creating training examples.
For WES, increased training examples with lower coverage exomes, down to 30x.
For PACBIO, added training data for Sequel II v2 chemistry and samples processed with CCS v4 algorithm.
Loosened the restriction that the BAM files need to have exactly one sample_name. Now if there are multiple samples in the header, use the first one. If there was none, use a default.
Changes in realigner code. Realigner aligns reads to haplotypes first and then realigns them to the reference. With this change some of the haplotypes (with not enough read support) are now discarded. This results in fewer reads needing to be realigned. Theoretically, this fix should improve FP rate. It also helps to resolve a GitHub issue.

Changes to flags:

Added --sample_name flag to run_deepvariant.py.
Reduced default for vsc_min_fraction_indels to 0.06 for Illumina data (WGS and WES mode) which increases sensitivity.
Expanded the use of --reads to take multiple BAMs in a comma-separated list.
Use --ref for CRAM by default. (Set --use_ref_for_cram to true by default)
Added support for BAM output for realigner debugging. See --realigner_diagnostics and --emit_realigned_reads flags in realigner.py.