Deepvariant Versions Save

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

v0.8.0

5 years ago

With the v0.8.0 release, we introduce a new DeepVariant model for PacBio CCS data. This model can be run in the same manner as the Illumina WGS and WES models. For more details, see our manuscript with PacBio and our blog post.

This release also includes general improvements to DeepVariant and the Illumina WGS and WES models. These include:

New script that lets the users run DeepVariant in one command. See Quick Start.
Improved accuracy for NovaSeq samples, especially PCR-Free ones, achieved by adding NovaSeq samples to the training data. See DeepVariant training data.
Improved accuracy for low coverage (30x and below), achieved by training on a broader mix of downsampled data. See DeepVariant training data.
Overall speed improvements which reduce runtime by ~24% on WGS case study:
- Speed improvements in querying SAM files and doing calculations with Reads and Ranges.
- Fewer unnecessary copies when constructing DeBrujin graphs.
- Less memory usage when writing BED, FASTQ, GFF, SAM, and VCF files.
- Speed improvements in postprocess_variants when creating gVCFs - achieved by combining writing and merging for both VCF and gVCF.
Improved support for CRAM files, allowing the use of a provided reference file instead of the embedded reference. See the use_ref_for_cram flag below.

New optional flags:

make_examples.py
- use_ref_for_cram: Default is False (using the embedded reference in the CRAM file). If set to True, --ref will be used as the reference instead. See CRAM support section for more details.
- parse_sam_aux_fields and use_original_quality_scores: Option to read base quality scores from OQ tag. To use this option, set both flags to true. Standard GATK process includes a score re-calibration stage where base quality scores are re-calibrated using special software. DeepVariant produces a slightly better accuracy when original scores are used. Usually original scores are stored in a BAM file under OQ optional tag. This feature will allow to read quality scores from OQ tag instead of QUAL field.
- min_base_quality: Allowed users to try different thresholds for minimum base quality score.
- min_mapping_quality: Allowed users to try different thresholds for minimum mapping quality score.
call_variants.py
- config_string: Allowed users to specify estimator session configuration through a flag when running on CPU and GPU, thanks to the contribution of @A-Tsai from ATGENOMIX in #159.
- num_mappers: Allowed users to modify the number of dataset mappers through a flag, thanks to the contribution of @fo40225 from National Taiwan University Hospital in #152.

v0.7.2

5 years ago

Htslib updated to v1.9, fixing an outstanding CRAM issue.
Fix for the issue of non-deterministic output caused by changing number of shards in the make_example process.
Upgrade to TensorFlow v1.12.
Speed improvements in make_examples via the use of a flat_hash_map.
Speed improvements in call_variants.
The genotypes of low-quality (GQ < 20) homozygous reference calls are set to ./. instead of 0/0. The threshold is configurable via --cnn_homref_call_min_gq flag in postprocess_variants.py. This improves downstream cohort merging performance based on our internal investigation in a "Improved non-human variant calling using species-specific DeepVariant models" blog.
Google Cloud Runner:
- Localize BED region files (given via --region flag), fixing an outstanding issue.
- Make worker logs available in case of a failure inside DeepVariant.

v0.7.1

5 years ago

Fix for postprocess_variants - the previous version crashes if the first shard contains no records.
Update the TensorFlow version dependency to 1.11.
Added support to build on Ubuntu 18.04.
Documentation changes: Move the commands in WGS and WES Case Studies into scripts under scripts/ to make it easy to run.
Google Cloud runner:
- Added batch_size in case the users need to change it for the call_variants step.
- Added logging_interval_sec to control how often worker logs are written into Google Cloud Storage.
- Improved the use of call_variants: only one call_variants is run on each machine for better performance. This improved the GPU cost and speed.

v0.7.0

5 years ago

This release includes numerous performance improvements that collectively reduce the runtime of DeepVariant by about 65%.

A few highlighted changes in this release:

Update TensorFlow version to 1.9 built by default with Intel MKL support, speeding up call_variants runtime by more than 3x compared to v0.6.
The components that use TensorFlow (both inference and training) can now be run on Cloud TPUs.
Extensive optimizations in make_examples which result in significant runtime improvements. For example, make_examples now runs more than 3 times faster in the WGS case study than v0.6.
- New realigner implementation (fast_pass_aligner.cc) with parameters re-tuned using Vizier for better accuracy and performance.
- Changed window selector to use a linear decision model for choosing realignment candidates. This can be controlled by a flag. -ws_use_window_selector_model which is now on by default.
- Many micro-optimizations throughout the codebase.
Added a new training case study showing how to train and fine-tune DeepVariant models.
Added support for CRAM files

v0.6.1

6 years ago

Update the build scripts and header files so that it builds successfully on Debian.
Include a script that demonstrates how to build the CLIF binary we released.
Update GCP runner's default #cores.
Small code fix: Fix the call_variants issue of crashing on empty shards.

v0.6.0

6 years ago

This release has a new WGS model that has major accuracy improvement on PCR+ data. We also released a new WES model that has some minor accuracy improvement.

A few important changes in this release:

Changes in the training data for the WGS model:
- Addition:
  - 3 replicates of HG001 (PCR+, HiSeqX) provided by DNAnexus
  - 2 replicates of HG001 (PCR+, NovaSeq) from BaseSpace public data.
- Removal:
  - WES data (In v0.5.0, we trained our WGS model with WGS+WES data. This time we found that it didn’t help with WGS accuracy, so we removed them)
Improved training data labels. See haplotype_labeler.py
For direct inputs/outputs from cloud storage, we no longer support direct file I/O (like gs://deepvariant) due to bugs in htslib. Instead we recommend using gcsfuse to read/write data directly on GCS buckets. See “Inputs and Outputs” in DeepVariant user guide.

v0.5.2

6 years ago

This release is a bugfix release for gVCF creation. See https://github.com/google/deepvariant/issues/58 for details.

v0.5.1

6 years ago

This release fixes issue #27 and adds support for creating the MIN_DP field in gVCF records.

v0.5.0

6 years ago

Release two separate models for calling genome and exome sequencing data. Significant improvement of Indel F1 on exome data.
- On exome sequencing data (HG002):
  - Indel F1 0.936959 --> 0.961724; SNP F1 0.998636 --> 0.998962
- On whole genome sequencing data (HG002):
  - Indel F1 0.996632 --> 0.996684; SNP F1 0.999495 --> 0.999542
Provide capability to produce gVCF files as output from DeepVariant [doc]: gVCF files are required as input for analyses that create a set of variants in a cohort of individuals, such as cohort merging or joint genotyping.
Training data: All models are trained with a benchmarking-compatible strategy: That is, we never train on any data from the HG002 sample, or from chromosome 20 from any sample.
- Whole genome sequencing model: We used training data from both genome sequencing data as well as exome sequencing data.
  - WGS data:
    - HG001: 1 from PrecisionFDA, and 8 replicates from Verily.
    - HG005: 2 from Verily.
  - WES data:
    - HG001: 11 HiSeq2500, 17 HiSeq4000, 50 NovaSeq.
    - HG005: 1 from Oslo University.
  In order to increase diversity of training data, we also used the downsample_fraction flag when making training examples.
- Whole exome sequencing model: We started from a trained WGS model as a checkpoint, then we continue to train only on WES data above. We also use various downsample fractions for the training data.
DeepVariant now provides deterministic output by rounding QUAL field to one digit past the decimal when writing to VCF.
Update the model input data representation from 7 channels to 6.
- Removal of "Op-Len" (CIGAR operation length) as a model feature. In our tests this makes the model more robust to input that has different read lengths.
- Added an example for visualizing examples.
Add a post-processing step to variant calls to eliminate rare inconsistent haplotypes [description].
Expand the excluded contigs list to include common problematic contigs on GRCh38 [GitHub issue].
It is now possible to run DeepVariant workflows on GCP with pre-emptible GPUs.

v0.4.1

6 years ago

This fixes a problem with htslib_gcp_oauth when network access is unavailable.