DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
With the v0.8.0 release, we introduce a new DeepVariant model for PacBio CCS data. This model can be run in the same manner as the Illumina WGS and WES models. For more details, see our manuscript with PacBio and our blog post.
This release also includes general improvements to DeepVariant and the Illumina WGS and WES models. These include:
use_ref_for_cram
flag below.New optional flags:
make_examples.py
use_ref_for_cram
:
Default is False (using the embedded reference in the CRAM file). If set to True, --ref
will be used as the reference instead. See CRAM support section for more details.parse_sam_aux_fields
and use_original_quality_scores
:
Option to read base quality scores from OQ tag. To use this option, set both flags to true.
Standard GATK process includes a score re-calibration stage where base quality scores are re-calibrated using special software. DeepVariant produces a slightly better accuracy when original scores are used. Usually original scores are stored in a BAM file under OQ optional tag. This feature will allow to read quality scores from OQ tag instead of QUAL field.min_base_quality
:
Allowed users to try different thresholds for minimum base quality score.min_mapping_quality
:
Allowed users to try different thresholds for minimum mapping quality score.call_variants.py
config_string
:
Allowed users to specify estimator session configuration through a flag when running on CPU and GPU, thanks to the contribution of @A-Tsai from ATGENOMIX in #159.num_mappers
:
Allowed users to modify the number of dataset mappers through a flag, thanks to the contribution of @fo40225 from National Taiwan University Hospital in #152../.
instead of 0/0
. The threshold is configurable via --cnn_homref_call_min_gq
flag in postprocess_variants.py
. This improves downstream cohort merging performance based on our internal investigation in a "Improved non-human variant calling using species-specific DeepVariant models" blog.batch_size
in case the users need to change it for the call_variants step.logging_interval_sec
to control how often worker logs are written into Google Cloud Storage.call_variants
: only one call_variants
is run on each machine for better performance. This improved the GPU cost and speed.This release includes numerous performance improvements that collectively reduce the runtime of DeepVariant by about 65%.
A few highlighted changes in this release:
call_variants
runtime by more than 3x compared to v0.6.make_examples
which result in significant runtime improvements. For example, make_examples
now runs more than 3 times faster in the WGS case study than v0.6.
-ws_use_window_selector_model
which is now on by default.This release has a new WGS model that has major accuracy improvement on PCR+ data. We also released a new WES model that has some minor accuracy improvement.
A few important changes in this release:
This release is a bugfix release for gVCF creation. See https://github.com/google/deepvariant/issues/58 for details.
This release fixes issue #27 and adds support for creating the MIN_DP field in gVCF records.
Release two separate models for calling genome and exome sequencing data. Significant improvement of Indel F1 on exome data.
Provide capability to produce gVCF files as output from DeepVariant [doc]: gVCF files are required as input for analyses that create a set of variants in a cohort of individuals, such as cohort merging or joint genotyping.
Training data: All models are trained with a benchmarking-compatible strategy: That is, we never train on any data from the HG002 sample, or from chromosome 20 from any sample.
Whole genome sequencing model: We used training data from both genome sequencing data as well as exome sequencing data.
In order to increase diversity of training data, we also used the downsample_fraction
flag when making training examples.
Whole exome sequencing model: We started from a trained WGS model as a checkpoint, then we continue to train only on WES data above. We also use various downsample fractions for the training data.
DeepVariant now provides deterministic output by rounding QUAL field to one digit past the decimal when writing to VCF.
Update the model input data representation from 7 channels to 6.
Add a post-processing step to variant calls to eliminate rare inconsistent haplotypes [description].
Expand the excluded contigs list to include common problematic contigs on GRCh38 [GitHub issue].
It is now possible to run DeepVariant workflows on GCP with pre-emptible GPUs.
This fixes a problem with htslib_gcp_oauth when network access is unavailable.