DNA Seq Analysis Save

DNA sequencing analysis notes from Ming Tang

Project README

DNA-seq

Databases for variants

Important paper DNA damage is a major cause of sequencing errors, directly confounding variant identification

However, in this study we show that false positive variants can account for more than 70% of identified somatic variations, rendering conventional detection methods inadequate for accurate determination of low allelic variants. Interestingly, these false positive variants primarily originate from mutagenic DNA damage which directly confounds determination of genuine somatic mutations. Furthermore, we developed and validated a simple metric to measure mutagenic DNA damage and demonstrated that mutagenic DNA damage is the leading cause of sequencing errors in widely-used resources including the 1000 Genomes Project and The Cancer Genome Atlas.

Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

How to represent sequence variants

Sequence Variant Nomenclature from Human Genome Variation Society

dbSNP IDs are not unique?

Oh God, why are people still using dbSNP IDs as though they're unique identifiers?
— Daniel MacArthur (@dgmacarthur) July 27, 2016

The Evolving Utility of dbSNP

see a post:dbSNP (build 147) exceeds a ridiculous 150 million variants

In the early days of next-generation sequencing, dbSNP provided a vital discriminatory tool. In exome sequencing studies of Mendelian disorders, any variant already present in dbSNP was usually common, and therefore unlikely to cause rare genetic diseases. Some of the first high-profile disease gene studies therefore used dbSNP as a filter. Similarly, in cancer genomics, a candidate somatic mutation observed at the position of a known polymorphism typically indicated a germline variant that was under-called in the normal sample. Again, dbSNP provided an important filter.

Now, the presence or absence of a variant in dbSNP carries very little meaning. The database includes over 100,000 variants from disease mutation databases such as OMIM or HGMD. It also contains some appreciable number of somatic mutations that were submitted there before databases like COSMIC became available. And, like any biological database, dbSNP undoubtedly includes false positives.

Thus, while the mere presence of a variant in dbSNP is a blunt tool for variant filtering, dbSNP’s deep allele frequency data make it incredibly powerful for genetics studies: it can rule out variants that are too prevalent to be disease-causing, and prioritize ones that are rarely observed in human populations. This discriminatory power will only increase as ambitious large-scale sequencing projects like CCDG make their data publicly available.

Tips and lessons learned during my DNA-seq data analysis journey.

Allel frequency(AF)
Allele frequency, or gene frequency, is the proportion of a particular allele (variant of a gene) among all allele copies being considered. It can be formally defined as the percentage of all alleles at a given locus on a chromosome in a population gene pool represented by a particular allele. AF is affected by copy-number variation, which is common for cancers. tools such as pyclone take tumor purity and copy-number data into account to calculate Cancer Cell Fraction (CCFs).
"for SNVs, we are interested in genotype 0/1, 1/1 for tumor and 0/0 for normal. 1/1 genotype is very rare.
It requires the same mutation occurs at the same place in two sister chromsomes which is very rare. one possible way to get 1/1 is deletion of one chromosome and duplication of the mutated chromosome". Quote from Siyuan Zheng.
"Mutect analysis on the TCGA samples finds around 5000 ~ 8000 SNVs per sample." Quote from Siyuan Zheng.
Cell lines might be contamintated or mislabled. The Great Big Clean-Up
Tumor samples are not pure, you will always have stromal cells and infiltrating immnue cells in the tumor bulk. When you analyze the data, keep this in mind.
the devil 0 based and 1 based coordinate systems! Make sure you know which system your file is using:

credit from Vince Buffalo. Also, read this post and this post

Also read The UCSC Genome Browser Coordinate Counting Systems

Which human reference genome to use? by Heng Li

TL;DR: If you map reads to GRCh37 or hg19, use hs37-1kg:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz If you map to GRCh37 and believe decoy sequences help with better variant calling, use hs37d5:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz If you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

Reference Genome Components by GATK team.
Human genome reference builds - GRCh38/hg38 - b37 - hg19 by GATK team.

get the reference files and mapping index programmatically

Go Get Data from Aaron's lab.
Refgenie: a reference genome resource manager
genomepy

some useful tools for preprocessing

FastqPuri fastq quality assessment and filtering tool.
fastp A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. really promising, take a look!
A new tool bazam A read extraction and realignment tool for next generation sequencing data. Take a look!
bwa-mem2 exact the same results of bwa-mem, 80% faster!

check sample swapping

somalier sample-swap checking directly on BAMs/CRAMs for cancer data

Mutation caller, structural variant caller

Sarek, a nextflow pipeline for variant calling
sample-swap checking directly on BAMs/CRAMs for cancer data
paper Making the difference: integrating structural variation detection tools
Mapping and characterization of structural variation in 17,795 deeply sequenced human genomes
GATK HaplotypeCaller Analysis of BWA (mem) mapped Illumina reads
NGS-DNASeq_GATK-session.pdf
GATK pipeline
An ensemble approach to accurately detect somatic mutations using SomaticSeq tool github page
A synthetic-diploid benchmark for accurate variant-calling evaluation A benchmark dataset from Heng Li. github repo
Strelka2: fast and accurate calling of germline and somatic variants paper: https://www.nature.com/articles/s41592-018-0051-x
lancet is a somatic variant caller (SNVs and indels) for short read data. Lancet uses a localized micro-assembly strategy to detect somatic mutation with high sensitivity and accuracy on a tumor/normal pair. paper: https://www.nature.com/articles/s42003-018-0023-9
needlestack an ultra-sensitive variant caller for multi-sample next generation sequencing data. This tool seems to be very useful for multi-region tumor sample analysis. paper
PerSVade: personalized structural variant detection in any species of interest
lumpy
wham
SV-Bay
Delly
Delly2

Delly is the best sv caller in the DREAM challenge https://www.synapse.org/#!Synapse:syn312572/wiki/70726

SV caller benchmark*
[Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing] (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1720-5)
Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software
[Genotyping structural variants in pangenome graphs using the vg toolkit (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1949-z)
SVAFotate Annotate a (lumpy) structual variant (SV) VCF with allele frequencies (AFs) from large population SV cohorts (currently CCDG and/or gnomAD) with a simple command line tool.
Comprehensively benchmarking applications for detecting copy number variation Our results show that the sequencing depth can strongly affect CNV detection. Among the ten applications benchmarked, LUMPY performs best for both high sensitivity and specificity for each sequencing depth.
minigraph from Heng Li to call complex SVs.
Parliament2: Accurate structural variant calling at scale. by Fritz group in BCM. https://academic.oup.com/gigascience/article/9/12/giaa145/6042728
Bent Perderson works on smoove which improves upon lumpy.
COSMOS: Somatic Large Structural Variation Detector
Fusion And Chromosomal Translocation Enumeration and Recovery Algorithm (FACTERA)
VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. we demonstrated that VarDict has improved sensitivity over Manta and equivalent sensitivity to Lumpy. SNP call rates are on par with MuTect, and VarDict is more sensitive and precise than Scalpel and other callers for insertions and deletions. see a post by Brad Chapman. Looks very promising.
Weaver: Allele-Specific Quantification of Structural Variations in Cancer Genomes. Paper
SVScore: An Impact Prediction Tool For Structural Variation
Prioritisation of Structural Variant Calls in Cancer Genomes simple_sv_annotation.py to annotate Lumpy and Mannta SV calls.
Genome-wide profiling of heritable and de novo STR variations short tandem repeats.

SNV filtering

paper: Using high-resolution variant frequencies to empower clinical genome interpretation shiny App

Whole exome and genome sequencing have transformed the discovery of genetic variants that cause human Mendelian disease, but discriminating pathogenic from benign variants remains a daunting challenge. Rarity is recognised as a necessary, although not sufficient, criterion for pathogenicity, but frequency cutoffs used in Mendelian analysis are often arbitrary and overly lenient. Recent very large reference datasets, such as the Exome Aggregation Consortium (ExAC), provide an unprecedented opportunity to obtain robust frequency estimates even for very rare variants. Here we present a statistical framework for the frequency-based filtering of candidate disease-causing variants, accounting for disease prevalence, genetic and allelic heterogeneity, inheritance mode, penetrance, and sampling variance in reference datasets.

a new database called dbDSM A database of Deleterious Synonymous Mutation, a continually updated database that collects, curates and manages available human disease-related SM data obtained from published literature.
LncVar: a database of genetic variation associated with long non-coding genes

Annotation of the variants

Mannual review of the variants called by IGV

Third generation sequencing for Structural variants (works on short reads as well!)

beautiful “Ribbon” viewer to visualize complicated SVs revealed by PacBio reads github page
Sniffles: Structural variation caller using third generation sequencing is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs using evidence from split-read alignments, high-mismatch regions, and coverage analysis.
splitThreader for visualizing structural variants. Finally a good visualizer!
New Genome Browser (NGB) - a Web - based NGS data viewer with unique Structural Variations (SVs) visualization capabilities, high performance, scalability, and cloud data support. Looks very promising.

tools useful for everyday bioinformatics

bedtools one must know how to use it!
bedops useful as bedtools.
valr provides tools to read and manipulate genome intervals and signals. (dplyr friendly!)
tidygenomics similar to GRanges but operate on dataframes!
InteractionSet useful for Hi-C, ChIA-PET. I used it for Breakpoints clustering for structural variants
Paired Genomic Loci Tool Suite gpltools intersect can do breakpoint merging.
svtools Tools for processing and analyzing structural variants.
sveval Functions to compare a SV call sets against a truth set.
Teaser A tool to benchmark mappers and different parameters within minutes.

A series of posts from Brad Chapman

Copy number variants

Interactive analysis and assessment of single-cell copy-number variations: Ginkgo
Copynumber Viewer
paper: Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives
bioconductor copy number work flow
paper: Assessing the reproducibility of exome copy number variations predictions
CNVkit A command-line toolkit and Python library for detecting copy number variants and alterations genome-wide from targeted DNA sequencing.
SavvyCNV: genome-wide CNV calling from off-target reads
dryclean Robust foreground detection in somatic copy number data https://www.biorxiv.org/content/10.1101/847681v2

Tools for visulization

New app gene.iobio
App here I will definetely have it a try.
ASCIIGenome is a command-line genome browser running from terminal window and solely based on ASCII characters. Since ASCIIGenome does not require a graphical interface it is particularly useful for quickly visualizing genomic data on remote servers. The idea is to make ASCIIGenome the Vim of genome viewers.

Tools for vcf files

tools for pedigree files. It can determine sex from PED and VCF files. Developed by Brent Pedersen. I really like tools from Aaron Quinlan's lab.
cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files
PyVCF - A Variant Call Format Parser for Python
VcfR: an R package to manipulate and visualize VCF format data
Varapp is an application to filter genetic variants, with a reactive graphical user interface. Powered by GEMINI.
varmatch: robust matching of small variant datasets using flexible scoring schemes
vcf-validator validate your VCF files!
BrowseVCF: a web-based application and workflow to quickly prioritize disease-causative variants in VCF files

mutation signature

signeR
deconstructSigs
MutationalPatterns
sigminer: an easy-to-use and scalable toolkit for genomic alteration signature analysis and visualization in R

Tools for MAF files

TCGA has all the variants calls in MAF format. Please read a post by Cyriac Kandoth.

convert vcf to MAF: perl script by Cyriac Kandoth.
once converted to MAF, one can use this MAFtools to do visualization: oncoprint wraps complexHeatmap, Lollipop and Mutational Signatures etc. Very cool, I just found it...
MutationalPatterns: an integrative R package for studying patterns in base substitution catalogues

Tools for bam files

VariantBam: Filtering and profiling of next-generational sequencing data using region-specific rules

Annotate and explore variants

Variant Effect Predictor: VEP
SNPEFF
vcfanno
myvariant.info tutorial
FunSeq2- A flexible framework to prioritize regulatory mutations from cancer genome sequencing
ClinVar
ExAC
vcf2db and GEMINI: a flexible framework for exploring genome variation from Qunlan lab.

Plotting

1.oncoprint 2.deconstructSigs aims to determine the contribution of known mutational processes to a tumor sample. By using deconstructSigs, one can: Determine the weights of each mutational signature contributing to an individual tumor sample; Plot the reconstructed mutational profile (using the calculated weights) and compare to the original input sample 3. Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Identify driver genes

MUFFINN: cancer gene discovery via network analysis of somatic mutation data

intra-Tumor heterogenity

ESTIMATE
ABSOLUTE
THetA: Tumor Heterogeneity Analysis is an algorithm that estimates the tumor purity and clonal/sublconal copy number aberrations directly from high-throughput DNA sequencing data. The latest release is called THetA2 and includes a number of improvements over previous versions.
CIBERSORT is an analytical tool developed by Newman et al. to provide an estimation of the abundances of member cell types in a mixed cell population, using gene expression data
xcell is a webtool that performs cell type enrichment analysis from gene expression data for 64 immune and stroma cell types. xCell is a gene signatures-based method learned from thousands of pure cell types from various sources.
paper: Digitally deconvolving the tumor microenvironment
Comprehensive analyses of tumor immunity: implications for cancer immunotherapy by Shierly Liu's lab. TIMER: Tumor IMmune Estimation Resource A comprehensive resource for the clinical relevance of tumor-immune infiltrations
Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. The R package's documentation is minimal... see tutorial here from the author. Brent Perdson has a tool implementing the same method used by Houseman: celltypes450.
paper: Toward understanding and exploiting tumor heterogeneity
paper: The prognostic landscape of genes and infiltrating immune cells across human cancers from Alizadeh lab.
Robust enumeration of cell subsets from tissue expression profiles from Alizadeh lab, and the CIBERSORT tool
A series of posts on tumor evolution
mapscape bioc package MapScape integrates clonal prevalence, clonal hierarchy, anatomic and mutational information to provide interactive visualization of spatial clonal evolution.
cellscape bioc package Explores single cell copy number profiles in the context of a single cell tree

tumor colonality and evolution

A step-by-step guide to estimate tumor clonality/purity from variant allele frequency data
densityCut: an efficient and versatile topological approach for automatic clustering of biological data can be used to cluster allel frequence.
phyC: Clustering cancer evolutionary trees
CloneCNA: detecting subclonal somatic copy number alterations in heterogeneous tumor samples from whole-exome sequencing data
paper: Distinct evolution and dynamics of epigenetic and genetic heterogeneity in acute myeloid leukemia
paper: Visualizing Clonal Evolution in Cancer
An R package for inferring the subclonal architecture of tumors:sciclone
Inferring and visualizing clonal evolution in multi-sample cancer sequencing: clonevol
fishplot: Create timecourse "fish plots" that show changes in the clonal architecture of tumors
tools from OMICS tools website
PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors.
SCHISM SubClonal Hierarchy Inference from Somatic Mutation

mutual exclusiveness of mutations

MEGSA: A powerful and flexible framework for analyzing mutual exclusivity of tumor mutations.
CoMet
DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data.

mutation enrich in pathways

*PathScore: a web tool for identifying altered pathways in cancer data

Non-coding mutations

Large-scale Analysis of Variants in noncoding Annotations:LARVA

CRISPR

long reads

Quality Assessment Tools for Oxford Nanopore MinION data Signal-level algorithms for MinION data

Single-cell DNA sequencing

A review paper 2016: Single-cell genome sequencing:current state of the science
Monovar: single-nucleotide variant detection in single cells
R2C2: Improving nanopore read accuracy enables the sequencing of highly-multiplexed full-length single-cell cDNA
sci-LIANTI, a high-throughput, high-coverage single-cell DNA sequencing method that combines single-cell combinatorial indexing (sci) and linear amplification via transposon insertion (LIANTI)

Open Source Agenda is not affiliated with "DNA Seq Analysis" Project. README Source: crazyhottommy/DNA-seq-analysis

Stars

136

Open Issues

Last Commit

1 year ago

Repository

crazyhottommy/DNA-seq-analysis

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/dna-seq-analysis"><img src="https://www.opensourceagenda.com/projects/dna-seq-analysis/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022