RNASeq pipeline
This pipeline performs the following tasks:
Use git clone:
git clone [email protected]:UMCUGenetics/RNASeq.git
Download the RNAseq pipeline. Make sure all dependencies are installed and the right paths are set in the pipeline (RNAseqAnalyse.pl) in the "Get options" section.
Generate genome indexes files using the instructions in section Generate genome indexes. The genome indexes are saved to disk and need only be generated once for each genome/annotation combination. Next to the files you had to collect to generate the genome indexes, you need:
perl RNAseqAnalyse.pl -input [/path/to/rundir] -outputDir [/path/to/outputdirname] -mail [email]
To see additional parameters, just type:
perl RNAseqAnalyse.pl
Create a directory where you want to store the indexes (e.g. /GENOMES/STAR/Homo_sapiens.GRCh37).
Collect the following files for your genome:
Run STAR:
STAR \
--runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--genomeFastaFiles /path/to/genome/fasta.fa \
--runThreadN 4 \
--sjdbGTFfile /path/to/annotations.gtf
RNA-seq reads are aligned to the reference genome using STAR, which was designed to specifically address many of the challenges of RNA-seq data mapping, and uses a novel strategy for spliced alignments.
The pipeline searches for fusion genes / chimeric junctions by default. The default minimum segment length is 1 and can be edited with option -chimSegmentMin.
Counting sequencing reads in features (genes/transcripts/exons) is done with htseq-count using the union mode.
The raw read counts table can be found in the directory:
The raw read counts are normalized using the DESeq method included in the DESeq Bioconductor package and is based on the hypothesis that most genes are not DE. A DESeq scaling factor for a given lane is computed as the median of the ratio, for each gene, of its read count over its geometric mean across all lanes. The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis. By calling the estimateSizeFactors() and sizeFactors() functions in the DESeq Bioconductor package, this factor is computed for each lane, and raw read counts are divided by the factor associated with their sequencing lane.
RPKM is a method of quantifying gene expression from RNA sequencing data by normalizing for total read length and the number of sequencing reads. RPKMs are calculated using the RPKM() function included in the edgeR Bioconductor package. RPKM = [# of mapped reads]/([length of transcript]/1000)/([total reads]/10^6)
CAUTION: Make sure the RPKM normalization is actually necessary in your analysis. If you are comparing gene expression among samples only, there really is no reason to normalize by length as you will be dividing each gene among the samples by a constant (gene length). You only need to use RPKM when you are comparing transcript expression within one sample.
Differential expression analysis can be done only for standard designs, such as: sample1 test sample2 control sample3 test sample4 test sample5 control
DE analysis is done using the DESeq2 Bioconductor package. It takes the merged raw read counts (from HTseq-count) as an input and creates the following output inside the /
For the variant calling and filtering, GATK's HaplotypeCaller and VariantFiltration tool is used. SnpEff is used to add genetic variant annotation and effect predictions to the vcf. In case of human genome data, also the vcf will be annotated with the dbNSFP database using SnpSift.
Please contact Sander Boymans ([email protected]) if you want to add additional tools/scripts/options.