Xinglab CLAM Save

CLIP-seq Analysis of Multi-mapped reads

Project README

Latest Release Github All Releases

CLIP-seq Analysis of Multi-mapped reads

Table of Content

  1. Introduction
  2. Installation
  3. Input
  4. Usage
    4.1 preprocessor
    4.2 realigner
    4.3 peakcaller
    4.4 permutation_callpeak
    4.5 peak_annotator
    4.6 data_downloader
  5. Output
  6. An example
    6.1 Sample dataset
    6.2 Preprossing of sample dataset
    6.3 Read mapping
    6.4 PCR duplicate removal
    6.5 CLAM pre-processing
    6.6 Realigning
    6.7 Peak calling
    6.8 Annotation
    6.9 Detecting motif

Introduction

CLAM is a general toolkit for re-aligning multi-mapped reads in CLIP/RIP-seq data and calling peaks. For details, please read our NAR paper. Recently, we updated CLAM to version v1.2.0 with new features:

  1. Implemented new strandness tagging for anti-sense reads.
  2. Multi-replicate mode for peak calling and benchmarking.
  3. Annotating peaks to genomic region annotations.
  4. Install through pypi.

TOC

Installation

CLAM v1.2 works under Python 2/3. Please download the lastest release.

wget https://github.com/Xinglab/CLAM/releases/download/v1.2.0/CLAM-1.2.0.tar.gz
tar -zxvf CLAM-1.2.0.tar.gz && cd CLAM-1.2.0
python setup.py install

in your terminal and this will automatically install CLAM in your currently working python.

You should have already installed "pysam" using pip/conda for your python interpreter. If not, you can check the detailed requirements in the file "requirements.txt", or type

pip install -r requirements.txt

to install those requirements manually.

Or, a simpler way to install CLAM is through Pypi:

pip install --index-url https://test.pypi.org/simple/ --no-deps CLAM

TOC

Input

The input for CLAM is a sorted or unsorted BAM file of CLIP-seq alignment and a gene annotation file in GTF format.
In the case of RIP-seq or eCLIP, a BAM file for IP experiment and a BAM file for Control/input experiment are taken together as input.

Note: As in the released version v1.1, the read_gtf function had a bug and required the Gencode format GTF (i.e. last column of GTF matches gene_id "(xx)" ) to proceed the peak calling. This bug has been fixed in the github repository and has been fixed in later releases/patches.

TOC

Usage

CLAM is run through issueing subcommands. Currently there are four subcommands available: preprocessor, realigner, peakcaller permutation_callpeak and peak_annotator.

usage: CLAM [-h] [--version]
            {preprocessor,realigner,peakcaller,permutation_callpeak,peak_annotator,data_downloader}
            ...

CLAM -- CLip-seq Analysis of Multi-mapped reads

positional arguments:
  {preprocessor,realigner,peakcaller,permutation_callpeak,peak_annotator,data_downloader}
    preprocessor        CLAM Preprocessor: tag read alignments to specific
                        locus
    realigner           CLAM Realigner: realign multi-mapped reads using
                        expectation-maximization
    peakcaller          CLAM Peakcaller: negative binomial model-based peak
                        calling combining unique- and multi-reads
    permutation_callpeak
                        CLAM permutation peakcaller: call peaks using
                        permutation (as in v1.0.0)
    peak_annotator      CLAM peak annotator: assign peaks to genomic regions
    data_downloader     CLAM data downloader: download data of genomic regions

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

For command line options of each sub-command, type: CLAM COMMAND -h

TOC

CLAM preprocessor

This subcommand (new since v1.1) will prepare the input files for CLAM pipeline. It looks for reads passing QC, splits the input bam file by sorting them into unique.sorted.bam and multi.sorted.bam, and adding an additional tag "RT" (short for Read Tag) to each alignment based which read tagger function the user supplied.

Note that you can also run CLAM realigner directly, which will call preprocessor and automatically determine if preprocessor has been called in the output folder.

Parameters:

-h, --help            show help message and exit
-i IN_BAM, --input IN_BAM
                      Input bam file
-o OUT_DIR, --out-dir OUT_DIR
                      Output folder
--read-tagger-method {median,start}
                      Read tagger method. Indicate method to tag a CLIP/RIP read to a particular locus; 
                      'median' tags read center and is recommended for RIP-seq; 
                      'start' tags read start site and is recommended for CLIP-seq.
--max-multihits MAX_HITS
                      The maximum hits allowed for multi-mapped reads;
                      default: 100
--max-tags MAX_TAGS   The maximum identical tags at given location; default:
                      -1, no filter. CLAM will collapse reads mapped to the identical locations up to MAX_TAGS reads. 
                      This is usually used for RIP-seq with strong PCR duplication issues.
--lib-type {sense,antisense,unstranded}
                      The expected read strandness with transcription
                      direction: sense, antisense, or unstranded;
                      default: sense

If you don't want to run realigner, you can also run peakcaller directly after preprocessor.

Example run:

CLAM preprocessor -i path/to/input/Aligned.out.bam -o path/to/clam/outdir/ --read-tagger-method median

TOC

CLAM realigner

This subcommand will run expectation-maxmization to assign the multi-mapped reads in a probablistic framework. More details about the EM model is described in our NAR paper.

Parameters:

-i IN_BAM, --input IN_BAM
                      Input bam file
-o OUT_DIR, --out-dir OUT_DIR
                      Output folder
--read-tagger-method {median,start}
                      Read tagger method. Indicate method to tag a CLIP/RIP read to a particular locus; 
                      'median' tags read center and is recommended for RIP-seq; 
                      'start' tags read start site and is recommended for CLIP-seq.
--max-multihits MAX_HITS
                      The maximum hits allowed for multi-mapped reads;
                      default: 100
--max-tags MAX_TAGS   The maximum identical tags at given location; default:
                      -1, no filter. CLAM will collapse reads mapped to the identical locations up to MAX_TAGS reads. 
                      This is usually used for RIP-seq with strong PCR duplication issues.
--retag               Retag the bam regardless when turned on; invalid when
                      no previous files found
--winsize WINSIZE     Local window size for em computations; default: 50
--lib-type {sense,antisense,unstranded}
                      The expected read strandness with transcription
                      direction: sense, antisense, or unstranded;
                      default: sense

Note when --retag is specified, realigner will re-run preprocessor regardless; otherwise, it will use the prepared files in outdir if available.

Example run:

CLAM realigner -i path/to/input/Aligned.out.bam -o path/to/clam/outdir/ --read-tagger-method start --retag

TOC

CLAM peakcaller

This subcommand (new since v1.1) will call peaks by looking for bins enriched with IP reads over control, specifying a Negative-binomial model on observed read counts.

Note we can specify both unique.sorted.bam (from preprocessor) and realigned.sorted.bam (from realigner) and separte the two file paths by a space, to call peaks using the combination of uniquely- and multi-mapped reads.

Alternatively, we can also only input unique.sorted.bam; this will allow CLAM to call peaks using only uniquely- mapped reads.

As a new feature in version 1.2.0, we implemented multi-replicate mode for peakcaller. Simply use comma to seperate bam files of replicates (Here, we assume there are two replicates named rep1 and rep2).

Parameters:

-h, --help            show help message and exit
-i IN_BAM [IN_BAM ...], --input IN_BAM [IN_BAM ...]
                      Filepaths for IP bam files, e.g ubam1,ubam2
                      mbam1,mbam2
-c CON_BAM [CON_BAM ...], --control-dir CON_BAM [CON_BAM ...]
                      Filepaths for control bam files
-o OUT_DIR, --out-dir OUT_DIR
                      Output folder
--gtf GTF_FP          GTF filepath
-p NTHREAD, --nthread NTHREAD
                      Number of threads; default: 8
-u, --unique-only     Call peaks using only unique-mapped reads when turned
                      on
--pool                Pool the read counts if provided with multiple
                      replicates; default: False
--min-clip-cov MIN_CLIP_COV
                      Minimum CLIP reads per gene to perform analysis;
                      default: 4
--qval-cutoff QVAL_CUTOFF
                      Cutoff for adjusted p-values; default: 0.05
--fold-change FOLD_CHANGE [FOLD_CHANGE ...]
                      Threasholds for signal range (fold change w/ control;
                      tag count w/o control); default: 2-inf
--normalize-lib       use total library size to normalize signal and
                      control, instead of gene-by-gene basis; default: False
-b BINSIZE, --binsize BINSIZE
                      Bin size for calling peaks; default: 50
--lib-type {sense,antisense,unstranded}
                      The expected read strandness with transcription
                      direction: sense, antisense, or unstranded;
                      default: sense

Example run:

CLAM peakcaller -i path/to/IP/rep1/unique.sorted.bam,path/to/IP/rep2/unique.sorted.bam \
path/to/IP/rep1/realigned.sorted.bam,path/to/IP/rep2/realigned.sorted.bam \
-c path/to/CTRL/unique.sorted.bam path/to/CTRL/realigned.sorted.bam \
-o path/to/peaks/outdir --lib_type sense --binsize 100 --lib-type unstranded \
--gtf path/to/gencode.v19.annotation.gtf

TOC

CLAM permutation_callpeak

This subcommand will call peaks using permutation by randomly placing reads along the gene. More details about the permutation procedure is described in our NAR paper.

Example run:

CLAM permutation_callpeak -i path/to/outdir/unique.sorted.bam path/to/outdir/realigned.sorted.bam \
-o path/to/peaks/outdir -p 8 \
--gtf path/to/gencode.v19.annotation.gtf

TOC

CLAM peak_annotator

This sumcommand will annotate peaks to genomic regions. It requires genomic region files to exist. If not, peak_annotator will call data_downloader automatically. Currently, we support annotation of human(version hg19 and hg38) and mouse(version mm10)

Example run:

CLAM peak_annotator path/to/peak/file/narrow_peak.unique.bed hg19 path/to/output/annotate.txt

TOC

CLAM data_downloader

This sumcommand will download prepared genomic annotation files to local system. Usually, it was called by peak_annotator automatically. You can also run it manually.

Example run:

CLAM data_downloader hg19

TOC

Output

The output of the re-aligner is "realigned.sorted.bam" (previously "assigned_multimapped_reads.bam" in v1.0), which is a customized BAM file following SAM format. Note that the re-aligned weights are stored in "AS:" tag, so please be aware and do not change/omit it. Output of re-aligner could also be seen as an intermediate file for CLAM pipeline.

The output of the peak-caller is a bed file following NarrowPeak format. It is a 10-column BED format file.

If you run permutation peak caller (as in v1.0), there will be only one output file called "narrow_peak.permutation.bed". Hence a peak with "combined" but no "unique" on the fifth column indicates this is a rescued peak; both "unique" and "combined" as common peak; or lost peak otherwise.

If you run model-based peak caller (since v1.1), depending on the specified paramter (whether you turned on --unique-only), the output will be either "narrow_peak.unique.bed" for peaks called using only uniquely-mapped reads; or "narrow_peak.combined.bed" for peaks called when adding realigned multi-mapped reads.

The output of peak-annotator is a BED file with extra columns output by BEDTools intersect command (-wa, -wb, -bed ). A more detailed description can be found here

TOC

An example

Sample dataset

A typical application of CLAM is to call peaks with CLIP-seq data. Here, we take an eCLIP seq dataset as an example. (A full Snakemake pipeline for analysing eCLIP data can be found here.)

In this demo, we focused on RBFOX2 eCLIP data from K562 cell line provided by Van et. al.

Van Nostrand, E. L., G. A. Pratt, A. A. Shishkin, C. Gelboin-Burkhart, M. Y. Fang, B. Sundararaman, S. M. Blue, T. B. Nguyen, C. Surka, K. Elkins, R. Stanton, F. Rigo, M. Guttman and G. W. Yeo (2016). "Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP)." Nat Methods 13(6): 508-514.

First, download raw data from ENCODE:

wget https://www.encodeproject.org/files/ENCFF495WQA/@@download/ENCFF495WQA.fastq.gz
wget https://www.encodeproject.org/files/ENCFF492QZU/@@download/ENCFF492QZU.fastq.gz
wget https://www.encodeproject.org/files/ENCFF930TLO/@@download/ENCFF930TLO.fastq.gz
wget https://www.encodeproject.org/files/ENCFF462CMF/@@download/ENCFF462CMF.fastq.gz
wget https://www.encodeproject.org/files/ENCFF163QEA/@@download/ENCFF163QEA.fastq.gz
wget https://www.encodeproject.org/files/ENCFF942TPA/@@download/ENCFF942TPA.fastq.gz

A detaild list for each file:

File Name Cell Line Bait Group Run type Encode Accession Platform
ENCFF495WQA K562 RBFOX2 Control-Read 1 PE55nt ENCSR051IXX Illumina HiSeq 4000
ENCFF492QZU K562 RBFOX2 Control-Read 2 PE46nt ENCSR051IXX Illumina HiSeq 4000
ENCFF930TLO K562 RBFOX2 Replicate 1-Read 1 PE45nt ENCSR756CKJ Illumina HiSeq 4000
ENCFF462CMF K562 RBFOX2 Replicate 1-Read 2 PE46nt ENCSR756CKJ Illumina HiSeq 4000
ENCFF163QEA K562 RBFOX2 Replicate 2-Read 1 PE45nt ENCSR756CKJ Illumina HiSeq 4000
ENCFF942TPA K562 RBFOX2 Replicate 2-Read 2 PE46nt ENCSR756CKJ Illumina HiSeq 4000

TOC

Preprossing of sample dataset

When all raw reads were downloaded, use catadaptor to remove adaptor sequences. Before this, let's unzip all files and rename them:

ll | grep fastq | awk '{print $9}' | xargs gzip -d
mv ENCFF495WQA.fastq K562_RBFOX2_Inp_R1.fastq
mv ENCFF492QZU.fastq K562_RBFOX2_Inp_R2.fastq
mv ENCFF930TLO.fastq K562_RBFOX2_rep1_IP_R1.fastq
mv ENCFF462CMF.fastq K562_RBFOX2_rep1_IP_R2.fastq
mv ENCFF163QEA.fastq K562_RBFOX2_rep2_IP_R1.fastq
mv ENCFF942TPA.fastq K562_RBFOX2_rep2_IP_R2.fastq

To remove adaptors, each paired-end reads should go through 2 rounds of processing. Take control as an example:

cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 1 --quality-cutoff 6 -m 18 \
-a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT \
-A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA \
-A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA \
-A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC \
-A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC \
-A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT \
-o K562_RBFOX2_Inp_R1.adapterTrim.fastq -p K562_RBFOX2_Inp_R2.adapterTrim.fastq \
K562_RBFOX2_Inp_R1.fastq K562_RBFOX2_Inp_R2.fastq
cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 \
-A AACTTGTAGATCGGA -A AGGACCAAGATCGGA \
-A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG \
-A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG \
-A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG \
-A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC \
-A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT \
-o K562_RBFOX2_Inp_R1.adapterTrim.round2.fastq -p K562_RBFOX2_Inp_R2.adapterTrim.round2.fastq \
K562_RBFOX2_Inp_R1.adapterTrim.fastq K562_RBFOX2_Inp_R2.adapterTrim.fastq

TOC

Read mapping

After reads were processed, we will need to map reads to genome. Here, we use STAR to map reads to human genome, version GRCh37.75. For a more comprehensive introduction to STAR, click here.
Still, we take Control as an example:

STAR --genomeDir /full/path/to/STAR/index/folder \
--readFilesIn K562_RBFOX2_Inp_R1.adapterTrim.round2.fastq K562_RBFOX2_Inp_R2.adapterTrim.round2.fastq 
--outSAMtype BAM Unsorted \
--outFileNamePrefix ../star/K562_RBFOX2_Inp/ \
--alignEndsProtrude 15 ConcordantPair \
--outFilterMultimapNmax 100 \
--runThreadN 20 \
--outStd Log

TOC

PCR duplicate removal

As rRNAs and other repetitive RNA may cause bias when performing peak calling, we will need to clean the mapped reads.
Use BEDTOOLS to remove rRNAs (Optional. rRNA annotation can be exported from UCSC Table Browser).

bedtools intersect -f 0.90 -abam ../star/K562_RBFOX2_Inp/Aligned.out.bam -b /path/to/rRNA/annotation.bed -v > ../star/K562_RBFOX2_Inp/Aligned.out.mask_rRNA.bam
bedtools intersect -f 0.90 -abam ../star/K562_RBFOX2_rep1_IP/Aligned.out.bam -b /path/to/rRNA/annotation.bed -v > ../star/K562_RBFOX2_rep1_IP/Aligned.out.mask_rRNA.bam
bedtools intersect -f 0.90 -abam ../star/K562_RBFOX2_rep2_IP/Aligned.out.bam -b /path/to/rRNA/annotation.bed -v > ../star/K562_RBFOX2_rep2_IP/Aligned.out.mask_rRNA.bam

In CLIP data analysis, its important to remove PCR duplicates. Here we used our in-house code to remove PCR duplicates (The code can be found here). Them -m option indicates output log file of removed read number of each adaptor.

python2 collapse_duplicates.py -b star/K562_RBFOX2_Inp/Aligned.out.mask_rRNA.bam -o star/K562_RBFOX2_Inp/Aligned.out.mask_rRNA.dup_removed.bam -m star/K562_RBFOX2_Inp/dup_removal.metrics.txt
python2 collapse_duplicates.py -b star/K562_RBFOX2_rep1_IP/Aligned.out.mask_rRNA.bam -o star/K562_RBFOX2_rep1_IP/Aligned.out.mask_rRNA.dup_removed.bam -m star/K562_RBFOX2_rep1_IP/dup_removal.metrics.txt
python2 collapse_duplicates.py -b star/K562_RBFOX2_rep2_IP/Aligned.out.mask_rRNA.bam -o star/K562_RBFOX2_rep2_IP/Aligned.out.mask_rRNA.dup_removed.bam -m star/K562_RBFOX2_rep2_IP/dup_removal.metrics.txt

TOC

CLAM preprocessing

As one of the major feature of CLAM, multi-mapped reads were rescued by an EM procedure while peak calling (See our paper for more detail). Before peak calling, CLAM will seperate multi-mapped reads and uniquely mapped reads. This process can be omiited, if so, CLAM will still call the preprocessing module if it cannot find seperated BAM files.

CLAM preprocessor -i star/K562_RBFOX2_Inp/Aligned.out.mask_rRNA.dup_removed.bam -o clam/K562_RBFOX2_Inp --read-tagger-method start
CLAM preprocessor -i star/K562_RBFOX2_rep1_IP/Aligned.out.mask_rRNA.dup_removed.bam -o clam/K562_RBFOX2_rep1_IP --read-tagger-method start
CLAM preprocessor -i star/K562_RBFOX2_rep2_IP/Aligned.out.mask_rRNA.dup_removed.bam -o clam/K562_RBFOX2_rep2_IP --read-tagger-method start

TOC

Realigning

This step will realign multi-mapped reads to each putative genome location with a probability. CLAM realigner uses uniquely mapped reads to determine the probability weight of mutli-mapped reads for each genome locus.

CLAM realigner -i clam/K562_RBFOX2_Inp/multi.sorted.bam -o clam/K562_RBFOX2_Inp --winsize 50 --max-tags -1 --read-tagger-method start
CLAM realigner -i clam/K562_RBFOX2_rep1_IP/multi.sorted.bam -o clam/K562_RBFOX2_rep1_IP --winsize 50 --max-tags -1 --read-tagger-method start
CLAM realigner -i clam/K562_RBFOX2_rep2_IP/multi.sorted.bam -o clam/K562_RBFOX2_rep2_IP --winsize 50 --max-tags -1 --read-tagger-method start

TOC

Peak calling

This step will call peaks in multi-replicate mode of CLAM.

CLAM peakcaller -i clam/K562_RBFOX2_rep1_IP/unique.sorted.bam,clam/K562_RBFOX2_rep2_IP/unique.sorted.bam \
clam/K562_RBFOX2_rep1_IP/realigned.sorted.bam,clam/K562_RBFOX2_rep2_IP/realigned.sorted.bam \
-c clam/K562_RBFOX2_Inp/unique.sorted.bam clam/K562_RBFOX2_Inp/realigned.sorted.bam \
-o clam/peaks --binsize 100 \
--gtf path/to/gencode.v19.annotation.gtf

TOC

Annotation

Once peaks were called, the genome regions of peaks can be annotated by peak_annotator. If genomic region annotation file location is not in system environment or CLAM cannot find the files of the specific genome version in dedicated location, it will call data_downloder automatically.

CLAM peak_annotator -i clam/peaks/narrow_peak.combined.bed -g hg19 -o clam/peaks/annotate.txt

The header of result file will looks like:

## Annotation peaks to genomic regions all intersected genomic regions are presented.
## CLAM version: 1.2.0
## Column 1:  Peak chromosome
## Column 2:  Peak start
## Column 3:  Peak end
## Column 4:  Peak name
## Column 5:  Peak score
## Column 6:  Peak strand
## Column 7:  Peak signal value
## Column 8:  Peak pValue
## Column 9:  Peak qValue
## Column 10: Point-source called for this peak
## Column 11: Genomic region chromosome
## Column 12: Genomic region start
## Column 13: Genomic region end
## Column 14: Gene ID
## Column 15: Quality score
## Column 16: Genomic region strand
## Column 17: Genomic region type

TOC

Detecting motif

If you have HOMER installed, try to detect over-represented motifs using it.

findMotifsGenome.pl clam/peaks/narrow_peak.combined.bed hg19 motif/ -rna -len 5,6,7 -p 20 -size 100 -S 10

Ideally, you will get a list of motifs, including a 'GCAUG' motif.

HOMER motif

TOC

Open Source Agenda is not affiliated with "Xinglab CLAM" Project. README Source: Xinglab/CLAM
Stars
26
Open Issues
5
Last Commit
2 years ago
Repository
License

Open Source Agenda Badge

Open Source Agenda Rating