--- title : CAGE data analysis subtitle : MODHEP workshop author : Dave Tang job : PhD candidate framework : io2012 # {io2012, html5slides, shower, dzslides, ...} highlighter : highlight.js # {highlight.js, prettify, highlight} hitheme : tomorrow # widgets : [mathjax] # {mathjax, quiz, bootstrap} mode : selfcontained # {standalone, draft} knit : slidify::knit2slides github: user: davetang repo: cage_r --- ## Welcome! > * My name is Dave Tang; I was born in Hong Kong but raised in Papua New Guinea. > * I'm a PhD candidate in bioinformatics at the VU University Amsterdam. > * I also maintain a [blog](http://davetang.org/muse), where I write about bioinformatics. > * I'm here to talk about analysing CAGE data. *** Presenter notes Telling people I'm from PNG is usually a good ice breaker, in case they have nothing to ask me at all. --- ## About this presentation > * I had a lot of difficulty deciding what to present in this talk. > * I started off extremely ambiguous and believed that it is possible to show you how to analyse CAGE data in under two hours. > * Then I realised that this isn't possible. > * This presentation will provide an overview of analysing CAGE data > * There are links on each slide for those who want to learn more. *** Presenter notes Ask audience how many plan on analysing CAGE data. --- ## Slidify > * These slides were made using an R package called [Slidify](http://slidify.org/index.html). > * All the slides and the output are generated by R; it's like embedding analysis code in PowerPoint but also having PowerPoint perform the analysis. > * This is the first time I've used Slidify for a presentation. > * These slides can be viewed at > * If you're interested in Slidify, have a look at and the links within. --- ## Promoter A promoter is a region of DNA that initiates transcription Eukaryotic promoter

Source: *** Presenter notes Promoter regions also contain various transcription factor binding sites --- ## Cap Analysis Gene Expression (CAGE) CAGE protocol 1

Image source: me. *** Presenter notes The double strand protects the RNA from RNase I digestion. However, RNAs without a Cap, are not pulled down. --- ## CAGE CAGE protocol 2

For the full protocol see . *** Presenter notes cDNAs are released from the RNA and a 5' linker (that includes a barcode sequence and EcoP15I sequence) is ligated. Second-strand cDNA synthesis takes place and digestion with EcoP15I takes place. A 3′ linker that contains the 3′ Illumina primer sequence is ligated at the 3′ end. Amplification takes place and the tags are sequenced. --- ## Why do we perform CAGE? > * Quantify the expression of transcripts > * Study promoter structure and usage > * Discover new promoters > * Discover regulatory elements > * For CAGE versus RNA-Seq see *** Presenter notes Piero and Erik showed various CAGE case studies. --- ## A typical CAGE analysis pipeline > * The output of CAGE is a set of short nucleotide sequences and their counts. > * Typical steps in a CAGE analysis pipeline include: > * Tag extraction > * Quality control steps > * Mapping tags to a reference genome > * Tag clustering > * Tag cluster annotation > * Data analysis (e.g. differential expression analysis, co-expression networks, etc.) --- ## ENCODE CAGE data Download these BAM files into some directory: ``` path=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage curl -O $path/wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz curl -O $path/wgEncodeRikenCageHelas3CellPapAlnRep1.bam curl -O $path/wgEncodeRikenCageHelas3CellPapAlnRep2.bam curl -O $path/wgEncodeRikenCageHelas3CytosolPapAlnRep1.bam curl -O $path/wgEncodeRikenCageHelas3CytosolPapAlnRep2.bam curl -O $path/wgEncodeRikenCageHelas3NucleusPapAlnRep1.bam curl -O $path/wgEncodeRikenCageHelas3NucleusPapAlnRep2.bam ``` --- ## Raw sequencing data A [fastq](http://en.wikipedia.org/wiki/FASTQ_format) file contains all the reads from the sequencer. ``` gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4 @HWUSI-EAS566_0007:5:1:965:4705#0/1 NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC +HWUSI-EAS566_0007:5:1:965:4705#0/1 BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^ ``` For each read, there are four lines of information. 1. The first line is the read name and contains information on the read 2. The second line is the raw sequence of the CAGE tag 3. The third line can be used to store optional notes 4. The fourth line denotes the quality of each base call, i.e. how sure the machine is that an A is an A. --- ## Tag extraction ``` gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4 @HWUSI-EAS566_0007:5:1:965:4705#0/1 NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC +HWUSI-EAS566_0007:5:1:965:4705#0/1 BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^ gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | perl -nle 'if (($. - 2) % 4 == 0){ print substr($_,0,3)}' | sort | uniq -c | sort -k1,1rn | head -4 25596580 TAG 300377 TAT 158391 TCG 142412 CAG ``` The barcode for this library was TAG. The contains tools for tag extraction. See also , which uses Hidden Markov Models for tag extraction (and performs quality control steps). *** Presenter notes Since we know the barcode sequence, we can identify sequencing errors. --- ## Quality control ``` gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4 @HWUSI-EAS566_0007:5:1:965:4705#0/1 NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC +HWUSI-EAS566_0007:5:1:965:4705#0/1 BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^ ``` We would remove this read because it contains an ambiguous base call, especially since it's in the barcode. The quality scores range from 3 to 40:

3------------------------------------40

BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

Various different criteria to remove reads, e.g. remove reads with 10 base calls that have a quality less than 10, etc. The tool can be used to calculate and visualise read qualities. --- ## A note about quality scores ```{r, out.width='300px'} p <- function(q){ return(10^{-q/10}) } p(10) plot(3:37, 1-p(3:37), xaxt='n', type='l', xlab='Quality score', ylab='Probability correct') axis(1, 0:40); abline(h=0.9, v=10) ``` *** Presenter notes If a base has a quality of 10, there's a 10% probability that a read is incorrect. --- ## Mapping > * High-throughput sequencing results in millions to billions of reads > * A computational strategy known as "indexing" speeds up mapping algorithms. > * It works like a book index; an index of a large DNA sequence allows one to rapidly find shorter sequences embedded within it. See [How to map billions of short reads onto genomes](http://www.ncbi.nlm.nih.gov/pubmed/19430453). > * The confidence of mapping is indicated by the mapping quality, which uses the same scale as the base calling qualities. > * Pipelines may remove reads that have a low mapping quality. > * I have a post on [mapping qualities](http://davetang.org/muse/2011/09/14/mapping-qualities/). --- ## SAM/BAM A SAM/BAM file contains alignment information of a single read to a reference genome. ``` samtools view wgEncodeRikenCageHelas3CellPapAlnRep1.bam | head -3 HWUSI-EAS566_0009:3:16:8371:2418#0|TAG 16 chr1 16213 1 27M * 0 0 GCCATGCTCTGACAGTCTCAGTTGCAC fefffffffffffffdfefddffdddd NM:i:0 MD:Z:27 XP:Z:~~~~-----000077777,,,,9998: HWUSI-EAS566_0007:5:11:4550:12891#0|TAG 16 chr1 16445 17 27M * 0 0 AGTTTGAAAACCACTATTTTATGAACC fefffffdffffffeffdffffeefff NM:i:0 MD:Z:27 XP:Z:~~~~~~~~~~~~~~~~~~~~~~~LO?~ HWUSI-EAS566_0007:5:63:12130:4370#0|TAG 0 chr1 16446 17 27M * 0 0 GTTTGAAAACCACTATTTTATGAACCA fffffdfffffdffffffffffefffe NM:i:0 MD:Z:27 XP:Z:~~~~~~~~~~~~~~~~~~~~~~~~~L~ ``` If you are confused about the second column, have a look at . --- ## CAGE defined transcription starting sites CTSS

Image source: me. --- ## Tag cluster annotation Intersect/overlap tag clusters to transcript annotations UCSC Genome Browser screenshot

--- ## Intermission * Any questions so far? --- ## A few words on R before we begin > * I have been using R on and off for a couple of years and it takes a while to get used to. > * Honest confession: I am not very good with R (I keep a lot of [documentation](http://davetang.org/muse/r/) to make up for this). > * I think it is worthwhile learning because a lot of genomics analysis packages are available via the Bioconductor project. > * By default, data loaded into R is stored into memory; the bigger your data, the more memory will be used. --- ## Bioconductor > * From [Wikipedia](http://en.wikipedia.org/wiki/Bioconductor): Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology. > * Provides state of the art software to analyse various genomic datasets. > * Has very well written guides, called vignettes, aimed at biologists. > * To learn more take a look at these [courses](http://bioconductor.org/help/course-materials/), which are provided by the Bioconductor team. *** Presenter notes It's pronounced "vin 'yet". --- ## Loading data into R Download the [data](https://dl.dropboxusercontent.com/u/15251811/pnas_expression.txt) for this [paper](http://www.ncbi.nlm.nih.gov/pubmed/19088194) (right-click on data and save as). ```{r} getwd() setwd("/Users/davetang/slidify_test/data/") data <- read.table("pnas_expression.txt", header=TRUE, sep="\t", stringsAsFactors = FALSE) head(data, 5) ``` --- ## Working with data frames ```{r} class(data) dim(data) data[1,] ``` --- ## Removing a column ```{r} data_subset <- data[,-9] dim(data_subset) head(data_subset) ``` --- ## Naming the rows ```{r} row.names(data_subset) <- data_subset[,1] data_subset <- data_subset[,-1] head(data_subset) ``` --- ## Plotting the rowSums ```{r} plot(density(log2(rowSums(data_subset))), xlab="Gene expression (log2)") ``` --- ## Independent filtering ```{r} dim(data_subset) data_subset <- subset(data_subset, rowSums(data_subset)>50) dim(data_subset) ``` See . --- ## Differential expression ```{r eval=FALSE} source("http://bioconductor.org/biocLite.R") biocLite("edgeR") ``` ```{r, message=FALSE} library(edgeR) group <- c(rep("Control",4), rep("Test",3)) d <- DGEList(counts = data_subset, group=group) #data normalisation d <- calcNormFactors(d, method="TMM") d <- estimateDisp(d) et <- exactTest(d) summary(de <- decideTestsDGE(et, p=0.05, adjust="BH")) ``` --- ## Differentially expressed genes ```{r, out.width='450px'} detags <- rownames(d)[as.logical(de)] plotSmear(et, de.tags=detags) abline(h = c(-2, 2), col = "blue") ``` --- ## Differentially expression genes ```{r} topTags(et, n=5) data_subset['ENSG00000151503',] ``` --- ## Saving the results ```{r} my_genes <- data_subset[row.names(topTags(et, n=100)),] head(my_genes, 3) write.table(my_genes, file = "my_genes.tsv", quote = FALSE, sep = "\t") ``` --- ## Intermission 2 * Any questions or comments? --- ## Correlation See Co-expression

Image source: adpated from . --- ## Pairwise comparisons ```{r} #a vs. b, a vs. c, b vs. c choose(3,2) choose(1000, 2) choose(30000, 2) ``` --- ## Pairwise comparisons ```{r} plot(1000:30000, log2(choose(1000:30000, 2)), type='l', xlab="n", ylab="Pairwise comparisons (log2)") ``` --- ## Hierarchical clustering ```{r, out.width='450px'} h <- hclust(dist(t(data_subset))) plot(h) ``` --- ## Distance measure See . ```{r} dist(t(data_subset)) ``` --- ## Heatmap See . ```{r, message=FALSE, out.width='400px'} #install.packages("gplots") library(gplots) heatmap.2(as.matrix(my_genes), scale="row", trace="none") ``` --- ## CAGEr The CAGEr package available on Bioconductor provides various methods for analysing CAGE data. To install the CAGEr package: ```{r eval=FALSE} source("http://bioconductor.org/biocLite.R") biocLite("CAGEr") ``` To load the CAGEr package: ```{r eval=FALSE} library(CAGEr) ``` *** Presenter notes I decided it wasn't feasible to perform a live demo because - CAGE data is huge - Loading the data is slow and computationally intensive - The vignette provides all the details --- ## What CAGEr can do > * Quality control steps > * CAGE normalisation > * CAGE tag clustering > * Analysis of promoter shape > * Expression clustering > * Analysis of promoter shifting > * Importantly, we can use the CAGEr package to prepare our data for further downstream analyses. > * For more information: . *** Presenter notes Michiel talked about level one and two tag clustering. --- ## Some more resources > * I highly recommend this book called [Bioinformatics Data Skills](http://shop.oreilly.com/product/0636920030157.do). I wrote a [book review](http://davetang.org/muse/2014/04/03/bioinformatics-data-skills/) on it (on my blog of course). > * My collection of links to various [introductory material related to bioinformatics](http://davetang.org/muse/primer/). > * My collection of links to [resources for learning about R](http://davetang.org/muse/r/). > * My collection of links to [courses or course material](http://davetang.org/muse/learn/) related to bioinformatics. > * Feel free [to send me an email](mailto:me@davetang.org?subject=Hello from the MODHEP workshop!) any time. > * Thank you for your attention!