{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DSI Summer Workshops Series\n",
"\n",
"## June 21, 2018\n",
"\n",
"Peggy Lindner
\n",
"Center for Advanced Computing & Data Science (CACDS)
\n",
"Data Science Institute (DSI)
\n",
"University of Houston \n",
"plindner@uh.edu \n",
"\n",
"\n",
"Please make sure you have Jupyterhub running with support for R and all the required packages installed.\n",
"Data for this and other tutorials can be found in the github repsoitory for the Summer 2018 DSI Workshops\n",
"https://github.com/peggylind/Materials_Summer2018\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computational Genomics with R\n",
"\n",
"Basis understanding of Genomic Data Analysis using R"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Goals\n",
"\n",
"* If you are not familiar with R, you will get the basics of R and divide right in to specialized uses of R for computational genomics.\n",
"* You will understand genomic intervals and operations on them, such as overlap\n",
"* You will be able to retrieve data and explore it\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](Images/DataAnalysis.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some R Basics \n",
"### Packages and functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(stats)\n",
"ls(\"package:stats\") # functions in the package\n",
"ls() # objects in your R enviroment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get help on hist() function\n",
"?hist\n",
"help(\"hist\")\n",
"# search the word \"hist\" in help pages\n",
"help.search(\"hist\")\n",
"??hist"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basic Computations in R"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"2 + 3 * 5 # Note the order of operations.\n",
"log(10) # Natural logarithm with base e\n",
"5^2 # 5 raised to the second power\n",
"3/2 # Division\n",
"sqrt(16) # Square root\n",
"abs(3-7) # Absolute value of 3-7\n",
"pi # The number\n",
"exp(2) # exponential function\n",
"# This is a comment line"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Structures\n",
"#### Vectors"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x <- c(1, 3, 2, 10, 5) #create a vector x with 5 components\n",
"x\n",
"## [1] 1 3 2 10 5\n",
"y <- 1:5 #create a vector of consecutive integers y\n",
"y + 2 #scalar addition\n",
"## [1] 3 4 5 6 7\n",
"2 * y #scalar multiplication\n",
"## [1] 2 4 6 8 10\n",
"y^2 #raise each component to the second power\n",
"## [1] 1 4 9 16 25\n",
"2^y #raise 2 to the first through fifth power\n",
"## [1] 2 4 8 16 32\n",
"y #y itself has not been unchanged\n",
"## [1] 1 2 3 4 5\n",
"y <- y * 2\n",
"y #it is now changed\n",
"## [1] 2 4 6 8 10\n",
"r1 <- rep(1, 3) # create a vector of 1s, length 3\n",
"length(r1) #length of the vector\n",
"## [1] 3\n",
"class(r1) # class of the vector\n",
"## [1] \"numeric\"\n",
"a <- 1 # this is actually a vector length one"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x <- c(1, 2, 3, 4)\n",
"y <- c(4, 5, 6, 7)\n",
"m1 <- cbind(x, y)\n",
"m1\n",
"## x y\n",
"## [1,] 1 4\n",
"## [2,] 2 5\n",
"## [3,] 3 6\n",
"## [4,] 4 7\n",
"t(m1) # transpose of m1\n",
"## [,1] [,2] [,3] [,4]\n",
"## x 1 2 3 4\n",
"## y 4 5 6 7\n",
"dim(m1) # 2 by 5 matrix\n",
"## [1] 4 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data Frames"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"chr <- c(\"chr1\", \"chr1\", \"chr2\", \"chr2\")\n",
"strand <- c(\"-\",\"-\",\"+\",\"+\")\n",
"start<- c(200,4000,100,400)\n",
"end<-c(250,410,200,450)\n",
"mydata <- data.frame(chr,start,end,strand)\n",
"#change column names\n",
"names(mydata) <- c(\"chr\",\"start\",\"end\",\"strand\")\n",
"mydata # OR this will work too\n",
"mydata <- data.frame(chr=chr,start=start,end=end,strand=strand)\n",
"mydata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Slicing and Dicing\n",
"\n",
"![](Images/slicingDataFrames.png)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"mydata[,2:4] # columns 2,3,4 of data frame\n",
"mydata[,c(\"chr\",\"start\")] # columns chr and start from data frame\n",
"mydata$start # variable start in the data frame\n",
"mydata[c(1,3),] # get 1st and 3rd rows\n",
"mydata[mydata$start>400,] # get all rows where start>400\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### List"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# example of a list with 4 components\n",
"# a string, a numeric vector, a matrix, and a scalar\n",
"w <- list(name=\"Fred\",\n",
" mynumbers=c(1,2,3),\n",
" mymatrix=matrix(1:4,ncol=2),\n",
" age=5.3)\n",
"w"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"w[[3]] # 3rd component of the list\n",
"w[[\"mynumbers\"]] # component named mynumbers in list\n",
"w$age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Factors"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features=c(\"promoter\",\"exon\",\"intron\")\n",
"f.feat=factor(features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data types\n",
" * numeric\n",
" * logical\n",
" * character\n",
" * integer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#create a numeric vector x with 5 components\n",
"x<-c(1,3,2,10,5)\n",
"x\n",
"#create a logical vector x\n",
"x<-c(TRUE,FALSE,TRUE)\n",
"x\n",
"# create a character vector\n",
"x<-c(\"sds\",\"sd\",\"as\")\n",
"x\n",
"class(x)\n",
"# create an integer vector\n",
"x<-c(1L,2L,3L)\n",
"x\n",
"class(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading and Writing Data\n",
"\n",
"Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is BED format used primarily by UCSC genome browser but most other genome browsers and tools will support BED format. We have all the annotation data in BED format. In R, you can easily read tabular format data with read.table() function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"enh.df <- read.table(\"dataJune21th/subset.enhancers.hg18.bed\", header = FALSE) # read enhancer marker BED file\n",
"cpgi.df <- read.table(\"dataJune21th/subset.cpgi.hg18.bed\", header = FALSE) # read CpG island BED file\n",
"# check first lines to see how the data looks like\n",
"head(enh.df)\n",
"head(cpgi.df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"write.table(cpgi.df,file=\"cpgi.txt\",quote=FALSE,\n",
" row.names=FALSE,col.names=FALSE,sep=\"\\t\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"save(cpgi.df,enh.df,file=\"mydata.RData\")\n",
"load(\"mydata.RData\")\n",
"# saveRDS() can save one object at a type\n",
"saveRDS(cpgi.df,file=\"cpgi.rds\")\n",
"x=readRDS(\"cpgi.rds\")\n",
"head(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One important thing is that with save() you can save many objects at a time and when they are loaded into memory with load() they retain their variable names. For example, in the above code when you use load(\"mydata.RData\") in a fresh R session, an object names “cpg.df” will be created. That means you have to figure out what name you gave it to the objects before saving them. On the contrary to that, when you save an object by saveRDS() and read by readRDS() the name of the object is not retained, you need to assign the output of readRDS() to a new variable (“x” in the above code chunk)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting in R\n",
"Let us sample 50 values from normal distribution and do some plots."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# setting figure size in notebook\n",
"options(repr.plot.width = 4, repr.plot.height = 4)\n",
"# sample 50 values from normal distribution\n",
"# and store them in vector x\n",
"x<-rnorm(50)\n",
"hist(x) # plot the histogram of those values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#let's add a title and change the color\n",
"hist(x,main=\"Hello histogram!!!\",col=\"red\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Scatterplot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# randomly sample 50 points from normal distribution\n",
"y<-rnorm(50)\n",
"#plot a scatter plot\n",
"# control x-axis and y-axis labels\n",
"plot(x,y,main=\"scatterplot of random samples\",\n",
" ylab=\"y values\",xlab=\"x values\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Boxplot\n",
"\n",
"lowerWhisker=Q1-1.5[IQR] and upperWhisker=Q1+1.5*[IQR]\n",
"\n",
"In addition, outliers can be depicted as dots. In this case, outliers are the values that remain outside the whiskers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" boxplot(x,y,main=\"boxplots of random samples\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Barplot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"perc=c(50,70,35,25)\n",
"barplot(height=perc,names.arg=c(\"CpGi\",\"exon\",\"CpGi\",\"exon\"),\n",
" ylab=\"percentages\",main=\"imagine %s\",\n",
" col=c(\"red\",\"red\",\"blue\",\"blue\"))\n",
"legend(\"topright\",legend=c(\"test\",\"control\"),fill=c(\"red\",\"blue\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ## Saving plots\n",
" If you want to save your plots to an image file there are couple of ways of doing that. Normally, you will have to do the following:\n",
" 1. Open a graphics device\n",
" 2. Create the plot\n",
" 3. Close the graphics device"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdf(\"myplot.pdf\",width=5,height=5)\n",
"plot(x,y)\n",
"dev.off()\n",
"\n",
" #Alternatively, you can first create the plot then copy the plot to a graphic device.\n",
"\n",
"plot(x,y)\n",
"dev.copy(pdf,\"myplot.pdf\",width=7,height=5)\n",
"dev.off()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with sequences, primarily DNA sequences, and genomic features.\n",
"\n",
"We will be using Bioconductor packages for this. \n",
"\n",
"Bioconductor represents a different strand of current development in R, separate from the Hadley Wickham tidyverse. Where Hadley emphasizes the data frame above all else, Bioconductor uses a great variety of data types. It's the very opposite of tidy!\n",
"\n",
"Nevertheless, Bioconductor is overwhelmingly *comprehensive*, and represents the most complete environment available for working with bioinformatic data currently available.\n",
"\n",
"Bioconductor packages usually have useful documentation in the form of \"vignettes\". These are readable on the Bioconductor website, or within R:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"source(\"http://bioconductor.org/biocLite.R\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install a basic set of packages\n",
"biocLite()\n",
"\n",
"# Install further packages used in this tutorial\n",
"biocLite(c(\n",
" \"Biostrings\",\n",
" \"GenomicRanges\", \n",
" \"rtracklayer\",\n",
" \"motifRG\",\n",
" \"AnnotationHub\",\n",
" \"ggbio\"\n",
"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(Biostrings) # Provides DNAString, DNAStringSet, etc\n",
"library(GenomicRanges) # Provides GRanges, etc\n",
"library(rtracklayer) # Provides import() and export()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DNAString\n",
"\n",
"Package `Biostrings` offers classes for storing DNA strings, `DNAString`, amino acid sequences, `AAString`, or anything else in a `BString`. These are very like character strings, but a variety of biologically meaningful functions can be applied to them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"myseq <- DNAString(\"ACCATTGATTAT\")\n",
"myseq"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class(myseq)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reverseComplement(myseq)\n",
"translate(myseq)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"subseq(myseq, 3,5)\n",
"myseq[3:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"as.character(myseq)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"methods(class=\"DNAString\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"?\"DNAString-class\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DNAStringSet\n",
"\n",
"Often we want to work with a list of sequences, such as chromosomes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString(\"ACGTACGT\")) )\n",
"myset\n",
"\n",
"# A DNAStringSet is list-like\n",
"myset$chrII\n",
"# or myset[[\"chrII\"]]\n",
"# or myset[[2]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Loading files\n",
"\n",
"Loading sequences\n",
"\n",
"DNA sequences are generally stored in FASTA format, a simple text format. These can be loaded with `readDNAStringSet` from `Biostrings`. Let's load the genome of E. coli strain K-12, obtained from the Ensembl FTP site.\n",
"\n",
"```\n",
"### The start of the .fa file looks like this:\n",
"# >Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1\n",
"# AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC\n",
"# TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG\n",
"# TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC\n",
"# ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT\n",
"# AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG\n",
"# CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT\n",
"# ...\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"seqs <- readDNAStringSet(\"dataJune21th/gendata/Escherichia_coli_k_12.GCA_000800765.1.29.dna.genome.fa\")\n",
"seqs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Our chromosome name is too verbose.\n",
"# Remove everything from the name after the first space.\n",
"names(seqs)\n",
"names(seqs) <- sub(\" .*\",\"\",names(seqs))\n",
"names(seqs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Genomic Intervals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"[Bioconductor](http://bioconductor.org) project has a dedicated package called **GenomicRanges** to deal with genomic intervals. In this section, we will provide use cases involving operations on genomic intervals. The main reason we will stick to this package is that it provides tools to do overlap operations. However package requires that users operate on specific data types that are conceptually similar to a tabular data structure implemented in a way that makes overlapping and related operations easier. The main object we will be using is called GRanges object and we will also see some other related objects from the GenomicRanges package.\n",
"\n",
"#### How to create and manipulate a GRanges object"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"gr=GRanges(seqnames=c(\"chr1\",\"chr2\",\"chr2\"),\n",
" ranges=IRanges(start=c(50,150,200),end=c(100,200,300)),\n",
" strand=c(\"+\",\"-\",\"-\")\n",
")\n",
"gr"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# subset like a data frame\n",
"gr[1:2,]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gr=GRanges(seqnames=c(\"chr1\",\"chr2\",\"chr2\"),\n",
" ranges=IRanges(start=c(50,150,200),end=c(100,200,300)),\n",
" names=c(\"id1\",\"id3\",\"id2\"),\n",
" scores=c(100,90,50)\n",
")\n",
"# or add it later (replaces the existing meta data)\n",
"mcols(gr)=DataFrame(name2=c(\"pax6\",\"meis1\",\"zic4\"),\n",
" score2=c(1,2,3))\n",
"\n",
"gr=GRanges(seqnames=c(\"chr1\",\"chr2\",\"chr2\"),\n",
" ranges=IRanges(start=c(50,150,200),end=c(100,200,300)),\n",
" names=c(\"id1\",\"id3\",\"id2\"),\n",
" scores=c(100,90,50)\n",
")\n",
"\n",
"# or appends to existing meta data\n",
"mcols(gr)=cbind(mcols(gr),\n",
" DataFrame(name2=c(\"pax6\",\"meis1\",\"zic4\")) )\n",
"gr"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# elementMetadata() and values() do the same things\n",
"elementMetadata(gr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"values(gr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Getting genomic regions into R as GRanges objects\n",
"\n",
"There are multiple ways you can read in your genomic features into R and create a GRanges object. Most genomic interval data comes as a tabular format that has the basic information about the location of the interval and some other information. We already showed how to read BED files as data frame. Now we will show how to convert it to GRanges object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# read CpGi data set\n",
"cpgi.df = read.table(\"dataJune21th/cpgi.hg19.chr21.bed\", header = FALSE,\n",
" stringsAsFactors=FALSE) \n",
"# remove chr names with \"_\"\n",
"cpgi.df =cpgi.df [grep(\"_\",cpgi.df[,1],invert=TRUE),]\n",
"\n",
"cpgi.gr=GRanges(seqnames=cpgi.df[,1],\n",
" ranges=IRanges(start=cpgi.df[,2],\n",
" end=cpgi.df[,3]))\n",
"\n",
"cpgi.gr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes pre-processing is necessary"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# read refseq file\n",
"ref.df = read.table(\"dataJune21th/refseq.hg19.chr21.bed\", header = FALSE,\n",
" stringsAsFactors=FALSE) \n",
"ref.gr=GRanges(seqnames=ref.df[,1],\n",
" ranges=IRanges(start=ref.df[,2],\n",
" end=ref.df[,3]),\n",
" strand=ref.df[,6],name=ref.df[,4])\n",
"# get TSS\n",
"tss.gr=ref.gr\n",
"# end of the + strand genes must be equalized to start pos\n",
"end(tss.gr[strand(tss.gr)==\"+\",]) =start(tss.gr[strand(tss.gr)==\"+\",])\n",
"# startof the - strand genes must be equalized to end pos\n",
"start(tss.gr[strand(tss.gr)==\"-\",])=end(tss.gr[strand(tss.gr)==\"-\",])\n",
"# remove duplicated TSSes ie alternative transcripts\n",
"# this keeps the first instance and removes duplicates\n",
"tss.gr=tss.gr[!duplicated(tss.gr),]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading the genomic features as text files and converting to GRanges is not the only way to create GRanges object. With the help of rtracklayer package we can directly import."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import.bed(\"dataJune21th/refseq.hg19.chr21.bed\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will show how to use other packages to automatically obtain the data in GRanges format. But you will not be able to use these methods for every data set so it is good to now how to read data from flat files as well. First, we will use rtracklayer package to download data from UCSC browser. We will download CpG islands as GRanges objects."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"session <- browserSession()\n",
"genome(session) <- \"mm9\"\n",
"## choose CpG island track on chr12\n",
"query <- ucscTableQuery(session, track=\"CpG Islands\",table=\"cpgIslandExt\",\n",
" range=GRangesForUCSCGenome(\"mm9\", \"chr12\"))\n",
"## get the GRanges object for the track\n",
"track(query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding regions that (does/does not) overlap with another set of regions\n",
"\n",
"This is one of the most common tasks in genomics. Usually, you have a set of regions that you are interested in and you want to see if they overlap with another set of regions or see how many of them overlap. A good example is transcription factor binding sites determined by ChIP-seq experiments. In these types of experiments and followed analysis, one usually ends up with genomic regions that are bound by transcription factors. One of the standard next questions would be to annotate binding sites with genomic annotations such as promoter,exon,intron and/or CpG islands. Below is a demonstration of how transcription factor binding sites can be annotated using CpG islands. First, we will get the subset of binding sites that overlap with the CpG islands. In this case, binding sites are ChIP-seq peaks.\n",
"\n",
"We can find the subset of peaks that overlap with the CpG islands using subsetByoverlaps() function. You will also see another way of converting data frames to GRanges."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pk1=read.table(\"dataJune21th/wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz\")\n",
"head(pk1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# convert data frame to GRanges\n",
"pk1.gr=makeGRangesFromDataFrame(pk1,\n",
" seqnames.field=c(\"V1\"),\n",
" start.field=c(\"V2\"),\n",
" end.field=c(\"V3\"))\n",
"# only peaks on chr21\n",
"pk1.gr=pk1.gr[seqnames(pk1.gr)==\"chr21\",]\n",
"# get the peaks that overlap with CpG\n",
"# islands\n",
"subsetByOverlaps(pk1.gr,cpgi.gr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each CpG island, we can count the number of peaks that overlap with a given CpG island with countOverlaps()."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#count the peaks that\n",
"# overlap with CpG islands\n",
"counts=countOverlaps(pk1.gr,cpgi.gr)\n",
"head(counts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"findOverlaps() function can be used to see one-to-one overlaps between peaks and CpG islands. It returns a matrix showing which peak overlaps with which CpGi island."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"findOverlaps(pk1.gr,cpgi.gr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another interesting thing would be to look at the distances to nearest CpG islands for each peak. In addition, just finding the nearest CpG island could also be interesting. Often times, you will need to find nearest TSS or gene to your regions of interest, and the code below is handy for doing that."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# find nearest CpGi to each TSS\n",
"n.ind=nearest(pk1.gr,cpgi.gr)\n",
"# get distance to nearest\n",
"dists=distanceToNearest(pk1.gr,cpgi.gr,select=\"arbitrary\")\n",
"dists"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some Visualizations: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# histogram of the distances to nearest TSS\n",
"dist2plot=mcols(dists)[,1]\n",
"hist(log10(dist2plot),xlab=\"log10(dist to nearest TSS)\",\n",
" main=\"Distances\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Tracks - aligning plots along chromosomes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(ggbio)\n",
"\n",
"df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10)\n",
"p1 <- qplot(data = df1, x = time, y = score, geom = \"line\")\n",
"df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm(120-30 +1))\n",
"p2 <- ggplot(data = df2, aes(x = time, y = score)) + geom_line() + geom_point(size = 2, aes(color = value))\n",
"tracks(time1 = p1, time2 = p2) + xlim(1, 40) + theme_tracks_sunset()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Plotting genomic ranges\n",
"GRanges objects are essential for storing alignment or annotation ranges in R/Bioconductor. The following creates a sample GRanges object and plots its content."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"set.seed(1); N <- 100; gr <- GRanges(seqnames = sample(c(\"chr1\", \"chr2\", \"chr3\"), size = N, replace = TRUE), IRanges(start = sample(1:300, size = N, replace = TRUE), width = sample(70:75, size = N,replace = TRUE)), strand = sample(c(\"+\", \"-\"), size = N, replace = TRUE), value = rnorm(N, 10, 3), score = rnorm(N, 100, 30), sample = sample(c(\"Normal\", \"Tumor\"), size = N, replace = TRUE), pair = sample(letters, size = N, replace = TRUE))\n",
"autoplot(gr, aes(color = strand, fill = strand), facets = strand ~ seqnames)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Plotting coverage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Mirrored coverage"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"autoplot(gr, aes(color = strand, fill = strand), facets = strand ~ seqnames, stat = \"coverage\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pos <- sapply(coverage(gr[strand(gr)==\"+\"]), as.numeric)\n",
"pos <- data.frame(Chr=rep(names(pos), sapply(pos, length)), Strand=rep(\"+\", length(unlist(pos))), Position=unlist(sapply(pos, function(x) 1:length(x))), Coverage=as.numeric(unlist(pos)))\n",
"neg <- sapply(coverage(gr[strand(gr)==\"-\"]), as.numeric)\n",
"neg <- data.frame(Chr=rep(names(neg), sapply(neg, length)), Strand=rep(\"-\", length(unlist(neg))), Position=unlist(sapply(neg, function(x) 1:length(x))), Coverage=-as.numeric(unlist(neg)))\n",
"covdf <- rbind(pos, neg)\n",
"p <- ggplot(covdf, aes(Position, Coverage, fill=Strand)) + \n",
"\t geom_bar(stat=\"identity\", position=\"identity\") + facet_wrap(~Chr)\n",
"p"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Circular genome plots"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ggplot(gr) + layout_circle(aes(fill = seqnames), geom = \"rect\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More complex circular example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"seqlengths(gr) <- c(400, 500, 700)\n",
"values(gr)$to.gr <- gr[sample(1:length(gr), size = length(gr))]\n",
"idx <- sample(1:length(gr), size = 50)\n",
"gr <- gr[idx]\n",
"ggplot() + layout_circle(gr, geom = \"ideo\", fill = \"gray70\", radius = 7, trackWidth = 3) +\n",
" layout_circle(gr, geom = \"bar\", radius = 10, trackWidth = 4,\n",
" aes(fill = score, y = score)) +\n",
" layout_circle(gr, geom = \"point\", color = \"red\", radius = 14,\n",
" trackWidth = 3, grid = TRUE, aes(y = score)) +\n",
" layout_circle(gr, geom = \"link\", linked.to = \"to.gr\", radius = 6, trackWidth = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Alignments and variants plot\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(rtracklayer); library(GenomicFeatures); library(Rsamtools); library(GenomicAlignments); library(VariantAnnotation)\n",
"\n",
"options(repr.plot.width = 6, repr.plot.height = 6)\n",
"ga <- readGAlignments(\"dataJune21th/plotdata/SRR064167.fastq.bam\", use.names=TRUE, param=ScanBamParam(which=GRanges(\"Chr5\", IRanges(4000, 8000))))\n",
"p1 <- autoplot(ga, geom = \"rect\")\n",
"p2 <- autoplot(ga, geom = \"line\", stat = \"coverage\")\n",
"vcf <- readVcf(file=\"dataJune21th/plotdata/varianttools_gnsap.vcf\", genome=\"ATH1\")\n",
"p3 <- autoplot(vcf[seqnames(vcf)==\"Chr5\"], type = \"fixed\") + xlim(4000, 8000) + theme(legend.position = \"none\", axis.text.y = element_blank(), axis.ticks.y=element_blank())\n",
"txdb <- makeTxDbFromGFF(file=\"dataJune21th/plotdata/TAIR10_GFF3_trunc.gff\", format=\"gff3\")\n",
"p4 <- autoplot(txdb, which=GRanges(\"Chr5\", IRanges(4000, 8000)), names.expr = \"gene_id\")\n",
"tracks(Reads=p1, Coverage=p2, Variant=p3, Transcripts=p4, heights = c(0.3, 0.2, 0.1, 0.35)) + ylab(\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading features\n",
"\n",
"Genome annotations are available in a variety of text formats such as GFF3 and GTF. They can be loaded with the `import` function from `rtracklayer`. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.\n",
"\n",
"```\n",
"### The start of the .gtf file looks like this:\n",
"# #!genome-build ASM80076v1\n",
"# #!genome-version GCA_000800765.1\n",
"# #!genome-date 2014-12\n",
"# #!genome-build-accession GCA_000800765.1\n",
"# #!genebuild-last-updated 2014-12\n",
"# Chromosome ena gene 190 255 . + . gene_id \"ER3413_4519\"; gene_version \"1\"; gene_name \"thrL\"; gene_source \"ena\"; gene_biotype \"protein_coding\";\n",
"# Chromosome ena transcript 190 255 . + . gene_id \"ER3413_4519\"; gene_version \"1\"; transcript_id \"AIZ54182\"; transcript_version \"1\"; gene_name \"thrL\"; gene_source \"ena\"; gene_biotype \"protein_coding\"; transcript_name \"thrL-1\"; transcript_source \"ena\"; transcript_biotype \"protein_coding\";\n",
"# Chromosome ena exon 190 255 . + . gene_id \"ER3413_4519\"; gene_version \"1\"; transcript_id \"AIZ54182\"; transcript_version \"1\"; exon_number \"1\"; gene_name \"thrL\"; gene_source \"ena\"; gene_biotype \"protein_coding\"; transcript_name \"thrL-1\"; transcript_source \"ena\"; transcript_biotype \"protein_coding\"; exon_id \"AIZ54182-1\"; exon_version \"1\";\n",
"# Chromosome ena CDS 190 252 . + 0 gene_id \"ER3413_4519\"; gene_version \"1\"; transcript_id \"AIZ54182\"; transcript_version \"1\"; exon_number \"1\"; gene_name \"thrL\"; gene_source \"ena\"; gene_biotype \"protein_coding\"; transcript_name \"thrL-1\"; transcript_source \"ena\"; transcript_biotype \"protein_coding\"; protein_id \"AIZ54182\"; protein_version \"1\";\n",
"# ...\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features <- import(\"dataJune21th/gendata/Escherichia_coli_k_12.GCA_000800765.1.29.gtf\")\n",
"\n",
"# Optional: just retain the columns of metadata we need\n",
"mcols(features) <- mcols(features)[,c(\"type\",\"gene_name\",\"gene_id\")]\n",
"\n",
"features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use these annotations to grab sequences from the genome."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"feat <- features[4,]\n",
"feat\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The metadata columns let us query the GRanges, for example for a particular gene."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"subset(features, gene_name == \"lacA\")\n",
"# Equivalently:\n",
"# features[features$gene_name == \"lacA\" & !is.na(features$gene_name),]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: `subset` is a generic R function. It is also similar to dplyr's `filter`. The second argument is special, in it you can refer to columns of the GRanges directly.\n",
"\n",
"We could also get all features of a particular type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cds <- subset(features, type == \"CDS\")\n",
"cds\n",
"# Equivalently:\n",
"# features[features$type == \"CDS\",]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Further data types to explore\n",
"\n",
"**GRangesList, etc**: Many Bioconductor types have a List version -- `GRangesList`, `DNAStringSetList`, etc. For example the exons of a collection of genes could be naturally stored in a `GRangesList`. Most functions that work with `GRanges` will also worked with `GRangesList`, and operate on each list element separately.\n",
"\n",
"**TxDb**: `TxDb` objects represent the hierarchy of genes which contain transcripts which contain exons and CDS (CoDing Sequence) ranges. `TxDb` objects are provided by the `GenomicFeatures` package.\n",
"\n",
"**Seqinfo**: `GRanges` (and various other types) may have associated sequence information accessed with `seqinfo()`. This contains the names and lengths of the sequences the ranges may refer to, and whether they are circular. It allows for some error checking if present."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding a known motif\n",
"\n",
"AGGAGGU is the Shine-Dalgarno sequence, which assists binding of the ribosome to a transcript."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vmatchPattern(\"AGGAGGT\", seqs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`vmatchPattern` is strand specific. If we want matches on the reverse strand we need to also:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vmatchPattern(reverseComplement(DNAString(\"AGGAGGT\")), seqs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Demanding an exact match here is overly strict. `vmatchPattern` has arguments allowing inexact matches. Alternatively, there is a similar function for searching for a Position Weight Matrix pattern, `matchPWM`.\n",
"\n",
"The following will search both strands, allowing one mismatch, and produce the result in convenient GRanges form:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query <- DNAString(\"AGGAGGT\")\n",
"max.mismatch <- 1\n",
"\n",
"fwd <- vmatchPattern(query, seqs, max.mismatch=max.mismatch)\n",
"fwd <- as(fwd, \"GRanges\")\n",
"strand(fwd) <- \"+\"\n",
"rev <- vmatchPattern(reverseComplement(query), seqs, max.mismatch=max.mismatch)\n",
"rev <- as(rev, \"GRanges\")\n",
"strand(rev) <- \"-\"\n",
"\n",
"complete <- c(fwd, rev)\n",
"complete\n",
"\n",
"# Write to GFF file\n",
"export(complete, \"motif-matches.gff\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We might then view this in the IGV genome browser:\n",
"\n",
"![](Images/igvbrowser.png)\n",
"\n",
"[http://software.broadinstitute.org/software/igv/home]\n",
"\n",
"# De novo motif finding\n",
"\n",
"\n",
"Let's try to \"discover\" the Shine-Dalgarno sequence for ourselves."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Note: bacteria do not have introns\n",
"# In a eukaryote, you would need to merge CDS by transcript\n",
"\n",
"size <- 30\n",
"\n",
"initiation_regions <- flank(cds, size, start=TRUE)\n",
"initiation_seqs <- getSeq(seqs, initiation_regions)\n",
"names(initiation_seqs) <- initiation_regions$gene_id"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Look for any composition bias\n",
"library(seqLogo)\n",
"letter_counts <- consensusMatrix(initiation_seqs)\n",
"probs <- prop.table(letter_counts[1:4,], 2)\n",
"seqLogo(probs, ic.scale=FALSE)\n",
"seqLogo(probs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Next steps\n",
"\n",
"We've seen just the smallest part of what [Bioconductor](http://bioconductor.org/) has to offer in this space.\n",
"\n",
"* [Most downloaded Bioconctor packages](http://bioconductor.org/packages/stats/)\n",
"* [Bioconductor cheat sheet](https://github.com/mikelove/bioc-refcard/blob/master/README.Rmd)\n",
"* [COMBINE Bioconductor course from May 2017](https://combine-australia.github.io/2017-05-19-bioconductor-melbourne/)\n",
"* [Bioconductor's Stack Overflow-style support site](https://support.bioconductor.org/)\n",
"\n",
"Besides software, Bioconductor includes packages with data for model organisms, for example. The data is generally from these central repositories:\n",
"\n",
"* NCBI's Entrez Gene gene database and Refseq reference sequences\n",
"* The EBI's Ensembl genome browser\n",
"* The UCSC genome browser\n",
"\n",
"These organizations will generally obtain genome assemblies from the same ultimate sources. For example, all of the above use the Genome Reference Consortium's GRCh38 DNA sequence for homo sapiens. UCSC likes to call this \"hg38\" but it is the same DNA sequence. These DNA sequences serve as a common frame of reference. However the three organizations above will differ on their exact set of gene and transcript annotations, and all use a different gene and transcript ID system. These annotations are also revised more often than the underlying DNA sequences.\n",
"\n",
"This mess is partly due to American/European rivalry, and partly due to differing goals. The UCSC genome browser has always been about practicality and showing many lines of evidence. The others are more concerned with careful curation and standardization.\n",
"\n",
"Some example packages:\n",
"\n",
"### BSgenome.Hsapiens.UCSC.hg38 \n",
"\n",
"Biostrings genome, Homo sapiens, from the UCSC browser, version hg38.\n",
"\n",
"DNA for chromosomes, usable in the same way as the DNAStringSet used above.\n",
"\n",
"### TxDb.Hsapiens.UCSC.hg38.knownGene\n",
"\n",
"Transcript database, Homo sapiens, from UCSC browser, genome verison hg38, \"knownGene\" gene annotations.\n",
"\n",
"GRanges information for genes and transcripts, much as we loaded from a GTF file above.\n",
"\n",
"### org.Hs.eg.db\n",
"\n",
"Organism Homo sapiens, primary key is Entrez Gene, database.\n",
"\n",
"Translation of gene ids from various databases, assignment to GO terms, KEGG pathways, etc. Entrez Gene ids are used as the primary key.\n",
"\n",
"### biomaRt \n",
"\n",
"Access to BioMart data, on the internet -- translation of gene ids, gene sets, gene information, etc.\n",
"\n",
"### AnnotationHub\n",
"\n",
"AnnotationHub is a way to retrieve data from a more comprehensive set of organisms and data providers than the above styles of package. The retrieved data is returned in an appropriate Bioconductor type. If data is being updated over time (eg improved annotation of a genome), each version receives a unique ID in AnnotationHub, making it much easier to write reproducable analyses.\n",
"\n",
"AnnotationHub also provides access to experimental data which maps to locations on a genome, similar to the sorts of tracks you would load in the UCSC browser.\n",
"\n",
"Files are cached, so they will only be downloaded once.\n",
"\n",
"In the example below, the yeast genome and annotations are retrieved:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"library(AnnotationHub)\n",
"ah <- AnnotationHub()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ah contains a large collection of records that can be retrieved\n",
"ah\n",
"length(ah)\n",
"colnames( mcols(ah) )\n",
"table( ah$rdataclass )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# query() searches for terms in an unstructured way\n",
"records <- query(ah, c(\"Ensembl\", \"85\", \"Saccharomyces cerevisiae\"))\n",
"records"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mcols(records)\n",
"mcols(records)[,c(\"title\",\"rdataclass\")]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Having located records of interest,\n",
"# your R script can refer to the specific AH... record,\n",
"# so it always uses the same version of the data.\n",
"ah[[\"AH51399\"]]\n",
"sc_genome <- import( ah[[\"AH51399\"]] )\n",
"sc_granges <- ah[[\"AH51088\"]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# More recent versions of Bioconductor also allow you to\n",
"# retrieve TxDb (and similar EnsDb) objects.\n",
"\n",
"\n",
"query(ah, c(\"OrgDb\", \"Saccharomyces cerevisiae\"))\n",
"sc_orgdb <- ah[[\"AH49589\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tutorial based on input from:\n",
"\n",
"https://al2na.github.io/compgenr/\n",
"\n",
"https://monashbioinformaticsplatform.github.io/r-more/topics/sequences_and_features.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.5.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}