---
title: "_longevityTools_: Connecting Drug- and Age-related Gene Expression Signatures"
author: "Authors: Thomas Girke, Danjuma Quarless, Tyler Backman, Kuan-Fu Ding, Jamison McCorrison, Nik Schork, Dan Evans"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
package: "`r pkg_ver('longevityTools')`"
output:
BiocStyle::html_document:
toc: true
toc_depth: 3
fig_caption: yes
fontsize: 14pt
bibliography: bibtex.bib
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
library(longevityTools)
library(ggplot2) })
```
# Introduction
This vignette is part of the NIA funded Longevity Genomics project. For more information on this project please visit its
website [here](http://www.longevitygenomics.org/projects/). The GitHub repository of the corresponding R package
is available here and the most recent version of this
vignette can be found here.
The project component covered by this vignette analyzes drug- and age-related
genome-wide expression data from public microarray and RNA-Seq experiments. One
of the main objective is the identification drug candidates modulating the
expression of longevity genes and pathways. For this, we compare age-related
expression signatures with those from drug treamtments. The age-related query
signatures are from recent publications such as Peters et al. [-@Peters2015-fc]
and Sood et al. [-@Sood2015-pb], while the drug-related reference signatures
are from the Connectivity Map (CMAP) and LINCS projects [@Lamb2006-uv].
[Back to Table of Contents]()
# Getting Started
## Installation
The R software for running [_`longevityTools`_](https://github.com/tgirke/longevityTools) can be downloaded from [_CRAN_](http://cran.at.r-project.org/). The _`longevityTools`_ package can be installed from the R console using the following _`biocLite`_ install command.
```{r install, eval=FALSE}
source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script
biocLite("tgirke/longevityTools", build_vignettes=FALSE) # Installs package from GitHub
```
[Back to Table of Contents]()
## Loading package and documentation
```{r documentation, eval=FALSE}
library("longevityTools") # Loads the package
library(help="longevityTools") # Lists package info
vignette(topic="longevityTools_eDRUG", package="longevityTools") # Opens vignette
```
[Back to Table of Contents]()
# Setup of working environment
## Import custom functions
Preliminary functions that are still under developement, or not fully tested and documented
can be imported with the `source` command from the `inst/extdata` directory of the package.
```{r source_fct, eval=TRUE}
fctpath <- system.file("extdata", "longevityTools_eDRUG_Fct.R", package="longevityTools")
source(fctpath)
```
[Back to Table of Contents]()
## Create expected directory structure
The following creates the directory structure expected by this workflow. Input data
will be stored in the `data` directory and results will be written to the `results` directory.
All paths are given relative to the present working directory of the user's R session.
```{r work_envir, eval=FALSE}
dir.create("data"); dir.create("data/CEL"); dir.create("results")
```
[Back to Table of Contents]()
# Data downloads
## Download data from Connectivity Map project site
The drug-related expression data are downloaded from the CMAP web site
[here](http://www.broadinstitute.org/cmap). The `getCmap` function downloads
the CMAP rank matrix along with the compound annotations, and `getCmapCEL`
downloads the corresponding 7,056 CEL files. The functions will write the
downloaded files to the `data` and `data/CEL` directories within the present
working directory of the user's R session. Since some of the raw data sets
are large, the functions will only rerun the download if the argument `rerun`
is assigned `TRUE`. If the raw data are not needed then users can skip this
time consuming download step and work with the preprocessed data
obtained in the next section.
```{r download_cmap, eval=FALSE}
getCmap(rerun=FALSE) # Downloads cmap rank matrix and compound annotation files
getCmapCEL(rerun=FALSE) # Download cmap CEL files. Note, this will take some time
```
[Back to Table of Contents]()
## Overview of CMAP data
The experimental design of the CMAP project is defined in the file
`cmap_instances_02.xls`. Note, this file required some cleaning in LibreOffice
(Excel would work for this too). After this it was saved as tab delimited txt
file named
[cmap_instances_02.txt](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/cmap_instances_02.txt).
The following count statisitics are extracted from this file.
The panel of cell lines used by CMAP includes
[MCF7](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#mcf7),
[ssMCF7](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#mcf7),
[HL60](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#hl60),
[PC3](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#pc3) and
[SKMEL5](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#skmel5).
Each cell type was subjected to the following number of total treatments and number
of distinct drugs, respectively. The total number of drugs used by CMAP is 1,309.
```{r overview_cmap_drugs, eval=TRUE}
library(ggplot2); library(reshape2)
cmap <- read.delim("./data/cmap_instances_02.txt", check.names=FALSE)
df <- data.frame(table(cmap[, "cell2"]), as.numeric(table(cmap[!duplicated(paste0(cmap$cmap_name, cmap$cell2)),"cell2"])))
colnames(df) <- c("Cell_type", "Treatments", "Drugs")
df <- melt(df, id.vars=c("Cell_type"), variable.name = "Samples", value.name="Counts")
ggplot(df, aes(Cell_type, Counts, fill=Samples)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Number of treatments by cell types")
```
The number Affymetrix chip used in the experiments is plotted here for each of
the three chip types used by CMAP:
```{r overview_cmap_chip_type, eval=TRUE}
df <- data.frame(table(cmap$array3)); colnames(df) <- c("Chip_type", "Counts")
ggplot(df, aes(Chip_type, Counts)) +
geom_bar(position="dodge", stat="identity", fill="cornflowerblue") +
ggtitle("Number of chip types")
```
[Back to Table of Contents]()
# Pre-processing of CEL files
## Determine chip type from CEL files
The CMAP data set is based on three different Affymetrix chip types (HG-U133A,
HT_HG-U133A and U133AAofAv2). The following extracts the chip type information
from the CEL files and stores the result in an `rds` file with the path
`./data/chiptype.rds`. Users who skipped the download of the CEL files can
download this file [here](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/chiptype.rds).
```{r get_cel_type, eval=FALSE}
celfiles <- list.files("./data/CEL", pattern=".CEL$")
chiptype <- sapply(celfiles, function(x) affxparser::readCelHeader(paste0("data/CEL/", x))$chiptype)
if(FALSE) saveRDS(chiptype, "./data/chiptype.rds") # if(FALSE) protects this line from accidental execution!
```
[Back to Table of Contents]()
## Normalization of CEL files
The follwoing processes the CEL files from each chip type separately using the
MAS5 normalization algorithm. The results will be written to 3 subdirectores
under `data` that are named after the chip type names. To save time, the
processing is parallelized with `BiocParallel` to run on 100 CPU cores of a
computer cluster with a scheduler (_e.g._ Torque). The number of CEL files from
each chip type are: 807 CEL files from HG-U133A, 6029 CEL files from
HT_HG-U133A, and 220 CEL files from U133AAofAv2. Note, these numbers are slightly
different than those reported in the `cmap_instances_02.txt` file. The MAS5 normalized data
sets can be downloaded here:
[HG-U133A](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/HG-U133A/all_mas5exprs.rds),
[HT_HG-U133A](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/HT_HG-U133A/all_mas5exprs.rds),
[U133AAofAv2](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/U133AAofAv2/all_mas5exprs.rds).
```{r normalize_chips, eval=FALSE}
library(BiocParallel); library(BatchJobs); library(affy)
chiptype <- readRDS("./data/chiptype.rds")
chiptype_list <- split(names(chiptype), as.character(chiptype))
normalizeCel(chiptype_list, rerun=FALSE)
```
[Back to Table of Contents]()
## Combine results from same chip type in single data frame
```{r comb_chip_type_data, eval=FALSE}
chiptype_dir <- unique(readRDS("./data/chiptype.rds"))
combineResults(chiptype_dir, rerun=FALSE)
```
[Back to Table of Contents]()
## Clean-up of intermediate files
This deletes intermediate files. Before executing these lines, please make sure that this is what you want.
```{r cleanup1, eval=FALSE}
for(i in seq_along(chiptype_dir)) unlink(list.files(paste0("data/", chiptype_dir[i]), pattern="cellbatch", full.names=TRUE), recursive=TRUE)
unlink("data/CEL/*.CEL") # Deletes downloaded CEL files
```
[Back to Table of Contents]()
## Obtain annotation information
The following generates annotation information for the Affymetirx probe set
identifiers. Note, the three different Affymetrix chip types used by CMAP
share most probe set ids (>95%), meaning it is possible to combine the data
after normalization and use the same annotation package for all of them. The
annotation libraries for the chip types HG-U133A and HT_HG-U133A are
`hgu133a.db` and `hthgu133a.db`, respectively. However, there is no annotation
library (e.g. CDF) available for U133AAofAv2. The annotation file can be downloaded
from here: [`myAnnot.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/myAnnot.xls).
```{r affyid_annotations, eval=FALSE, message=FALSE}
library(hgu133a.db)
myAnnot <- data.frame(ACCNUM=sapply(contents(hgu133aACCNUM), paste, collapse=", "),
SYMBOL=sapply(contents(hgu133aSYMBOL), paste, collapse=", "),
UNIGENE=sapply(contents(hgu133aUNIGENE), paste, collapse=", "),
ENTREZID=sapply(contents(hgu133aENTREZID), paste, collapse=", "),
ENSEMBL=sapply(contents(hgu133aENSEMBL), paste, collapse=", "),
DESC=sapply(contents(hgu133aGENENAME), paste, collapse=", "))
write.table(myAnnot, file="./results/myAnnot.xls", quote=FALSE, sep="\t", col.names = NA)
saveRDS(myAnnot, "./results/myAnnot.rds")
```
[Back to Table of Contents]()
# DEG analysis with `limma`
## Generate list of CEL names defining treatment vs. control comparisons
The `sampleList` function extracts the sample comparisons (contrasts) from the
CMAP annotation table and stores them as a list.
```{r cel_file_list, eval=FALSE}
cmap <- read.delim("./data/cmap_instances_02.txt", check.names=FALSE)
# comp_list <- sampleList(cmap, myby="CMP")
comp_list <- sampleList(cmap, myby="CMP_CELL")
```
[Back to Table of Contents]()
## Load normalized expression data
The following loads the MAS5 normalized expression data into a single `data.frame`.
To accelerate the import, the data is read from `rds` files.
```{r load_mas5_data, eval=FALSE}
chiptype_dir <- unique(readRDS("./data/chiptype.rds"))
df1 <- readRDS(paste0("data/", chiptype_dir[1], "/", "all_mas5exprs.rds"))
df2 <- readRDS(paste0("data/", chiptype_dir[2], "/", "all_mas5exprs.rds"))
df3 <- readRDS(paste0("data/", chiptype_dir[3], "/", "all_mas5exprs.rds"))
affyid <- rownames(df1)[rownames(df1) %in% rownames(df2)]; affyid <- affyid[affyid %in% rownames(df3)]
mas5df <- cbind(df1[affyid,], df2[affyid,], df3[affyid,])
```
## Transform probe set to gene level data
The next step generates gene level expression values. If genes are represented by several
probe sets then their mean intensities are used. This is necessary because
the U133 chip contains many genes with duplicated probe sets. Probe sets not matching
any gene are removed.
```{r mean_mas5_data, eval=FALSE}
myAnnot <- readRDS("./results/myAnnot.rds")
myAnnot <- myAnnot[as.character(myAnnot[,"ENTREZID"]) != "NA",]
mas5df <- mas5df[rownames(myAnnot),]
idlist <- tapply(row.names(myAnnot), as.character(myAnnot$ENTREZID), c)
mas5df <- t(sapply(names(idlist), function(x) colMeans(mas5df[idlist[[x]], ])))
```
[Back to Table of Contents]()
## DEG analysis with `limma`
The analysis of differentially expressed genes (DEGs) is performed with the `limma` package.
Genes meeting the chosen cutoff criteria are reported as DEGs (below set to FDR of 10% and
a minimum fold change of 2). The DEG matrix is written to a file named
[`degMA.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/degMA.xls).
```{r deg_limma, eval=FALSE}
degList <- runLimma(df=log2(mas5df), comp_list, fdr=0.10, foldchange=1, verbose=TRUE, affyid=NULL)
write.table(degList$DEG, file="./results/degMA.xls", quote=FALSE, sep="\t", col.names = NA)
saveRDS(degList$DEG, "./results/degMA.rds") # saves binary matrix
saveRDS(degList, "./results/degList.rds") # saves entire degList
```
[Back to Table of Contents]()
## Number of DEGs across drug treatments
The following plots the number of drug treatments (y-axis) for increasing bin sizes (x-axis)
of DEGs.
```{r deg_distr, eval=TRUE, message=TRUE}
degMAgene <- readRDS("./results/degMA.rds")
y <- as.numeric(colSums(degMAgene))
interval <- table(cut(y, right=FALSE, dig.lab=5, breaks=c(0, 5, 10, 50, 100, 200, 500, 1000, 10000)))
df <- data.frame(interval); colnames(df) <- c("Bins", "Counts")
ggplot(df, aes(Bins, Counts)) +
geom_bar(position="dodge", stat="identity", fill="cornflowerblue") +
ggtitle("DEG numbers by bins")
```
[Back to Table of Contents]()
## Identify DEG overlaps with Peters et al. [-@Peters2015-fc]
Peters et al. [-@Peters2015-fc] reported 1,497 age-related gene expression
signatures. The `intersectStats` function computes their intersects with each
of the 3,318 drug-responsive DEG sets from CMAP. The result includes the
Jaccard index as a simple similarity metric for gene sets as well as the raw
and adjusted p-values based on the hypergeometric distribution expressing how
likely it is to obtain the observed intersect sizes just by chance. The
results for the 20 top scoring drugs are given below and the full data set is
written to a file named
[`degOL_PMID26490707.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/degOL_PMID26490707.xls).
```{r deg_overlaps_PMID26490707, eval=TRUE}
PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#")
myAnnot <- readRDS("./results/myAnnot.rds")
geneid <- as.character(PMID26490707$"NEW.Entrez.ID")
degMAgene <- readRDS("./results/degMA.rds") # Faster than read.delim()
degMAsub <- degMAgene[rownames(degMAgene) %in% geneid,]
degOL_PMID26490707 <- intersectStats(degMAgene, degMAsub)
write.table(degOL_PMID26490707, file="./results/degOL_PMID26490707.xls", quote=FALSE, sep="\t", col.names = NA)
sum(degOL_PMID26490707[,1] > 0) # Drugs with any overlap
degOL_PMID26490707[1:20,]
```
[Back to Table of Contents]()
## Identify DEG overlaps with Sood et al. [-@Sood2015-pb]
Sood et al. [-@Sood2015-pb] reported 150 age-related gene expression signatures.
The `intersectStats` function computes their intersects with each of the 3,318
drug-responsive DEG sets from CMAP. The result includes the Jaccard index as a simple
similarity metric for gene sets as well as the raw and adjusted p-values based on the
hypergeometric distribution expressing how likely it is to observe the observed intersect
sizes just by chance. The results for the 20 top scoring drugs are given below and the full
data set is written to a file named [`degOL_PMID26343147.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/degOL_PMID26343147.xls).
```{r deg_overlaps_PMID26343147, eval=TRUE}
PMID26343147 <- read.delim("./data/PMID26343147_S1T1.xls", check.names=FALSE, comment="#")
myAnnot <- readRDS("./results/myAnnot.rds")
geneid <- as.character(myAnnot[rownames(myAnnot) %in% as.character(PMID26343147[,1]), "ENTREZID"])
geneid <- geneid[geneid!="NA"]
degMA <- readRDS("./results/degMA.rds") # Faster then read.delim()
degMA <- degMA[ , !is.na(colSums(degMA))] # Remove columns where DEG analysis was not possible
degMAsub <- degMA[geneid,]
degOL_PMID26343147 <- intersectStats(degMAgene, degMAsub)
write.table(degOL_PMID26343147, file="./results/degOL_PMID26343147.xls", quote=FALSE, sep="\t", col.names = NA)
sum(degOL_PMID26343147[,1] > 0) # Drugs with any overlap
degOL_PMID26343147[1:20,] # Top 20 scoring drugs
```
[Back to Table of Contents]()
## Drugs affecting known longevity genes
The following identifies CMAP drugs affecting the expression of the IGF1 or IGF1R genes.
The final result is written to a file named [`deg_IGF1.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/deg_IGF1.xls).
```{r deg_queries, eval=TRUE}
genesymbols <- c("IGF1", "IGF1R")
geneids <- unique(as.character(myAnnot[myAnnot$SYMBOL %in% genesymbols,"ENTREZID"]))
names(geneids) <- unique(as.character(myAnnot[myAnnot$SYMBOL %in% genesymbols,"SYMBOL"]))
degList <- readRDS("./results/degList.rds")
df <- data.frame(row.names=colnames(degList$DEG), check.names=FALSE)
index <- which(colSums(degList$DEG[geneids,])>= 1)
for(i in seq_along(geneids)) {
tmp <- data.frame(DEG=degList$DEG[geneids[i],index], logFC=degList$logFC[geneids[i],index], FDR=degList$FDR[geneids[i],index])
colnames(tmp) <- paste0(names(geneids)[i], "_", colnames(tmp))
df <- cbind(df, tmp[rownames(df),] )
}
df <- df[names(index),]
write.table(df, file="./results/deg_IGF1.xls", quote=FALSE, sep="\t", col.names = NA)
```
Now the final `data.frame` can be sorted by increasing mean FDR values.
```{r sort_pvalue, eval=TRUE}
igfDF <- read.delim("./results/deg_IGF1.xls", row.names=1)
igfDF[order(rowMeans(igfDF[,c(3,6)])),][1:20,]
```
[Back to Table of Contents]()
## Plot structures of compounds
```{r plot_sdf, eval=TRUE}
library(ChemmineR)
mypath <- system.file("extdata", "longevitydrugs.sdf", package="longevityTools")
mypath <- "../inst/extdata/longevitydrugs.sdf"
sdfset <- read.SDFset(mypath)
data(sdfsample)
sdfsample
plot(sdfsample[1:4], print=FALSE)
```
[Back to Table of Contents]()
# Connectivity maps enrichment analysis
The connectivity maps approach is a rank-based enrichment method utilizing the KS test [@Lamb2006-uv].
It measures the similarities of expression signatures based on the enrichment of up- and
down-regulated genes at the top and bottom of sorted (ranked) gene lists.
## Query drug signatures
The following uses the 1,497 age-related gene expression signatures from Peters et al.
[-@Peters2015-fc] as a query against the CMAP signatures. The results are sorted by the
ES Distance and the top scoring 20 drugs are given below. The full result table is
written to a file named [`drugcmap2.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/drugcmap2.xls).
```{r drug_enrichment, eval=TRUE, message=FALSE}
library(DrugVsDisease)
PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#", check.names=FALSE)
data(drugRL)
PMID26490707sub <- PMID26490707[PMID26490707[,"NEW-Gene-ID"] %in% rownames(drugRL),]
PMID26490707sub <- PMID26490707sub[order(PMID26490707sub$Zscore, decreasing=TRUE),]
PMID26490707sub <- rbind(head(PMID26490707sub, 100), tail(PMID26490707sub, 100)) # Subsets to top 200 DEGs
testprofiles <- list(ranklist=matrix(PMID26490707sub$Zscore, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])),
pvalues=matrix(PMID26490707sub$P, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])))
drugcmap <- classifyprofile(data=testprofiles$ranklist, case="disease", signif.fdr=0.5, no.signif=20)
drugcmap2 <- classifyprofile(data=testprofiles$ranklist, case="disease",
pvalues=testprofiles$pvalues, cytoout=FALSE, type="dynamic",
dynamic.fdr=5, signif.fdr=5, adj="BH", no.signif=1000)
write.table(drugcmap2, file="./results/drugcmap2.xls", quote=FALSE, sep="\t", col.names = NA)
drugcmap2[[1]][1:20,]
```
[Back to Table of Contents]()
## Query disease signatures
The same query is performed against a reference set of disease expression signatures.
The results are sorted by the ES Distance and the top scoring 20 drugs are given below.
The full result table is written to a file named [`diseasecmap2.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/diseasecmap2.xls).
```{r disease_enrichment, eval=TRUE, message=TRUE}
PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#", check.names=FALSE)
data(diseaseRL)
PMID26490707sub <- PMID26490707[PMID26490707[,"NEW-Gene-ID"] %in% rownames(diseaseRL),]
testprofiles <- list(ranklist=matrix(PMID26490707sub$Zscore, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])),
pvalues=matrix(PMID26490707sub$P, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])))
diseasecmap <- classifyprofile(data=testprofiles$ranklist, case="drug", signif.fdr=0.5, no.signif=20)
diseasecmap2 <- classifyprofile(data=testprofiles$ranklist, case="drug",
pvalues=testprofiles$pvalues, cytoout=FALSE, type="dynamic",
dynamic.fdr=5, adj="BH", no.signif=100)
write.table(diseasecmap2, file="./results/diseasecmap2.xls", quote=FALSE, sep="\t", col.names = NA)
diseasecmap2[[1]][1:20,]
```
[Back to Table of Contents]()
# Age-drug network analysis
In progress...
[Back to Table of Contents]()
# Age-disease network analysis
In progress...
[Back to Table of Contents]()
# Funding
This project is funded by NIH grant U24AG051129 awarded by the National Intitute on Aging (NIA).
[Back to Table of Contents]()
# Version information
```{r sessionInfo}
sessionInfo()
```
[Back to Table of Contents]()
# References