--- title: "_longevityTools_: Connecting Drug- and Age-related Gene Expression Signatures" author: "Authors: Thomas Girke, Danjuma Quarless, Tyler Backman, Kuan-Fu Ding, Jamison McCorrison, Nik Schork, Dan Evans" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" package: "`r pkg_ver('longevityTools')`" output: BiocStyle::html_document: toc: true toc_depth: 3 fig_caption: yes fontsize: 14pt bibliography: bibtex.bib --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(longevityTools) library(ggplot2) }) ``` # Introduction This vignette is part of the NIA funded Longevity Genomics project. For more information on this project please visit its website [here](http://www.longevitygenomics.org/projects/). The GitHub repository of the corresponding R package is available here and the most recent version of this vignette can be found here. The project component covered by this vignette analyzes drug- and age-related genome-wide expression data from public microarray and RNA-Seq experiments. One of the main objective is the identification drug candidates modulating the expression of longevity genes and pathways. For this, we compare age-related expression signatures with those from drug treamtments. The age-related query signatures are from recent publications such as Peters et al. [-@Peters2015-fc] and Sood et al. [-@Sood2015-pb], while the drug-related reference signatures are from the Connectivity Map (CMAP) and LINCS projects [@Lamb2006-uv].

[Back to Table of Contents]()

# Getting Started ## Installation The R software for running [_`longevityTools`_](https://github.com/tgirke/longevityTools) can be downloaded from [_CRAN_](http://cran.at.r-project.org/). The _`longevityTools`_ package can be installed from the R console using the following _`biocLite`_ install command. ```{r install, eval=FALSE} source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script biocLite("tgirke/longevityTools", build_vignettes=FALSE) # Installs package from GitHub ```

[Back to Table of Contents]()

## Loading package and documentation ```{r documentation, eval=FALSE} library("longevityTools") # Loads the package library(help="longevityTools") # Lists package info vignette(topic="longevityTools_eDRUG", package="longevityTools") # Opens vignette ```

[Back to Table of Contents]()

# Setup of working environment ## Import custom functions Preliminary functions that are still under developement, or not fully tested and documented can be imported with the `source` command from the `inst/extdata` directory of the package. ```{r source_fct, eval=TRUE} fctpath <- system.file("extdata", "longevityTools_eDRUG_Fct.R", package="longevityTools") source(fctpath) ```

[Back to Table of Contents]()

## Create expected directory structure The following creates the directory structure expected by this workflow. Input data will be stored in the `data` directory and results will be written to the `results` directory. All paths are given relative to the present working directory of the user's R session. ```{r work_envir, eval=FALSE} dir.create("data"); dir.create("data/CEL"); dir.create("results") ```

[Back to Table of Contents]()

# Data downloads ## Download data from Connectivity Map project site The drug-related expression data are downloaded from the CMAP web site [here](http://www.broadinstitute.org/cmap). The `getCmap` function downloads the CMAP rank matrix along with the compound annotations, and `getCmapCEL` downloads the corresponding 7,056 CEL files. The functions will write the downloaded files to the `data` and `data/CEL` directories within the present working directory of the user's R session. Since some of the raw data sets are large, the functions will only rerun the download if the argument `rerun` is assigned `TRUE`. If the raw data are not needed then users can skip this time consuming download step and work with the preprocessed data obtained in the next section. ```{r download_cmap, eval=FALSE} getCmap(rerun=FALSE) # Downloads cmap rank matrix and compound annotation files getCmapCEL(rerun=FALSE) # Download cmap CEL files. Note, this will take some time ```

[Back to Table of Contents]()

## Overview of CMAP data The experimental design of the CMAP project is defined in the file `cmap_instances_02.xls`. Note, this file required some cleaning in LibreOffice (Excel would work for this too). After this it was saved as tab delimited txt file named [cmap_instances_02.txt](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/cmap_instances_02.txt). The following count statisitics are extracted from this file. The panel of cell lines used by CMAP includes [MCF7](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#mcf7), [ssMCF7](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#mcf7), [HL60](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#hl60), [PC3](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#pc3) and [SKMEL5](http://www.broadinstitute.org/cmap/help_topics_linkified.jsp#skmel5). Each cell type was subjected to the following number of total treatments and number of distinct drugs, respectively. The total number of drugs used by CMAP is 1,309. ```{r overview_cmap_drugs, eval=TRUE} library(ggplot2); library(reshape2) cmap <- read.delim("./data/cmap_instances_02.txt", check.names=FALSE) df <- data.frame(table(cmap[, "cell2"]), as.numeric(table(cmap[!duplicated(paste0(cmap$cmap_name, cmap$cell2)),"cell2"]))) colnames(df) <- c("Cell_type", "Treatments", "Drugs") df <- melt(df, id.vars=c("Cell_type"), variable.name = "Samples", value.name="Counts") ggplot(df, aes(Cell_type, Counts, fill=Samples)) + geom_bar(position="dodge", stat="identity") + ggtitle("Number of treatments by cell types") ``` The number Affymetrix chip used in the experiments is plotted here for each of the three chip types used by CMAP: ```{r overview_cmap_chip_type, eval=TRUE} df <- data.frame(table(cmap$array3)); colnames(df) <- c("Chip_type", "Counts") ggplot(df, aes(Chip_type, Counts)) + geom_bar(position="dodge", stat="identity", fill="cornflowerblue") + ggtitle("Number of chip types") ```

[Back to Table of Contents]()

# Pre-processing of CEL files ## Determine chip type from CEL files The CMAP data set is based on three different Affymetrix chip types (HG-U133A, HT_HG-U133A and U133AAofAv2). The following extracts the chip type information from the CEL files and stores the result in an `rds` file with the path `./data/chiptype.rds`. Users who skipped the download of the CEL files can download this file [here](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/chiptype.rds). ```{r get_cel_type, eval=FALSE} celfiles <- list.files("./data/CEL", pattern=".CEL$") chiptype <- sapply(celfiles, function(x) affxparser::readCelHeader(paste0("data/CEL/", x))$chiptype) if(FALSE) saveRDS(chiptype, "./data/chiptype.rds") # if(FALSE) protects this line from accidental execution! ```

[Back to Table of Contents]()

## Normalization of CEL files The follwoing processes the CEL files from each chip type separately using the MAS5 normalization algorithm. The results will be written to 3 subdirectores under `data` that are named after the chip type names. To save time, the processing is parallelized with `BiocParallel` to run on 100 CPU cores of a computer cluster with a scheduler (_e.g._ Torque). The number of CEL files from each chip type are: 807 CEL files from HG-U133A, 6029 CEL files from HT_HG-U133A, and 220 CEL files from U133AAofAv2. Note, these numbers are slightly different than those reported in the `cmap_instances_02.txt` file. The MAS5 normalized data sets can be downloaded here: [HG-U133A](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/HG-U133A/all_mas5exprs.rds), [HT_HG-U133A](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/HT_HG-U133A/all_mas5exprs.rds), [U133AAofAv2](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/data/U133AAofAv2/all_mas5exprs.rds). ```{r normalize_chips, eval=FALSE} library(BiocParallel); library(BatchJobs); library(affy) chiptype <- readRDS("./data/chiptype.rds") chiptype_list <- split(names(chiptype), as.character(chiptype)) normalizeCel(chiptype_list, rerun=FALSE) ```

[Back to Table of Contents]()

## Combine results from same chip type in single data frame ```{r comb_chip_type_data, eval=FALSE} chiptype_dir <- unique(readRDS("./data/chiptype.rds")) combineResults(chiptype_dir, rerun=FALSE) ```

[Back to Table of Contents]()

## Clean-up of intermediate files This deletes intermediate files. Before executing these lines, please make sure that this is what you want. ```{r cleanup1, eval=FALSE} for(i in seq_along(chiptype_dir)) unlink(list.files(paste0("data/", chiptype_dir[i]), pattern="cellbatch", full.names=TRUE), recursive=TRUE) unlink("data/CEL/*.CEL") # Deletes downloaded CEL files ```

[Back to Table of Contents]()

## Obtain annotation information The following generates annotation information for the Affymetirx probe set identifiers. Note, the three different Affymetrix chip types used by CMAP share most probe set ids (>95%), meaning it is possible to combine the data after normalization and use the same annotation package for all of them. The annotation libraries for the chip types HG-U133A and HT_HG-U133A are `hgu133a.db` and `hthgu133a.db`, respectively. However, there is no annotation library (e.g. CDF) available for U133AAofAv2. The annotation file can be downloaded from here: [`myAnnot.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/myAnnot.xls). ```{r affyid_annotations, eval=FALSE, message=FALSE} library(hgu133a.db) myAnnot <- data.frame(ACCNUM=sapply(contents(hgu133aACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(hgu133aSYMBOL), paste, collapse=", "), UNIGENE=sapply(contents(hgu133aUNIGENE), paste, collapse=", "), ENTREZID=sapply(contents(hgu133aENTREZID), paste, collapse=", "), ENSEMBL=sapply(contents(hgu133aENSEMBL), paste, collapse=", "), DESC=sapply(contents(hgu133aGENENAME), paste, collapse=", ")) write.table(myAnnot, file="./results/myAnnot.xls", quote=FALSE, sep="\t", col.names = NA) saveRDS(myAnnot, "./results/myAnnot.rds") ```

[Back to Table of Contents]()

# DEG analysis with `limma` ## Generate list of CEL names defining treatment vs. control comparisons The `sampleList` function extracts the sample comparisons (contrasts) from the CMAP annotation table and stores them as a list. ```{r cel_file_list, eval=FALSE} cmap <- read.delim("./data/cmap_instances_02.txt", check.names=FALSE) # comp_list <- sampleList(cmap, myby="CMP") comp_list <- sampleList(cmap, myby="CMP_CELL") ```

[Back to Table of Contents]()

## Load normalized expression data The following loads the MAS5 normalized expression data into a single `data.frame`. To accelerate the import, the data is read from `rds` files. ```{r load_mas5_data, eval=FALSE} chiptype_dir <- unique(readRDS("./data/chiptype.rds")) df1 <- readRDS(paste0("data/", chiptype_dir[1], "/", "all_mas5exprs.rds")) df2 <- readRDS(paste0("data/", chiptype_dir[2], "/", "all_mas5exprs.rds")) df3 <- readRDS(paste0("data/", chiptype_dir[3], "/", "all_mas5exprs.rds")) affyid <- rownames(df1)[rownames(df1) %in% rownames(df2)]; affyid <- affyid[affyid %in% rownames(df3)] mas5df <- cbind(df1[affyid,], df2[affyid,], df3[affyid,]) ``` ## Transform probe set to gene level data The next step generates gene level expression values. If genes are represented by several probe sets then their mean intensities are used. This is necessary because the U133 chip contains many genes with duplicated probe sets. Probe sets not matching any gene are removed. ```{r mean_mas5_data, eval=FALSE} myAnnot <- readRDS("./results/myAnnot.rds") myAnnot <- myAnnot[as.character(myAnnot[,"ENTREZID"]) != "NA",] mas5df <- mas5df[rownames(myAnnot),] idlist <- tapply(row.names(myAnnot), as.character(myAnnot$ENTREZID), c) mas5df <- t(sapply(names(idlist), function(x) colMeans(mas5df[idlist[[x]], ]))) ```

[Back to Table of Contents]()

## DEG analysis with `limma` The analysis of differentially expressed genes (DEGs) is performed with the `limma` package. Genes meeting the chosen cutoff criteria are reported as DEGs (below set to FDR of 10% and a minimum fold change of 2). The DEG matrix is written to a file named [`degMA.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/degMA.xls). ```{r deg_limma, eval=FALSE} degList <- runLimma(df=log2(mas5df), comp_list, fdr=0.10, foldchange=1, verbose=TRUE, affyid=NULL) write.table(degList$DEG, file="./results/degMA.xls", quote=FALSE, sep="\t", col.names = NA) saveRDS(degList$DEG, "./results/degMA.rds") # saves binary matrix saveRDS(degList, "./results/degList.rds") # saves entire degList ```

[Back to Table of Contents]()

## Number of DEGs across drug treatments The following plots the number of drug treatments (y-axis) for increasing bin sizes (x-axis) of DEGs. ```{r deg_distr, eval=TRUE, message=TRUE} degMAgene <- readRDS("./results/degMA.rds") y <- as.numeric(colSums(degMAgene)) interval <- table(cut(y, right=FALSE, dig.lab=5, breaks=c(0, 5, 10, 50, 100, 200, 500, 1000, 10000))) df <- data.frame(interval); colnames(df) <- c("Bins", "Counts") ggplot(df, aes(Bins, Counts)) + geom_bar(position="dodge", stat="identity", fill="cornflowerblue") + ggtitle("DEG numbers by bins") ```

[Back to Table of Contents]()

## Identify DEG overlaps with Peters et al. [-@Peters2015-fc] Peters et al. [-@Peters2015-fc] reported 1,497 age-related gene expression signatures. The `intersectStats` function computes their intersects with each of the 3,318 drug-responsive DEG sets from CMAP. The result includes the Jaccard index as a simple similarity metric for gene sets as well as the raw and adjusted p-values based on the hypergeometric distribution expressing how likely it is to obtain the observed intersect sizes just by chance. The results for the 20 top scoring drugs are given below and the full data set is written to a file named [`degOL_PMID26490707.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/degOL_PMID26490707.xls). ```{r deg_overlaps_PMID26490707, eval=TRUE} PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#") myAnnot <- readRDS("./results/myAnnot.rds") geneid <- as.character(PMID26490707$"NEW.Entrez.ID") degMAgene <- readRDS("./results/degMA.rds") # Faster than read.delim() degMAsub <- degMAgene[rownames(degMAgene) %in% geneid,] degOL_PMID26490707 <- intersectStats(degMAgene, degMAsub) write.table(degOL_PMID26490707, file="./results/degOL_PMID26490707.xls", quote=FALSE, sep="\t", col.names = NA) sum(degOL_PMID26490707[,1] > 0) # Drugs with any overlap degOL_PMID26490707[1:20,] ```

[Back to Table of Contents]()

## Identify DEG overlaps with Sood et al. [-@Sood2015-pb] Sood et al. [-@Sood2015-pb] reported 150 age-related gene expression signatures. The `intersectStats` function computes their intersects with each of the 3,318 drug-responsive DEG sets from CMAP. The result includes the Jaccard index as a simple similarity metric for gene sets as well as the raw and adjusted p-values based on the hypergeometric distribution expressing how likely it is to observe the observed intersect sizes just by chance. The results for the 20 top scoring drugs are given below and the full data set is written to a file named [`degOL_PMID26343147.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/degOL_PMID26343147.xls). ```{r deg_overlaps_PMID26343147, eval=TRUE} PMID26343147 <- read.delim("./data/PMID26343147_S1T1.xls", check.names=FALSE, comment="#") myAnnot <- readRDS("./results/myAnnot.rds") geneid <- as.character(myAnnot[rownames(myAnnot) %in% as.character(PMID26343147[,1]), "ENTREZID"]) geneid <- geneid[geneid!="NA"] degMA <- readRDS("./results/degMA.rds") # Faster then read.delim() degMA <- degMA[ , !is.na(colSums(degMA))] # Remove columns where DEG analysis was not possible degMAsub <- degMA[geneid,] degOL_PMID26343147 <- intersectStats(degMAgene, degMAsub) write.table(degOL_PMID26343147, file="./results/degOL_PMID26343147.xls", quote=FALSE, sep="\t", col.names = NA) sum(degOL_PMID26343147[,1] > 0) # Drugs with any overlap degOL_PMID26343147[1:20,] # Top 20 scoring drugs ```

[Back to Table of Contents]()

## Drugs affecting known longevity genes The following identifies CMAP drugs affecting the expression of the IGF1 or IGF1R genes. The final result is written to a file named [`deg_IGF1.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/deg_IGF1.xls). ```{r deg_queries, eval=TRUE} genesymbols <- c("IGF1", "IGF1R") geneids <- unique(as.character(myAnnot[myAnnot$SYMBOL %in% genesymbols,"ENTREZID"])) names(geneids) <- unique(as.character(myAnnot[myAnnot$SYMBOL %in% genesymbols,"SYMBOL"])) degList <- readRDS("./results/degList.rds") df <- data.frame(row.names=colnames(degList$DEG), check.names=FALSE) index <- which(colSums(degList$DEG[geneids,])>= 1) for(i in seq_along(geneids)) { tmp <- data.frame(DEG=degList$DEG[geneids[i],index], logFC=degList$logFC[geneids[i],index], FDR=degList$FDR[geneids[i],index]) colnames(tmp) <- paste0(names(geneids)[i], "_", colnames(tmp)) df <- cbind(df, tmp[rownames(df),] ) } df <- df[names(index),] write.table(df, file="./results/deg_IGF1.xls", quote=FALSE, sep="\t", col.names = NA) ``` Now the final `data.frame` can be sorted by increasing mean FDR values. ```{r sort_pvalue, eval=TRUE} igfDF <- read.delim("./results/deg_IGF1.xls", row.names=1) igfDF[order(rowMeans(igfDF[,c(3,6)])),][1:20,] ```

[Back to Table of Contents]()

## Plot structures of compounds ```{r plot_sdf, eval=TRUE} library(ChemmineR) mypath <- system.file("extdata", "longevitydrugs.sdf", package="longevityTools") mypath <- "../inst/extdata/longevitydrugs.sdf" sdfset <- read.SDFset(mypath) data(sdfsample) sdfsample plot(sdfsample[1:4], print=FALSE) ```

[Back to Table of Contents]()

# Connectivity maps enrichment analysis The connectivity maps approach is a rank-based enrichment method utilizing the KS test [@Lamb2006-uv]. It measures the similarities of expression signatures based on the enrichment of up- and down-regulated genes at the top and bottom of sorted (ranked) gene lists. ## Query drug signatures The following uses the 1,497 age-related gene expression signatures from Peters et al. [-@Peters2015-fc] as a query against the CMAP signatures. The results are sorted by the ES Distance and the top scoring 20 drugs are given below. The full result table is written to a file named [`drugcmap2.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/drugcmap2.xls). ```{r drug_enrichment, eval=TRUE, message=FALSE} library(DrugVsDisease) PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#", check.names=FALSE) data(drugRL) PMID26490707sub <- PMID26490707[PMID26490707[,"NEW-Gene-ID"] %in% rownames(drugRL),] PMID26490707sub <- PMID26490707sub[order(PMID26490707sub$Zscore, decreasing=TRUE),] PMID26490707sub <- rbind(head(PMID26490707sub, 100), tail(PMID26490707sub, 100)) # Subsets to top 200 DEGs testprofiles <- list(ranklist=matrix(PMID26490707sub$Zscore, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])), pvalues=matrix(PMID26490707sub$P, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"]))) drugcmap <- classifyprofile(data=testprofiles$ranklist, case="disease", signif.fdr=0.5, no.signif=20) drugcmap2 <- classifyprofile(data=testprofiles$ranklist, case="disease", pvalues=testprofiles$pvalues, cytoout=FALSE, type="dynamic", dynamic.fdr=5, signif.fdr=5, adj="BH", no.signif=1000) write.table(drugcmap2, file="./results/drugcmap2.xls", quote=FALSE, sep="\t", col.names = NA) drugcmap2[[1]][1:20,] ```

[Back to Table of Contents]()

## Query disease signatures The same query is performed against a reference set of disease expression signatures. The results are sorted by the ES Distance and the top scoring 20 drugs are given below. The full result table is written to a file named [`diseasecmap2.xls`](http://biocluster.ucr.edu/~tgirke/projects/longevity/cmap/results/diseasecmap2.xls). ```{r disease_enrichment, eval=TRUE, message=TRUE} PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#", check.names=FALSE) data(diseaseRL) PMID26490707sub <- PMID26490707[PMID26490707[,"NEW-Gene-ID"] %in% rownames(diseaseRL),] testprofiles <- list(ranklist=matrix(PMID26490707sub$Zscore, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])), pvalues=matrix(PMID26490707sub$P, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"]))) diseasecmap <- classifyprofile(data=testprofiles$ranklist, case="drug", signif.fdr=0.5, no.signif=20) diseasecmap2 <- classifyprofile(data=testprofiles$ranklist, case="drug", pvalues=testprofiles$pvalues, cytoout=FALSE, type="dynamic", dynamic.fdr=5, adj="BH", no.signif=100) write.table(diseasecmap2, file="./results/diseasecmap2.xls", quote=FALSE, sep="\t", col.names = NA) diseasecmap2[[1]][1:20,] ```

[Back to Table of Contents]()

# Age-drug network analysis In progress...

[Back to Table of Contents]()

# Age-disease network analysis In progress...

[Back to Table of Contents]()

# Funding This project is funded by NIH grant U24AG051129 awarded by the National Intitute on Aging (NIA).

[Back to Table of Contents]()

# Version information ```{r sessionInfo} sessionInfo() ```

[Back to Table of Contents]()

# References