---
title: "John Hopkins Genomic Data Science Capstone"
author: "by Cenk Celik"
date: "28/08/2020"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# The Big Question

A fundamental question in human biology is understanding how we develop from a fetus to a grown human being. One way to do this is to study how the human brain changes over time. In a recent study, Jaffe _et al._ measured gene expression from different individuals across the human lifespan. They were looking for genes that showed patterns of expression that changed over time as people aged. You can find the paper [here](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4281298/).
You don’t need to understand every sentence in this paper, but it will be a good idea to spend some time trying to read it.

The primary data generated by this group is transcriptome sequencing data (known as RNA-seq) from human post-mortem brains, sequenced on the Illumina platform. You have learned about this type of data in the Introduction to Genomic Technologies class and learned pieces of the analysis pipeline along the way. The specific question we will ask in the capstone project is:

> What is the difference between gene expression in fetal and adult brains?

As always in genomics, this dataset has unique challenges. For example, this is data from human brains and all samples are obtained after the death of the individual (also called post-mortem). It is very possible that brain expression changes after death and that this change is bigger the more time has passed after death (also called the post-mortem interval).

# What you will be doing in this capstone project?

In this project you will be completing the following tasks to answer the fundamental question we have posed:

1. You will read the original research article and begin to understand the fundamental scientific question you are studying.
2. You will download the raw sequence data and the meta-data (phenotypic information and technological information) from a public database, and pre-process the data by aligning it to the genome.
3. You will then perform a quality control on the alignments.
4. You will calculate expression measurements at the gene count level.
5. You will do an exploratory analysis to identify major features of the data and figure out which model to build.
6. You will fit statistical models to identify genes that are different between fetal brains and adult brains.
7. You will integrate you results with data from other projects to identify biological patterns.
8. You will document your analysis in detail to demonstrate reproducibility and turn in a brief write-up.

These steps will be evaluated with a combination of quizzes and peer grading to make sure you are progressing through the capstone.

# Why re-analysis?

Most datasets are either highly practical and proprietary - think of data collected by Google, Facebook, and Airbnb - or not very practical but freely available - like data from astronomical observatories. One of the most incredible things about the genomic data revolution is that the data are both highly practical for solving biological problems and very often freely available.

Many scientists spend their entire careers cleverly reusing data that has been generated by other people and made freely available on the internet. In this project, we will be focusing on the reanalysis of a publicly available data set. The question we will be answering is a fundamental question in biology that is an area of intense ongoing scientific investigation. The type of reanalysis you are doing is just like the type you would see happen in genomic labs all over the world.

# Why is this so hard?

We will guide you through the overall process and point you to useful tools, you may have to look up information, download data, or figure out how to put a pipeline together to complete the project. This will likely involve some frustration because there isn’t one right answer or step by step process. This is on purpose. The goal of this capstone project is to demonstrate you can figure out how to execute genomic data analyses on your own. As people who regularly work with and hire genomic data scientists, this independence and ability to solve problems is as important as any of the specific tools or techniques you have learned.

Just like in a real data analysis this doesn’t mean you are on your own. We expect you to complete your projects on your own, but should definitely ask for help and discuss with your classmates. We hope that you will get a lot out of the experience and come out confident in your ability to solve genomic problems with data science.

# Phenotype information
The data is deposited into the Short Read Archive as [BioProject PRJNA245228](http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA245228). This project has 48 different samples, where a sample is a different person and library extraction method. The authors studied 3 different populations of RNA: “total”, “cytosol”, and “nucleus”. These populations refers to RNA present in different parts of the cell, “total” indicates that the RNA comes from both the cytosol and the nucleus. The authors did polyA+ selection to removed ribosomal RNA; this is the standard assay for RNA sequencing.

For “total” RNA samples, the authors studied 6 different age groups: fetal (<0 years), infant (0-1 years), child (1-10 years), adolescent (10-20 years), adult (20-50 years) and old, (50+ years), and each age group had 6 individuals.

The individual samples has names such as [Br5341C1_DLPFC_polyA_RNAseq_cytosol](http://www.ncbi.nlm.nih.gov/biosample/2999562), which indicates individual “Br5341C1” from the “DLPFC” project (the authors studied the human dorsolateral pre-frontal cortex) isolated using “polyA” from the “cytosol”. In the link you can see some additional phenotype information such as age (-0.4 years), sex (male), RIN number (10; this is a measure of RNA quality). The age is measured as weeks after conception converted to years before term (which is why the age is negative for this sample; it indicates a fetal sample).

# Getting the data from SRA
The previous section helps us to locate the samples. We now need to get the data out of SRA in the form of a set of FASTQ files.

On the [webpage](http://www.ncbi.nlm.nih.gov/biosample/2999518), in the top right corner there is a header called “Related Information”, with a link to SRA. Clicking on that link takes us to an SRA [page](http://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=2999518). Here we see important information. First, internally in SRA this BioSample is called SRX683793; two Runs with ids SRR1554535 and SRR2071346 are associated with this sample. The ids beginning with SRX are called experiment ids and the ones beginning with SRR are called run ids. In this case it means that this particular sample was sequenced twice. Each sequencing run will give us a FASTQ file, and we will therefore have two FASTQ files associated with this sample.

Side note: internally in SRA all data is stored in a special SRA format, which is - frankly - irritating to deal with. But it allows us to potentially retrieve the data in FASTA, FASTQ and SRA formats. We want to get the FASTQ files.

Confusingly, the page you land at is labelled “download” but you can only download a pileup file. Instead click at the top, at the other (!) download tab. Here you get taken to a page where you can only enter experiment ids, beginning with SRX. Doing this, lands you on a page like [this](http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?exp=SRX683792&cmd=search&m=downloads&s=seq), which allows you to download FASTQs.

An alternative to the web interface is to use the sra toolkit, a command line utility you can find at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software. There are several command line utilities for accessing SRA. A classic is the fastq-dump command; an example of using it is

> fastq-dump -v --gzip SRR1554534

This will output a gzipped FASTQ file. The utility only appears to support run ids, not experiment ids.

# Which samples to use?

As stated elsewhere, we will be doing a comparison between fetal and adult samples. But the samples themselves are quite big. The comparison will of course be better if you use the full 6 vs. 6 samples, but you can select down to only 3 vs. 3 and still get full credit.

The full list of fetal and adult samples are as follows:

### Fetal
* [R3452_DLPFC_polyA_RNAseq_total; SRX683795; SRR1554537, SRR2071348](http://www.ncbi.nlm.nih.gov/biosample/2999520)

* [R3462_DLPFC_polyA_RNAseq_total; SRX683796; SRR1554538, SRR2071349](http://www.ncbi.nlm.nih.gov/biosample/2999521)

* [R3485_DLPFC_polyA_RNAseq_total; SRX683799; SRR1554541, SRR2071352](http://www.ncbi.nlm.nih.gov/biosample/2999524)

* [R4706_DLPFC_polyA_RNAseq_total; SRX683824; SRR1554566, SRR2071377](http://www.ncbi.nlm.nih.gov/biosample/2999549)

* [R4707_DLPFC_polyA_RNAseq_total; SRX683825; SRR1554567, SRR2071378](http://www.ncbi.nlm.nih.gov/biosample/2999550)

* [R4708_DLPFC_polyA_RNAseq_total; SRX683826; SRR1554568, SRR2071379](http://www.ncbi.nlm.nih.gov/biosample/2999551)

### Adult

* [R2869_DLPFC_polyA_RNAseq_total; SRX683793; SRR1554535, SRR2071346](http://www.ncbi.nlm.nih.gov/biosample/2999518)

* [R3098_DLPFC_polyA_RNAseq_total; SRX683794; SRR1554536, SRR2071347](http://www.ncbi.nlm.nih.gov/biosample/2999519)

* [R3467_DLPFC_polyA_RNAseq_total; SRX683797; SRR1554539, SRR2071350](http://www.ncbi.nlm.nih.gov/biosample/2999522)

* [R3969_DLPFC_polyA_RNAseq_total; SRX683814; SRR1554556, SRR2071367](http://www.ncbi.nlm.nih.gov/biosample/2999539)

* [R4166_DLPFC_polyA_RNAseq_total; SRX683819; SRR1554561, SRR2071372](http://www.ncbi.nlm.nih.gov/biosample/2999544)

* [R2857 DLPFC polyA+ transcriptome; SRX683792; SRR1554534, SRR2071345](http://www.ncbi.nlm.nih.gov/biosample/2731373)

# Task 1: Instructions
The purpose of genomic data science is to answer fundamental questions in biology. Before starting on the data analysis process, the first step is always to understand the scientific question you are trying to answer. The first assessment in the capstone is to make sure you have read and understood the background on the question we are trying to answer. Please read the following paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4281298 and then answer the questions in the Task 1 Quiz.

I will not share the answers for the Task 1 quiz. I am sure you will be able to pass it if you have a look at the paper by Jaffe _et al._

# Task 2: Alignment
Once you have understood the problem, the next step is to obtain the raw data so that you can perform your analysis. To complete this assessment you need to
a. find the raw data,
b. download and align the data for a specific set of samples, and 
c. compile information on those samples that you will use in downstream analyses.

You will be comparing infant brains to adult brains for this project. Accessing the data is described in detail in the project description.

You should download the data and align the data with a spliced aligner (some options include _Tophat2_, _Hisat_, _STAR_, but there are others) You may do this with command line tools or with Galaxy. If you do not have the computing resources to do this on your computer use the Galaxy Cloud servers to perform the alignment. In the paper they aligned the reads to version hg19 of the human genome, you may choose to align to any version of the human genome you think is appropriate.

The second component of this assessment is to collect information about the samples we will be analyzing. Go to the SRA and collect as much information as possible about the each sample. At minimum collect information on the age of the person who the brain came from and the RNA-quality (RIN) of the sample. Store the data in a tab-delimited text file where each column corresponds to a variable and each row corresponds to one samples values for those variables. This is the phenotype table we will use later on.

## Answer:
I selected three samples for each category: foetal and adult. Using SRR accession numbers, I directly sent the raw data to Galaxy Main:

* Fetal sample accession numbers: SRR1554537, SRR1554566, SRR1554568

* Adult sample accession numbers: SRR1554536, SRR1554539, SRR1554534

Once the files have been uploaded on [Galaxy Main](https://usegalaxy.org), I, then, aligned the samples using *Hisat2* **(Galaxy Version 2.1.0+galaxy5)** to the reference sample **hg19**, with paired-end alignment. Galaxy Main returned a BAM file for each sample. Then, I proceeded with *FastQC* **(Galaxy Version 0.72+galaxy1)** to test the quality of alignments. The program returned two files for each sample: (1) Raw data, (2) a web page summarising the results.

Number of reads were in the range of 22,887,026 to 62,654,788. The GC content percentage were in the range of 46 to 51. All the alignment rates were above 99.54%.

Then I collected further information regarding the samples on this [link](https://www.ncbi.nlm.nih.gov/Traces/study1/?query_key=1&WebEnv=MCID_5f3a39744441971ac345c40e&o=acc_s%3Aa&s=SRR1554537,SRR1554536,SRR1554539,SRR1554534,SRR1554566,SRR1554568)

```{r}
file <- read.csv("SraRunTable_v1.csv")
phenotype_table = file
rownames(phenotype_table) = phenotype_table[,1]
phenotype_table[,1] = NULL
head(phenotype_table)
```
```{r}
write.table(phenotype_table, file="phenotype_data.txt", col.names=TRUE, row.names=TRUE)
```
Summaries for each age group:
```{r}
adult <- file[1:3,]
infant <- file[4:6,]

summary(adult)
summary(infant)
```

# Task 3: QC the Alignment
Now, you have aligned the data, the next step is to do some quality control to make sure that the data are in good shape. You can use _FastQC_ to perform these analyses. As an aside, the _FastQC_ team have recently started a very interesting blog on QC for genomics: [QCFail](https://sequencing.qcfail.com/).
In addition to the above questions use this [document](http://bioinfo-core.org/index.php/9th_Discussion-28_October_2010) to determine if any of the other QC metrics look strange for specific samples. If there are any major problems with specific samples, include a variable in your phenotype table as defined in the previous assessment.

## Answer:
I used _FastQC_ **(Galaxy Version 0.72+galaxy1)** on Galaxy Main.

A lot of groups have found that RNA-Seq libraries created with Illumina kits show this odd bias in the first ~10 bases of the run. This seems to be due to the 'random' primers which are used in the library generation, which may not be quite as random as we'd hope but the results seem to be OK.

1. Does the document appear to have an appropriate QC?
Yes, except for Adult sample 1 (SRR1554536). In this case the library was supposed to be human, which would produce a library with a median GC content of ~42%, however you can see that the median GC content for this sample is 44-46%. This may seem like a small shift, but GC profiles are remarkably stable and even a minor deviation indicates that there is a problem in the library.

2. Does the percentage of mapped reads seem appropriate?
Yes.

```{r table, echo=FALSE, message=FALSE, results='asis', warnings=FALSE}
tabl <- "
| SRR        | mapped reads (%) |
|------------|:----------------:|
| SRR1554534 |            99.95 |
| SRR1554536 |            99.67 |
| SRR1554537 |            99.56 |
| SRR1554539 |            99.56 |
| SRR1554566 |            99.56 |
| SRR1554568 |            99.55 |
"
cat(tabl)
```

### t-test:

```{r}
t.test(adult$alignment_rate, infant$alignment_rate)
t.test(adult$avg_qual_per_read, infant$avg_qual_per_read)
```

The _p_-values were 0.441 and 0.4226 for mapping rate and average quality score between the two groups, respectively, which indicate there was no significant different between the two groups.

# Task 4: Feature counts

Now that you have performed alignment and quality control, the next step is to calculate the abundance of every gene in every sample. This count is an approximation to the expression level for the gene. It is calculated by comparing the aligned reads to pre-specified positions in the genome that correspond to known genes. To calculate these counts you need a file that tells you the positions of the genes. You will need the “GTF file” describing the locations of the genes for the corresponding genome you aligned to during the alignment step. You can then use one of many types of software for obtaining the gene counts (*featureCounts*, *HTseq*, etc). The result should be a table that is formatted with one gene per row and one sample per column. The count for that gene in that sample is in the corresponding cell.

## Answer:

I used *featureCounts* **(Galaxy Version 1.6.4+galaxy1)** on Galaxy main to obtain gene positions as in GTF format for each sample.

> Here is the workflow on Galaxy main:

![](workflow_.png)

I downloaded each file and parsed in R to get feature counts table:

The packages I need:

```{r message=FALSE, warning=FALSE}
library('tidyverse')
library(org.Hs.eg.db)
library(annotate)
```

```{r}
files <- list.files(path = "/Users/cenkcelik/Projects/genomic_data_science_specialisation/8.genomic_data_science_capstone/FeatureCounts", pattern = "tabular$", full.names = TRUE)
list <- lapply(files, read.csv, sep = "\t")
```

Then, I merged the files by geneids:

```{r}
feature_counts <- Reduce(function(x, y) merge(x, y, by="Geneid"), list)
colnames(feature_counts) <- c("gene_id", "SRR1554534", "SRR1554536", "SRR1554537",  "SRR1554539", "SRR1554566", "SRR1554568")
feature_counts <- feature_counts[c("gene_id", "SRR1554534", "SRR1554536", "SRR1554537",  "SRR1554539", "SRR1554566", "SRR1554568")]
```

Using *org.Hs.eg*, I converted geneids to gene names:

```{r}
for (i in 1:nrow(feature_counts)){
  feature_counts[i,1] <- lookUp(toString(feature_counts[i,1]), 'org.Hs.eg', 'SYMBOL')
}
rownames(feature_counts) <- make.names(feature_counts[,1], unique=TRUE)
feature_counts[,1] <- NULL
feature_table <- feature_counts
```

To save the file as a delimited TXT file:

```{r}
write.table(feature_table, file="feature_counts.txt", sep='\t', row.names=TRUE, col.names=TRUE)
```

# Task 5: Exploratory Analysis

After summarising your genomic data the next step is to load the data into R for analysis with Bioconductor. Load the phenotype table and the gene expression matrix created in the previous steps into R (it may be a good idea to create a SummarizedExperiment object). The next step is to explore the data for important features. 
Hint: In this, and the next questions, it will be important to account for differences in sequencing depth between samples. The classic way to do this is to compute the number of reads mapped to a feature per million reads mapped.
Make a box-plot of the expression levels for each sample and assess whether any transformation should be applied. Then perform a principal components analysis on the data and plot the top two principal components in a scatter plot. Colour the points by variables you collected in your phenotype table object and see if there are any correlations between principal components and known variables, especially variables that are not the phenotype of interest. Keep these in mind in the next section.

## Answer:
I started to do exploratory data analysis by loading the required packages in R.

```{r message=FALSE, warning=FALSE}
library(GenomicRanges)
library(SummarizedExperiment)
library(edgeR)
```

Next, I removed lowly expressed genes:

```{r}
feature_table = feature_table[rowMeans(feature_table) > 10, ]
```

Then, I created a SummarizedExperiment object:

```{r}
col_data <- phenotype_table
row_data <- relist(GRanges(), vector("list", length=nrow(feature_table)))
se = SummarizedExperiment(assays = list(counts = feature_table), rowRanges = row_data, colData = col_data)
```

I made a box-plot to see gene expression levels for each sample:

```{r}
dge <- DGEList(counts = assay(se, "counts"), group = phenotype_table$age_group )
dge$samples <- merge(dge$samples, as.data.frame(colData(se)), by = 0)
boxplot(dge$counts, col = as.numeric(phenotype_table$sample))
```

It did not seem easy to interpret because of outliers, so I proceeded with log2 transformation:

```{r}
log2_dge_count = log2(dge$counts + 1)
boxplot(log2_dge_count)
```

The box-plot of transformed data looked better, even though adult samples had many outliers. Infant data had only few outliers.

Next, I performed Principle Component data analysis and colour-coded each sample by RIN:

```{r message=FALSE, warning=FALSE}
library(ggfortify)
count_pca <- prcomp(log2_dge_count, center=TRUE, scale=TRUE)
dat <- data.frame(X=count_pca$rotation[,1], Y=count_pca$rotation[,2], age_group <- phenotype_table$age_group, RIN=phenotype_table$RIN)

ggplot(dat, aes(x=X, y=Y, shape=age_group, color=RIN)) + geom_point(size=5) + xlab("PC1") + ylab("PC2")
```

When we only use RIN, the gene expression cannot be distinguished between adult and infant samples. However, if we colour-code the samples by age group, it is easier to distinguish.

```{r}
ggplot(dat, aes(x=X, y=Y, shape=age_group, color=age_group)) + geom_point(size=5) + xlab("PC1") + ylab("PC2")
```

# Task 6: Statistical Analysis

The next step is to perform a statistical analysis to detect genes that are differentially expressed. To do this, set up the null and alternative hypotheses. Be sure to identify which covariates you are going to adjust for in your analysis. Then use a linear model to test for genes that are differentially expressed between age groups adjusting for an appropriate set of covariates. You can use one of several packages (*DESeq*, *edgeR*, *limma*+*voom*, or *edge*) to perform this analysis. Use a correction to identify genes that are differentially expressed after accounting for multiple testing. Make a plot of the fold change (or effect size) for age in each linear model versus the *log10* *p*-value. This is called a volcano plot which you can use to see if there are any major patterns among the observed differential expression.

## Answer:
I will perform a statistical analysis to detect genes that are differentially expressed. I first defined the null hypothesis and alternative hypothesis:

> H<sub>0</sub>: mean gene expression level of each gene does not differentiate between adult and infant samples

> H<sub>1</sub>: mean gene expression level of each gene are different between adult and infant samples.

To test the hypotheses, I made linear models of gene expressions of adult and infant samples using *limma* package and applied log2 transformation with lowly expressed data removed.

```{r message=FALSE, warning=FALSE}
library(limma)
library(edge)

edata <- assay(se)
edata <- log2(as.matrix(edata) + 1)
edata <- edata[rowMeans(edata) > 10, ]

mod <- model.matrix(~ se$age_group)
fit_limma <- lmFit(edata, mod)
ebayes_limma <- eBayes(fit_limma)
limma_toptable = topTable(ebayes_limma, number = dim(edata)[1])
limma_table_output = limma_toptable[,c(1,4,5)]
```
To write the results in a tab-delimited text file:
```{r}
write.table(limma_table_output, file = "gene_expression.txt", sep='\t', row.names=TRUE, col.names=TRUE)
```

Then, I made a volcano plot using adjusted *p*-values where blue represents differentially expressed genes after *log10* transformation.

```{r}
with(limma_toptable, plot(logFC, -log10(adj.P.Val), pch=20, main="Volcano plot"))
with(subset(limma_toptable, adj.P.Val < 0.05), points(logFC, -log10(adj.P.Val), pch=10, col="blue"))
```
```{r}
print(paste0("Differentially expressed genes: ", sum(limma_table_output$adj.P.Val < 0.05)))
print(paste0("Differentially expressed genes that down-regulated from infant to adult: ", sum(limma_table_output$adj.P.Val < 0.05 & limma_table_output$logFC > 1)))
print(paste0("Differentially expressed genes that up-regulated from infant to adult: ", sum(limma_table_output$adj.P.Val < 0.05 & limma_table_output$logFC < -1)))
```

In summary, there were 5133 differentially expressed genes between infant and adult samples among which 4852 genes were down-regulated while 281 genes were upregulated.

# Task 7: Gene set analysis

In assessment 6 we have identified genes differentially expressed between foetal and adult brain. Now we will examine these results in a wider context. The roadmap epigenomics project has profiled both foetal and adult brain and other tissues such as liver for the promoter associated histone modification H3K4me3. Using this data, examine whether genes which are differentially associated between foetal and adult brain, are associated with changes in H3K4me3 in their promoters, between foetal and adult brain. As a sanity check, note that the list of differentially expressed genes is expressed in brain at some developmental time point. H3K4me3 is an epigenetic mark which is cell type dependent. We would expect that genes important for brain development are not expressed in the liver (say). Therefore, as a sanity check, examine whether promoters for the list of differentially expressed genes are marked by H3K4me3 in liver.

“Are there changes in H3K4me3 between foetal and adult brain over promoters for genes differentially expressed between foetal and adult brain?” and “Are promoters of genes differentially expressed between adult and foetal brain marked by H3K4me3 in liver”?
The latter question is a control question.

# Answer:

In this task, I will try to answer the two questions below:

1. Are there changes in H3K4me3 between infant and adult brain over promoters for genes differentially expressed between infant and adult brain?

2. Are promoters of genes differentially expressed between adult and infant brain marked by H3K4me3 in liver?

To answer these questions, I first downloaded infant and adult brain, prefrontal cortex and liver *narrowpeak* datasets from *AnnotationHub* and *TxDb.Hsapiens.UCSC.hg19.knownGene*.

```{r message=FALSE, warning=FALSE}
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

ah <- AnnotationHub()
ah <- subset(ah, species == "Homo sapiens")
ah_fetal <- query(ah, c("EpigenomeRoadMap", "H3K4me3", "E081"))
ah_adult <- query(ah, c("EpigenomeRoadMap", "H3K4me3", "E073"))
ah_liver <- query(ah, c("EpigenomeRoadMap", "H3K4me3", "E066"))

fetal_data <- ah_fetal[[2]]
adult_data <- ah_adult[[2]]
liver_data <- ah_liver[[2]]
```

To match gene ids with entrez gene ids:

```{r message=FALSE, warning=FALSE}
library(mygene)
dif_exp_genes = row.names(limma_toptable[limma_toptable$adj.P.Val < 0.05,])
dif_exp_gene_ids = queryMany(dif_exp_genes, scopes = "symbol", fields = "entrezgene", species = "human" )
```

To extract differentially expressed genes by their corresponding promoters:

```{r}
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb_genes <- genes(txdb)
dif_expressed_promoters <- promoters(txdb_genes[dif_exp_gene_ids$entrezgene %in% txdb_genes$gene_id])
```

To find the overlap between differentially expressed gene promoters and *narrowPeak* datasets:

```{r}
adult_peak = length(subsetByOverlaps(adult_data, dif_expressed_promoters, ignore.strand=TRUE)) / length(adult_data)
fetal_peak = length(subsetByOverlaps(fetal_data, dif_expressed_promoters, ignore.strand=TRUE)) / length(fetal_data)
liver_peak = length(subsetByOverlaps(liver_data, dif_expressed_promoters, ignore.strand=TRUE)) / length(liver_data)
print(paste0("Differentially expressed gene in adult narrowpeaks: ", round(adult_peak, 3), " per cent"))
print(paste0("Differentially expressed gene in fetal narrowpeaks: ", round(fetal_peak, 3), " per cent"))
print(paste0("Differentially expressed gene in adult liver narrowpeaks: ", round(liver_peak, 3), " per cent"))

```

Therefore, 36% of promoters associated with differentially expressed gene overlap with infant brain narrowpeak data, whereas 24% and 19.8% overlap with adult brain narrowpeak data and adult liver narrowpeak data, respectively.

To calculate odds ratio for adult brain, infant brain and adult liver, I created a 2x2 overlap matrix:

```{r}
odds_ratio <- function(prom_counts, peak_counts, print=TRUE){
  overlapMat <- matrix(0, ncol = 2, nrow = 2)
  colnames(overlapMat) <- c("in.peaks", "out.peaks")
  rownames(overlapMat) <- c("in.promoters", "out.promoter")
  
  prom <- reduce(prom_counts, ignore.strand = TRUE)
  peaks <- reduce(peak_counts)
  both <- intersect(prom, peaks)  
  only.prom <- setdiff(prom, both)
  only.peaks <- setdiff(peaks, both)
  
  overlapMat[1,1] <- sum(width(both))
  overlapMat[1,2] <- sum(width(only.prom))
  overlapMat[2,1] <- sum(width(only.peaks))
  overlapMat[2,2] <- 1.5*10^9 - sum(overlapMat)
  
  oddsRatio <- overlapMat[1,1] * overlapMat[2,2] / (overlapMat[2,1] * overlapMat[1,2])
  return(oddsRatio)
}
```

And, results:

```{r}
print(paste0("Odds ratio in adult brain: ", round(odds_ratio(dif_expressed_promoters, adult_data), 2)))
print(paste0("Odds ratio in infant brain: ", round(odds_ratio(dif_expressed_promoters, fetal_data), 2)))
print(paste0("Odds ratio in adult liver: ", round(odds_ratio(dif_expressed_promoters, liver_data), 2)))
```
So, the odds ratio of infant brain sample is **18.28**, indicating the overlap between infant peaks and the promoters is about 18-fold enhanced than expected. The odds ratio of adult brain is slightly lesser than infant sample (**16**). The odds ratio of adult liver sample is more lesser than both infant and adult brain sample (**12.38**), indicating fewer infant and adult brain promoters’ differentially expressed genes were marked by H3K4me3 in the liver sample.

**This is the end of the analysis.**

```{r session_info}
devtools::session_info()
```