---
title: "Differential Expression with DESeq2"
output:
  learnr::tutorial:
    progressive: true
    version: 1.2
subtitle: |
  | ![](https://raw.githubusercontent.com/C-MOOR/clovis-biol11a/master/assets/images/cmoor_logo_notext.png){width=60%} ![](https://raw.githubusercontent.com/C-MOOR/clovis-biol11a/master/assets/images/Clovis_logo.jpg){width=38%}
runtime: shiny_prerendered
---

```{r setup, include=FALSE}
library(learnr)
tutorial_options(exercise.completion=TRUE)
knitr::opts_chunk$set(echo = FALSE)

library("DESeq2")
library("clusterProfiler")
library("org.Dm.eg.db")
library("tidyverse")

formatDESeq2Results <- function( x ) {
  df <- as.data.frame(x)
  df <- data.frame(rownames(df), df)
  colnames(df) <- c("GeneID", colnames(df)[-1])
  rownames(df) <- c()
  return(df)
}

runClusterProfiler <- function (x) {
  ids <- bitr( x$GeneID, "ENSEMBL", "ENTREZID", "org.Dm.eg.db" )
  kegg <- enrichKEGG(ids$ENTREZID, "dme", keyType="ncbi-geneid")
  return(kegg)
}

getClusterProfilerGenes <- function (x, i) {
  data.frame( x ) %>%
  filter( Description == i ) %>%
  pull( geneID ) %>%
  strsplit( "/" ) %>%
  unlist() %>%
  bitr( fromType="ENTREZID", toType="SYMBOL", OrgDb="org.Dm.eg.db") %>%
  pull( SYMBOL )
}

plotAcrossRegions <- function( x ) {
  df <- data.frame(counts(midgut)[x,], midgut_tsv$condition)
  colnames(df) <- c("counts", "region")
  df <- df[10:30,]
  df$region <- fct_relevel( factor( df$region ), "a1", "a2_3", "Cu", "LFCFe", "Fe", "p1", "p2_4" )
  ggplot( df ) +
    geom_bar( aes( region, counts ), stat="identity" ) +
    theme( axis.text.x = element_text( angle=90, vjust=0.5 ) )
}

download.file( "https://drive.google.com/uc?id=13wOmFWPOdmDRLFIaM_8Vu4aJKX99kXEh", "data.tgz", mode="wb" )
untar( "data.tgz" )
midgut_tsv <- read.table("https://raw.githubusercontent.com/C-MOOR/clovis-biol11a/master/assets/data/midgut.tsv", header=TRUE)
midgut <- DESeqDataSetFromHTSeqCount(midgut_tsv, "data/", design = ~condition)
midgut <- DESeq(midgut)
```

## Welcome {.splashpage}

![](https://raw.githubusercontent.com/C-MOOR/clovis-biol11a/master/assets/images/cmoor_logo_text_horizontal.png){width=100%}

![](https://raw.githubusercontent.com/C-MOOR/clovis-biol11a/master/assets/images/Clovis_logo_wide.jpg){width=100%}

```{r}
# Extract the tutorial version from the YAML data and store it so we can print it using inline r code below.  This can't be done directly inline because the code for extracting the YAML data uses backticks
tv <- rmarkdown::metadata$output$`learnr::tutorial`$version
```

#### Introduction: Comparing Data from Muliple RNA-seq Samples

In the previous lab, we looked at a single HTSeq file, which provides gene expression data for one sample.  In today's lab we will learn a hand-full of methods for looking at gene expression across the multiple samples from the *Drosophila* midgut.

#### Learning Goals

1. Calculate differential expression statistics using DESeq2
2. Understand the columns and rows of DESeq2 results
3. Use R to extract results for a single gene of interest
4. Use R to create a list of significantly different genes for different significance cutoffs.
5. Use clusterProfiler to carry out Gene Set Analysis and identify categories of genes that are significantly different between samples.
6. Use R to plot expression of a single gene across all regions.


#### Authors:
* [Dr. Katherine Cox](https://c-moor.github.io/portfolio/coxkatherine/)
* [Dr. Stephanie Coffman](https://c-moor.github.io/portfolio/coffmanstephanie/)


#### Version: `r tv`

## Introduction

In the previous lab, we looked at a single HTSeq file, which provides gene expression data for one sample. Let's consider an example from Marianes and Spradling et al. Recall that they compared many RNA-seq samples from different regions of the midgut in *Drosophilia melanogaster* (Figure 2A, from [Marianes and Spradling 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3755342/))

![Figure 2A](https://raw.githubusercontent.com/C-MOOR/clovis-biol11a/master/assets/images/elife-00886-fig2A.jpg){ width=100% }
**Figure 2A** Diagram illustrating the different RNA-seq libraries constructed by Marianes and Spradling 2013 from the Drosophila midgut.

From an HTSeq file, we can learn that the *lab* gene is highly expressed in the Cu region of the *Drosophlia* midgut. But, does that mean the *lab* gene is important for cells of the Cu region? Maybe, maybe not. It could be that *lab* is expressed at that level across the whole midgut or possible even the whole fly. To really answer questions about gene expression, we need to compare more than one sample with each other. Which samples are compared depends on the scientific question being asked. In terms of the *lab* gene, Marianes and Spradling analyzed its expression in each region of the gut and determined that it is localized to the Cu region (Figure 2D, from [Marianes and Spradling 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3755342/)). This is a clue that *lab* is important in the Cu region.

![Figure 2D](https://raw.githubusercontent.com/C-MOOR/clovis-biol11a/master/assets/images/elife-00886-fig2D.jpg){ width=100% }
**Figure 2D** Analysis of seven genes and their gene expression across each region of the gut.

In today's lab we will learn a hand-full of methods for looking at gene expression across the *Drosophila* midgut. As you work through the lab, think about scientific questions you could ask with the data, and what methods and samples would be useful for answering those questions.

## Scientific Notation

Many of the values reported by the progams for that analyze RNA-seq data are reported using scientific notation.  There is a brief explanations below if you need some review.  If you're comfortable with scientific notation, feel free to skip this section.  If you're not sure, you can always come back to this section later by clicking on the table of contents on the upper left.

### Expressing values in scientific notation

Scientific notation is a convenient method for expressing very large or very small numbers.  Instead of writing out long numbers, we just write out a small number, and then write down how many times that number has to be multiplied by 10 to get to the big number than we want to express.

For example, instead of writing out `1,200,000,000` (1 billion, two-hundred million), we just write `1.2 * 10^9` (1.2 times 10 to the 9th power, which is 1.2 times 1 billion).

For small numbers, we use 10 raised to a negative power.  For example, we could express the number `0.0000012` as `1.2 * 10^-6`.

People often use the letter "e" (short for exponent) as a shorthand for `10^`.  For example, `1.2 * 10^9` would be written `1.2e9`.

An easy way to figure out what the exponent should be for scientific notation is to count how many places you would need to move the decimal point in order to get the number you're aiming for.

The table below shows some more examples of numbers in scientific notation:

```{r}
t <- c("7580000000", "758000000", "75800000", "7580000", "758000", "75800", "7580", "758", "75.8", "7.58", "0.758", "0.0758", "0.00758", "0.000758", "0.0000758", "0.00000758", "0.000000758", "0.0000000758", "0.00000000758")
s <- data.frame(t, rep(7.58, times=19) * 10^seq(9,-9))
colnames(s) <- c("Standard Notation", "Scientific Notation")
s[5:14,]
```


### Interpreting Scientific Notation

When you read a number written in scientific notation...

* First look at the exponent
    + Is is positive or negative?
    + How big is it?
* Then look at the front number

```{r scientific-notation-big}
quiz(caption = "Comparing numbers written in scientific notation",
  question("Which is the biggest number?",
    answer("1e0"),
    answer("1e2"),
    answer("1e-2"),
    answer("1e7", correct=TRUE),
    answer("1e-9", message = "This exponent has the greatest absolute value, but it's negative, so this number is very small."),
    allow_retry=TRUE,
    random_answer_order = TRUE
    ),
 question("Which is the biggest number?",
    answer("7e0"),
    answer("9e2", message="Look at the exponent first, then the front number."),
    answer("3e-2"),
    answer("1e7", correct=TRUE),
    answer("4e-9", message = "This exponent has the greatest absolute value, but it's negative, so this number is very small."),
    allow_retry=TRUE,
    random_answer_order = TRUE
    ),
  question("Which is the biggest number?",
    answer("7e2"),
    answer("9e2", correct=TRUE),
    answer("9e-2", message="This exponent is negative, so this number is small."),
    answer("1e2"),
    answer("4e2"),
    allow_retry=TRUE,
    random_answer_order = TRUE
    ),
   question("Which is the biggest number?",
    answer("9e-9"),
    answer("9e7", correct=TRUE),
    answer("7e7"),
    answer("7e-9"),
    answer("9e-7"),
    allow_retry=TRUE,
    random_answer_order = TRUE
    )
)
```

```{r scientific-notation-small}
quiz(caption = "Comparing numbers written in scientific notation",
  question("Which is the smallest number?",
    answer("1e0", message="This exponent has the smallest absolute value, but negative exponents represent smaller numbers than an exponent of zero."),
    answer("1e2"),
    answer("1e-2"),
    answer("1e-7", correct=TRUE),
    answer("1e7"),
    allow_retry=TRUE,
    random_answer_order = TRUE
    ),
 question("Which is the smallest number?",
    answer("3e0", message="This exponent has the smallest absolute value, but negative exponents represent smaller numbers than an exponent of zero."),
    answer("3e2"),
    answer("3e-2"),
    answer("1e7", message="Look at the exponent first, then the front number."),
    answer("9e-9", correct=TRUE),
    allow_retry=TRUE,
    random_answer_order = TRUE
    ),
  question("Which is the smallest number?",
    answer("7e-2"),
    answer("9e-2"),
    answer("1e2", message="This exponent is positive, so this number is bigger than numbers with a negative exponent."),
    answer("1e-2", correct = TRUE),
    answer("4e-2"),
    allow_retry=TRUE,
    random_answer_order = TRUE
    ),
   question("Which is the smallest number?",
    answer("9e-9"),
    answer("9e7"),
    answer("7e7"),
    answer("7e-9", correct=TRUE),
    answer("9e-7"),
    allow_retry=TRUE,
    random_answer_order = TRUE
    )
)
```

## R cheat sheet

Here is a reminder of some basic R commands, for your convenience:

* `head(df)` prints the first few rows of the dataframe named `df`
* `tail(df)` prints the last few rows of the dataframe named `df`
* `summary(df)` prints some summary statistics for each column of the dataframe named `df`, such as min, mean, median, and max for numeric columns
* `log(v)`, `log(v, 10)` if `v` is a vector of numbers, these will return a vector containining the natural log or log10 of the items in `v`
* `plot(x,y)` plots the vector `x` against the vector `y`. Usually this will be a scatterplot, but R will try to guess a sensible plot if you give it data that doesn't fit in a scatter plot
* `hist(v)` is a specialized version of plot, which will generate a histogram of the items in the vector `v`

You can "filter" a dataframe (pick out only some of the rows and columns).  Some examples:

* `df$c` finds the column named `c` from at dataframe named `df` (keeps all rows)
* `filter(df, c>5)` finds all rows where the value in column `c` is greater than 5 (keeps all columns)
* `filter(df, c==5)` finds all rows where the value in column `c` is exactly equal to 5 (keeps all columns)
* `filter(df, c>5 & d<=10)` finds all rows where the value in column `c` is greater than 5 **and** the value in column `d` is less than or equal to 10 (keeps all columns)
* `filter(df, c=="some_name")` finds all rows where the value in column `c` is exactly "some_name"
* `filter(df, str_detect(c, "name"))` finds all rows where the value in column `c` contains "name" (things like "some_name", "other_name", as well as "name")

You can usually combine commands any way you like, using parentheses.  As long as the output of the inner command is appropriate input for the outer command, things will probably work fine.  Some examples:

* `plot(df$c, df$d)` plots columns of the dataframe named `df`, putting the column named `c` on the x axis and the column named `d` on the y axis
* `hist(log(v))` plots a histogram of the natural log of the values in the vector named `v`


## Calculate Differential Expression

### Introduction

Usually when you do an RNA-seq experiment, you want to find out what genes behave differently in different samples.  This is called "differential expression", because you're looking for *differences* in gene *expression*.

`DESeq2` is a program for calculating differential expression (the DE in DESeq2 stands for differential expression).  It starts with the read count data (for example, HTSeq files) for *all* the samples, then does a bunch of math to figure out which genes are the most different between samples.

Since setting up DESeq2 is a little complicated, we've gone ahead and run the commands in the background, so that you can focus on exploring the results and looking for patterns in the data.  The commands we used to run DESeq2 are shown below, in case you're curious, but don't worry about trying to figure these out right now.


```{r setup-DESeq2, echo=TRUE, eval=FALSE, message=FALSE}
midgut_tsv <- read.table("https://raw.githubusercontent.com/C-MOOR/clovis-biol11a/master/assets/data/midgut.tsv", header=TRUE)
midgut <- DESeqDataSetFromHTSeqCount(midgut_tsv, "data/", design = ~condition)
midgut <- DESeq(midgut)
```


Your job is to explore the results from DESeq2 and try to figure out what they mean biologically.  DESeq2 just reports statistical calculations, giving each gene scores for how different it is between samples.  It's the job of a scientist to look through that information and decide what it **means**.

### Looking at results from DESeq2

We have stored the results from DESeq2 in an object named `midgut`.  This is a special object made by DESeq2, with lots of dataframes inside it.  You can access the dataframe of results for a particular comparison using the `results()` command.  This process takes two steps:

1. First, use the `results()` command to get the relevant data.  The `results()` command needs to know which samples you want to compare.  Make sure to store this output in a variable.
  + In the example below, we have named this variable `temp` (short for temporary) because it's not in the right format yet.
2. Reformat the results to make them more useful.  We've made a special command that takes care of reformatting DESeq2 results, named `formatDESeq2Results()`.  Run this command on your temporary variable to convert in into a nicely formatted dataframe.  Store the dataframe in a new variable, and give it a descriptive name so you can remember what it is.
  + In the example below, we have named our dataframe `a1_vs_p1`, since it contains the data comparing region a1 to region p1.

Here's an example of this two step process for getting the data comparing samples from region "a1" to region "p1".  As the last command, we type out the name of our new dataframe so that it will print to the screen and we can see what it looks like.

```{r, echo=TRUE}
temp <- results(midgut, contrast = c("condition", "a1", "p1") )
a1_vs_p1 <- formatDESeq2Results(temp)
a1_vs_p1
```

Each row in this dataframe is a gene, and the columns contain a bunch of information about the statistics that DESeq2 calculated.  This data frame has a lot of numbers in it, but there are three columns that we're interested in:

- **baseMean**:  The mean expression level for this gene across all samples
- **log2FoldChange**:  How much higher this gene is expressed in the first region ("a1") than in the second region ("p1").  A positive number means the gene is higher in a1 than in p1, and a negative number means the gene is lower in a1 than in p1
- **padj**: (If this variable doesn't show up on your screen, click the little arrow at the top right of the table, next to the column names.  This will let you scroll through all the columns if they don't fit on your screen).  This is the "adjusted p-value" for the gene - basically, a score that helps you decide whether or not this gene is actually different between the regions.  A **high score** means the gene is probably **not different** between the samples.  A **low score** means that it probably is a **real difference**, and not a fluke in the data.  Scientists often use a score of 0.05 as a cutoff, so genes with a `padj` score of less than 0.05 would be considered "significantly different" between these samples.  This is important because measurements are never perfect, and we have to decide whether the differences we see are because the samples are actually different, or are just errors in measurement.

```{r intro-deseq-results}
quiz(caption = "What's in this dataframe?",
  question("Out of the first 10 genes shown above, which gene has the highest average gene expression?",
    answer("FBgn0000003"),
    answer("FBgn0000008"),
    answer("FBgn0000014"),
    answer("FBgn0000015"),
    answer("FBgn0000017"),
    answer("FBgn0000018"),
    answer("FBgn0000022"),
    answer("FBgn0000024", correct=TRUE),
    answer("FBgn0000028"),
    answer("FBgn0000032"),
    allow_retry=TRUE
    ),
  question("Out of the first 10 genes shown above, which gene(s) are more highly expressed in a1 than p1?  HINT: There are four - you must select all four to get the correct answer.",
    answer("FBgn0000003", correct=TRUE),
    answer("FBgn0000008"),
    answer("FBgn0000014"),
    answer("FBgn0000015"),
    answer("FBgn0000017"),
    answer("FBgn0000018"),
    answer("FBgn0000022", correct=TRUE),
    answer("FBgn0000024", correct=TRUE),
    answer("FBgn0000028", correct=TRUE),
    answer("FBgn0000032"),
   allow_retry=TRUE
   ),
  question('Out of the first 10 genes shown above, which gene(s) are likely to "really" be different between regions a1 and p1, if we use a padj cutoff of 0.05?  HINT: There are two - you must select both to get the correct answer.',
    answer("FBgn0000003"),
    answer("FBgn0000008"),
    answer("FBgn0000014", correct=TRUE),
    answer("FBgn0000015"),
    answer("FBgn0000017"),
    answer("FBgn0000018"),
    answer("FBgn0000022"),
    answer("FBgn0000024", correct=TRUE),
    answer("FBgn0000028"),
    answer("FBgn0000032"),
   allow_retry=TRUE
   )
)
```

### What does "NA" mean?
For some genes you will see a value of "NA" instead of a number.  NA stands for "not available".  This means that for some reason R does not have a value for this gene.  This might happen, for example, if a gene is not expressed at all in a particular set of samples - R can't calculate the `log2FoldChange` because it would involve dividing by zero.  Likewise, sometime R does not calculate a `padj` value because the preliminary calculations show that it's not worth it - it's already clear that the gene won't have a good `padj` score.

### Try it yourself: look at a different pair of samples

Use the `results()` and `formatDESeq2Results()` commands in the box below to look at a comparison between region "a1" and region "Cu".  You can copy the commands from the box above (replace "a1" and "p1" with the samples you want to compare).

```{r deseq-results, exercise=TRUE}
```

```{r intro-deseq-results-2}
quiz(caption = "Examine another comparison between regions",
   question("Out of the first 10 genes shown above, which gene has the highest average gene expression?",
    answer("FBgn0000003"),
    answer("FBgn0000008"),
    answer("FBgn0000014"),
    answer("FBgn0000015"),
    answer("FBgn0000017"),
    answer("FBgn0000018"),
    answer("FBgn0000022", message="Pay attention to the exponent (scientific notation)"),
    answer("FBgn0000024", correct=TRUE),
    answer("FBgn0000028"),
    answer("FBgn0000032"),
    allow_retry=TRUE
   ),
  question("Are these values the same or different than the gene expression values in the comparison between regions a1 and p1",
   answer("Same", correct=TRUE, message = "The baseMean is the average score across all samples, so it will be the same no matter what samples we're looking at"),
   answer("Different"),
   allow_retry=TRUE
   ),
  question("Out of the first 10 genes, which gene are more highly expressed in a1 than Cu?  HINT: There are four - you must select all four to get the correct answer.",
    answer("FBgn0000003", correct=TRUE),
    answer("FBgn0000008", correct=TRUE),
    answer("FBgn0000014"),
    answer("FBgn0000015"),
    answer("FBgn0000017"),
    answer("FBgn0000018"),
    answer("FBgn0000022"),
    answer("FBgn0000024", correct=TRUE),
    answer("FBgn0000028", correct=TRUE),
    answer("FBgn0000032"),
    allow_retry=TRUE
   ),
  question("Are these values the same or different than the log2FoldChange values in the comparison between regions a1 and p1?",
   answer("Same"),
   answer("Different", correct = TRUE, message="The log2FoldChange describes the difference between the two specific samples you chose to look at using the `results()` command"),
    allow_retry=TRUE
   ),
  question('Out of the first 10 genes shown above, which gene(s) are likely to "really" be different between regions a1 and Cu, if we use a padj cutoff of 0.05?  HINT: There are four - you must select all four to get the correct answer.',
    answer("FBgn0000003"),
    answer("FBgn0000008", correct=TRUE),
    answer("FBgn0000014", correct=TRUE),
    answer("FBgn0000015"),
    answer("FBgn0000017", correct=TRUE),
    answer("FBgn0000018"),
    answer("FBgn0000022"),
    answer("FBgn0000024", correct=TRUE),
    answer("FBgn0000028"),
    answer("FBgn0000032"),
   allow_retry=TRUE
   ),
    question("Are these values the same or different than the padj values in the comparison between regions a1 and p1?",
   answer("Same"),
   answer("Different", correct = TRUE, "The padj describes the difference between the two specific samples you chose to look at using the `results()` command"),
   allow_retry=TRUE
   )
)
```

## Get results for a single gene

The `results()` command gives information about *all* genes.  That's a lot of data!  We can use some R commands to focus on smaller sets of those genes that we think are interesting.

To look up a particular gene that you're interested in, you can select it by name using the `filter()` command.  It's important that the name you use *exactly matches* the name of the gene in the results dataframe.  R isn't like Google, it can't guess what you meant if you spell it wrong or even capitalize it differently.  The following code finds the gene "FBgn0000251" in the results for a1 vs. p1.  The first two commands are just getting the results, like we practiced earlier.  The filter command is looking through the column GeneID for any thing that exactly matches "FBgn0000251".  Note that we stored the result in a new variable, named `gene251`, so we can easily get this gene back whenever we want it.

```{r filter-by-name-example, echo=TRUE}
temp <- results(midgut, contrast = c("condition", "a1", "p1") )
a1_vs_p1 <- formatDESeq2Results(temp)

gene251 <- filter(a1_vs_p1, GeneID=="FBgn0000251")
gene251
```

```{r filter-by-name-quiz}
quiz(caption = "Looking at a specific gene",
   question("Is this gene expressed more highly in region a1 or p1?",
   answer("higher in a1"),
   answer("higher in p1", correct=TRUE),
   allow_retry=TRUE
   ),
  question("Is this difference significant at a padj value of 0.05?  In other words, is this difference 'real' or is it probably a fluke?",
   answer("real", correct=TRUE),
   answer("probably a fluke"),
   allow_retry=TRUE
   )
)
```

Use the box below to look at FBgn0000251 and compare its expression between the a1 and Cu regions.

```{r filter-by-name-practice, exercise=TRUE}
```

```{r filter-by-name-quiz-2}
quiz(caption = "Looking at a specific gene",
   question("Is this gene expressed more highly in region a1 or Cu?",
   answer("higher in a1", correct=TRUE),
   answer("higher in Cu"),
   allow_retry=TRUE
   ),
  question("Is this difference significant at a padj value of 0.05?  In other words, is this difference 'real' or is it probably a fluke?",
   answer("real"),
   answer("probably a fluke", correct=TRUE),
   allow_retry=TRUE
   )
)
```

## Lists of Interesting Genes

While it's useful to be able to look up particular genes we're interested in, it's also valuable to get a list of all the genes that are different between two samples.  This lets us discover genes that we never would have thought to check by name.

### Get a list of genes with good padj scores

The first thing we need to do is to pick out all the genes that the computer thinks are "really" different between the two samples - those with a padj score smaller than our cutoff.  Remember, a good padj score is a small number.  This means that there is a very small chance that the differences we see between samples are just an accident.

A cutoff of 0.05 or 0.10 is often a good place to start.  We can get a list of these genes using the `filter()` command, similar to the way we pulled out single genes by name.  The following code finds only the genes that have a padj score of less than or equal to 0.05, stores the results in a new dataframe named sig_genes.  (This is an abbreviation for "significantly different genes", you could name it whatever you like).  

```{r filter-by-padj-example, echo=TRUE}
temp <- results(midgut, contrast = c("condition", "a1", "p1") )
a1_vs_p1 <- formatDESeq2Results(temp)

sig_genes <- filter(a1_vs_p1, padj <= 0.05)
sig_genes
```

Since R by default only shows us the first 1000 rows, we can use the `dim` command to find out how big our dataframe of significant genes is:

```{r, echo=TRUE}
dim(sig_genes)
```

```{r significantly-different-genes-quiz}
quiz(caption = "Looking at a significantly different genes",
  question("What percentage of all the fly genes made it onto this list?  Remember, we started out with 17559 genes.",
    answer("Less than 1%"),
    answer("About 2%"),
    answer("About 14%", correct=TRUE),
    answer("About 27%"),
    allow_retry=TRUE
   )
)
```

##  Choosing a padj cutoff

Often, a couple hundred to a couple thousand genes is a pretty comfortable number of genes to work with, it will let us look for patterns without overwhelming us.  If instead we got a very small number, we might want to loosen our cutoff (raise our minimum padj value).  Or, if we got thousands of genes, we might want to make our cutoff more strict (lower our padj value).  It depends on what we want to do with our list of genes.  If we're examining them by hand we might want a smaller, more certain list.  If we're looking for broad patterns, we might want a bigger list with more errors.

When you pick a padj cutoff, it's important to think about what it means.  In DESeq2, the padj value is the fraction of genes on the list that are probably not "really" different ("False Positives").  If we pick a padj value of 0.05, and we get a list of 100 genes, we expect 95 of those genes to be "real" and (0.05 X 100 = 5) 5 of those genes to be accidents.  If we instead pick a higher padj value of 0.10, then we expect 90 genes to be "real" and (0.10 X 100 = 10) 10 to be accidents.  Likewise, if we picked a very low padj cutoff of 0.001 then we would expect all 100 of the genes to be real (0.001 X 100 = 0.1, so less that 1 gene is expected to be an accident).

We would love to be able to have only "real" genes on the list (by picking a very low padj cutoff), but often we don't have good enough data to be able to do this (if we pick a low enough number that we expect all genes to be real, then very few genes (or maybe no genes) pass the cutoff).  And as we make the cutoff stricter, we end up throwing out a lot of real genes along with the false ones.  As data analysts, we have to decide when it's better to have a smaller list of differentially expressed genes that are more likely to be real, and when it's better to have a bigger list of genes but know that several of them aren't "really" different.

In this case, for a1 vs p1 we actually have quite good data with many significantly different genes.  From our 2492 genes, we expect 0.05 X 2492 ~ 125 of them to be false positives.  Since we have so many genes, it might be a good idea to pick a lower p-value and see how many genes we're left with.

If we try a padj cutoff of 0.01:
```{r, echo=TRUE}
sig_01_genes <- filter(a1_vs_p1, padj <= 0.01)
dim(sig_01_genes)
```
In this case, we get 1799 genes, and we expect 0.01 X 1799 = 18 of them to be false positives.  That's a pretty strong list of genes.  We got rid of 125-18 = 107 false genes (yay!).  On the other hand, we also threw out about 600 real genes (boo!)

If we try a stricter cutoff of 0.001:
```{r, echo=TRUE}
sig_001_genes <- filter(a1_vs_p1, padj <= 0.001)
dim(sig_001_genes)
```
Now we get 1278 genes and we expect 0.001 X 1278 ~ 1 to be a false positive.  So we've basically gotten rid of all false genes, but we've also thrown out ~half of the real genes as compared to using the 0.05 cutoff.

With this particular dataset, I'd probably pick a cutoff of either 0.05 or 0.01 for further computational analyis.  I think 0.001 throws away too many real genes.

Use the box below to find genes that are significantly different between the a1 and Cu regions.

```{r filter-by-padj-practice, exercise=TRUE}
```

```{r filter-by-padj-pracice-quiz}
quiz(caption = "Looking at a significantly different genes",
  question("How many genes are significantly different between the a1 and Cu regions at a padj cutoff of 0.05?",
    answer("6,631"),
    answer("3,804"),
    answer("3,257", correct=TRUE),
    answer("2,421"),
    answer("1,803"),
    allow_retry=TRUE
   ),
  question("How many genes are significantly different between the a1 and Cu regions at a padj cutoff of 0.01?",
    answer("6,631"),
    answer("3,804"),
    answer("3,257"),
    answer("2,421", correct=TRUE),
    answer("1,803"),
    allow_retry=TRUE
  )
)
```

## Gene Set Analysis with clusterProfiler

Now that we have a list of interesting genes, we can use `clusterProfiler` program to figure out what these genes do.  Like `DESeq2`, we have to do a bit of reformating to organize our data for `clusterProfiler`, and we've made a command to make that easy: `runClusterProfiler`:

1. Use `runClusterProfiler` on a list of interesting genes (in this case, genes that were significantly different between regions of the midgut).

+ `clusterProfiler` looks at a couple of different international databases to find out what each of the genes does.  It then compares your list of interesting genes to the list of all genes in the fly genome, and checks what types of genes are over-represented in your list.
+ Make sure to save the results in a new variable.

2. Use `dotPlot` to look at the results.

+ You can adjust how many results appear on the plot using `showCategory`.  It's a good idea to find out how many gene categories you have (using the `dim` command on the results from `clusterProfiler`) and then set `dotplot` to show all of them.  If this is too crowded, you can pick a smaller number for the `dotplot` command.
+ You can add a title to the plot using `title`

For this analysis, we're using a padj cutoff of 0.01.

```{r, echo=TRUE, message=FALSE, warning=FALSE, fig.height=6}
temp <- results(midgut, contrast = c("condition", "a1", "p1") )
a1_vs_p1 <- formatDESeq2Results(temp)
sig_genes <- filter(a1_vs_p1, padj <= 0.01)

a1_vs_p1_clusters <- runClusterProfiler(sig_genes)
dim(a1_vs_p1_clusters)
dotplot(a1_vs_p1_clusters, showCategory=33, title="a1 vs p1")
```

```{r clusterProfiler-quiz}
quiz(caption = "Categories of genes with differential expression",
  question("Which category has the largest number of differentially expressed genes?",
    answer("Drug metabolism - other enzymes", correct=TRUE),
    answer("Metabolism of xenobiotics by cytochrome P450"),
    answer("Cysteine and methionine metabolism"),
    answer("Arginine biosynthesis"),
    allow_retry=TRUE,
    random_answer_order = TRUE
    ),
 question("Which categories have more than 20 differentially expressed genes?  Choose more than one answer.",
    answer("Drug metabolism - other enzymes", correct=TRUE),
    answer("Metabolism of xenobiotics by cytochrome P450", correct=TRUE),
    answer("Alanine, aspartate and glutamate metabolism"),
    answer("Biosynthesis of unsaturated fatty acids"),
    allow_retry=TRUE,
    random_answer_order = TRUE
    )
)
```

In the box below, use the `runClusterProfiler` command to find what groups of genes are significantly different between the a1 and Cu regions.

* You will first need to make a list of significantly different genes.
* Look at the code above and adjust it to examine the a1 and Cu regions instead of a1 and p1.
* Use the `dim` command to find out how many groups of genes there are, and change `showCategory` in `dotplot` to match
* You may see an error message telling you that it "returned 1:many mapping between keys and columns" or some small percentage of your genes couldn't be found in the databases ("fail to map").  That's fine, it shouldn't affect results too much as long as it's a small percentage, and we always lose some genes when we compare across different databases since databases don't all have exactly the same genes.

```{r clusterProfiler-practice, exercise=TRUE, fig.height=6}
```


### Find out what genes are in a clusterProfiler group

Once you've identified an interesting category of genes using `clusterProfiler`, it may be useful to find out what those genes are.  This is a bit tricky to do with R, so we've again created a function for you, named `getClusterProfilerGenes()` that will extract the gene symbols for you.  The `getClusterProfilerGenes()` command needs two pieces of information:

1. The `clusterProfiler` results
2. The name of the group of genes you want to list. This must be in quotes and must be spelled **exactly** as it is in the `clusterProfiler` results.

For example, if we wanted to get a list of the genes in the "Fatty acid degradation" group for a1 vs p1, we would use the commands you've already learned to run clusterProfiler, then look at the dot use the `getClusterProfiler()` command:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
# Run clusterProfiler
temp <- results(midgut, contrast = c("condition", "a1", "p1") )
a1_vs_p1 <- formatDESeq2Results(temp)
sig_genes <- filter(a1_vs_p1, padj <= 0.01)
a1_vs_p1_clusters <- runClusterProfiler(sig_genes)

# Look at the dotplot to see the groups of genes
dim(a1_vs_p1_clusters)
dotplot(a1_vs_p1_clusters, showCategory=33, title="a1 vs p1")

# Get the gene symbols for a group of genes
getClusterProfilerGenes(a1_vs_p1_clusters, "Fatty acid degradation")
```

This gives us a list of gene symbols.  We can then look these genes up on FlyBase to find out more about them.

Use the box below to find groups of genes that are significantly different between the a1 and Cu regions with the `runClusterProfiler` command.  Then use the `getClusterProfilerGenes` command to find out what genes are in those groups.

```{r getClusterProfilerGenes-practice, exercise=TRUE}
```

```{r clusterProfilerGenes-quiz}
quiz(caption = "Genes in clusterProfiler groups",
  question("Which of the following genes are involved in Carbon metabolism?  HINT: There are two - you must select both to get the correct answer.",
    answer("Hex-A", correct=TRUE),
    answer("Pgk", correct=TRUE),
    answer("lab"),
    answer("bgm"),
    answer("Ace"),
    allow_retry=TRUE,
    random_answer_order = TRUE
    )
)
```


## Plot a single gene across all regions

Plotting gene expression out across all regions again takes some reformatting of the data, so we've created a command to take care of that for you, named `plotAcrossRegions()`:


```{r, echo=TRUE}
plotAcrossRegions("FBgn0000003")
```

This allows us to see patterns of gene expression across all regions.  In this case, FBgn0000003 is only expressed in the a1 and Fe regions.

How do you decide what gene to look at?

1. Pick genes that were mentioned in a scientific paper that you think would be relevant, or look at genes involved in the processes identified by `clusterProfiler`.  Once you know of a gene you're interested in, use the FlyBase website to look up the FlyBase ID for a gene so you can plot the data.

2. Sort the genes by p-value (to get the most significantly different genes) or log2FoldChange (to get the genes with the biggest differences in expression).  This uses DESeq results data, so you will have to pick a particular pair of regions to compare.  The first few steps should look familiar by now.  The last command, `arrange`, sorts the genes by a particular column.  This command sorts smallest to largest; if you want to see the biggest values, you can use `tail()` to see the last few rows.  Then use the FlyBase ID to plot the data for the top ranked genes across all regions.

```{r, echo=TRUE}
temp <- results(midgut, contrast = c("condition", "a1", "p1") )
a1_vs_p1 <- formatDESeq2Results(temp)
a1_vs_p1_sorted <- arrange(a1_vs_p1, padj)
head(a1_vs_p1_sorted)
```

In this case, FBgn0035665 has the lowest padj score when comparing regions a1 and p1.  Let's plot it across all regions:

```{r, echo=TRUE}
plotAcrossRegions("FBgn0035665")
```

Now we can see that it's generally more highly expressed towards the anterior and is expressed at a low level in posterior regions, but there is also high expression in the Fe region.

## Try it Out!

Now that you know how to find the categories of genes that are differentially expressed betweens samples, there are a few different ways you can explore scientific questions about nutrient metabolism in the Drosophila midgut.

1. Repeat the `runClusterProfiler` analysis using different regions, to get a high level picture of which types of genes are differentially expressed across different regions.  Are you seeing similar results to the paper, or are you seeing something different?
2. Plot out expression of a single gene across *all* regions.  Does it show a pattern you'd expect from what you knew already, or not?  Do this for a few different genes from the same category (e.g. sugar metabolism genes) to get a feel for how they're behaving.
3. Find out more about these categories of genes.  Google and the FlyBase website are great places to start!  Find out what genes are in a particular category - what genes are "galactose metabolism" genes?  How do they work together to metabolize galactose?

Use the codeblocks below to explore the differential expression data.  There are several blocks for your convenience, use as many as you like.

```{r freeplayd-exercise, exercise=TRUE, exercise.eval=FALSE}

```

```{r freeplayd2-exercise, exercise=TRUE, exercise.eval=FALSE}

```

```{r freeplayd3-exercise, exercise=TRUE, exercise.eval=FALSE}

```

```{r freeplayd4-exercise, exercise=TRUE, exercise.eval=FALSE}

```

```{r freeplayd5-exercise, exercise=TRUE, exercise.eval=FALSE}

```

```{r freeplayd6-exercise, exercise=TRUE, exercise.eval=FALSE}

```