---
title: "Tutorial -- Comparing means under controlled conditions"
author: "Jacques van Helden"
date: "`r Sys.Date()`"
output:
  html_document:
    code_folding: hide
    fig_caption: yes
    highlight: zenburn
    self_contained: no
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  ioslides_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  word_document:
    toc: yes
    toc_depth: '3'
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  slidy_presentation:
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: yes
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
font-import: http://fonts.googleapis.com/css?family=Risque
subtitle: Diplôme Universitaire en Bioinformatique Intégrative (DUBii)
font-family: Garamond
transition: linear
---

```{r include=FALSE, echo=FALSE, eval=TRUE}
library(knitr)

options(width = 300)
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.path = 'figures/mean-compa-tests_',
  fig.align = "center", 
  size = "tiny", 
  echo = TRUE, eval = TRUE, 
  warning = FALSE, message = FALSE, 
  results = TRUE, comment = "")
# knitr::asis_output("\\footnotesize")

```


## Required libraries

Load the following libraries or install them if required. 

```{r load_libraries}
require(knitr)
```


## Introduction

This tutorial aims at providing an empirical introduction to the application of mean comparison tests to omics data.

The goals include

- revisiting the **basic underlying concepts** (sampling, estimation, hypothesis testing, risks...);
- perceiving the problems that arise when a test of hypothesis is applied on several thousand of features (**multiple testing**);
- introducing some methods to circumvent these problems (**multiple testing corrections**);
- using graphical representations in order to grasp the results of several thousand tests in a winkle of an eye: 

    - p-value histogram
    - MA plot
    - volcano plot
  

The whole tutorial will rely on artificial data generated by drawing random numbers that follow a given distribution  of probabilities (in this case, the normal distribution, but other choices could be made afterwards).


The tutorial will proceed progressively: 

1. Generate a multivariate table (with *individuals* in columns and *features* in rows) and fill it with random data following a given distribution of probability.

2. Measure different descriptive parameters on the sampled data.

3. Use different graphical representations to visualise the data distribution. 

4. Run a test of hypothesis on a given feature.

5. Run the same test of hypothesis on all the features. 

6. Use different graphical representations to summarize the results of all the tests.

7. Apply different corrections for multiple testing (Bonferroni, Benjamini-Hochberg, Storey-Tibshirani q-value). 

8. Compare the performances of the test depending on the chosen multiple testing correction. 

## Experimental setting

Well, by "experimental" we mean here that we will perform *in silico* experiments. 

Let us define the parameters of our analysis. We will generate data tables of artificial data following normal distributions, with either different means (tests under $H_1$) or equal means (tests under $H_0$). 

We will do this for a number of features  $m_0=10,000$ (number of rows in the "$H_0$" data table), which could be considered as replicates to study the impact of sampling fluctuations. 

In a second time (not seen here) we could refine the script by running a sampling with a different mean for each feature, in order to mimic the behaviour of omics datasets (where genes have different levels of expression, proteins and metabolite different concentrations). 


### Parameters

| Parameter | Value | Description | Value |
|---------|-------------|---------|
| $n_1$  |  ${2, 3, 4, 8, 16, 32, 64}$ | size of the sample from the first population. individual choice. Each participant will choose a given sample size |
| $n_2$  |  $= n_1$ | size of the sample from the second population |
| $\mu_1$ | 10 or 7 | mean of the first population. each participant will chose one value |
| $\mu_2$ | 10 | mean of the second population |
| $\sigma_1$ | 2 | Standard deviation of the first population |
| $\sigma_2$ | 3 | Standard deviation of the second population |
| $m_0$$ | $10,000$ | number of features under null hypothesis |


### Sample sizes

Each participant will choose a different sample size among the following values: $n  \in {2, 3, 4, 5, 8, 16, 64}$. Noteowrthy, many omics studies are led with a very small number of replicates (frequently 3), so that it will be relevant to evaluate thee impact of the statistical sample size (number of replicates) on the sensibility (proportion of cases declared positive under $H_1$).  


## Performances of the tests

We will measure the performances of a test by running $r = 10,000$ times under $H_0$, and $r = 10,000$ times under $H_1$. 

- count the number of $FP$, $TP$, $FN$, $TN$
- compute the derived statistics: $FPR$, $FDR$ and $Sn$

$$FPR = \frac{FP}{m_0} = \frac{FP}{FP + TN} $$

$$FDR = \frac{FP}{\text{Positive}} = \frac{FP}{FP + TP} $$

$$Sn = \frac{TP}{m_1} = \frac{TP}{TP + FN} $$
$$PPV = \frac{TP}{\text{Positive}} = \frac{TP}{TP + FP}$$

## Recommendations


### Coding recommendations

1. Choose a consistent coding style, consistent with a reference  style guide (e.g. [Google R Syle Guide](https://google.github.io/styleguide/Rguide.html)). In particular: 

    - For variable names, use the camel back notation with a leading lowercase (e.g. `myDataTable`) rather than the dot separator (`my.data.table`)
    - For variable names, use the camel back notation with a leading uppercase (e.g. `MyMeanCompaTest`). 

2. Define your variables with explicit names (sigma, mu rather than a, b, c, ...).

3. Comment your code

    - indicate what each variable represents
    - before each segment of code, explain what it will do

4. Ensure consistency between the code and the report $\rightarrow$ inject the actual values of the R variables in the markdown. 

### Scientific recommendations

1. Explicitly formulate the statistical hypotheses before running a test.

2. Discuss the assumptions underlying the test: are they all fulfilled? If not explain why (e.g. because we want to test the impact of this parameter, ...) .


## Tutorial


### Part 1: generating random datasets

#### Define your parameters

Write a block of code to define the parameters specified aboce. 

Note that each participant will have a different value for the sample sizes ($n_1, n_2$). 

```{r parameters}
#### Defining the parameters ####

## Sample sizes. 
## This parameter should be defined individually for each participant
n1 <- 16 # sample size for the first group
n2 <- 16 # ssample size for second group

## First data table
m <- 1000 # Number of features
mu1 <- 7 # mean of the first population
mu2 <- 10 # mean of the second population

## Standard deviations
sigma1 <- 2 # standard deviation of the first population
sigma2 <- 3 # standard deviation of the second population

## Significance threshold. 
## Note: will be applied successively on the different p-values
## (nominal, corrected) to evaluate the impact
alpha <- 0.05

```

The table below lists the actual values for my parameters (your values for $n_i$ will be different). 

| Parameter | Value | Description | 
|-----------|-------|-------------------|
| $\mu_1$ | `r mu1` | Mean of the first population | 
| $\mu_2$ | `r mu2` | Mean of the second population | 
| $\sigma_1$ | `r sigma1` | Standard deviation of the first population | 
| $\sigma_2$ | `r sigma1` | Standard deviation of the second population | 
| $n_1$ | `r n1` | Sample size for the first group | 
| $n_2$ | `r n2` | Sample size for the second group | 


#### Generate random data sets

Exercise: 

- Generate an data frame named `group1` which with $m_0$ rows (the number of features under $H_0$, defined above) and $n_1$ columns (sample size for the first population), filled with random numbers drawn from the first population. 

- Name the columns with labels indicating the group and sample number: `g1s1`, , `g1s2` ... with indices from 1 $n_1$.

- Name the rows to denote the feature numbers: `ft1`, `ft2`, ... with indices from 1 to $m_0$. 

- Check the dimensions of `group1`.

- Print its first and last rows to check its content and the row and column names.

- Generate a second data frame named `group2` for the samples drawn from the second population with the appropriate size, and name the columns and rows consistently. 


**Useful functions**

- `rnorm()`
- `matrix()`
- `data.frame()`
- `paste()`
- `paste0()`
- `colnames()`
- `rownames()`
- `dim()`

**Trick**: 

- the funciton `matrix()` enables us to define the number of columns and the number of rows, 
- the function `data.frames()` does not enable this, but it can take as input a matrix, from which it will inherit the dimensions

**Solution** (click on the "code" button to check the solution). 

```{r random_sampling}
#### Generate two tables (one per population) with the random samples ####

## Info: for didactiv purpose we will use a progressive approach to generate 
## the data the first group,  and a more compact formulation (in one command) 
## for the second group.

## Dataset under H0, with a progressive approach

## Generate a vector of size m*n1 with all the random values 
## for each feature and each infividual
normVector <- rnorm(n = m * n1, mean = mu1, sd = sigma1)

## Wrap the vector in an m x n1 matrix
normMatrix <- matrix(data = normVector,
                     nrow = m, 
                     ncol = n1)

## Create a data frame with the content of this matrix
group1 <- data.frame(normMatrix)

## Set the column and row names
colnames(group1) <- paste(sep = "", "g1s", 1:n1)
rownames(group1) <- paste(sep = "", "ft", 1:m)

## Dataset under H1: use a more direct approach, compact the 3 steps in a single command
group2 <- data.frame(
  matrix(data = rnorm(n = m * n2, mean = mu2, sd = sigma2),
         nrow = m, 
         ncol = n2))
colnames(group2) <- paste(sep = "", "g2s", 1:n2)
rownames(group2) <- paste(sep = "", "ft", 1:m)

```

Check the content of the data table from the first group. 

```{r}
dim(group1)
kable(head(group1))
kable(tail(group1))
```

Check the content of the data table from the second group. 

```{r}
dim(group2)
kable(head(group2))
kable(tail(group2))
```


### Checking the properties of the data tables

Check the properties of the columns (individuals, e.g. biological samples) and rows (features, e.g. genes or proteins or metabolites) of the data tables. 

- Use the `summary()` funciton to quickly inspect the column-wise properties (statistics per individual).

- Use `apply()`,  `mean()` and `sd()` to generate a data frame that collects

    - the column-wise parameters (statistics per feature)
    - the row-wise  parameters (statistics per feature).  

```{r}

## Column-wise summaries
summary(group1)
summary(group2)

## Columns-wise statistics
colStats <- data.frame(
    m1 = apply(group1, MARGIN = 2, mean),
    m2 = apply(group2, MARGIN = 2, mean),
    s1 = apply(group1, MARGIN = 2, sd),
    s2 = apply(group2, MARGIN = 2, sd)
  )

## Row-wise statistics
rowStats <- data.frame(
    m1 = apply(group1, MARGIN = 1, mean),
    m2 = apply(group2, MARGIN = 1, mean),
    s1 = apply(group1, MARGIN = 1, sd),
    s2 = apply(group2, MARGIN = 1, sd)
  )

```

#### Add a column with the difference between sample means for each feature.


**Tips: ** this can be done in a single operation. 


```{r}
rowStats$diff <- rowStats$m1 - rowStats$m2 
```

## Merge the two groups in a single data frame

In omics data analysis, we typically obtain 

- a single data table with all the individuals (biological samples) of all the groups
- a table containing the metadata, i.e. the information about each individual (biological sample)

Two methods could be envisaged a priori:

- `cbind()`, which simply binds the columns of two input tables. This can however be somewhat dangerous, because it assumes that these two tables have the same number of rows (features) and that these rows contain information about the same features *in the same order*. However, in some cases we dispose of data tables coming from different sources (or software tools), where the features (genes, proteins, metabolites) might have a partial overlap rather than an exact 1-to-1 correspondence, and, even when the feature sets are the same, they might be presented in different orderes.

- A much safer approach is thus to use `merge()`, and to explicitly indicate one or several colummns on which the features from the two table will be unified 

In our case, the two data tables only contain numeric data, and the identification will be done on the row names (which contain the feature identifiers ft1, ft2, ... that we defined above). In some cases, we will have to merge data table containing different informations, including a column with identifiers (or maybe additional columns e.g. the genotype, conditions, ...) and we will use internal columns of the table to unify their rows.  

We will create such a structure for further analysis/

```{r merge_groups}

## Read the help of the merge() function
# ?merge

## Create a data frame with the merged values
dataTable <- merge(x = group1, y = group2, by = "row.names")
# dim(dataTable) # NOTE: the merged table contains n1 + n2 columns + one additional column Row.names

## Use the Row.names column as names for the merged table
row.names(dataTable) <- dataTable$Row.names
dataTable <- dataTable[, -1] ## Suppress the first column which contained the row names
# dim(dataTable)

## Generate a metadata table
metaData <- data.frame(
  sampleNames = colnames(dataTable),
  sampleGroup = c(rep("g1", length.out = n1), rep("g2", length.out = n2)),
  sampleColor = c(rep("#DDEEFF", length.out = n1), rep("#FFEEDD", length.out = n2))
)

kable(metaData, caption = "Metadata table")

```

## Visualisation of the data


### Box plot of the sampled values

```{r sample_boxplot, fig.width=8, fig.height=12, out.width="60%", fig.cap="Box plot of the sampled values"}
#### Boxplot of the values per sample ####
boxplot(dataTable, 
        col = as.vector(metaData$sampleColor),
        horizontal = TRUE, 
        las = 1, 
        main = "Sampled values", 
        xlab = "X")

```

### Violin plot of the sampled values

```{r sample_vioplot, fig.width=8, fig.height=12, out.width="60%", fig.cap="Violin plot of the sampled values"}
#### Boxplot of the values per sample ####
vioplot::vioplot(dataTable, 
                 col = as.vector(metaData$sampleColor),
                 horizontal = TRUE, 
                 las = 1, 
                 main = "Sampled values", 
                 xlab = "X")

```


### Histogram of the sampled values

- Draw two histograms with all the values group 1 and group2, respectively. 

**Tip: ** use `mfrow()` to display the histogram above each other, and set the limits of the abscissa (x axis) to the same value. 

```{r hist_g1_g2, fig.width = 8, fig.heigh = 8, out.width="60%"}

xlim  <- range(append(group1, group2))

## Compute mean and sd for all the samples of group1 and group2, resp
m1 <- mean(unlist(group1))
m2 <- mean(unlist(group2))
s1 <- sd(unlist(group1))
s2 <- sd(unlist(group2))

par(mfrow = c(2,1))
## Plot histogram of Group1 values, and get the values in a variable named h1
h1 <- hist(x = unlist(group1), breaks = 100, col = "#DDEEFF", border = "blue",
     main = "Group 1", xlab = "X", las = 1, xlim = xlim)
abline(v = m1, col = "blue", lwd = 2) ## mark the mean of all samples
arrows(x0 = m1, x1 = m1 + s1, 
       y0 = max(h1$counts)/2, y1 = max(h1$counts)/2,
       length = 0.07, angle = 20, col = "blue", lwd = 2, code = 2)
legend("topright", 
       legend = c(paste0("m1 = ", round(digits = 2, m1)), 
                  paste0("s1 = ", round(digits = 2, s1))))

## Plot histogram of Group2 values, and get the values in a variable named h2
h2 <- hist(x = unlist(group2), breaks = 100, col = "#FFEEDD", border = "brown", 
     main = "Group 2", xlab = "X", las = 1, xlim = xlim)
abline(v = mean(unlist(group2)), col = "brown", lwd = 2)
arrows(x0 = m2, x1 = m2 + s2, 
       y0 = max(h2$counts)/2, y1 = max(h2$counts)/2,
       length = 0.07, angle = 20, col = "brown", lwd = 2, code = 2)
legend("topright", 
       legend = c(paste0("m2 = ", round(digits = 2, m2)),
                  paste0("s2 = ", round(digits = 2, s2))))

par(mfrow = c(1,1))
```


### Sampling distribution of the means

Draw histogram with the sampling distribution of the means in the respective groups. 

```{r sampling_distrib_mean, fig.width = 8, fig.heigh = 8, out.width="60%", fig.cap="Sampling distribution of the mean"}

# xlim  <- range(append(colStats$m1, colStats$m2))

## Compute the parameters (mean, sd) of the sampling distributions of the means
mm1 <- mean(rowStats$m1)
mm2 <- mean(rowStats$m2)
se1 <- sd(rowStats$m1)
se2 <- sd(rowStats$m2)

par(mfrow = c(2,1))
## Plot histogram of Group1 values, and get the values in a variable named h1
h1 <- hist(x = rowStats$m1, breaks = 100, col = "#AACCFF", border = "#AACCFF", 
     main = "Group 1", xlab = "X", las = 1, xlim = xlim)
abline(v = mm1, col = "blue", lwd = 2) ## mark the mean of all samples
arrows(x0 = mm1, x1 = mm1 + se1, 
       y0 = max(h1$counts)/2, y1 = max(h1$counts)/2,
       length = 0.07, angle = 20, col = "blue", lwd = 2, code = 2)
legend("topright", 
       legend = c(paste0("mm1 = ", round(digits = 2, mm1)),
                  paste0("se1 = ", round(digits = 2, se1))))

## Plot histogram of Group2 values, and get the values in a variable named h2
h2 <- hist(x = rowStats$m2, breaks = 100, col = "#FFCCAA", border = "#FFCCAA", 
     main = "Group 2", xlab = "X", las = 1, xlim = xlim)
abline(v = mean(unlist(group2)), col = "brown", lwd = 2)
arrows(x0 = mm2, x1 = mm2 + se2, 
       y0 = max(h2$counts)/2, y1 = max(h2$counts)/2,
       length = 0.07, angle = 20, col = "brown", lwd = 2, code = 2)
legend("topright", 
       legend = c(paste0("mm2 = ", round(digits = 2, mm2)),
                  paste0("se2 = ", round(digits = 2, se2))))

par(mfrow = c(1,1))
```


- Compare the standard deviations measures in the sampled values, and in the feature means. Do they differ ? Explain why. 

**Answer: ** the standard deviation of the sample means corresponds to the **standard error**.  


## Part 2: hypothesis testing

### Run Student test on a given feature

Since we are interested by differences in either directions, we run a two-tailed test. 

Hypotheses: 

$$H_0: \mu_1 = \mu2$$

$$H_1: \mu_1 \ne \mu2$$
**Exercise:** pick up a given feature (e.g. the $267^{th}$) and run a mean comparison test. Choose the parameters according to your experimental setting. 

**Tips:**

- `t.test()`
- you need to choose the test depending on whether the two populations have equal variance (Student) or not (Welch). Since we defined different values for the populations standard deviations ($\sigma_1$, $\sigma_2$), the choise is obvious. 

```{r one_t_test}
i <- 267 ## Pick up a given feature, arbitrarily

## Select the values for this feature in the group 1 and group 2, resp. 
## Tip: I use unlist() to convert a single-row data.frame into a vector
x1 <- unlist(group1[i, ]) 
x2 <- unlist(group2[i, ])

## Run Sudent t test on one pair of samples
t.result <- t.test(
  x = x1, y = x2, 
  alternative = "two.sided", var.equal = FALSE)

## Print the result of the t test
print(t.result)

## Compute some additional statistics about the samples
mean1 <- mean(x1) ## Mean of sample 1
mean2 <- mean(x2) ## Mean of sample 2
d <- mean2 - mean1 ## Difference between sample means

```


###  Interpret the result

The difference between sample means was $d = `r round(digits = 2, d)`$. 


The $t$ test computed the $t$ statistics, which standardizes this observed distance between sample means relative to the estimated variance of the population, and to the sample sizes. With the random numbers generated above, the value is $t_{obs} = `r round(digits = 4, t.result$statistic)`$. 

The corresponding p-value is computed as the sum of the area of the left and right tails of the Student distribution, with $\nu = n_1 + n_2 - 2 = `r t.result$parameter`$ degrees of freedom. It  indicates the probability of obtaining by chance -- ***under the null hypothesis*** -- a result at least as extreme as the one we observed.


In our case, we obtain $p = P(T > |t_{obs}| = P(T > `r round(digits=4, abs(t.result$statistic))`) = `r signif(digits = 3, t.result$p.value)`$. This is higher than our threshold $alpha = `r alpha`$. We thus accept the null hypothesis. 


### Replicating the test for each feature

In R, loops are quite inefficient, and it is generally recommended to directly run the computations on whole vectors (R has been designed to be efficient for this), or to use specific functions in order to apply a given function each row / column of a table, or to each element of a list. 

For the sake of simplicity, we will first show how to implement a simple but inefficient code with a loop. In the advanced course (STATS2) will see how to optimize the speed with the `apply()` function. 


```{r t_test_loop}

## Define the statistics we want to collect
resultColumns <- c("i",       # iteration number
                    "m1",      # first sample mean
                    "m2",      # second sample mean
                    "s1",      # sd estimation for the first population
                    "s2",      # sd estimation for the second population
                    "d",       # difference between sample means
                    "t",       # test statistics
                    "df",      # degrees of freedom
                    "p.value"  # nominal p-value
                    )

## Instantiate a result table to store the results
resultTable <- data.frame(matrix(nrow = m, ncol = length(resultColumns)))
colnames(resultTable) <- resultColumns # set the column names
# View(resultTable) ## Check the table: it contians NA values
  
## Iterate random number sampling followed by t-tests
## Use the funciton system.time() to measure the elapsed time
## This function is particular: you can also use it with curly brackets in order to enclose a block of several lines
time.iteration <- system.time(
  for (i in 1:m) {
    ## Generate two vectors containing the values for sample 1 and sample 2, resp.
    x1 <- unlist(group1[i, ]) ## sample 1 values
    x2 <- unlist(group2[i, ]) ## sample 2 values
    
    ## Run the t test
    t.result <- t.test(
      x = x1, y = x2, 
      alternative = "two.sided", var.equal = FALSE)
    # names(t.result)
    
    ## Collect the selected statistics in the result table
    resultTable[i, "i"] <- i
    resultTable[i, "t"] <- t.result$statistic
    resultTable[i, "df"] <- t.result$parameter
    resultTable[i, "p.value"] <- t.result$p.value
    
    ## Compute some additional statistics about the samples
    resultTable[i, "m1"] <- mean(x1) ## Mean of sample 1
    resultTable[i, "m2"] <- mean(x2) ## Mean of sample 2
    resultTable[i, "s1"] <- sd(x1) ## Standard dev estimation for population 1
    resultTable[i, "s2"] <- sd(x2) ## Standard dev estimation for population 1
    resultTable[i, "d"] <- resultTable[i, "m1"] - resultTable[i, "m2"]  ## Difference between sample means
  }
  #}
  ## View(resultTable)
)  

print(time.iteration)
```
  
### Distribution of the observed differences for the $`r m`$ iterations of the test

```{r diff_histo, fig.width=8, fig.height=8, out.width="50%"}
par(mfrow = c(2, 1))
#head(resultTable)

## Compute the maximal abslute value of difference to get a centered abcsissa.
## This enables to highlight whether the differences are positive or negative. 
max.diff <- max(abs(resultTable$d))

## Draw an histogram of the observed differences
hist(resultTable$d, 
     breaks = 100, 
     col = "#DDFFEE", 
     border = "#44DD88", 
     las = 1, 
     xlim = c(-max.diff, max.diff), ## Make sure that the graph is centered on 0
     main = "Differences between means",
     xlab = "Difference between means",
     ylab = "Number of tests")
abline(v = 0)

max.t <- max(abs(resultTable$t))
hist(resultTable$t, 
     breaks = 100, 
     col = "#DDFFEE", 
     border = "#44DD88", 
          las = 1,
     xlim = c(-max.t, max.t), ## Make sure that the graph is centered on 0
     main = "T statistics",
     xlab = "T-test statistic",
     ylab = "Number of tests")
abline(v = 0)

par(mfrow = c(1, 1))

```


### P-value histogram

```{r pval_histo, fig.width=7, fig.height=5, out.width="50%"}

## Choose a color depending on whether we are under H0 (grey) or H1 (pink)
if (mu1 == mu2) {
  histColor <- "#DDDDDD"
  histBorder <- "#888888"
} else {
  histColor <- "#FFCCCC"
  histBorder <- "#DD8888"
}

## Draw an histogram of p-values with 20 bins
hist(resultTable$p.value, 
     breaks = 20,
     col = histColor, 
     border = histBorder, 
     las = 1,
     main = "P-value histogram",
     xlab = "T-test P-value",
     ylab = "Number of tests")

## Draw a horizontal line indicating the number of tests per bin that would be expected under null hypothesis
abline(h = m / 20, col = "darkgreen", lwd = 2)
```

## Creating a function to reuse the same code with different parameters

Depending on the selected task in the assignments above, we will run different tests with different parameters and compare the results. The most rudimentary way to do this is top copy-paste the chunk of code above for each test and set of parameters required for the assigned tasks. 

However, having several copies of an almost identical block of code is a very bad pracrice in programming, for several reasons

- lack of readability: the code rapidly becomes very heavy;
- difficulty to maintain: any modification has to be done on each copy of the chunk of code;
- risk for consistency: this is a source of inconsistency, because at some moment we will modify one copy and forget another one.

A better practice is to define a **function** that encapsulates the code, and enables to modify the parameters by passing them as **arguments*. Hereafter we define a function that 

- takes the parameters of the analysis as arguments

    - population means $\mu_1$ and $\mu_2$, 
    - population standard deviations $\sigma_1$ and $\sigma_2$, 
    - sample sizes $n_1$ and $n_2$,
    - number of iterations $r$.
    
- runs $r$ iterations of the t-test with 2 random samples, 
- returns the results in a table with one row per iteration, and one column per resulting statistics (observed $t$ score, p-value, difference between means, ...);


```{r iterate_t_test_function}
#### Define a function that runs r iterations of the t-test ####

#' @title Repeat a T test with random numbers drawn from a normal distribution
#' @param mu1 mean of the first population
#' @param mu2 mean of the second population
#' @param sigma1 standard deviation of the first population
#' @param sigma2 standard deviation of the second population
#' @param n1 first sample size
#' @param n2 second sample size
#' @param m number of repetitions of the tests
#' @param ... additional parameters are passed to t.test(). This enables to set var.equal, alternative, ...
#' @return  a table with one row per repetition of the test, and one column per statistics
IterateTtest <- function(mu1,    
                         mu2,
                         sigma1,
                         sigma2, 
                         n1,     
                         n2,     
                         m,    
                         ...) {
  
  
  ## Define the statistics we want to collect
  resultColumns <- c("i",       # iteration number
                      "m1",      # first sample mean
                      "m2",      # second sample mean
                      "s1",      # sd estimation for the first population
                      "s2",      # sd estimation for the second population
                      "d",       # difference between sample means
                      "t",       # test statistics
                      "df",      # degrees of freedom
                      "p.value"  # nominal p-value
  )
  
  ## Instantiate a result table to store the results
  resultTable <- data.frame(matrix(nrow = m, ncol = length(resultColumns)))
  colnames(resultTable) <- resultColumns # set the column names
  # View(resultTable) ## Check the table: it contians NA values
  
  ## Iterate random number sampling followed by t-tests
  for (i in 1:m) {
    ## Generate two vectors containing the values for sample 1 and sample 2, resp.
    x1 <- rnorm(n = n1, mean = mu1, sd = sigma1) ## sample 1 values
    x2 <- rnorm(n = n2, mean = mu2, sd = sigma2) ## sample 2 values
    
    ## Run the t test
    t.result <- t.test(
      x = x1, y = x2, 
      alternative = "two.sided", var.equal = TRUE)
    # names(t.result)
    
    ## Collect the selected statistics in the result table
    resultTable[i, "i"] <- i
    resultTable[i, "t"] <- t.result$statistic
    resultTable[i, "df"] <- t.result$parameter
    resultTable[i, "p.value"] <- t.result$p.value
    
    ## Compute some additional statistics about the samples
    resultTable[i, "m1"] <- mean(x1) ## Mean of sample 1
    resultTable[i, "m2"] <- mean(x2) ## Mean of sample 2
    resultTable[i, "s1"] <- sd(x1) ## sd estimate for population 1
    resultTable[i, "s2"] <- sd(x2) ## sd estimate for population 2
    resultTable[i, "d"] <- resultTable[i, "m1"] - resultTable[i, "m2"] ## Difference between sample means
  }
  
  #### Compute multiple testing corrections ####

  ## E-value
  ## Expected number of FP
  resultTable$e.value <- resultTable$p.value * nrow(resultTable)
  
  ## Family-Wise Error Rate (FWER)
  ## Probability to have at least one FP given the p-value
  resultTable$FWER <- pbinom(q = 0, 
                             size = nrow(resultTable), 
                             prob = resultTable$p.value, 
                             lower.tail = FALSE)

  
  ## Multiple testing corrections
  for (method in p.adjust.methods) {
    resultTable[, method] <- p.adjust(resultTable$p.value, method = method)
  }
  
  # ## Storey - Tibshirani q-value
  # qvalResult <- qvalue::qvalue(p = unlist(resultTable$p.value))
  # resultTable$q.value <- 
  
  return(resultTable) ## This function returns the result table
}

```


We can now use this function to iterate the $t$ test with the parameters we want.
Let us measure the running time

```{r}

## Some tests under H1
system.time(
  tTestTableH1 <- IterateTtest(mu1 = 7, mu2 = 10, sigma1 = 2, sigma2 = 3, n1 = 16, n2 = 16, m = 1000)
)

## Some tests under H0
system.time(
  tTestTableH0 <- IterateTtest(mu1 = 10, mu2 = 10, sigma1 = 2, sigma2 = 3, n1 = 16, n2 = 16, m = 1000)
)
```


This function can then be used several times, with different values of the parameters. 

```{r itrerate_t_test_runs}

## What happens when  the two means are equal (under the null hypothesis)
testH0 <- IterateTtest(mu1 = 10, mu2 = 10, sigma1 = sigma1, sigma2 = sigma2, n1 = n1, n2 = n2, m = m, var.equal = FALSE, alternative = "two.sided")


## Test increasing values of the difference between population means (delta)
delta0.1 <- IterateTtest(mu1 = mu1, mu2 = mu1 + 0.1, sigma1 = sigma1, sigma2 = sigma2, n1 = n1, n2 = n2, m = m, var.equal = FALSE, alternative = "two.sided")

delta0.5 <- IterateTtest(mu1 = mu1, mu2 = mu1 + 0.5, sigma1 = sigma1, sigma2 = sigma2, n1 = n1, n2 = n2, m = m, var.equal = FALSE, alternative = "two.sided")

delta1 <- IterateTtest(mu1 = mu1, mu2 = mu1 + 1, sigma1 = sigma1, sigma2 = sigma2, n1 = n1, n2 = n2, m = m, var.equal = FALSE, alternative = "two.sided")

delta2 <- IterateTtest(mu1 = mu1, mu2 = mu1 + 2, sigma1 = sigma1, sigma2 = sigma2, n1 = n1, n2 = n2, m = m, var.equal = FALSE, alternative = "two.sided")

delta3 <- IterateTtest(mu1 = mu1, mu2 = mu1 + 3, sigma1 = sigma1, sigma2 = sigma2, n1 = n1, n2 = n2, m = m, var.equal = FALSE, alternative = "two.sided")


```


## P-value histograms

```{r pval_histogram_function}
## Define a function that rdraws the p-value histogram 
## based on the result table of t-test iterations
## as produced by the iterate.t.test() function.
pvalHistogram <- function(
  resultTable, ## required input (no default value): the result table from iterate.t.test()
  main = "P-value histogram",  ## main title (with default value)
  alpha = 0.05, ## Significance threshold
  ... ## Additional parameters, which will be passed to hist()
  ) {
  
  ## Plot the histogram
  hist(resultTable$p.value, 
       breaks = seq(from = 0, to = 1, by = 0.05),
       las = 1, 
       xlim = c(0,1),
       main = main,
       xlab = "T-test P-value",
       ylab = "Number of tests", 
       ...)
  
  ## Draw a horizontal line indicating the number of tests per bin that would be expected under null hypothesis
  abline(h = m / 20, col = "darkgreen", lwd = 2)
  abline(v = alpha, col = "red", lwd = 2)
  
  ## Compute the percent of positive and negative results``
  nb.pos <- sum(resultTable$p.value < alpha)
  nb.neg <- m - nb.pos
  percent.pos <- 100 * nb.pos / m
  percent.neg <- 100 * nb.neg / m
  
  ## Add a legend indicating the percent of iterations declaed positive and negative, resp.
  legend("topright", 
         bty = "o", 
         bg = "white",
         cex = 0.7,
         legend = c(
           paste("m = ", nrow(resultTable)),
           paste("N(+) = ", nb.pos),
           paste("N(-) =", nb.neg),
           paste("pc(+) = ", round(digits = 2, percent.pos)),
           paste("pc(-) =", round(digits = 2, percent.neg))
         ))
}

```

```{r pval_histograms, fig.width=8, fig.height=8, out.width="90%", fig.cap="P-value historams of the multiple test results with increasing effect sizes."}
par(mfrow = c(3, 2)) ## Prepare 2 x 2 panels figure
pvalHistogram(testH0, main = "under H0 (mu1 = mu2)", col = "#DDDDDD", border = "#DDDDDD")
pvalHistogram(delta0.1, main = "mu1 - mu2 = 0.1", col = "#FFDDBB")
pvalHistogram(delta0.5, main = "mu1 - mu2 = 0.5", col = "#FFDDBB")
pvalHistogram(delta1, main = "mu1 - mu2 = 1", col = "#FFDDBB")
pvalHistogram(delta2, main = "mu1 - mu2 = 2", col = "#FFDDBB")
pvalHistogram(delta3, main = "mu1 - mu2 = 3", col = "#FFDDBB")
par(mfrow = c(1, 1)) ## Restore single-panel layout for next figures

```


## Volcano plots

```{r volcanoplot_function}
#### Define VolcanoPlot() function ####
VolcanoPlot <- function(resultTable, 
                        alpha = 0.05, 
                        ...) {

  
  ## Collect effect size values for the X axis  
  effectSize <- resultTable$d
  maxXabs <- max(abs(effectSize))
  xmax <- signif(digits = 1, maxXabs)

  ## Collect p-values for the Y axis
  log10Pval <- -log10(resultTable$p.value)
    
  plot(effectSize, 
       log10Pval,
       xlim = c(-xmax, xmax),
       lax = 1,
       xlab = "Effect size",
       ylab = "-log10(Pval)",
       ...
       )
  abline(v = 0) 
  abline(h = 0)
  abline(h = -log10(alpha), col = "red")
  
  ## Compute the percent of positive and negative results``
  nb.pos <- sum(resultTable$p.value < alpha)
  nb.neg <- m - nb.pos
  percent.pos <- 100 * nb.pos / m
  percent.neg <- 100 * nb.neg / m
  

  ## Add a legend indicating the percent of iterations declaed positive and negative, resp.
  legend("topright", 
         bty = "o", 
         bg = "white",
         cex = 0.7,
         legend = c(
           paste("m = ", nrow(resultTable)),
           paste("N(+) = ", nb.pos),
           paste("N(-) =", nb.neg),
           paste("pc(+) = ", round(digits = 2, percent.pos)),
           paste("pc(-) =", round(digits = 2, percent.neg))
         ))

}

```

```{r volcano_plots, fig.width=8, fig.height=8, out.width="90%", fig.cap="Volcano plots of the multiple test results with increasing effect sizes."}
par(mfrow = c(3,2))

VolcanoPlot(testH0, main = "under H0 (mu1 = mu2)", col = "#DDDDDD")
VolcanoPlot(delta0.1, main = "mu1 - mu2 = 0.1", col = "#FFDDBB")
VolcanoPlot(delta0.5, main = "mu1 - mu2 = 0.5", col = "#FFDDBB")
VolcanoPlot(delta1, main = "mu1 - mu2 = 1", col = "#FFDDBB")
VolcanoPlot(delta2, main = "mu1 - mu2 = 2", col = "#FFDDBB")
VolcanoPlot(delta3, main = "mu1 - mu2 = 3", col = "#FFDDBB")

par(mfrow = c(1,1))
```

## Multiple testing corrections

```{r multi_test_corrections, fig.width=8, fig.height=8, out.width="90%", fig.cap="Multiple testing corrections."}
par(mfrow = c(3,2))

MultiCorrectionsPlot <- function(resultTable,
                                 main = "Multiple testing corrections") {
  #### Compare p-value corrections under H0 ####
  plot(resultTable$p.value, 
       resultTable$p.value,
       main = main,
       col = "grey",
       log = "xy",
       xlab = "Nominal P-value",
       ylab = "Corrected P-value",
       panel.first = grid())
  abline(v = c(0,1))
  abline(h = c(0,1))
  abline(h  = alpha, col = "red")
  abline(v  = alpha, col = "red")
  
  ## FWER
  points(resultTable$p.value, 
         resultTable$FWER, col = "blue")
  
  ## Bonferroni correction
  points(resultTable$p.value, 
         resultTable$bonferroni, col = "orange")
  
  ## Benjamini-Hochberg
  points(resultTable$p.value, 
         resultTable$BH, col = "darkgreen")
}  


MultiCorrectionsPlot(testH0, main = "under H0 (mu1 = mu2)")
MultiCorrectionsPlot(delta0.1, main = "mu1 - mu2 = 0.1")
MultiCorrectionsPlot(delta0.5, main = "mu1 - mu2 = 0.5")
MultiCorrectionsPlot(delta1, main = "mu1 - mu2 = 1")
MultiCorrectionsPlot(delta2, main = "mu1 - mu2 = 2")
MultiCorrectionsPlot(delta3, main = "mu1 - mu2 = 3")

par(mfrow = c(1,1))
```


```


## Interpretation of the results

We should now write a report of interpretation, which will address the following questions. 

- Based on the experiments under $H_0$, compute the number of false positives and estimate the **false positive rate** (**FPR**). Compare these values with the **E-value** (expected number of false positives) for the `r m` tests,  and with your $alpha$ trheshold. 

- Based on the experiments under $H_1$, estimate the **sensitivity** (**Sn**) of the test for the different mean differences tested here.

- Interpret the histograms of P-values obtained with the different parameters ?

- Draw a **power curve** (i.e. the sensitivity as a function of the actual difference between population means)

- Discuss about the adequation between the test and the conditions of our simulations. 

- Do these observations correspond to what would be expected from the theory?