---
title: "Mean comparison tests"
author: "Jacques van Helden"
date: '`r Sys.Date()`'
output:
  slidy_presentation: 
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    self_contained: no
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  html_document:
    self_contained: no
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  ioslides_presentation:
    self_contained: no
    css: slides.css
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  word_document:
    toc: yes
    toc_depth: 3
font-import: http://fonts.googleapis.com/css?family=Risque
subtitle: Probabilités et statistique pour la biologie (STAT1)
font-family: Garamond
transition: linear
---

```{r include=FALSE, echo=FALSE, eval=TRUE}
library(knitr)
library(diagram) ## For flowcharts


options(width = 300)
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.align = "center", 
  fig.path = "figures/sampling-estimation_",
  size = "tiny", 
  echo = FALSE, eval = TRUE, 
  warning = FALSE, message = FALSE, 
  results = TRUE, comment = "")
# knitr::asis_output("\\footnotesize")

dir.main <- "~/stat1"
dir.slides <- file.path(dir.main, "slides")
setwd(dir.slides)
```


## Contents

We present here one of the most popular applications of statistics: the mean comparison test. 

This test is used in a wide variety of contexts, and we will apply it here to two data types: 


1. **Artificial data** generated by drawing samples in two populations following normal distributions, and whose means will be either identical or different, depending on the case. The goal will be to understand how to run the test and how to unterpret the results, in conditions where we control all the parameters (we know whether or not we created populations with equal or different means). 

2. **Transcriptome data** obtained with microarrays. We will test whether a give gene is expressed differentially between two groups of biological samples (e.g. patients suffering from two different types of cancers). 

    **Note:** the microarray technology has been replaced by NGS, and RNA-seq has now been widely adopted for transcriptome studies. However, differential analysis with RNA-seq data relies on more advanced concepts, which will be introduvced in other courses.


## The hypotheses

**General principle:** 

- We observe a difference between sample means, and we would like to know whether this difference results from sampling effects or from an actual difference between the population means. 
- We oppone the ***null hypothesis*** ($H_0$) according to which there is no difference between the two populations, and the ***alternative hypothesis*** ($H_1$), which states that there is a difference. 
- We evaluate the **P-value**, i.e. the probability that a random sampling under the null hypothesis would return a difference at least as high as what we observe.
- If this P-value is very weak (below a given threshold $\alpha$), we reject the null hypothesis ($RH_0$), else we (temporarily) accept it ($AH_0$).


## Two-tailed test

In the ***two-tailed test***, we are a priori interested by a difference in either direction ($\mu_1 > \mu_2$ or $\mu_2 > \mu_2$).  

$$\begin{aligned}
H_0: \mu_1 = \mu_2 \\
H_1: \mu_1 \neq \mu_2
\end{aligned}$$


## One-tailed test

In the ***one--tailed test***, we are specifically interested by detecting differences with a given sign.
The null hypothesis thus covers both the equality and the differences with the opposite sign. 


Positive difference (*right-tailed test*): 

$$\begin{aligned}
H_0: \mu_1 \le \mu_2 \\
H_1: \mu_1 > \mu_2
\end{aligned}$$


Negative difference (*left-tailed test*):

$$\begin{aligned}
H_0: \mu_1 \ge \mu_2 \\
H_1: \mu_1 < \mu_2
\end{aligned}$$


## Assumptions

- There are several possible methods to test the mean equality.

- The choice of a method depends on the nature of the data.

- Before running the test, it is crucial to answer some questions in order to choose  the appropriate method.

- This amounts to  **check the assumptions**, i.e. a series of conditions of applicability for the selected method. 

## Normality assumption

**Do the two populations to be compared follow normal distributions?**

- If so, we can run **parametric tests** (which rely on a normality assumption). 
- Else, we must use non-parametric tests.


    **Why?** Because the tables used to evaluate the probability of risk are derived from on mathematical models relying on a normality assumption. 
    

- In case of non-normality, **do we dispose of large-sized samples?** If so, we can use parametric tests despite the non-normality. 

    **Why?** Because, by virtue of the **Central Limit Theorem**, the sample means tend towards a normal distribution even though the original populations are not normal.  


## Homoscedasticity assumption (equality of population variances)

For parametric tests, the second question is: **do the two population have the same variance?**


- Yes $\rightarrow$ we can use Student test 
- No $\rightarrow$ we must use Welch test 


    **Why?** Student probability distribution was computed based on a homoscedasticity hypothesis. Welch test is a variation on Student test, which corrects the probabilities in case of **heteroscedasticity (unequal variances)** by modifying the number of degrees of freedom as a function of the difference between variances. 

    
## Flowchart for the choice of a mean comparison test

```{r mean_compa_flowchart, fig.width=7, fig.height=6}
# creates an empty plot
openplotmat(main = "Choice of a mean comparison test")

# create the coordinates
pos <- coordinates(c(1, 2, 2, 3))
pos[2,1] <- (pos[6,1] + pos[7, 1])/2
pos[4,1] <- (pos[6,1] + pos[7, 1])/2
pos[3,1] <- pos[8,1]
pos[5,1] <- pos[8,1]
pos.labels <- c("normality?", 
               "Parametric", "Large samples?", 
               "Equal var?", "Non-parametric",
               "Student", "Welch", "Wilcoxon\nMann-Whitney")
nb.nodes <- nrow(pos)
question.nodes <- c(1, 3, 4)
  
fromto <- matrix(ncol = 2, byrow = TRUE,
                 data = c(1, 2, # 1
                          1, 3, # 2
                          3, 2, # 3
                          2, 4, # 4
                          4, 6, # 5
                          4, 7, # 6
                          3, 5, # 7
                          5, 8  # 8
                          ))
yes.arrows <- c(1,3,5)
no.arrows <- c(2,6,7)


nb.arrows     <- nrow(fromto)
arrpos <- matrix(ncol = 2, nrow = nb.arrows)
for (i in 1:nb.arrows) {
  if (i %in% yes.arrows) {
    arrow.color <- "#00BB00"
  } else if (i %in% no.arrows) {
    arrow.color <- "red"
  } else {
    arrow.color <- "black"
  }
 arrpos[i, ] <- straightarrow(
   from = pos[fromto[i, 1], ],
   to = pos[fromto[i, 2], ],
   lwd = 2, arr.pos = 0.5,
   arr.length = 0.25, lcol = arrow.color)
  if (i %in% yes.arrows) {
    text(arrpos[i, 1] + 0.02, arrpos[i, 2] - 0.03, "yes", col = "#00BB00")
  } else if (i %in% no.arrows) {
    text(arrpos[i, 1] + 0.02, arrpos[i, 2] - 0.03, "no", col = "red")
  }
}

for (i in 1:nb.nodes) {
  if (i %in% question.nodes) {
    ## Draw diamons for choices
    textdiamond(
      mid = pos[i,], radx = 0.2, rady = 0.05, 
      lab = pos.labels[i], cex = 1, 
      lcol = "orange", lwd = 2)
    
  } else {
    ## Draw ellipses for operations 
    textellipse(
      mid = pos[i,], radx=0.15, rady=0.05,
      lab = pos.labels[i])
  }
}
# text(arrpos[2, 1] + 0.05, arrpos[2, 2], "no", col = "red")
# text(arrpos[4, 1] - 0.05, arrpos[4, 2], "yes", col = "#00BB00")
# text(arrpos[5, 1] + 0.05, arrpos[5, 2], "no", col = "red")
# text(arrpos[4, 1] - 0.05, arrpos[4, 2], "yes", col = "#00BB00")
# text(arrpos[5, 1] + 0.05, arrpos[5, 2], "no", col = "red")


```


## Student test

Assumptions: **normalité** (or large samples), **homoscedasticity**. 

Student's statistics: 

$$t_{S} = \frac{\hat{\delta}}{\hat{\sigma}_\delta} =  \frac{\bar{x}_{2} - \bar{x}_{1}}{\sqrt{\frac{n_1 s_{1}^2 + n_2 s_{2}^2}{n_1+n_2-2} \left(\frac{1}{n_1}+ \frac{1}{n_2}\right)}}$$


## Population and sample parameters

For a finite population, the ***population parameters*** are computed as follows.

| Parameter | Formula |
|-----------|---------|
| Population mean | $\mu = \frac{1}{N}\sum_{i=1}^{N} x_i$ |
| Population variance | $\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_i - \mu)^2$ |
| Population standard deviation | $\sigma = \sqrt{\sigma^2}$ |

Where $x_i$ is the $i^{th}$ measurement, and $N$ the population size.

However, in practice we are generally not in state to measure $x_i$ for all individuals of the population. We thus we have to ***estimate*** the population parameters ($\mu$, $\sigma^2$) from a ***sample***. 

***Sample parameters*** are computed with the same formulas, restricted to a subset of the population. 

| Parameter | Formula |
|-----------|---------|
| Sample mean | $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ |
| Sample variance | $s^2 = \frac{1}{n}\sum_{i=1}^{n} (x_i - \bar{x})^2$ |
| Sample standard deviation | $s = \sqrt{s^2}$ |

where $n$ is the sample size (number of sampled individuals), and $\bar{x}$ the sample mean. 

## Sampling issues: why *n-1* and not simply *n*?

The sample mean is an ***unbiased estimator*** of the population mean. Each time we select a sample we should expect  some random fluctuations, but if we perform an infinite number of sampling trials, and compute the mean of each of these samples, the mean of all these sample means tends towards the actual mean of the population. 

On the contrary, the sample variance is a ***biased estimator*** of the population variance. On the average, the sample variance under-estimates the population variance, but this can be fixed by multiplying it by a simple correcting factor: $n / (n - 1)$. 

| Parameter | Formula |
|-----------|---------|
| Sample-based estimate of the population mean | $\hat{\mu} = \bar{x}$ |
| Sample-based estimate of the population variance | $\hat{\sigma}^2 = \frac{n}{n-1}s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2$ |
| Sample-based estimate of the population standard deviation | $\hat{\sigma} = \sqrt{\hat{s}^2}$ |

Greek symbols ($\mu$, $\sigma$: population parameters; Roman symbols ($\bar{x}$, $s$): sample parameters; the "hat" symbol $\hat{ }$ reads "***estimation of***". 

## R `var()` and `sd()` functions

**Beware**: the **R** functions `var()` and `sd()` directly compute an estimate of the population variance ($\hat{\sigma}^2$) and standard deviation ($\hat{\sigma}$), respectively, instead of computing the variance ($\bar{x}$) and standard deviation ($s$) of the input data. 

```{r echo=TRUE, var_sd_functions}
x <- c(1, 5)

n <- length(x) # Gives 2
sample.mean <- mean(x) # Gives 3
sample.var <- sum((x - sample.mean)^2) / n # Gives 4
pop.var.est <- sum((x - sample.mean)^2) / (n - 1) # Gives 8
r.var <- var(x) # Gives 8

kable(data.frame(sample.var = sample.var, pop.var.est = pop.var.est, r.var = r.var))
```


## P-value

This statistics can be used to compute an ***P-value*** ($P$), which measures the probability to obtain, ***under the null hypothesis***, a $t_S$ statistics at least as extreme as the one observed. Extreme refers here to the tail(s) of the distribution depending on the orientation of the test. 

In the case of hypothesis testing, the P-value can be interpreted as an evaluation of the risk of **first kind error**, which corresponds to the risk of rejecting the null hypothesis whereas it is true. 
## Degrees of freedom for Student tst

The shape of the Student distribution depends on a parameter named ***degrees of freedom*** ($\nu$), which represents the number of independent variables used in the equation. 

In a two-sample t-test (as in our case), the degrees of freedom are computed as the total number of elements in the respective samples ($n_1$, $n_2$) minus the number of means estimated from these elements (we estimated the means of group 1 and group 2, respectively). Thus: 

$$\nu_S = n_1 + n_2 - 2$$

In classical texbooks of statistics, the p-value can be found in [Student's $t$ table](../supports/t-table.pdf). 

With R, the p-value of a $t$ statistics can be computed wiht the function $pt()$.

## Student distribution

```{r student_distribution, fig.width=10, fig.height=8, out.width="60%", fig.cap="Student density (left) and CDF (right) on linear (top) or logarithmic (bottom) scale. The log scale highlights the rather low convergence in the tails. "}
par(mfrow = c(2,2))

freedom <- c(30,10,5,2,1)
x <- seq(from = -10, to=10, by=0.01)
# lty <- c("solid", "longdash", "twodash", "dashed", "dotdash", "dotted") ## Line types for the different degrees of freedom
lty <- rep("solid", length(freedom)+1)
#col <- rep("black", length(freedom)+1)
#col <- c("black", "black", "black", "#888888", "#888888", "#888888")
# col <- rainbow(n = 6)
col <- c("black", "brown", "red", "orange", "#00BB00", "blue")
lwd <- 1

########################################################################
# Plot the distributions first with linear scales, then with a logarithmic ordinate
for (log in c("", "y")) {
  
  ########################################################################
  ## Draw a series of Student probability mass functions with different degrees of freedom
  i <- 1 ## Counter for the curves
  ## Plot the normal distribution to highlight asymptotic properties
  plot(x, dnorm(x), las = 1,
       type="l",lwd = lwd, lty=lty[1], col = col[1], log=log,
       main = "Student density", panel.first = grid(),
       xlab = "x", ylab = "Density")
  
  for (n in freedom) {
    i <- i + 1
    lines(x,dt(x,n), type="l", lwd = lwd, lty=lty[i], col = col[i])
  }
  if (log == "") {
    legend.corner <- "topright"
  } else {
    legend.corner <- xy.coords(-3.5,1e-14)
  }
  legend(legend.corner,
         legend=c("normal",paste("student, df=",freedom)),
         cex = 0.9, lty=lty, col = col,
         lwd = lwd, bty="o", bg="white")
  
  
  ########################################################################
  ## Draw a series of Student Cumulative Distribution Function (CDF)
  i <- 1 ## Counter for the curves
  ## Plot the normal distribution to highlight asymptotic properties
  plot(x, pnorm(x), las = 1,
       type="l",lwd = lwd, lty=lty[1], col = col[1], log=log,
       main = "Student CDF", panel.first = grid(),
       xlab = "x", ylab = "CDF")
  for (n in freedom) {
    i <- i + 1
    lines(x,pt(x,n), type="l", lwd = lwd, lty=lty[i], col = col[i])
  }
  legend("bottomright",
         legend=c("normal",paste("student, df=",freedom)),
         cex = 1, lty=lty, col = col,
         lwd = lwd, bty="o", bg="white")
}

```


## Decision

- Classically, one choses *a priori* a threshold on p-value, denoted $\alpha$ (e.g. $\alpha = 0.05$).
- If $P < \alpha$, the **null hypothesis is rejected** ($RH_0$) and the test is declared **positive**.
- If $P \ge \alpha$, the **null hypothesis is accepted** ($AH_0$) and the test is declared **negative**.


| Orientation of the test   | Decision criterion |
|----------------------------|--------------------------------------|
| Right-tailed | $RH_0 \quad \text{if} \quad t_S > t_{1-\alpha}^{n_1 + n_2 -2}$ |
| Left-tailed | $RH_0 \quad \text{if} \quad t_S < t_{alpha}^{n_1 + n_2 -2} = - t_{1-\alpha}^{n_1 + n_2 -2}$ |
| Two-tailed | $RH_0 \quad \text{if} \quad \lvert t_S \rvert > t_{1-\frac{\alpha}{2}}^{n_1 + n_2 -2}$ |

The two-tailed test splits the risk symmetrically on the left and right tails ($\frac{\alpha}{2}$ on each side). 

## Welch *t* statistics

Welch's t-test defines the $t$ statistic by the following formula.

$$t_{W}=\frac{\bar{x}_{2} - \bar{x}_{1}}{\sqrt{\frac {s^2_{1}}{n_1} + \frac{s^2_{2}}{n_2}}}$$

Where:

* the indices $1$ and $2$ denote the respective populations
* $\bar{x_i}$ is the mean of sample $i$,
* $s^2_{i}$ the sample variance,
* $n_{i}$ the sample size.


## Welch degrees of freedom

The Welch test corrects the impact of heteroscedacity by adapting the degrees of freedom ($\nu$) of the Student distribution according to the respective sample variances. 

$$\nu = \frac{\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{s_1^4}{n_1^2(n_1-1)} + \frac{s_2^4}{n_2^2(n_2-1)}}$$

- You generally don't have to compute this by yourself, it is done automatically in R.

- Under homoscedasticity, Student and Welch degrees of freedom are equal. 

- Under homoscedasticity, the ***power of the test*** (capacity to reject $H_0$ under $H_1$) is higher for Student than Welch.  


## Summary table: notations

Greek symbols ($\mu$, $\sigma$) denote population-wide statistics, and roman symbols ($\bar{x}$, $s$) sample-based statistics.

| Symbol       | Description   |
|-------|------------------------------------------------|
| $\mu_{1},  \mu_{2}$ | Population means |
| $\delta = \mu_{2} - \mu_{1}$ | Difference beyween population means |
| $\sigma_{1}, \sigma_{2}$ | Population standard deviations |
| $n_1$, $n_2$ | Sample sizes |
| $\bar{x}_{1}, \bar{x}_{2}$ | Sample means |
| $d = \bar{x}_{2} - \bar{x}_{2}$ | Effect size, i.e. difference between sample means |
| $s^2_{1}, s^2_{2}$ | Sample variances |
| $s_{1}, s_{2}$ | Sample standard deviations |

## Summary table: formulas

The "hat" ($\hat{ }$) symbol is used to denote sample-based estimates of population parameters.

| Symbol       | Description   |
|-------|------------------------------------------------|
| $d = \hat{\delta} = \hat{\mu}_2 - \hat{\mu}_1  = \bar{x}_2 - \bar{x}_1$ | $d$ = Effect size (difference between sample means), estimator of the difference between population means $\delta$. |
| $\hat{\sigma}_p = \sqrt{\frac{n_1 s_1^2 + n_2 s_2^2}{n_1+n_2-2}}$ | Estimate of the pooled standard deviation under variance equality assumption |
| $\hat{\sigma}_\delta = \hat{\sigma}_p \sqrt{\left(\frac{1}{n_1}+ \frac{1}{n_2}\right)}$  | Standard error about the difference between means of two groups whose variances are assumed equal (Student). |
| $t_{S} = \frac{\hat{\delta}}{\hat{\sigma}_\delta} =  \frac{\bar{x}_{2} - \bar{x}_{1}}{\sqrt{\frac{n_1 s_{1}^2 + n_2 s_{2}^2}{n_1+n_2-2} \left(\frac{1}{n_1}+ \frac{1}{n_2}\right)}}$ | Student $t$ statistics |
| $\nu_S = n_1 + n_2 - 2$ | degrees of freedom for Student test|
| $t_{W}=\frac{\bar{x}_{1} - \bar{x}_{2}}{\sqrt{\frac {s^2_{1}}{n_1} + \frac{s^2_{2}}{n_2}}}$ | Welch $t$ statistics |
| $\nu_W   = TOBEADDED$ | degrees of freedom for Welch test|


## Exercise 1

A researcher analysed the level of expression of a gene of intrerest in 50 patients ($n_p = 50$)

Un chercheur a analysé, à l’aide de biopuces, le niveau d’expression de l’ensemble des gènes à partir d’échantillons sanguins prélevés chez 50 patients ($n_p=50$) et chez 50 sujets sains ("contrôles", $n_c=50$). He obtains the following results. 

-	for the patients, a mean expression level of $m_p= 21$
-	for the controls, a mean expression level of $m_t= 10$
-	the sample standard deviations are identical for the two groups:  $s_p = s_c = s = 15$

In order to test whether the observed difference between the means is significant, the researcher decides to run a Student test. 


a.	Is the choice of this test appropriate? Justify. Which alternatives would have been conceivable?

b.	Since we have no prior idea about the sense of a potential effect of the disease on this gene, would you recommend a two-tailed or single-tailed test?

c.	Formulate the null hypothesis and explain it in one sentence. 

d.	Compute (manually) the $t$ statistics.

e.	Based on the [Student's $t$ table](../supports/t-table.pdf), evaluate the p-value of the test.  

f.	Interpret this p-value, and help the researcher to draw a conclusion from the study. 


## Exercise 2

A research team detected an association between bilharziose and a high concentration of IgE in the blood.? Another team attempted to replicate this result in an independent population exposed to bilharziose, and obtained the following results. 

-	For resistant subjects ($n_r=32$), the mean  is $m_r=10$.
-	For susceptible subjects ($n_s=32$), the mean is $m_s=7$.
-	The sample standard deviations are identical for the two groups: $s_r = s_s = s = 2.8$.

a. Which method would you recommend to test mean equality? Justify. 

b. Which are the assumptions of this test?

b. Assuming that these conditions are fulfilled, formulate the null hypothesis and compute the $t$ statistics. 

c. Evaluate the corresponding P-value basd on the [Student's $t$ table](../supports/t-table.pdf). 

d. Based on these results, which decision would you take? Justify your answer