--- title: "Mean comparison tests" author: "Jacques van Helden" date: '`r Sys.Date()`' output: slidy_presentation: fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango self_contained: no smaller: yes theme: cerulean toc: yes widescreen: yes beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 html_document: self_contained: no fig_caption: yes highlight: zenburn theme: cerulean toc: yes toc_depth: 3 toc_float: yes ioslides_presentation: self_contained: no css: slides.css fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango smaller: yes toc: yes widescreen: yes word_document: toc: yes toc_depth: 3 font-import: http://fonts.googleapis.com/css?family=Risque subtitle: Probabilités et statistique pour la biologie (STAT1) font-family: Garamond transition: linear --- ```{r include=FALSE, echo=FALSE, eval=TRUE} library(knitr) library(diagram) ## For flowcharts options(width = 300) knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.align = "center", fig.path = "figures/sampling-estimation_", size = "tiny", echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") # knitr::asis_output("\\footnotesize") dir.main <- "~/stat1" dir.slides <- file.path(dir.main, "slides") setwd(dir.slides) ``` ## Contents We present here one of the most popular applications of statistics: the mean comparison test. This test is used in a wide variety of contexts, and we will apply it here to two data types: 1. **Artificial data** generated by drawing samples in two populations following normal distributions, and whose means will be either identical or different, depending on the case. The goal will be to understand how to run the test and how to unterpret the results, in conditions where we control all the parameters (we know whether or not we created populations with equal or different means). 2. **Transcriptome data** obtained with microarrays. We will test whether a give gene is expressed differentially between two groups of biological samples (e.g. patients suffering from two different types of cancers). **Note:** the microarray technology has been replaced by NGS, and RNA-seq has now been widely adopted for transcriptome studies. However, differential analysis with RNA-seq data relies on more advanced concepts, which will be introduvced in other courses. ## The hypotheses **General principle:** - We observe a difference between sample means, and we would like to know whether this difference results from sampling effects or from an actual difference between the population means. - We oppone the ***null hypothesis*** ($H_0$) according to which there is no difference between the two populations, and the ***alternative hypothesis*** ($H_1$), which states that there is a difference. - We evaluate the **P-value**, i.e. the probability that a random sampling under the null hypothesis would return a difference at least as high as what we observe. - If this P-value is very weak (below a given threshold $\alpha$), we reject the null hypothesis ($RH_0$), else we (temporarily) accept it ($AH_0$). ## Two-tailed test In the ***two-tailed test***, we are a priori interested by a difference in either direction ($\mu_1 > \mu_2$ or $\mu_2 > \mu_2$). $$\begin{aligned} H_0: \mu_1 = \mu_2 \\ H_1: \mu_1 \neq \mu_2 \end{aligned}$$ ## One-tailed test In the ***one--tailed test***, we are specifically interested by detecting differences with a given sign. The null hypothesis thus covers both the equality and the differences with the opposite sign. Positive difference (*right-tailed test*): $$\begin{aligned} H_0: \mu_1 \le \mu_2 \\ H_1: \mu_1 > \mu_2 \end{aligned}$$ Negative difference (*left-tailed test*): $$\begin{aligned} H_0: \mu_1 \ge \mu_2 \\ H_1: \mu_1 < \mu_2 \end{aligned}$$ ## Assumptions - There are several possible methods to test the mean equality. - The choice of a method depends on the nature of the data. - Before running the test, it is crucial to answer some questions in order to choose the appropriate method. - This amounts to **check the assumptions**, i.e. a series of conditions of applicability for the selected method. ## Normality assumption **Do the two populations to be compared follow normal distributions?** - If so, we can run **parametric tests** (which rely on a normality assumption). - Else, we must use non-parametric tests. **Why?** Because the tables used to evaluate the probability of risk are derived from on mathematical models relying on a normality assumption. - In case of non-normality, **do we dispose of large-sized samples?** If so, we can use parametric tests despite the non-normality. **Why?** Because, by virtue of the **Central Limit Theorem**, the sample means tend towards a normal distribution even though the original populations are not normal. ## Homoscedasticity assumption (equality of population variances) For parametric tests, the second question is: **do the two population have the same variance?** - Yes $\rightarrow$ we can use Student test - No $\rightarrow$ we must use Welch test **Why?** Student probability distribution was computed based on a homoscedasticity hypothesis. Welch test is a variation on Student test, which corrects the probabilities in case of **heteroscedasticity (unequal variances)** by modifying the number of degrees of freedom as a function of the difference between variances. ## Flowchart for the choice of a mean comparison test ```{r mean_compa_flowchart, fig.width=7, fig.height=6} # creates an empty plot openplotmat(main = "Choice of a mean comparison test") # create the coordinates pos <- coordinates(c(1, 2, 2, 3)) pos[2,1] <- (pos[6,1] + pos[7, 1])/2 pos[4,1] <- (pos[6,1] + pos[7, 1])/2 pos[3,1] <- pos[8,1] pos[5,1] <- pos[8,1] pos.labels <- c("normality?", "Parametric", "Large samples?", "Equal var?", "Non-parametric", "Student", "Welch", "Wilcoxon\nMann-Whitney") nb.nodes <- nrow(pos) question.nodes <- c(1, 3, 4) fromto <- matrix(ncol = 2, byrow = TRUE, data = c(1, 2, # 1 1, 3, # 2 3, 2, # 3 2, 4, # 4 4, 6, # 5 4, 7, # 6 3, 5, # 7 5, 8 # 8 )) yes.arrows <- c(1,3,5) no.arrows <- c(2,6,7) nb.arrows <- nrow(fromto) arrpos <- matrix(ncol = 2, nrow = nb.arrows) for (i in 1:nb.arrows) { if (i %in% yes.arrows) { arrow.color <- "#00BB00" } else if (i %in% no.arrows) { arrow.color <- "red" } else { arrow.color <- "black" } arrpos[i, ] <- straightarrow( from = pos[fromto[i, 1], ], to = pos[fromto[i, 2], ], lwd = 2, arr.pos = 0.5, arr.length = 0.25, lcol = arrow.color) if (i %in% yes.arrows) { text(arrpos[i, 1] + 0.02, arrpos[i, 2] - 0.03, "yes", col = "#00BB00") } else if (i %in% no.arrows) { text(arrpos[i, 1] + 0.02, arrpos[i, 2] - 0.03, "no", col = "red") } } for (i in 1:nb.nodes) { if (i %in% question.nodes) { ## Draw diamons for choices textdiamond( mid = pos[i,], radx = 0.2, rady = 0.05, lab = pos.labels[i], cex = 1, lcol = "orange", lwd = 2) } else { ## Draw ellipses for operations textellipse( mid = pos[i,], radx=0.15, rady=0.05, lab = pos.labels[i]) } } # text(arrpos[2, 1] + 0.05, arrpos[2, 2], "no", col = "red") # text(arrpos[4, 1] - 0.05, arrpos[4, 2], "yes", col = "#00BB00") # text(arrpos[5, 1] + 0.05, arrpos[5, 2], "no", col = "red") # text(arrpos[4, 1] - 0.05, arrpos[4, 2], "yes", col = "#00BB00") # text(arrpos[5, 1] + 0.05, arrpos[5, 2], "no", col = "red") ``` ## Student test Assumptions: **normalité** (or large samples), **homoscedasticity**. Student's statistics: $$t_{S} = \frac{\hat{\delta}}{\hat{\sigma}_\delta} = \frac{\bar{x}_{2} - \bar{x}_{1}}{\sqrt{\frac{n_1 s_{1}^2 + n_2 s_{2}^2}{n_1+n_2-2} \left(\frac{1}{n_1}+ \frac{1}{n_2}\right)}}$$ ## Population and sample parameters For a finite population, the ***population parameters*** are computed as follows. | Parameter | Formula | |-----------|---------| | Population mean | $\mu = \frac{1}{N}\sum_{i=1}^{N} x_i$ | | Population variance | $\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_i - \mu)^2$ | | Population standard deviation | $\sigma = \sqrt{\sigma^2}$ | Where $x_i$ is the $i^{th}$ measurement, and $N$ the population size. However, in practice we are generally not in state to measure $x_i$ for all individuals of the population. We thus we have to ***estimate*** the population parameters ($\mu$, $\sigma^2$) from a ***sample***. ***Sample parameters*** are computed with the same formulas, restricted to a subset of the population. | Parameter | Formula | |-----------|---------| | Sample mean | $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ | | Sample variance | $s^2 = \frac{1}{n}\sum_{i=1}^{n} (x_i - \bar{x})^2$ | | Sample standard deviation | $s = \sqrt{s^2}$ | where $n$ is the sample size (number of sampled individuals), and $\bar{x}$ the sample mean. ## Sampling issues: why *n-1* and not simply *n*? The sample mean is an ***unbiased estimator*** of the population mean. Each time we select a sample we should expect some random fluctuations, but if we perform an infinite number of sampling trials, and compute the mean of each of these samples, the mean of all these sample means tends towards the actual mean of the population. On the contrary, the sample variance is a ***biased estimator*** of the population variance. On the average, the sample variance under-estimates the population variance, but this can be fixed by multiplying it by a simple correcting factor: $n / (n - 1)$. | Parameter | Formula | |-----------|---------| | Sample-based estimate of the population mean | $\hat{\mu} = \bar{x}$ | | Sample-based estimate of the population variance | $\hat{\sigma}^2 = \frac{n}{n-1}s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2$ | | Sample-based estimate of the population standard deviation | $\hat{\sigma} = \sqrt{\hat{s}^2}$ | Greek symbols ($\mu$, $\sigma$: population parameters; Roman symbols ($\bar{x}$, $s$): sample parameters; the "hat" symbol $\hat{ }$ reads "***estimation of***". ## R `var()` and `sd()` functions **Beware**: the **R** functions `var()` and `sd()` directly compute an estimate of the population variance ($\hat{\sigma}^2$) and standard deviation ($\hat{\sigma}$), respectively, instead of computing the variance ($\bar{x}$) and standard deviation ($s$) of the input data. ```{r echo=TRUE, var_sd_functions} x <- c(1, 5) n <- length(x) # Gives 2 sample.mean <- mean(x) # Gives 3 sample.var <- sum((x - sample.mean)^2) / n # Gives 4 pop.var.est <- sum((x - sample.mean)^2) / (n - 1) # Gives 8 r.var <- var(x) # Gives 8 kable(data.frame(sample.var = sample.var, pop.var.est = pop.var.est, r.var = r.var)) ``` ## P-value This statistics can be used to compute an ***P-value*** ($P$), which measures the probability to obtain, ***under the null hypothesis***, a $t_S$ statistics at least as extreme as the one observed. Extreme refers here to the tail(s) of the distribution depending on the orientation of the test. In the case of hypothesis testing, the P-value can be interpreted as an evaluation of the risk of **first kind error**, which corresponds to the risk of rejecting the null hypothesis whereas it is true. ## Degrees of freedom for Student tst The shape of the Student distribution depends on a parameter named ***degrees of freedom*** ($\nu$), which represents the number of independent variables used in the equation. In a two-sample t-test (as in our case), the degrees of freedom are computed as the total number of elements in the respective samples ($n_1$, $n_2$) minus the number of means estimated from these elements (we estimated the means of group 1 and group 2, respectively). Thus: $$\nu_S = n_1 + n_2 - 2$$ In classical texbooks of statistics, the p-value can be found in [Student's $t$ table](../supports/t-table.pdf). With R, the p-value of a $t$ statistics can be computed wiht the function $pt()$. ## Student distribution ```{r student_distribution, fig.width=10, fig.height=8, out.width="60%", fig.cap="Student density (left) and CDF (right) on linear (top) or logarithmic (bottom) scale. The log scale highlights the rather low convergence in the tails. "} par(mfrow = c(2,2)) freedom <- c(30,10,5,2,1) x <- seq(from = -10, to=10, by=0.01) # lty <- c("solid", "longdash", "twodash", "dashed", "dotdash", "dotted") ## Line types for the different degrees of freedom lty <- rep("solid", length(freedom)+1) #col <- rep("black", length(freedom)+1) #col <- c("black", "black", "black", "#888888", "#888888", "#888888") # col <- rainbow(n = 6) col <- c("black", "brown", "red", "orange", "#00BB00", "blue") lwd <- 1 ######################################################################## # Plot the distributions first with linear scales, then with a logarithmic ordinate for (log in c("", "y")) { ######################################################################## ## Draw a series of Student probability mass functions with different degrees of freedom i <- 1 ## Counter for the curves ## Plot the normal distribution to highlight asymptotic properties plot(x, dnorm(x), las = 1, type="l",lwd = lwd, lty=lty[1], col = col[1], log=log, main = "Student density", panel.first = grid(), xlab = "x", ylab = "Density") for (n in freedom) { i <- i + 1 lines(x,dt(x,n), type="l", lwd = lwd, lty=lty[i], col = col[i]) } if (log == "") { legend.corner <- "topright" } else { legend.corner <- xy.coords(-3.5,1e-14) } legend(legend.corner, legend=c("normal",paste("student, df=",freedom)), cex = 0.9, lty=lty, col = col, lwd = lwd, bty="o", bg="white") ######################################################################## ## Draw a series of Student Cumulative Distribution Function (CDF) i <- 1 ## Counter for the curves ## Plot the normal distribution to highlight asymptotic properties plot(x, pnorm(x), las = 1, type="l",lwd = lwd, lty=lty[1], col = col[1], log=log, main = "Student CDF", panel.first = grid(), xlab = "x", ylab = "CDF") for (n in freedom) { i <- i + 1 lines(x,pt(x,n), type="l", lwd = lwd, lty=lty[i], col = col[i]) } legend("bottomright", legend=c("normal",paste("student, df=",freedom)), cex = 1, lty=lty, col = col, lwd = lwd, bty="o", bg="white") } ``` ## Decision - Classically, one choses *a priori* a threshold on p-value, denoted $\alpha$ (e.g. $\alpha = 0.05$). - If $P < \alpha$, the **null hypothesis is rejected** ($RH_0$) and the test is declared **positive**. - If $P \ge \alpha$, the **null hypothesis is accepted** ($AH_0$) and the test is declared **negative**. | Orientation of the test | Decision criterion | |----------------------------|--------------------------------------| | Right-tailed | $RH_0 \quad \text{if} \quad t_S > t_{1-\alpha}^{n_1 + n_2 -2}$ | | Left-tailed | $RH_0 \quad \text{if} \quad t_S < t_{alpha}^{n_1 + n_2 -2} = - t_{1-\alpha}^{n_1 + n_2 -2}$ | | Two-tailed | $RH_0 \quad \text{if} \quad \lvert t_S \rvert > t_{1-\frac{\alpha}{2}}^{n_1 + n_2 -2}$ | The two-tailed test splits the risk symmetrically on the left and right tails ($\frac{\alpha}{2}$ on each side). ## Welch *t* statistics Welch's t-test defines the $t$ statistic by the following formula. $$t_{W}=\frac{\bar{x}_{2} - \bar{x}_{1}}{\sqrt{\frac {s^2_{1}}{n_1} + \frac{s^2_{2}}{n_2}}}$$ Where: * the indices $1$ and $2$ denote the respective populations * $\bar{x_i}$ is the mean of sample $i$, * $s^2_{i}$ the sample variance, * $n_{i}$ the sample size. ## Welch degrees of freedom The Welch test corrects the impact of heteroscedacity by adapting the degrees of freedom ($\nu$) of the Student distribution according to the respective sample variances. $$\nu = \frac{\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{s_1^4}{n_1^2(n_1-1)} + \frac{s_2^4}{n_2^2(n_2-1)}}$$ - You generally don't have to compute this by yourself, it is done automatically in R. - Under homoscedasticity, Student and Welch degrees of freedom are equal. - Under homoscedasticity, the ***power of the test*** (capacity to reject $H_0$ under $H_1$) is higher for Student than Welch. ## Summary table: notations Greek symbols ($\mu$, $\sigma$) denote population-wide statistics, and roman symbols ($\bar{x}$, $s$) sample-based statistics. | Symbol | Description | |-------|------------------------------------------------| | $\mu_{1}, \mu_{2}$ | Population means | | $\delta = \mu_{2} - \mu_{1}$ | Difference beyween population means | | $\sigma_{1}, \sigma_{2}$ | Population standard deviations | | $n_1$, $n_2$ | Sample sizes | | $\bar{x}_{1}, \bar{x}_{2}$ | Sample means | | $d = \bar{x}_{2} - \bar{x}_{2}$ | Effect size, i.e. difference between sample means | | $s^2_{1}, s^2_{2}$ | Sample variances | | $s_{1}, s_{2}$ | Sample standard deviations | ## Summary table: formulas The "hat" ($\hat{ }$) symbol is used to denote sample-based estimates of population parameters. | Symbol | Description | |-------|------------------------------------------------| | $d = \hat{\delta} = \hat{\mu}_2 - \hat{\mu}_1 = \bar{x}_2 - \bar{x}_1$ | $d$ = Effect size (difference between sample means), estimator of the difference between population means $\delta$. | | $\hat{\sigma}_p = \sqrt{\frac{n_1 s_1^2 + n_2 s_2^2}{n_1+n_2-2}}$ | Estimate of the pooled standard deviation under variance equality assumption | | $\hat{\sigma}_\delta = \hat{\sigma}_p \sqrt{\left(\frac{1}{n_1}+ \frac{1}{n_2}\right)}$ | Standard error about the difference between means of two groups whose variances are assumed equal (Student). | | $t_{S} = \frac{\hat{\delta}}{\hat{\sigma}_\delta} = \frac{\bar{x}_{2} - \bar{x}_{1}}{\sqrt{\frac{n_1 s_{1}^2 + n_2 s_{2}^2}{n_1+n_2-2} \left(\frac{1}{n_1}+ \frac{1}{n_2}\right)}}$ | Student $t$ statistics | | $\nu_S = n_1 + n_2 - 2$ | degrees of freedom for Student test| | $t_{W}=\frac{\bar{x}_{1} - \bar{x}_{2}}{\sqrt{\frac {s^2_{1}}{n_1} + \frac{s^2_{2}}{n_2}}}$ | Welch $t$ statistics | | $\nu_W = TOBEADDED$ | degrees of freedom for Welch test| ## Exercise 1 A researcher analysed the level of expression of a gene of intrerest in 50 patients ($n_p = 50$) Un chercheur a analysé, à l’aide de biopuces, le niveau d’expression de l’ensemble des gènes à partir d’échantillons sanguins prélevés chez 50 patients ($n_p=50$) et chez 50 sujets sains ("contrôles", $n_c=50$). He obtains the following results. - for the patients, a mean expression level of $m_p= 21$ - for the controls, a mean expression level of $m_t= 10$ - the sample standard deviations are identical for the two groups: $s_p = s_c = s = 15$ In order to test whether the observed difference between the means is significant, the researcher decides to run a Student test. a. Is the choice of this test appropriate? Justify. Which alternatives would have been conceivable? b. Since we have no prior idea about the sense of a potential effect of the disease on this gene, would you recommend a two-tailed or single-tailed test? c. Formulate the null hypothesis and explain it in one sentence. d. Compute (manually) the $t$ statistics. e. Based on the [Student's $t$ table](../supports/t-table.pdf), evaluate the p-value of the test. f. Interpret this p-value, and help the researcher to draw a conclusion from the study. ## Exercise 2 A research team detected an association between bilharziose and a high concentration of IgE in the blood.? Another team attempted to replicate this result in an independent population exposed to bilharziose, and obtained the following results. - For resistant subjects ($n_r=32$), the mean is $m_r=10$. - For susceptible subjects ($n_s=32$), the mean is $m_s=7$. - The sample standard deviations are identical for the two groups: $s_r = s_s = s = 2.8$. a. Which method would you recommend to test mean equality? Justify. b. Which are the assumptions of this test? b. Assuming that these conditions are fulfilled, formulate the null hypothesis and compute the $t$ statistics. c. Evaluate the corresponding P-value basd on the [Student's $t$ table](../supports/t-table.pdf). d. Based on these results, which decision would you take? Justify your answer