--- title: "Personal work -- Differential analysis and multiple testing" author: "Jacques van Helden" date: "`r Sys.Date()`" output: html_document: code_folding: hide fig_caption: yes highlight: zenburn self_contained: no theme: cerulean toc: yes toc_depth: 3 toc_float: yes beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes ioslides_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango smaller: yes toc: yes widescreen: yes word_document: toc: yes toc_depth: '3' pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 slidy_presentation: fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango incremental: no keep_md: yes smaller: yes theme: cerulean toc: yes widescreen: yes font-import: http://fonts.googleapis.com/css?family=Risque subtitle: Probabilities and statistics for modelling 1 (STAT1) font-family: Garamond transition: linear --- ```{r include=FALSE, echo=FALSE, eval=TRUE} library(knitr) options(width = 300) knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.path = 'figures/02_combinatorix_', fig.align = "center", size = "tiny", echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, results = TRUE, comment = "") # knitr::asis_output("\\footnotesize") ``` ## Introduction The personal work will consist in developing a workflow to run an experimental benchmark of differential analysis applied to thousands of features. The goal will be to acquire an empirical feeling of the impact of different parameters on the performances of differential analysis tests. To this purpose, we will work with artificial data produced by generating random numbers - following different distribiutions of probabilities (normal, Poisson), - with different differences between the population means - with equal or distinct variances We will then run different tests and compare their performances by measuring the power and false positive rates. ## Performances of the tests We will measure the performances of a test by running $r = 10,000$ times under $H_0$, and $r = 10,000$ times under $H_1$. - count the number of $FP$, $TP$, $FN$, $TN$ - summarise the result in a confusion table - compute the derived statistics: $FPR$, $FDR$ and $Sn$ ## Questions ### 1. Impact of the difference between the means We test the impact of the actual difference between population means on the performances of the Student test. For this, generate random datasets compliant with the assumptions (normality, homoscedasticity) but with a set of increasing differences between sample means. In all cases, we will set the standard deviation to $\sigma = 1$. | $\mu_1$ | $\mu_2$ | $\delta$ | Comment | |---------|---------|---------|---------| | 0 | 0 | 0 | $H_0$ | | -0.05 | 0.05 | 0.1| $H_1$ | | -0.1 | 0.1 | 0.2 | $H_1$ | | -0.25 | 0.25 | 0.5 | $H_1$ | | -0.5 | 0.5 | 1 | $H_1$ | | -1 | 1 | 2 | $H_1$ | Maybe other values if you have time and if it brings additional insight. ### 2. Impact of sample size Generate random numbers following a normal distribution, with equal variances, but with a series of different sample sizes ($n = {2, 4, 16, 64}$). Run Student test and evaluate the impact of sample size on the rate of false positive and power of the test. ### 3. Impact of variance Generate random numbers following a normal distribution, with equal sample sizes ($n_1 = n_2 = 10$), but with a series of different standard deviations ($\sigma = {0.1, 1, 10, 100}$). Run Student test and evaluate the impact of dispersion standard deviation on the rate of false positive and power of the test. ### 4. Impact of non-normality on Student test Generate random numbers following either a normal or a Poisson distribution. 1a. Small samples: $n_1 = n_2 = 4$ 1b. Large samples: $n_1 = n_2 = 50$ ### 5. Impact of heteroscedasticity on Student test Generate random numbers following a normal distribution with either equal or unequal variances (test different values). Run Student test on both data types, and measure the sensitivity and power of the respective tests. ### 6. Impact of homoscedasticity on Welch test Generate random numbers following a normal distribution with either equal or unequal variances (test different values). Run Welch test on both data types, and measure the sensitivity and power of the respective tests. ### 7. Parametric versus non-parametric tests Generate random numbers following a normal and some non-normal distribution of your choice (e.g. uniform). Run a parametric and a non-parametric test and compare the performances in terms fo power and false positive rate. ## Assignment of the tasks Each student will develop an analysis to test one particular question. | Student | Question | |---------|-----------------------------| | JvH + you | 1. Impact of the difference between the means | | | | ## Expected results - Descriptive statistics (explor your data): - graphical representations of the data (use whatever you like: histograms, boxplots, violin plots, dot plots, ...) - descriptive parameters of your samples (distribution of sample means, sample sd, ...) - Histogram of the p-values (20 bins) for all the cases under $H_0$, and under $H_1$, resp. - Volcano plot for all the cases under $H_0$, and under $H_1$, resp. - Indicators: $FPR$, $FDR$, $Sn$, $E-value$ for threshold values $alpha = 0.05$, $alpha = 0.01$, , $alpha = 0.001$ ## Report format - R markdown (must be able to run on my machine) - either a pdf or a self-contained HTML (preferred because easier to browse tables are better formatted) ## Structure of the report - Introduction: explain the question addressed in your specific report, and the expected outcome of what you are testing (do you expect an increase of FP ? a loss in sensitivity ? ...). - Setting of the experiment: write explicitly the configuration of your tests (distributions, parameters) - Results and interpretation: summarize your results with figures and tables, document the figures and tables with detailed legends, and add an interpretation of what we see in the results section. - Summary and conclusion: summarize the results in 2-3 sentences and conclude about the effect you wanted to test. ## Recommendations ### Coding recommendations 1. Choose a consistent coding style, consistent with a reference style guide (e.g. [Google R Syle Guide](https://google.github.io/styleguide/Rguide.html)) 1. Name each chunk of R code 2. Define your variables with explicit names (sigma, mu rather than a, b, c, ...) 3. Comment your code - indicate what each variable represents - before each segment of code, explain what it will do 4. Ensure consistency between the code and the report $\rightarrow$ inject the actual values of the R variables in the markdown. ### Scientific recommendations 1. Explicitly formulate the statistical hypotheses before running a test. 2. Discuss the assumptions underlying the test: are they all fulfilled? If not explain why (e.g. because we want to test the impact of this parameter, ...) ## Tutorial ### Define your parameters We draw samples from random data generated following normal distributions with the following parameters. For this we use R `rnorm()` function. ```{r parameters} ## Define parameters mu1 <- 0 # mean of the first population mu2 <- 0 # mean of the second population sigma1 <- 1 # standard deviation of the first population sigma2 <- 1 # standard deviation of the second population n1 <- 10 # sample size for the first group n2 <- 10 # ssample size for second group alpha <- 0.05 # Probability of the first kind error, used as threshold on p-value r <- 10000 # Number of replicates of the test ## A trick for this report: let us set an arbitrary seed in order to ensure consistency between the interpretations across subsequent runs of the markdown ## ## I take 42, Deep Thought's Answer to the Ultimate Question of Life, the Universe, and Everything (42) set.seed(seed = 42) ``` | Parameter | Value | Description | |-----------|-------|-------------------| | $\mu_1$ | `r mu1` | Mean of the first population | | $\mu_2$ | `r mu2` | Mean of the second population | | $\sigma_1$ | `r sigma1` | Standard deviation of the first population | | $\sigma_2$ | `r sigma1` | Standard deviation of the second population | | $n_1$ | `r n1` | Sample size for the first group | | $n_2$ | `r n2` | Sample size for the second group | ### Generate one test set (1 sample from each population) The table below shows the values of the two samples (rounded to two decimals for the sake of readability). ```{r random_data_for_one_test} ## Generate two vectors containing the values for sample 1 and sample 2, resp. x1 <- rnorm(n = n1, mean = mu1, sd = sigma1) ## sample 1 values x2 <- rnorm(n = n2, mean = mu2, sd = sigma2) ## sample 2 values kable(t(data.frame(x1 = round(digits = 2, x1), x2 = round(digits = 2, x2))), col.names = 1:n1 ) ``` ### Run Student test Since we are interested by differences in either directions, we run a two-tailed test. Hypotheses: $$H_0: \mu_1 = \mu2$$ $$H_1: \mu_1 \ne \mu2$$ ```{r one_t_test} ## Run Sudent t test on one pair of samples t.result <- t.test( x = x1, y = x2, alternative = "two.sided", var.equal = TRUE) ## Print the result of the t test print(t.result) ## Compute some additional statistics about the samples mean1 <- mean(x1) ## Mean of sample 1 mean2 <- mean(x2) ## Mean of sample 2 d <- mean2 - mean1 ## Difference between sample means ``` ### Interpret the result The difference between sample means was $d = `r round(digits = 2, d)`$. The $t$ test computed the $t$ statistics, which standardizes this observed distance between sample means relative to the estimated variance of the population, and to the sample sizes. With the random numbers generated above, the value is $t_{obs} = `r round(digits = 4, t.result$statistic)`$. The corresponding p-value is computed as the sum of the area of the left and right tails of the Student distribution, with $\nu = n_1 + n_2 - 2 = `r t.result$parameter`$ degrees of freedom. It indicates the probability of obtaining by chance -- under th null hypothesis -- a result at least as extreme as the one we observed. In our case, we obtain $p = P(T > |t_{obs}| = P(T > `r round(digits=4, abs(t.result$statistic))`) = `r signif(digits = 3, t.result$p.value)`$. This is higher than our threshold $alpha = `r alpha`$. We thus accept the null hypothesis. ### Replicating the test 10,000 times with a loop In R, loops are quite inefficient, and it is generally recommended to directly run the computations on whole vectors (R has been designed to be efficient for this), or to use specific functions in order to apply a given function each row / column of a table, or to each element of a list. For the sake of simplicity, we will first show how to implement a simple but inefficient code with a loop. In the advanced course (STATS2) will see how to optimize the speed with the `apply()` function. ```{r t_test_loop} ## Define the statistics we want to collect result.columns <- c("i", "m1", "m2", "diff", "statistic", "p.value") ## Instantiate a result table to store the results result.table <- data.frame(matrix(nrow = r, ncol = length(result.columns))) colnames(result.table) <- result.columns # set the column names # View(result.table) ## Check the table: it contians NA values ## Iterate random number sampling followed by t-tests for (i in 1:r) { ## Generate two vectors containing the values for sample 1 and sample 2, resp. x1 <- rnorm(n = n1, mean = mu1, sd = sigma1) ## sample 1 values x2 <- rnorm(n = n2, mean = mu2, sd = sigma2) ## sample 2 values ## Run the t test t.result <- t.test( x = x1, y = x2, alternative = "two.sided", var.equal = TRUE) # names(t.result) ## Collect the selected statistics in the result table result.table[i, "i"] <- i result.table[i, "statistic"] <- t.result["statistic"] result.table[i, "p.value"] <- t.result["p.value"] ## Compute some additional statistics about the samples result.table[i, "m1"] <- mean(x1) ## Mean of sample 1 result.table[i, "m2"] <- mean(x2) ## Mean of sample 2 result.table[i, "diff"] <- mean(x2) - mean(x1) ## Difference between sample means } ## View(result.table) ``` ### Distribution of the observed differences for the $`r r`$ iterations of the test ```{r diff_histo, fig.width=6, fig.height=8, out.width="50%"} par(mfrow = c(2, 1)) ## Draw an histogram of the observed differences max.diff <- max(abs(result.table$diff)) hist(result.table$diff, breaks = 100, col = "#BBFFDD", las = 1, xlim = c(-max.diff, max.diff), ## Make sure that the graph is centered on 0 main = "Differences between means", xlab = "Difference between means", ylab = "Number of tests") abline(v = 0) max.t <- max(abs(result.table$statistic)) hist(result.table$statistic, breaks = 100, col = "#BBFFDD", las = 1, xlim = c(-max.t, max.t), ## Make sure that the graph is centered on 0 main = "T statistics", xlab = "T-test statistic", ylab = "Number of tests") par(mfrow = c(1, 1)) abline(v = 0) ``` ### P-value histogram ```{r pval_histo, fig.width=7, fig.height=5, out.width="50%"} ## Draw an histogram of p-values with 20 bins hist(result.table$p.value, breaks = 20, col = "#BBFFDD", las = 1, main = "P-value histogram", xlab = "T-test P-value", ylab = "Number of tests") ## Draw a horizontal line indicating the number of tests per bin that would be expected under null hypothesis abline(h = r / 20, col = "darkgreen", lwd = 2) ``` ## Creating a function to reuse the same code with different parameters Depending on the selected task in the assignments above, we will run different tests with different parameters and compare the results. The most rudimentary way to do this is top copy-paste the chunk of code above for each test and set of parameters required for the assigned tasks. However, having several copies of an almost identical block of code is a very bad pracrice in programming, for several reasons - lack of readability: the code rapidly becomes very heavy; - difficulty to maintain: any modification has to be done on each copy of the chunk of code; - risk for consistency: this is a source of inconsistency, because at some moment we will modify one copy and forget another one. A better practice is to define a **function** that encapsulates the code, and enables to modify the parameters by passing them as **arguments*. Hereafter we define a function that - takes the parameters of the analysis as arguments - population means $\mu_1$ and $\mu_2$, - population standard deviations $\sigma_1$ and $\sigma_2$, - sample sizes $n_1$ and $n_2$, - number of iterations $r$. - runs $r$ iterations of the t-test with 2 random samples, - returns the results in a table with one row per iteration, and one column per resulting statistics (observed $t$ score, p-value, difference between means, ...); ```{r itrerate_t_test_function} ## Define a function that runs r iterations of the t-test. iterate.t.test <- function(mu1 = -0.5, ## Mean of the first population mu2 = 0.5, ## Mean of the second population sigma1 = 1, ## Standard deviation of the first population sigma2 = 1, ## Standard deviation of the second population n1 = 10, ## First sample size n2 = 10, ## Second sample size r = 10000 ## Iterations ) { ## Define the statistics we want to collect result.columns <- c("i", "m1", "m2", "diff", "statistic", "p.value") ## Instantiate a result table to store the results result.table <- data.frame(matrix(nrow = r, ncol = length(result.columns))) colnames(result.table) <- result.columns # set the column names # View(result.table) ## Check the table: it contians NA values ## Iterate random number sampling followed by t-tests for (i in 1:r) { ## Generate two vectors containing the values for sample 1 and sample 2, resp. x1 <- rnorm(n = n1, mean = mu1, sd = sigma1) ## sample 1 values x2 <- rnorm(n = n2, mean = mu2, sd = sigma2) ## sample 2 values ## Run the t test t.result <- t.test( x = x1, y = x2, alternative = "two.sided", var.equal = TRUE) # names(t.result) ## Collect the selected statistics in the result table result.table[i, "i"] <- i result.table[i, "statistic"] <- t.result["statistic"] result.table[i, "p.value"] <- t.result["p.value"] ## Compute some additional statistics about the samples result.table[i, "m1"] <- mean(x1) ## Mean of sample 1 result.table[i, "m2"] <- mean(x2) ## Mean of sample 2 result.table[i, "diff"] <- mean(x2) - mean(x1) ## Difference between sample means } return(result.table) ## This function returns the result table } ``` This function can then be used several times, with different values of the parameters. ```{r itrerate_t_test_runs} ## What happens when the two means are equal (under the null hypothesis) d0 <- iterate.t.test(mu1 = 0, mu2 = 0) ## Iterations ## Test increasing values of the difference between means d0.1 <- iterate.t.test(mu1 = -0.05, mu2 = 0.05) ## Iterations d0.2 <- iterate.t.test(mu1 = -0.1, mu2 = 0.1) ## Iterations d0.5 <- iterate.t.test(mu1 = -0.25, mu2 = 0.25) ## Iterations d1 <- iterate.t.test(mu1 = -0.5, mu2 = 0.5) ## Iterations d2 <- iterate.t.test(mu1 = -1, mu2 = 1) ## Iterations ``` ```{r pval_histogram_function} ## Define a function that rdraws the p-value histogram ## based on the result table of t-test iterations ## as produced by the iterate.t.test() function. pvalHistogram <- function( result.table, ## required input (no default value): the result table from iterate.t.test() main = "P-value histogram", ## main title (with default value) alpha = 0.05, ## Significance threshold ... ## Additional parameters, which will be passed to hist() ) { ## Plot the histogram hist(result.table$p.value, breaks = seq(from = 0, to = 1, by = 0.05), las = 1, xlim = c(0,1), main = main, xlab = "T-test P-value", ylab = "Number of tests", ...) ## Draw a horizontal line indicating the number of tests per bin that would be expected under null hypothesis abline(h = r / 20, col = "darkgreen", lwd = 2) abline(v = alpha, col = "red", lwd = 2) ## Compute the percent of positive and negative results`` nb.pos <- sum(result.table$p.value < alpha) nb.neg <- r - nb.pos percent.pos <- 100 * nb.pos / r percent.neg <- 100 * nb.neg / r ## Add a legend indicating the percent of iterations declaed positive and negative, resp. legend("topright", bty = "o", bg = "white", legend = c( paste("% pos =", round(digits = 2, percent.pos)), paste("% neg =", round(digits = 2, percent.neg)) )) } ``` ```{r pval_histograms, fig.width=8, fig.height=8, out.width="90%"} par(mfrow = c(3, 2)) ## Prepare 2 x 2 panels figure pvalHistogram(d0, main = "under H0 (mu1 = mu2)", col = "#BBFFDD") pvalHistogram(d0.1, main = "mu1 - mu2 = 0.1", col = "#FFDDBB") pvalHistogram(d0.2, main = "mu1 - mu2 = 0.2", col = "#FFBB88") pvalHistogram(d0.5, main = "mu1 - mu2 = 0.5", col = "#FF8866") pvalHistogram(d1, main = "mu1 - mu2 = 1", col = "#FF6633") pvalHistogram(d2, main = "mu1 - mu2 = 2", col = "#FF4422") par(mfrow = c(1, 1)) ## Restore single-panel layout for next figures ``` ## Interpretation of the results We should now write a report of interpretation, which will address the following questions. - Based on the experiments under $H_0$, compute the number of false positives and estimate the **false positive rate** (**FPR**). Compare these values with the **E-value** (expected number of false positives) for the `r r` tests, and with your $alpha$ trheshold. - Based on the experiments under $H_1$, estimate the **sensitivity** (**Sn**) of the test for the different mean differences tested here. - Interpret the histograms of P-values obtained with the different parameters ? - Draw a **power curve** (i.e. the sensitivity as a function of the actual difference between population means) - Discuss about the adequation between the test and the conditions of our simulations. - Do these observations correspond to what would be expected? The same kind of questions will be asked for the 6 other questions above (impact of sample size, variance, non-normality, heteroscadicity, parametric vs non-parametric test).