--- title: "Statistical tests" author: Abhijit Dasgupta date: BIOF 339 --- ```{r setup, include=FALSE, child = here::here('slides/templates/setup.Rmd')} ``` --- class: middle, center # Comparing two groups ```{r, echo=F, results='hide'} library(rio) brca1 <- import('../data/clinical_data_breast_cancer_hw.csv') brca2 <- import('../data/BreastCancer_Expression.csv') brca <- left_join(brca1, brca2, by=c('Complete.TCGA.ID' = 'TCGA_ID')) %>% mutate(Age.at.Initial.Pathologic.Diagnosis = as.numeric(Age.at.Initial.Pathologic.Diagnosis)) %>% mutate(ER.Status = ifelse(ER.Status %in% c('Positive','Negative'), ER.Status, NA)) ``` --- ## The t-test The t-test compares whether the mean of a variable differs between two groups. It does assume the normal distribution for the data, but is robust to deviations from normality Do **not** test for normality before doing the t-test. It isn't necessary and screws up your error rates --- ## The t-test In R, there is a convenient function `t.test` ```{r 08-Summaries-24} t.test(NP_958782 ~ ER.Status, data = brca) ``` Read the code as "Do a t-test to see if (the mean of) `NP_958782` differs by `ER.Status`, where both are taken from the data set `brca`" You can read the `~` as "by", as in "t-test of NP_958782 by ER.Status" --- ## The t-test The packge `broom` provides a function `tidy` that makes the results of these statistical tests tidy. ```{r 08-Summaries-25} t.test(NP_958782 ~ ER.Status, data=brca) %>% broom::tidy() ``` -- ```{r 08-Summaries-26, echo=F} t.test(NP_958782 ~ ER.Status, data=brca) ``` --- ```{r through, include=FALSE, message=F, warning=F} brca %>% select(ER.Status, starts_with('NP')) %>% pivot_longer(names_to = 'protein', values_to = 'expression', cols = c(-ER.Status)) %>% split(.$protein) %>% map(~broom::tidy(t.test(expression ~ ER.Status, data=.))) %>% bind_rows(.id = 'Protein') %>% select(Protein, estimate, p.value, conf.low, conf.high) ``` `r chunk_reveal('through', title = '## Using broom The fact that broom::tidy makes the results of tests into tibbles is in fact extremely useful in high-throughput work', widths = c(60,40))` --- class: center, middle # Back to testing --- ## Wilcoxon test, nonparametric t-test ```{r 08-Summaries-32} wilcox.test(NP_958782 ~ ER.Status, data=brca) %>% broom::tidy() ``` -- ```{r 08-Summaries-33, echo=F} wilcox.test(NP_958782 ~ ER.Status, data=brca) ``` --- ## Wilcoxon test .pull-left[ ```{r test3, eval = F, echo = T} brca %>% select(ER.Status, starts_with('NP')) %>% tidyr::gather(protein,expression, -ER.Status) %>% split(.$protein) %>% map(~broom::tidy(wilcox.test(expression ~ ER.Status, data=.))) %>% bind_rows(.id='Protein') %>% select(Protein, p.value) ``` ] .pull-right[ ```{r 08-Summaries-34, eval=T, echo = F, ref.label="test3"} ``` ] --- ## Using `tableone` ```{r table4, eval = F, echo = T} CreateTableOne( data = brca %>% filter(!is.na(ER.Status)), vars = brca %>% select(starts_with('NP')) %>% names(), strata = 'ER.Status', test = T, testNormal = t.test ) ``` ```{r 08-Summaries-35, eval=T, echo = F, ref.label="table4"} ``` -- This is not quite the same results as before --- ## Using `tableone` ```{r table4a, eval = F, echo = T} CreateTableOne( data = brca %>% filter(!is.na(ER.Status)), vars = brca %>% select(starts_with('NP')) %>% names(), strata = 'ER.Status', test = T, testNormal = t.test, argsNormal = list(var.equal=F) #<< ) ``` ```{r 08-Summaries-36, eval=T, echo = F, ref.label="table4a"} ``` --- ## Tests for discrete data Testing whether the distribution of a categorical variable differs by levels of another categorical variable can be done using either the Chi-square test (`chisq.test`) or the Fisher's test (`fisher.test`). Both require you to create a 2x2 table first. ```{r 08-Summaries-37} fisher.test(table(brca$Tumor, brca$ER.Status)) ``` --- ## Tests for discrete data Testing whether the distribution of a categorical variable differs by levels of another categorical variable can be done using either the Chi-square test (`chisq.test`) or the Fisher's test (`fisher.test`). Both require you to create a 2x2 table first. ```{r 08-Summaries-38} chisq.test(table(brca$Tumor, brca$ER.Status)) ``` --- ## Tests for discrete data We can use `broom::tidy` for either of these ```{r 08-Summaries-39} chisq.test(table(brca$Tumor, brca$ER.Status)) %>% broom::tidy() ``` --- ## Using `tableone` ```{r 08-Summaries-40} CreateCatTable(vars = c('Tumor','Node','Metastasis'), data = filter(brca, !is.na(ER.Status)), strata = 'ER.Status', test = T) # chisq.test ``` --- ## Using `tableone` ```{r 08-Summaries-41} c1 <- CreateCatTable(vars = c('Tumor','Node','Metastasis'), data = filter(brca, !is.na(ER.Status)), strata = 'ER.Status', test = T) print(c1, exact = c('Tumor','Node','Metastasis')) # fisher.test ```