--- title: "Statistical inference: one and two-sample t-tests" editor: markdown: wrap: 72 --- ## Statistical Inference and Science - Previously: descriptive statistics. "Here are data; what do they say?". - May need to take some action based on information in data. - Or want to generalize beyond data (sample) to larger world (population). - Science: first guess about how world works. - Then collect data, by sampling. - Is guess correct (based on data) for whole world, or not? ## Sample data are imperfect - Sample data never entirely represent what you're observing. - There is always random error present. - Thus you can never be entirely certain about your conclusions. - The Toronto Blue Jays' average home attendance in part of 2015 season was 25,070 (up to May 27 2015, from baseball-reference.com). - Does that mean the attendance at every game was exactly 25,070? Certainly not. Actual attendance depends on many things, eg.: - how well the Jays are playing - the opposition - day of week - weather - random chance ## Packages for this section {r inference-1-R-1} library(tidyverse)  ## Reading the attendances ...as a .csv file: {r inference-1-R-2} my_url <- "http://ritsokiguess.site/datafiles/jays15-home.csv" jays <- read_csv(my_url) jays  ## Another way - This is a big data set: only 25 observations, but a lot of *variables*. - To see the first few values in all the variables, can also use glimpse: {r inference-1-R-5} glimpse(jays)  ## Attendance histogram {r inference-1-R-6, fig.height=3.8} ggplot(jays, aes(x = attendance)) + geom_histogram(bins = 6)  ## Comments - Attendances have substantial variability, ranging from just over 10,000 to around 50,000. - Distribution somewhat skewed to right (but no outliers). - These are a sample of "all possible games" (or maybe "all possible games played in April and May"). What can we say about mean attendance in all possible games based on this evidence? - Think about: - Confidence interval - Hypothesis test. ## Getting CI for mean attendance - t.test function does CI and test. Look at CI first: {r inference-1-R-7} t.test(jays$attendance)  - From 20,500 to 29,600. ## Or, 90% CI - by including a value for conf.level: {r inference-1-R-8} t.test(jays$attendance, conf.level = 0.90)  - From 21,300 to 28,800. (Shorter, as it should be.) ## Comments - Need to say "column attendance within data frame jays" using \$. - 95% CI from about 20,000 to about 30,000. - Not estimating mean attendance well at all! - Generally want confidence interval to be shorter, which happens if: - SD smaller - sample size bigger - confidence level smaller - Last one is a cheat, really, since reducing confidence level increases chance that interval won't contain pop. mean at all! ## Another way to access data frame columns {r inference-1-R-9} with(jays, t.test(attendance))  ## Hypothesis test - CI answers question "what is the mean?" - Might have a value$\mu$in mind for the mean, and question "Is the mean equal to$\mu$, or not?" - For example, 2014 average attendance was 29,327. - "Is the mean this?" answered by **hypothesis test**. - Value being assessed goes in **null hypothesis**: here,$H_0 : \mu = 29327$. - **Alternative hypothesis** says how null might be wrong, eg.$H_a : \mu \ne 29327$. - Assess evidence against null. If that evidence strong enough, *reject null hypothesis;* if not, *fail to reject null hypothesis* (sometimes *retain null*). - Note asymmetry between null and alternative, and utter absence of word "accept". ##$\alpha$and errors - Hypothesis test ends with decision: - reject null hypothesis - do not reject null hypothesis. - but decision may be wrong: | | Decision | | |----------------|-------------------|-----------------| | **Truth** | **Do not reject** | **reject null** | | **Null true** | Correct | Type I error | | **Null false** | Type II error | Correct | - Either type of error is bad, but for now focus on controlling Type I error: write$\alpha$= P(type I error), and devise test so that$\alpha$small, typically 0.05. - That is, **if null hypothesis true**, have only small chance to reject it (which would be a mistake). - Worry about type II errors later (when we consider power of test). ## Why 0.05? This man. ::: columns ::: {.column width="40%"} ![](fisher.png) ::: ::: {.column width="60%"} - analysis of variance - Fisher information - Linear discriminant analysis - Fisher's$z$-transformation - Fisher-Yates shuffle - Behrens-Fisher problem Sir Ronald A. Fisher, 1890--1962. ::: ::: ## Why 0.05? (2) - From The Arrangement of Field Experiments (1926): ![](fisher1.png){width="200%"} - and ![](fisher2.png){width="200%"} ## Three steps: - from data to test statistic - how far are data from null hypothesis - from test statistic to P-value - how likely are you to see "data like this" **if the null hypothesis is true** - from P-value to decision - reject null hypothesis if P-value small enough, fail to reject it otherwise ## Using t.test: {r inference-1-R-10} t.test(jays$attendance, mu=29327)  - See test statistic $-1.93$, P-value 0.065. - Do not reject null at $\alpha=0.05$: no evidence that mean attendance has changed. ## Assumptions - Theory for $t$-test: assumes normally-distributed data. - What actually matters is sampling distribution of sample mean: if this is approximately normal, $t$-test is OK, even if data distribution is not normal. - Central limit theorem: if sample size large, sampling distribution approx. normal even if data distribution somewhat non-normal. - So look at shape of data distribution, and make a call about whether it is normal enough, given the sample size. ## Blue Jays attendances again: - You might say that this is not normal enough for a sample size of $n = 25$, in which case you don't trust the $t$-test result: {r inference-1-R-11, fig.height=3} ggplot(jays, aes(x = attendance)) + geom_histogram(bins = 6)  ## Another example: learning to read - You devised new method for teaching children to read. - Guess it will be more effective than current methods. - To support this guess, collect data. - Want to generalize to "all children in Canada". - So take random sample of all children in Canada. - Or, argue that sample you actually have is "typical" of all children in Canada. - Randomization (1): whether or not a child in sample or not has nothing to do with anything else about that child. - Randomization (2): randomly choose whether each child gets new reading method (t) or standard one (c). ## Reading in data - File at . - Proper reading-in function is read_delim (check file to see) - Read in thus: {r inference-1-R-12} my_url <- "http://ritsokiguess.site/datafiles/drp.txt" kids <- read_delim(my_url," ")  ## The data {r inference-1-R-13} kids  In group, t is "treatment" (the new reading method) and c is "control" (the old one). ## Boxplots {r inference-1-R-14, fig.height=3.7} ggplot(kids, aes(x = group, y = score)) + geom_boxplot()  ## Two kinds of two-sample t-test - pooled (derived in B57): $t = { \bar{x}_1 - \bar{x}_2 \over s_p \sqrt{(1 / n_1) + (1 / n_2)}}$, - where $s_p^2 = {(n_1 - 1) s_1^2 + (n_2 - 1)s_2^2 \over n_1 + n_2 -2}$ - Welch-Satterthwaite: $t = {\bar{x}_1 - \bar{x}_2 \over \sqrt {{s_1^2 / n_1} + {s_2^2 / n_2}}}$ - this $t$ does not have exact $t$-distribution, but is approx $t$ with non-integer df. ## Two kinds of two-sample t-test - Do the two groups have same spread (SD, variance)? - If yes (shaky assumption here), can use pooled t-test. - If not, use Welch-Satterthwaite t-test (safe). - Pooled test derived in STAB57 (easier to derive). - Welch-Satterthwaite is test used in STAB22 and is generally safe. - Assess (approx) equality of spreads using boxplot. ## The (Welch-Satterthwaite) t-test - c (control) before t (treatment) alphabetically, so proper alternative is "less". - R does Welch-Satterthwaite test by default - Answer to "does the new reading program really help?" - (in a moment) how to get R to do pooled test? ## Welch-Satterthwaite {r inference-1-R-15} t.test(score ~ group, data = kids, alternative = "less")  ## The pooled t-test {r inference-1-R-16} t.test(score ~ group, data = kids, alternative = "less", var.equal = TRUE)  ## Two-sided test; CI - To do 2-sided test, leave out alternative: {r inference-1-R-17} t.test(score ~ group, data = kids)  ## Comments: - P-values for pooled and Welch-Satterthwaite tests very similar (even though the pooled test seemed inferior): 0.013 vs. 0.014. - Two-sided test also gives CI: new reading program increases average scores by somewhere between about 1 and 19 points. - Confidence intervals inherently two-sided, so do 2-sided test to get them. ## Jargon for testing - Alternative hypothesis: what we are trying to prove (new reading program is effective). - Null hypothesis: "there is no difference" (new reading program no better than current program). Must contain "equals". - One-sided alternative: trying to prove better (as with reading program). - Two-sided alternative: trying to prove different. - Test statistic: something expressing difference between data and null (eg. difference in sample means, $t$ statistic). - P-value: probability of observing test statistic value as extreme or more extreme, if null is true. - Decision: either reject null hypothesis or do not reject null hypothesis. **Never "accept"**. ## Logic of testing - Work out what would happen if null hypothesis were true. - Compare to what actually did happen. - If these are too far apart, conclude that null hypothesis is not true after all. (Be guided by P-value.) - As applied to our reading programs: - If reading programs equally good, expect to see a difference in means close to 0. - Mean reading score was 10 higher for new program. - Difference of 10 was unusually big (P-value small from t-test). So conclude that new reading program is effective. - Nothing here about what happens if null hypothesis is false. This is power and type II error probability.