---
title: "Statistical inference: one and two-sample t-tests"
editor: 
  markdown: 
    wrap: 72
---

## Statistical Inference and Science

-   Previously: descriptive statistics. "Here are data; what do they
    say?".
-   May need to take some action based on information in data.
-   Or want to generalize beyond data (sample) to larger world
    (population).
-   Science: first guess about how world works.
-   Then collect data, by sampling.
-   Is guess correct (based on data) for whole world, or not?

## Sample data are imperfect

-   Sample data never entirely represent what you're observing.
-   There is always random error present.
-   Thus you can never be entirely certain about your conclusions.
-   The Toronto Blue Jays' average home attendance in part of 2015
    season was 25,070 (up to May 27 2015, from baseball-reference.com).
-   Does that mean the attendance at every game was exactly 25,070?
    Certainly not. Actual attendance depends on many things, eg.:
    -   how well the Jays are playing
    -   the opposition
    -   day of week
    -   weather
    -   random chance

## Packages for this section

```{r inference-1-R-1}
library(tidyverse)
```

## Reading the attendances

...as a `.csv` file:

```{r inference-1-R-2}
my_url <- "http://ritsokiguess.site/datafiles/jays15-home.csv"
jays <- read_csv(my_url) 
jays
```

## Another way

-   This is a "big" data set: only 25 observations, but a lot of
    *variables*.

-   To see the first few values in all the variables, can also use
    `glimpse`:

```{r inference-1-R-5}
glimpse(jays)
```

## Attendance histogram

```{r inference-1-R-6, fig.height=3.8}
ggplot(jays, aes(x = attendance)) + geom_histogram(bins = 6)
```

## Comments

-   Attendances have substantial variability, ranging from just over
    10,000 to around 50,000.
-   Distribution somewhat skewed to right (but no outliers).
-   These are a sample of "all possible games" (or maybe "all possible
    games played in April and May"). What can we say about mean
    attendance in all possible games based on this evidence?
-   Think about:
    -   Confidence interval
    -   Hypothesis test.

## Getting CI for mean attendance

-   `t.test` function does CI and test. Look at CI first:

```{r inference-1-R-7}
t.test(jays$attendance)
```

-   From 20,500 to 29,600.

## Or, 90% CI

-   by including a value for conf.level:

```{r inference-1-R-8}
t.test(jays$attendance, conf.level = 0.90)
```

-   From 21,300 to 28,800. (Shorter, as it should be.)

## Comments

-   Need to say "column attendance within data frame `jays`" using \$.
-   95% CI from about 20,000 to about 30,000.
-   Not estimating mean attendance well at all!
-   Generally want confidence interval to be shorter, which happens if:
    -   SD smaller
    -   sample size bigger
    -   confidence level smaller
-   Last one is a cheat, really, since reducing confidence level
    increases chance that interval won't contain pop. mean at all!

## Another way to access data frame columns

```{r inference-1-R-9}
with(jays, t.test(attendance))
```

## Hypothesis test

-   CI answers question "what is the mean?"
-   Might have a value $\mu$ in mind for the mean, and question "Is the
    mean equal to $\mu$, or not?"
-   For example, 2014 average attendance was 29,327.
-   "Is the mean this?" answered by **hypothesis test**.
-   Value being assessed goes in **null hypothesis**: here,
    $H_0 : \mu = 29327$.
-   **Alternative hypothesis** says how null might be wrong, eg.
    $H_a : \mu \ne 29327$.
-   Assess evidence against null. If that evidence strong enough,
    *reject null hypothesis;* if not, *fail to reject null hypothesis*
    (sometimes *retain null*).
-   Note asymmetry between null and alternative, and utter absence of
    word "accept".

## $\alpha$ and errors

-   Hypothesis test ends with decision:
    -   reject null hypothesis
    -   do not reject null hypothesis.
-   but decision may be wrong:

|                | Decision          |                 |
|----------------|-------------------|-----------------|
| **Truth**      | **Do not reject** | **reject null** |
| **Null true**  | Correct           | Type I error    |
| **Null false** | Type II error     | Correct         |

-   Either type of error is bad, but for now focus on controlling Type I
    error: write $\alpha$ = P(type I error), and devise test so that
    $\alpha$ small, typically 0.05.
-   That is, **if null hypothesis true**, have only small chance to
    reject it (which would be a mistake).
-   Worry about type II errors later (when we consider power of test).

## Why 0.05? This man.

::: columns
::: {.column width="40%"}
![](fisher.png)
:::

::: {.column width="60%"}
-   analysis of variance
-   Fisher information
-   Linear discriminant analysis
-   Fisher's $z$-transformation
-   Fisher-Yates shuffle
-   Behrens-Fisher problem

Sir Ronald A. Fisher, 1890--1962.
:::
:::

## Why 0.05? (2)

-   From The Arrangement of Field Experiments (1926):

![](fisher1.png){width="200%"}

-   and

![](fisher2.png){width="200%"}

## Three steps:

-   from data to test statistic
    -   how far are data from null hypothesis
-   from test statistic to P-value
    -   how likely are you to see "data like this" **if the null
        hypothesis is true**
-   from P-value to decision
    -   reject null hypothesis if P-value small enough, fail to reject
        it otherwise

## Using `t.test`:

```{r inference-1-R-10}
t.test(jays$attendance, mu=29327)
```

-   See test statistic $-1.93$, P-value 0.065.
-   Do not reject null at $\alpha=0.05$: no evidence that mean
    attendance has changed.

## Assumptions

-   Theory for $t$-test: assumes normally-distributed data.
-   What actually matters is sampling distribution of sample mean: if
    this is approximately normal, $t$-test is OK, even if data
    distribution is not normal.
-   Central limit theorem: if sample size large, sampling distribution
    approx. normal even if data distribution somewhat non-normal.
-   So look at shape of data distribution, and make a call about whether
    it is normal enough, given the sample size.

## Blue Jays attendances again:

-   You might say that this is not normal enough for a sample size of
    $n = 25$, in which case you don't trust the $t$-test result:

```{r inference-1-R-11, fig.height=6}
ggplot(jays, aes(x = attendance)) + geom_histogram(bins = 6)
```

## Another example: learning to read

-   You devised new method for teaching children to read.
-   Guess it will be more effective than current methods.
-   To support this guess, collect data.
-   Want to generalize to "all children in Canada".
-   So take random sample of all children in Canada.
-   Or, argue that sample you actually have is "typical" of all children
    in Canada.
-   Randomization (1): whether or not a child in sample or not has
    nothing to do with anything else about that child.
-   Randomization (2): randomly choose whether each child gets new
    reading method (t) or standard one (c).

## Reading in data

-   File at <http://ritsokiguess.site/datafiles/drp.txt>.
-   Proper reading-in function is `read_delim` (check file to see)
-   Read in thus:

```{r inference-1-R-12}
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
```

## The data

```{r inference-1-R-13}
kids
```

In `group`, `t` is "treatment" (the new reading method) and `c` is
"control" (the old one).

## Boxplots

```{r inference-1-R-14, fig.height=6}
ggplot(kids, aes(x = group, y = score)) + geom_boxplot()
```

## Two kinds of two-sample t-test

-   pooled (derived in B57):
    $t = { \bar{x}_1 - \bar{x}_2 \over s_p \sqrt{(1 / n_1) + (1 / n_2)}}$,
    -   where
        $s_p^2 = {(n_1 - 1) s_1^2 + (n_2 - 1)s_2^2 \over n_1 + n_2 -2}$
-   Welch-Satterthwaite:
    $t = {\bar{x}_1 - \bar{x}_2 \over \sqrt {{s_1^2 / n_1} + {s_2^2 / n_2}}}$
    -   this $t$ does not have exact $t$-distribution, but is approx $t$
        with non-integer df.

## Two kinds of two-sample t-test

-   Do the two groups have same spread (SD, variance)?
    -   If yes (shaky assumption here), can use pooled t-test.
    -   If not, use Welch-Satterthwaite t-test (safe).
-   Pooled test derived in STAB57 (easier to derive).
-   Welch-Satterthwaite is test used in STAB22 and is generally safe.
-   Assess (approx) equality of spreads using boxplot.

## The (Welch-Satterthwaite) t-test

-   `c` (control) before `t` (treatment) alphabetically, so proper
    alternative is "less".
-   R does Welch-Satterthwaite test by default
-   Answer to "does the new reading program really help?"
-   (in a moment) how to get R to do pooled test?

## Welch-Satterthwaite

```{r inference-1-R-15}
t.test(score ~ group, data = kids, alternative = "less")
```

## The pooled t-test

```{r inference-1-R-16}
t.test(score ~ group, data = kids, 
       alternative = "less", var.equal = TRUE)
```

## Two-sided test; CI

-   To do 2-sided test, leave out `alternative`:

```{r inference-1-R-17}
t.test(score ~ group, data = kids)
```

## Comments:

-   P-values for pooled and Welch-Satterthwaite tests very similar (even
    though the pooled test seemed inferior): 0.013 vs. 0.014.
-   Two-sided test also gives CI: new reading program increases average
    scores by somewhere between about 1 and 19 points.
-   Confidence intervals inherently two-sided, so do 2-sided test to get
    them.

## Jargon for testing

-   Alternative hypothesis: what we are trying to prove (new reading
    program is effective).
-   Null hypothesis: "there is no difference" (new reading program no
    better than current program). Must contain "equals".
-   One-sided alternative: trying to prove better (as with reading
    program).
-   Two-sided alternative: trying to prove different.
-   Test statistic: something expressing difference between data and
    null (eg. difference in sample means, $t$ statistic).
-   P-value: probability of observing test statistic value as extreme or
    more extreme, if null is true.
-   Decision: either reject null hypothesis or do not reject null
    hypothesis. **Never "accept"**.

## Logic of testing

-   Work out what would happen if null hypothesis were true.
-   Compare to what actually did happen.
-   If these are too far apart, conclude that null hypothesis is not
    true after all. (Be guided by P-value.)
-   As applied to our reading programs:
    -   If reading programs equally good, expect to see a difference in
        means close to 0.
    -   Mean reading score was 10 higher for new program.
    -   Difference of 10 was unusually big (P-value small from t-test).
        So conclude that new reading program is effective.
-   Nothing here about what happens if null hypothesis is false. This is
    power and type II error probability.