---
title: "Basic statistical inference"
---

## Packages for this section

```{r inference-1-R-1}
library(tidyverse)
```


## Inference for means (from STAB57)

Three kinds of inference for means of normally-distributed data:

- **One-sample $t$**: a single sample from a population, estimate that population's mean
- **Two-sample $t$**: one sample from each of 2 populations, estimate difference in population means
- **Matched pairs $t$**: two paired measurements on same (or matched) individuals, estimate population mean difference

Two forms of inference for a population parameter:
 
- **Confidence interval**: "what is the population parameter?"
- **Hypothesis test**: "could the population parameter be equal to this value?"


## Examples:

- Blue jays attendances (one-sample)
- Kids learning to read (two-sample)
- Pain relief (matched pairs)

## Confidence interval

- You have a sample from some population
- Imagine repeated sampling from that population
- Procedure that gives an interval containing the true parameter in 95% (or 90% or 99%) of all possible samples

## Hypothesis test

- Null hypothesis gives value for population parameter
- Alternative hypothesis says how you are trying to prove the null hypothesis wrong (not equal, greater, less).
- Test statistic measures "distance" between data and null hypothesis
- P-value gives probability of observing test statistic *as extreme or more extreme*, **if the null hypothesis is true**.
- Reject null hypothesis if P-value small enough (eg smaller than 0.05).

## Why 0.05? This man.

::: columns
::: {.column width="40%"}
![](fisher.png)
:::

::: {.column width="60%"}
-   analysis of variance
-   Fisher information
-   Linear discriminant analysis
-   Fisher's $z$-transformation
-   Fisher-Yates shuffle
-   Behrens-Fisher problem

Sir Ronald A. Fisher, 1890--1962.
:::
:::

## Why 0.05? (2)

-   From The Arrangement of Field Experiments (1926):

![](fisher1.png){width="100%"}

-   and

![](fisher2.png){width="100%"}

## $\alpha$ and errors

-   Hypothesis test ends with decision:
    -   reject null hypothesis
    -   do not reject null hypothesis.
-   but decision may be wrong:

|                | Decision          |                 |
|----------------|-------------------|-----------------|
| **Truth**      | **Do not reject** | **reject null** |
| **Null true**  | Correct           | Type I error    |
| **Null false** | Type II error     | Correct         |

-   Either type of error is bad, but for now focus on controlling Type I
    error: write $\alpha$ = P(type I error), and devise test so that
    $\alpha$ small, typically 0.05.
-   That is, **if null hypothesis true**, have only small chance to
    reject it (which would be a mistake).
-   Worry about type II errors later (when we consider power of test).


## One sample: the Blue Jays attendances


-   The Toronto Blue Jays' average home attendance in part of 2015
    season was 25,070 (up to May 27 2015, from baseball-reference.com).
-   Does that mean the attendance at every game was exactly 25,070?
    Certainly not. Actual attendance depends on many things, eg.:
    -   how well the Jays are playing
    -   the opposition
    -   day of week
    -   weather
    -   random chance


## Reading the attendances

...as a `.csv` file:

\footnotesize

```{r inference-1-R-2}
my_url <- "http://ritsokiguess.site/datafiles/jays15-home.csv"
jays <- read_csv(my_url) 
jays
```

\normalsize

## Another way

-   This is a "big" data set: only 25 observations, but a lot of
    *variables*.

-   To see the first few values in all the variables, can also use
    `glimpse`:

\scriptsize

```{r inference-1-R-5}
glimpse(jays)
```

\normalsize

## Attendance histogram

```{r inference-1-R-6, fig.height=3.8}
ggplot(jays, aes(x = attendance)) + geom_histogram(bins = 6)
```

## Comments

-   Attendances have substantial variability, ranging from just over
    10,000 to around 50,000.
-   Distribution somewhat skewed to right (but no outliers).
-   These are a sample of "all possible games" (or maybe "all possible
    games played in April and May"). What can we say about mean
    attendance in all possible games based on this evidence?

## CI for mean attendance

-   `t.test` function does CI and test. Look at CI first:

```{r inference-1-R-7}
t.test(jays$attendance)
```

-   From 20,500 to 29,600.

## Or, 90% CI

-   by including a value for conf.level:

```{r inference-1-R-8}
t.test(jays$attendance, conf.level = 0.90)
```

-   From 21,300 to 28,800. (Shorter, as it should be.)

## Comments

-   Need to say "column attendance within data frame `jays`" using `$`.
-   95% CI from about 20,000 to about 30,000.
-   Not estimating mean attendance well at all!
-   Generally want confidence interval to be shorter, which happens if:
    -   SD smaller
    -   sample size bigger
    -   confidence level smaller
-   Last one is a cheat, really, since reducing confidence level
    increases chance that interval won't contain pop. mean at all!

## Another way to access data frame columns

```{r inference-1-R-9}
with(jays, t.test(attendance))
```

## Hypothesis testing for Blue Jays attendances


- Previous year's mean attendance was 29,327, so test to see whether the mean is different from that in any way (two-sided test):


```{r inference-1-R-10}
t.test(jays$attendance, mu = 29327)
```

-   See test statistic $-1.93$, P-value 0.065.
-   Do not reject null at $\alpha=0.05$: no evidence that mean
    attendance has changed.

## Another example: learning to read

- You devised new method for teaching children to read.
- Guess it will be more effective than current methods.
- To support this guess, collect data.
- Want to generalize to “all children in Canada”.
- So take random sample of all children in Canada.
- Or, argue that sample you actually have is “typical” of all children in
Canada.
- Randomization (1): whether or not a child in sample or not has
nothing to do with anything else about that child.
- Randomization (2): randomly choose whether each child gets new
reading method (t) or standard one (c).

## Reading in data 
- File at <http://ritsokiguess.site/datafiles/drp.txt>.
- Proper reading-in function is `read_delim` (check file to see)
- Read in thus:

```{r inference-1-R-12}
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
```

## The data (some) 

```{r inference-1-R-13}
kids
```

## Boxplots

```{r inference-1-R-14, fig.height=3.7}
ggplot(kids, aes(x = group, y = score)) + geom_boxplot()
```

## Two kinds of two-sample t-test

- Do the two groups have same spread (SD, variance)?
    - If yes (shaky assumption here), can use pooled t-test.
    - If not, use Welch-Satterthwaite t-test (safe).
- Pooled test derived in STAB57 (easier to derive, but assumes equal variances).
- Welch-Satterthwaite does not assume equality of variances.
- Assess (approx) equality of spreads using boxplot.

## The (Welch-Satterthwaite) t-test
- `c` (control) before `t` (treatment) alphabetically, so proper alternative
is “less”.
- R does Welch-Satterthwaite test by default 
- Answer to "does the new reading program really help?"
- (in a moment) how to get R to do pooled test?

## Welch-Satterthwaite 

```{r inference-1-R-15}
t.test(score ~ group, data = kids, alternative = "less")
```

## The pooled t-test 

```{r inference-1-R-16}
t.test(score ~ group, data = kids, 
       alternative = "less", var.equal = TRUE)
```

## Two-sided test; CI
- To do 2-sided test, leave out `alternative`:

```{r inference-1-R-17}
t.test(score ~ group, data = kids)
```


## Comments:

- P-values for pooled and Welch-Satterthwaite tests very similar (even though the pooled test seemed inferior): 0.013 vs.\ 0.014.
- Two-sided test also gives CI: new reading program increases average scores by
somewhere between about 1 and 19 points.
- Confidence intervals inherently two-sided, so do 2-sided test to get
them.

## Pain relief

Some data: 

\centering{
  \includegraphics[height=0.7\textheight]{Screenshot_2019-04-26_13-41-29}
}


## Matched pairs data

- Data are comparison of 2 drugs for effectiveness at reducing pain.

     - 12 subjects (cases) were arthritis sufferers
     - Response is #hours of pain relief from each drug.
      
- In reading example, each child tried only one reading method.
- But here, each subject tried out both drugs, giving us two
measurements.

    - Possible because, if you wait long enough, one drug has no influence
over effect of other.
    - Advantage: focused comparison of drugs. Compare one drug with
another on same person, removes a lot of variability due to differences between people. 
    - Matched pairs, requires different analysis.
      
- Design: randomly choose 6 of 12 subjects to get drug A first, other 6
get drug B first.

## Paired t test: reading the data
Values aligned in columns:  

```{r inference-4-R-1}
my_url <- "http://ritsokiguess.site/datafiles/analgesic.txt"
pain <- read_table(my_url)
```

## The data

```{r inference-4-R-2}
pain
```

## Paired *t*-test 

\small

```{r inference-4-R-3}
with(pain, t.test(druga, drugb, paired = T))
```

\normalsize

- P-value is 0.053. 
- Not quite evidence of difference between drugs.

## t-testing the differences

- Likewise, you can calculate the differences yourself and
do a 1-sample t-test on them.
- First calculate a column of differences:

\footnotesize

```{r inference-4-R-4}
(pain %>% mutate(diff=druga-drugb) -> pain)
```

\normalsize

## t-test on the differences
- then throw them into t.test, testing that the mean is zero, with
same result as before:
```{r inference-4-R-5}
with(pain, t.test(diff, mu=0))
```