---
title: "Alternative tests"
---

## When there isn't sufficient normality

- When your samples are not normal enough, cannot use $t$ procedures
- Sometimes transforming the data (eg taking logs of all the values) will help
- Or, use test with no assumptions about normality:
  - for one sample, use *sign test* for median
  - for two samples, use *Mood's median test*
  - for matched pairs, use *sign test* on differences.
  
  
##  One-sample: the IRS data

-   The IRS ("Internal Revenue Service") is the US authority that deals
    with taxes (like Revenue Canada).
-   One of their forms is supposed to take no more than 160 minutes to
    complete. A citizen's organization claims that it takes people
    longer than that on average.
-   Sample of 30 people; time to complete form recorded.
-   Read in data, and do $t$-test of $H_0 : \mu = 160$ vs.
    $H_a : \mu > 160$.
-   Only one column, so pretend it is
    delimited by something.

## Packages

```{r}
library(tidyverse)
library(smmr) 
```

- installation instructions for `smmr` later

## Read in data

\footnotesize

```{r inference-3-R-6, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/irs.txt"
irs <- read_csv(my_url)
irs
```

\normalsize

## Test whether mean is 160 or greater

```{r inference-3-R-7}
with(irs, t.test(Time, mu = 160, 
                 alternative = "greater"))
```

Reject null; mean (for all people to complete form) greater than 160.

## But, look at a graph

```{r inference-3-R-8, fig.height=5}
ggplot(irs, aes(x = Time)) + geom_histogram(bins = 6)
```

## Comments

-   Skewed to right.
-   Should look at *median*, not mean.

## The sign test

-   But how to test whether the median is greater than 160?
-   Idea: if the median really is 160 ($H_0$ true), the sampled values
    from the population are equally likely to be above or below 160.
-   If the population median is greater than 160, there will be a lot of
    sample values greater than 160, not so many less. Idea: test
    statistic is number of sample values greater than hypothesized
    median.

## Getting a P-value for sign test 1/3

-   How to decide whether "unusually many" sample values are greater
    than 160? Need a sampling distribution.
-   If $H_0$ true, pop. median is 160, then each sample value
    independently equally likely to be above or below 160.
-   So number of observed values above 160 has binomial distribution
    with $n = 30$ (number of data values) and $p = 0.5$ (160 is
    hypothesized to be *median*).

## Getting P-value for sign test 2/3

-   Count values above/below 160:

```{r inference-3-R-9}
irs %>% count(Time > 160)
```

-   17 above, 13 below. How unusual is that? Need a *binomial table*.

## Getting P-value for sign test 3/3

-   R function `dbinom` gives the probability of eg. exactly 17
    successes in a binomial with $n = 30$ and $p = 0.5$:

```{r inference-3-R-10}
dbinom(17, 30, 0.5)
```

-   but we want probability of 17 *or more*, so get all of those, find
    probability of each, and add them up:

```{r inference-3-R-11}
tibble(x=17:30) %>% 
  mutate(prob=dbinom(x, 30, 0.5)) %>% 
  summarize(total=sum(prob))
```

## ... or

use cumulative distribution

```{r}
pbinom(17, 30, 0.5) # prob of <= 17
```

and hence (note first input):

```{r}
pbinom(16, 30, 0.5, lower.tail = FALSE)
```

This last is $P(X \ge 17) = P(X > 16)$.

## Using my package `smmr`

-   I wrote a package `smmr` to do the sign test (and some other
    things). Installation is non-standard:

    
```{r}
#| eval: false
install.packages("smmr", repos = "nxskok.r-universe.dev")
```
    

-   Then load it:

```{r inference-3-R-13, eval=FALSE}
library(smmr)
```

## `smmr` for sign test

-   `smmr`'s function `sign_test` needs three inputs: a data frame, a
    column and a null median:

```{r inference-3-R-14}
sign_test(irs, Time, 160)
```

## Comments (1/4)

-   Testing whether population median *greater than* 160, so want
    *upper-tail* P-value 0.2923. Same as before.
-   Also get table of values above and below; this too as we got.

## Comments (2/4)

-   P-values are:

| Test | P-value |
|:-----|--------:|
| $t$  |  0.0392 |
| Sign |  0.2923 |

-   These are very different: we reject a mean of 160 (in favour of the
    mean being bigger), but clearly *fail* to reject a median of 160 in
    favour of a bigger one.
    
## Comments (3/4)
    
-   Why is that? Obtain mean and median:

```{r inference-3-R-15}
irs %>% summarize(mean_time = mean(Time), 
                  median_time = median(Time))
```

## Comments (4/4) {.smaller}

-   The mean is pulled a long way up by the right skew, and is a fair
    bit bigger than 160.
-   The median is quite close to 160.
-   We ought to be trusting the sign test and not the t-test here
    (median and not mean), and therefore there is no evidence that the
    "typical" time to complete the form is longer than 160 minutes.
-   Having said that, there are clearly some people who take a lot
    longer than 160 minutes to complete the form, and the IRS could
    focus on simplifying its form for these people.
-   In this example, looking at any kind of average is not really
    helpful; a better question might be "do an unacceptably large
    fraction of people take longer than (say) 300 minutes to complete
    the form?": that is, thinking about worst-case rather than
    average-case.

## CI for median 1/2

-   The sign test does not naturally come with a confidence interval for
    the median.
-   So we use the "duality" between test and confidence interval to say:
    the (95%) confidence interval for the median contains exactly those
    values of the null median that would not be rejected by the
    two-sided sign test (at $\alpha = 0.05$).
    
## CI for median 2/2

- Precisely: Let $C = 100(1 - \alpha)$, so C% gives
corresponding CI to level-$\alpha$ test. Then following always true.
(Symbol $\iff$ means "if and only if".)

| Test decision                         |        | Confidence interval                   |
|:--------------------------------------|:------:|:--------------------------------------|
| Reject $H_0$ at level $\alpha$        | $\iff$ | $C\%$ CI does not contain $H_0$ value |
| Do not reject $H_0$ at level $\alpha$ | $\iff$ | $C\%$ CI contains $H_0$ value         |

- Idea: 
  - "Plausible" parameter value inside CI, not rejected;         
  - "Implausible" parameter value outside CI, rejected.

## The value of this

-   If you have a test procedure but no corresponding CI:
-   you make a CI by including all the parameter values that would not
    be rejected by your test.
-   Use:
    -   $\alpha = 0.01$ for a 99% CI,
    -   $\alpha = 0.05$ for a 95% CI,
    -   $\alpha = 0.10$ for a 90% CI, and so on.

## For our data

-   The procedure is to try some values for the null median and see
    which ones are inside and which outside our CI.
-   smmr has pval_sign that gets just the 2-sided P-value:

```{r inference-3-R-16}
pval_sign(160, irs, Time)
```

-   Try a couple of null medians:

```{r inference-3-R-17}
pval_sign(200, irs, Time)
pval_sign(300, irs, Time)
```

-   So 200 inside the 95% CI and 300 outside.

## Doing a whole bunch

```{r inference-3-R-18}
(d <- tibble(null_median=seq(100,300,20)))
```

## ... and then

"for each null median, run the function `pval_sign` for that null median
and get the P-value":

```{r inference-3-R-19}
#| eval: false
d %>% rowwise() %>% 
  mutate(p_value = pval_sign(null_median, irs, Time))
```

## Results

```{r inference-3-R-19a}
#| echo: false
d %>% rowwise() %>% 
  mutate(p_value = pval_sign(null_median, irs, Time))
```


## Make it easier for ourselves

\small

```{r inference-3-R-20}
d %>% rowwise() %>% 
  mutate(p_value = pval_sign(null_median, irs, Time)) %>% 
  mutate(in_out = ifelse(p_value > 0.05, "inside", "outside"))
```

\normalsize

## Confidence interval for median?

-   95% CI to this accuracy from 120 to 200.
-   Can get it more accurately by looking more closely in intervals from
    100 to 120, and from 200 to 220.

## A more efficient way: bisection

-   Know that top end of CI between 200 and 220:

```{r inference-3-R-21}
lo <- 200 
hi <- 220
```

-   Try the value halfway between: is it inside or outside?

```{r inference-3-R-22}
try <- (lo + hi) / 2
try
pval_sign(try,irs,Time)
```

-   Inside, so upper end is between 210 and 220. Repeat (over):

## ... bisection continued

```{r inference-3-R-23}
lo <- try
try <- (lo + hi) / 2
try
pval_sign(try, irs, Time)
```

-   215 is inside too, so upper end between 215 and 220.
-   Continue until have as accurate a result as you want.

## Bisection automatically

-   A loop, but not a `for` since we don't know how many times we're
    going around. Keep going `while` a condition is true:

```{r inference-3-R-24, eval=F}
lo = 200
hi = 220
while (hi - lo > 1) { # replace 1 by desired accuracy
  try = (hi + lo) / 2
  ptry = pval_sign(try, irs, Time)
  print(c(try, ptry))
  if (ptry <= 0.05)
    hi = try
  else
    lo = try
}
```

## The output from this loop

```{r inference-3-R-25, echo=F}
lo = 200
hi = 220
while (hi - lo > 1) {
  try = (hi + lo) / 2
  ptry = pval_sign(try, irs, Time)
  print(c(try, ptry))
  if (ptry <= 0.05)
    hi = try
  else
    lo = try
}
```

-   215 inside, 215.625 outside. Upper end of interval to this accuracy
    is 215.

## Using smmr

-   `smmr` has function `ci_median` that does this (by default 95% CI):

```{r inference-3-R-26}
ci_median(irs, Time)
```

-   Uses a more accurate bisection than we did.
-   Or get, say, 90% CI for median:

```{r inference-3-R-27}
ci_median(irs, Time, conf.level=0.90)
```

-   90% CI is shorter, as it should be.

## Bootstrap

-   but, was the sample size (30) big enough to overcome the skewness?
-   Bootstrap, again:

```{r inference-3-R-28, echo=FALSE}
set.seed(457299)
```

```{r inference-3-R-29i}
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(irs$Time, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + 
    stat_qq() + stat_qq_line() -> g
```

## The sampling distribution

```{r inference-3-R-30}
#| fig-height: 5
g
```

## Comments

-   A little skewed to right, but not nearly as much as I was expecting.
-   The $t$-test for the mean might actually be OK for these data, *if
    the mean is what you want*.
-   In actual data, mean and median very different; we chose to make
    inference about the median.
-   Thus for us it was right to use the sign test.

## Two samples: Mood's median test

-   If normality fails (for one or both of the groups), what do we do
    then?
-   Again, can compare medians: use the thought process of the sign
    test, which does not depend on normality and is not damaged by
    outliers.
-   A suitable test called Mood's median test.
-   Before we get to that, a diversion.

## The chi-squared test for independence

Suppose we want to know whether people are in favour of having daylight
savings time all year round. We ask 20 males and 20 females whether they
each agree with having DST all year round ("yes") or not ("no"). 

## Some of the data:

\small

```{r inference-5-R-1, message=F}
my_url <- "http://ritsokiguess.site/datafiles/dst.txt"
dst <- read_delim(my_url," ")
dst %>% slice_sample(n = 10)
```

\normalsize

## ... continued

Count up individuals in each category combination, and arrange in
contingency table:

```{r inference-5-R-2}
tab <- with(dst, table(gender, agree))
tab
```

-   Most of the males say "yes", but the females are about evenly split.
-   Looks like males more likely to say "yes", ie. an association
    between gender and agreement.
  
## ... continued
    
-   Test an $H_0$ of "no association" ("independence") vs. alternative
    that there is really some association.
-   Done with `chisq.test`.

## ...And finally

```{r inference-5-R-3}
chisq.test(tab, correct=FALSE)
```

-   Reject null hypothesis of no association (P-value 0.008)
-   therefore there is a difference in rates of agreement between (all)
    males and females (or that gender and agreement are associated).
-   Same answers as by hand.
    (Omitting `correct = FALSE` uses "Yates correction".

## Mood's median test

-   Earlier: compare medians of two groups.
-   Sign test: count number of values above and below something (there,
    hypothesized median).
-   Mood's median test:
    -   Find "grand median" of all the data, regardless of group
    -   Count data values in each group above/below grand median.
    -   Make contingency table of group vs. above/below.
    -   Test for association.
    
## Why it works    
    
-   If group medians equal, each group should have about half its
    observations above/below grand median. If not, one group will be
    mostly above grand median and other below.

## Mood's median test for reading data

```{r inference-5-R-4, echo=FALSE, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/drp.txt"
kids <- read_delim(my_url," ")
```

-   Find overall median score:

```{r inference-5-R-5}
kids %>% summarize(med=median(score)) %>% pull(med) -> m
m
```

-   Make table of above/below vs. group:

```{r inference-5-R-6}
tab <- with(kids, table(group, score > m))
tab
```

-   Treatment group scores mostly above median, control group scores
    mostly below, as expected.

## The test

-   Do chi-squared test:

```{r inference-5-R-7}
chisq.test(tab, correct=FALSE)
```

-   Two-sided (tests for any association).
-   Here, is reading method *better*? (one-sided).
-   Most of treatment children above overall median, so halve P-value to get 0.017.
-   Again, children learn to read better using new
    method.

## Or by smmr

-   `median_test` does the whole thing:

\small

```{r inference-5-R-8}
median_test(kids,score,group)
```

-   P-value again two-sided.

\normalsize

## Comments 1/2

-   P-value 0.013 for (1-sided) t-test, 0.017 for (1-sided) Mood median
    test.
-   Like the sign test, Mood's median test doesn't use the data very
    efficiently (only, is each value above or below grand median).
-   Thus, if we can justify doing *t*-test, we should do it. This is the
    case here.

## Comments 2/2

-   The *t*-test will usually give smaller P-value because it uses the
    data more efficiently.
-   The time to use Mood's median test is if we are definitely unhappy
    with the normality assumption (and thus the t-test P-value is not to
    be trusted).
-   There is no obvious way to get a confidence interval for the difference between the two medians.


## Matched pairs: the pain relief data

Values aligned in columns:

```{r inference-4b-R-1}
my_url <- 
  "http://ritsokiguess.site/datafiles/analgesic.txt"
pain <- read_table(my_url)
pain %>% mutate(diff = druga - drugb) -> pain
glimpse(pain)
```


## Assessing normality

-   Matched pairs analyses assume (theoretically) that differences
    normally distributed.
-   How to assess normality? A normal quantile plot.

## The normal quantile plot (of differences)

```{r inference-4b-R-6, fig.height=4}
ggplot(pain, aes(sample = diff)) + stat_qq() + stat_qq_line()
```

-   Points should follow the straight line. Bottom left one way off, so
    normality questionable here: outlier.

## What to do instead?

-   Matched pairs $t$-test based on one sample of differences
-   the differences not normal (enough)
-   so do *sign test* on differences, null median 0:

## ... continued

```{r inference-4b-R-7}
sign_test(pain, diff, 0)
```

- No evidence of any difference between the drugs (that the median difference is not zero).

## Did we really need to worry about that outlier?

Bootstrap sampling distribution of sample mean differences:

```{r}
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(pain$diff, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() -> g
```

## The normal quantile plot

```{r}
#| fig-height: 5
g
```


Yes we did need to worry; this is clearly skewed left and not normal.