--- title: "The sign test" editor: markdown: wrap: 72 --- ## Packages ```{r} library(tidyverse) library(smmr) ``` `smmr` is new. See later how to install it. ## Duality between confidence intervals and hypothesis tests - Tests and CIs really do the same thing, if you look at them the right way. They are both telling you something about a parameter, and they use same things about data. - To illustrate, some data (two groups): ```{r inference-3-R-1} my_url <- "http://ritsokiguess.site/datafiles/duality.txt" twogroups <- read_delim(my_url," ") ``` ## The data ```{r inference-3-R-2} twogroups ``` ## 95% CI (default) for difference in means, group 1 minus group 2: ```{r inference-3-R-3} t.test(y ~ group, data = twogroups) ``` ## 90% CI ```{r inference-3-R-4} t.test(y ~ group, data = twogroups, conf.level = 0.90) ``` ## Hypothesis test Null is that difference in means is zero: ```{r inference-3-R-5} t.test(y ~ group, mu=0, data = twogroups) ``` ## Comparing results Recall null here is $H_0 : \mu_1 - \mu_2 = 0$. P-value 0.0668. - 95% CI from $-5.6$ to $0.2$, contains $0$. - 90% CI from $-5.0$ to $-0.3$, does not contain $0$. - At $\alpha = 0.05$, would not reject $H_0$ since P-value $> 0.05$. - At $\alpha = 0.10$, *would* reject $H_0$ since P-value $< 0.10$. ## Test and CI Not just coincidence. Let $C = 100(1 - \alpha)$, so C% gives corresponding CI to level-$\alpha$ test. Then following always true. (Symbol $\iff$ means "if and only if".) | Test decision | | Confidence interval | |:--------------------------|:---------------:|:--------------------------| | Reject $H_0$ at level $\alpha$ | $\iff$ | $C\%$ CI does not contain $H_0$ value | | Do not reject $H_0$ at level $\alpha$ | $\iff$ | $C\%$ CI contains $H_0$ value | Idea: "Plausible" parameter value inside CI, not rejected; "Implausible" parameter value outside CI, rejected. ## The value of this - If you have a test procedure but no corresponding CI: - you make a CI by including all the parameter values that would not be rejected by your test. - Use: - $\alpha = 0.01$ for a 99% CI, - $\alpha = 0.05$ for a 95% CI, - $\alpha = 0.10$ for a 90% CI, and so on. ## Testing for non-normal data - The IRS ("Internal Revenue Service") is the US authority that deals with taxes (like Revenue Canada). - One of their forms is supposed to take no more than 160 minutes to complete. A citizen's organization claims that it takes people longer than that on average. - Sample of 30 people; time to complete form recorded. - Read in data, and do $t$-test of $H_0 : \mu = 160$ vs. $H_a : \mu > 160$. - For reading in, there is only one column, so can pretend it is delimited by anything. ## Read in data ```{r inference-3-R-6, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/irs.txt" irs <- read_csv(my_url) irs ``` ## Test whether mean is 160 or greater ```{r inference-3-R-7} with(irs, t.test(Time, mu = 160, alternative = "greater")) ``` Reject null; mean (for all people to complete form) greater than 160. ## But, look at a graph ```{r inference-3-R-8, fig.height=3.5} ggplot(irs, aes(x = Time)) + geom_histogram(bins = 6) ``` ## Comments - Skewed to right. - Should look at *median*, not mean. ## The sign test - But how to test whether the median is greater than 160? - Idea: if the median really is 160 ($H_0$ true), the sampled values from the population are equally likely to be above or below 160. - If the population median is greater than 160, there will be a lot of sample values greater than 160, not so many less. Idea: test statistic is number of sample values greater than hypothesized median. ## Getting a P-value for sign test 1/3 - How to decide whether "unusually many" sample values are greater than 160? Need a sampling distribution. - If $H_0$ true, pop. median is 160, then each sample value independently equally likely to be above or below 160. - So number of observed values above 160 has binomial distribution with $n = 30$ (number of data values) and $p = 0.5$ (160 is hypothesized to be *median*). ## Getting P-value for sign test 2/3 - Count values above/below 160: ```{r inference-3-R-9} irs %>% count(Time > 160) ``` - 17 above, 13 below. How unusual is that? Need a *binomial table*. ## Getting P-value for sign test 3/3 - R function `dbinom` gives the probability of eg. exactly 17 successes in a binomial with $n = 30$ and $p = 0.5$: ```{r inference-3-R-10} dbinom(17, 30, 0.5) ``` - but we want probability of 17 *or more*, so get all of those, find probability of each, and add them up: ```{r inference-3-R-11} tibble(x=17:30) %>% mutate(prob=dbinom(x, 30, 0.5)) %>% summarize(total=sum(prob)) ``` or ```{r} pbinom(17, 30, 0.5) # prob of <= 17 ``` and hence (note first input): ```{r} pbinom(16, 30, 0.5, lower.tail = FALSE) ``` This last is $P(X \ge 17) = P(X > 16)$. ## Using my package `smmr` - I wrote a package `smmr` to do the sign test (and some other things). Installation is a bit fiddly: - Install devtools (once) with ```{r} #| eval = FALSE install.packages("devtools") ``` - then install `smmr` using `devtools` (once): ```{r inference-3-R-12} #| eval = FALSE library(devtools) install_github("nxskok/smmr") ``` - Then load it: ```{r inference-3-R-13, eval=FALSE} library(smmr) ``` ## `smmr` for sign test - `smmr`'s function `sign_test` needs three inputs: a data frame, a column and a null median: ```{r inference-3-R-14} sign_test(irs, Time, 160) ``` ## Comments (1/3) - Testing whether population median *greater than* 160, so want *upper-tail* P-value 0.2923. Same as before. - Also get table of values above and below; this too as we got. ## Comments (2/3) - P-values are: | Test | P-value | |:-----|--------:| | $t$ | 0.0392 | | Sign | 0.2923 | - These are very different: we reject a mean of 160 (in favour of the mean being bigger), but clearly *fail* to reject a median of 160 in favour of a bigger one. - Why is that? Obtain mean and median: ```{r inference-3-R-15} irs %>% summarize(mean_time = mean(Time), median_time = median(Time)) ``` ## Comments (3/3) {.smaller} - The mean is pulled a long way up by the right skew, and is a fair bit bigger than 160. - The median is quite close to 160. - We ought to be trusting the sign test and not the t-test here (median and not mean), and therefore there is no evidence that the "typical" time to complete the form is longer than 160 minutes. - Having said that, there are clearly some people who take a lot longer than 160 minutes to complete the form, and the IRS could focus on simplifying its form for these people. - In this example, looking at any kind of average is not really helpful; a better question might be "do an unacceptably large fraction of people take longer than (say) 300 minutes to complete the form?": that is, thinking about worst-case rather than average-case. ## Confidence interval for the median - The sign test does not naturally come with a confidence interval for the median. - So we use the "duality" between test and confidence interval to say: the (95%) confidence interval for the median contains exactly those values of the null median that would not be rejected by the two-sided sign test (at $\alpha = 0.05$). ## For our data - The procedure is to try some values for the null median and see which ones are inside and which outside our CI. - smmr has pval_sign that gets just the 2-sided P-value: ```{r inference-3-R-16} pval_sign(160, irs, Time) ``` - Try a couple of null medians: ```{r inference-3-R-17} pval_sign(200, irs, Time) pval_sign(300, irs, Time) ``` - So 200 inside the 95% CI and 300 outside. ## Doing a whole bunch - Choose our null medians first: ```{r inference-3-R-18} (d <- tibble(null_median=seq(100,300,20))) ``` ## ... and then "for each null median, run the function `pval_sign` for that null median and get the P-value": ```{r inference-3-R-19} d %>% rowwise() %>% mutate(p_value = pval_sign(null_median, irs, Time)) ``` ## Make it easier for ourselves ```{r inference-3-R-20} d %>% rowwise() %>% mutate(p_value = pval_sign(null_median, irs, Time)) %>% mutate(in_out = ifelse(p_value > 0.05, "inside", "outside")) ``` ## confidence interval for median? - 95% CI to this accuracy from 120 to 200. - Can get it more accurately by looking more closely in intervals from 100 to 120, and from 200 to 220. ## A more efficient way: bisection - Know that top end of CI between 200 and 220: ```{r inference-3-R-21} lo <- 200 hi <- 220 ``` - Try the value halfway between: is it inside or outside? ```{r inference-3-R-22} try <- (lo + hi) / 2 try pval_sign(try,irs,Time) ``` - Inside, so upper end is between 210 and 220. Repeat (over): ## ... bisection continued ```{r inference-3-R-23} lo <- try try <- (lo + hi) / 2 try pval_sign(try, irs, Time) ``` - 215 is inside too, so upper end between 215 and 220. - Continue until have as accurate a result as you want. ## Bisection automatically - A loop, but not a `for` since we don't know how many times we're going around. Keep going `while` a condition is true: ```{r inference-3-R-24, eval=F} lo = 200 hi = 220 while (hi - lo > 1) { try = (hi + lo) / 2 ptry = pval_sign(try, irs, Time) print(c(try, ptry)) if (ptry <= 0.05) hi = try else lo = try } ``` ## The output from this loop ```{r inference-3-R-25, echo=F} lo = 200 hi = 220 while (hi - lo > 1) { try = (hi + lo) / 2 ptry = pval_sign(try, irs, Time) print(c(try, ptry)) if (ptry <= 0.05) hi = try else lo = try } ``` - 215 inside, 215.625 outside. Upper end of interval to this accuracy is 215. ## Using smmr - `smmr` has function `ci_median` that does this (by default 95% CI): ```{r inference-3-R-26} ci_median(irs, Time) ``` - Uses a more accurate bisection than we did. - Or get, say, 90% CI for median: ```{r inference-3-R-27} ci_median(irs, Time, conf.level=0.90) ``` - 90% CI is shorter, as it should be. ## Bootstrap - but, was the sample size (30) big enough to overcome the skewness? - Bootstrap, again: ```{r inference-3-R-28, echo=FALSE} set.seed(457299) ``` ```{r inference-3-R-29} tibble(sim = 1:1000) %>% rowwise() %>% mutate(my_sample = list(sample(irs$Time, replace = TRUE))) %>% mutate(my_mean = mean(my_sample)) %>% ggplot(aes(x=my_mean)) + geom_histogram(bins=10) -> g ``` ## The sampling distribution ```{r inference-3-R-30} g ``` ## Comments - A little skewed to right, but not nearly as much as I was expecting. - The $t$-test for the mean might actually be OK for these data, *if the mean is what you want*. - In actual data, mean and median very different; we chose to make inference about the median. - Thus for us it was right to use the sign test.