--- title: "Normal quantile plots" --- ## The normal quantile plot - see that normal distributions of data (or being normal enough) important - only tools we have to assess this are histograms and maybe boxplots - a better tool is **normal quantile plot**: - plot data against what you expect if data actually normal - look for points to follow a straight line, at least approx - `ggplot` code: `aes` `sample`; geoms `stat_qq` and `stat_qq_line` ## Packages The usual: ```{r} library(tidyverse) ``` ## Kids learning to read ```{r inference-4a-R-1, echo=FALSE, message=FALSE} my_url <- "http://ritsokiguess.site/datafiles/drp.txt" kids <- read_delim(my_url," ") glimpse(kids) ``` ```{r inference-4a-R-2} ggplot(kids, aes(x = group, y = score)) + geom_boxplot() ``` Each group looks more or less normal, or at least close to symmetric. ## Get the groups separately ```{r inference-4a-R-3} kids %>% filter(group == "t") -> treatment kids %>% filter(group == "c") -> control ``` to check ```{r inference-4a-R-4} treatment %>% count(group) control %>% count(group) ``` ## The treatment group ```{r inference-4a-R-5, fig.height=4.5} ggplot(treatment, aes(sample = score)) + stat_qq() + stat_qq_line() ``` only problem here is lowest value a little too low (mild outlier). ## Control group ```{r inference-4a-R-6, fig.height=4} ggplot(control, aes(sample = score)) + stat_qq() + stat_qq_line() ``` This time, highest value a little too high, but again, no real problem with normality. ## Facetting more than one sample Use the whole data set and facet by groups ```{r inference-4a-R-7, fig.height=4.5} ggplot(kids, aes(sample = score)) + stat_qq() + stat_qq_line() + facet_wrap(~group) ``` ## Blue Jays attendances, skewed to right ```{r inference-4a-R-8, echo=FALSE, message=FALSE} jays <- read_csv("jays15-home.csv") ``` ```{r inference-4a-R-9} ggplot(jays, aes(x = attendance)) + geom_histogram(bins = 6) ``` ## On a normal quantile plot ```{r inference-4a-R-10, fig.height=3.5} ggplot(jays, aes(sample = attendance)) + stat_qq() + stat_qq_line() ``` - Attendances at low end too bunched up: skewed to right. - Right-skewness can also show up as highest values being too high, or as a curved pattern in the points. ## More normal quantile plots - How straight does a normal quantile plot have to be? - There is randomness in real data, so even a normal quantile plot from normal data won't look perfectly straight. - With a small sample, can look not very straight even from normal data. - Looking for systematic departure from a straight line; random wiggles ought not to concern us. - Look at some examples where we know the answer, so that we can see what to expect. ## Normal data, large sample ```{r set-seed, echo=F} set.seed(457299) ``` ```{r inference-4a-R-11, fig.height=4.5} d <- tibble(x=rnorm(200)) ggplot(d, aes(x=x)) + geom_histogram(bins=10) ``` ## The normal quantile plot ```{r inference-4a-R-12, fig.height=4.5} ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line() ``` ## Normal data, small sample ```{r inference-4a-R-13, echo=F} set.seed(457299) ``` - Not so convincingly normal, but not obviously skewed: ```{r normal-small, fig.height=4.5} d <- tibble(x=rnorm(20)) ggplot(d, aes(x=x)) + geom_histogram(bins=5) ``` ## The normal quantile plot Good, apart from the highest and lowest points being slightly off. I'd call this good: ```{r inference-4a-R-14, fig.height=4.5} ggplot(d, aes(sample=x)) + stat_qq() + stat_qq_line() ``` ## Chi-squared data, *df* = 10 Somewhat skewed to right: ```{r inference-4a-R-15, fig.height=4.5} d <- tibble(x=rchisq(100, 10)) ggplot(d,aes(x=x)) + geom_histogram(bins=10) ``` ## The normal quantile plot Somewhat opening-up curve: ```{r inference-4a-R-16, fig.height=4.5} ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line() ``` ## Chi-squared data, df = 3 Definitely skewed to right: ```{r chisq-small-df, fig.height=4.5} d <- tibble(x=rchisq(100, 3)) ggplot(d, aes(x=x)) + geom_histogram(bins=10) ``` ## The normal quantile plot Clear upward-opening curve: ```{r inference-4a-R-17, fig.height=4.5} ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line() ``` ## t-distributed data, df = 3 Long tails (or a very sharp peak): ```{r t-small, fig.height=4.5} d <- tibble(x=rt(300, 3)) ggplot(d, aes(x=x)) + geom_histogram(bins=15) ``` ## The normal quantile plot Low values too low and high values too high for normal. ```{r inference-4a-R-18, fig.height=4.5} ggplot(d,aes(sample=x))+stat_qq()+stat_qq_line() ``` ## Summary On a normal quantile plot: - points following line (with some small wiggles): normal. - kind of deviation from a straight line indicates kind of nonnormality: - a few highest point(s) too high and/or lowest too low: outliers - else see how points at each end off the line: | | High points | | |----------------|-------------|--------------| | **Low points** | **Too low** | **Too high** | | **Too low** | Skewed left | Long tails | | **Too high** | Short tails | Skewed right | - short-tailed distribution OK for $t$ (mean still good), but others problematic (depending on sample size).