--- title: "Regression revisited" editor: markdown: wrap: 72 --- ## Regression - Use regression when one variable is an outcome (*response*, $y$). - See if/how response depends on other variable(s), *explanatory*, $x_1, x_2,\ldots$. - Can have *one* or *more than one* explanatory variable, but always one response. - Assumes a *straight-line* relationship between response and explanatory. - Ask: - *is there* a relationship between $y$ and $x$'s, and if so, which ones? - what does the relationship look like? ## Packages ```{r bRegression-1, results='hide'} library(MASS, exclude = "select") # for Box-Cox, later library(tidyverse) library(broom) library(marginaleffects) library(conflicted) conflict_prefer("select", "dplyr") ``` ## A regression with one $x$ 13 children, measure average total sleep time (ATST, mins) and age (years) for each. See if ATST depends on age. Data in `sleep.txt`, ATST then age. Read in data: ```{r bRegression-2 } my_url <- "http://ritsokiguess.site/datafiles/sleep.txt" sleep <- read_delim(my_url, " ") ``` ## Check data ```{r bRegression-3, size="footnotesize"} summary(sleep) ``` ```{r} sleep ``` Make scatter plot of ATST (response) vs. age (explanatory) using code overleaf: ## The scatterplot ```{r suggo, fig.height=6} ggplot(sleep, aes(x = age, y = atst)) + geom_point() ``` ## Correlation - Measures how well a straight line fits the data: ```{r bRegression-4 } with(sleep, cor(atst, age)) ``` - $1$ is perfect upward trend, $-1$ is perfect downward trend, 0 is no trend. - This one close to perfect downward trend. - Can do correlations of all pairs of variables: ```{r bRegression-5 } cor(sleep) ``` ## Lowess curve - Sometimes nice to guide the eye: is the trend straight, or not? - Idea: *lowess curve*. "Locally weighted least squares", not affected by outliers, not constrained to be linear. - Lowess is a *guide*: even if straight line appropriate, may wiggle/bend a little. Looking for *serious* problems with linearity. - Add lowess curve to plot using `geom_smooth`: ## Plot with lowess curve ```{r icko,fig.height=6} ggplot(sleep, aes(x = age, y = atst)) + geom_point() + geom_smooth() ``` ## The regression Scatterplot shows no obvious curve, and a pretty clear downward trend. So we can run the regression: ```{r bRegression-6, echo=-1} options(width = 60) sleep.1 <- lm(atst ~ age, data = sleep) ``` ## The output ```{r bRegression-7} summary(sleep.1) ``` ## Conclusions - The relationship appears to be a straight line, with a downward trend. - $F$-tests for model as a whole and $t$-test for slope (same) both confirm this (P-value $5.7\times 10^{-7}=0.00000057$). - Slope is $-14$, so a 1-year increase in age goes with a 14-minute decrease in ATST on average. - R-squared is correlation squared (when one $x$ anyway), between 0 and 1 (1 good, 0 bad). - Here R-squared is 0.9054, pleasantly high. ## Doing things with the regression output - Output from regression (and eg. $t$-test) is all right to look at, but hard to extract and re-use information from. - Package `broom` extracts info from model output in way that can be used in pipe (later): ```{r bRegression-8, size="footnotesize"} tidy(sleep.1) ``` ## also one-line summary of model: ```{r bRegression-9} glance(sleep.1) ``` ## Broom part 2 ```{r bRegression-10,warning=F} sleep.1 %>% augment(sleep) ``` Useful for plotting residuals against an $x$-variable. ## CI for mean response and prediction intervals Once useful regression exists, use it for prediction: - To get a single number for prediction at a given $x$, substitute into regression equation, eg. age 10: predicted ATST is $646.48-14.04(10)=506$ minutes. - To express uncertainty of this prediction: - *CI for mean response* expresses uncertainty about mean ATST for all children aged 10, based on data. - *Prediction interval* expresses uncertainty about predicted ATST for a new child aged 10 whose ATST not known. More uncertain. - Also do above for a child aged 5. ## The `marginaleffects` package 1/2 To get predictions for specific values, set up a dataframe with those values first: ```{r} new <- datagrid(model = sleep.1, age = c(10, 5)) new ``` Any variables in the dataframe that you don't specify are set to their mean values (quantitative) or most common category (categorical). ## The `marginaleffects` package 2/2 Then feed into `newdata` in `predictions`. This contains a lot of columns, so you probably want only to display the ones you care about: ```{r} cbind(predictions(sleep.1, newdata = new)) %>% select(estimate, conf.low, conf.high, age) ``` The confidence limits are a 95% confidence interval for the mean response at that `age`. ## Prediction intervals These are obtained (instead) with `predict` as below. Use the same dataframe `new` as before: ```{r bRegression-12 } pp <- predict(sleep.1, new, interval = "p") pp cbind(new, pp) ``` ## Plotting the confidence intervals for mean response again: ```{r} plot_predictions(sleep.1, condition = "age") ``` ## Comments - Age 10 closer to centre of data, so intervals are both narrower than those for age 5. - Prediction intervals bigger than CI for mean (additional uncertainty). - Technical note: output from `predict` is R `matrix`, not data frame, so Tidyverse `bind_cols` does not work. Use base R `cbind`. ## That grey envelope Marks confidence interval for mean for all $x$: ```{r bRegression-15, fig.height=6.0} ggplot(sleep, aes(x = age, y = atst)) + geom_point() + geom_smooth(method = "lm") + scale_y_continuous(breaks = seq(420, 600, 20)) ``` ## Diagnostics How to tell whether a straight-line regression is appropriate? - Before: check scatterplot for straight trend. - After: plot *residuals* (observed minus predicted response) against predicted values. Aim: a plot with no pattern. ## Residual plot Not much pattern here --- regression appropriate. ```{r akjhkadjfhjahnkkk,fig.height=3.4} ggplot(sleep.1, aes(x = .fitted, y = .resid)) + geom_point() ``` ## An inappropriate regression Different data: ```{r curvy} my_url <- "http://ritsokiguess.site/datafiles/curvy.txt" curvy <- read_delim(my_url, " ") ``` ## Scatterplot ```{r bRegression-16, fig.height=3.8} ggplot(curvy, aes(x = xx, y = yy)) + geom_point() ``` ## Regression line, anyway \scriptsize ```{r bRegression-17} curvy.1 <- lm(yy ~ xx, data = curvy) summary(curvy.1) ``` ## Residual plot ```{r altoadige,fig.height=3.8} ggplot(curvy.1, aes(x = .fitted, y = .resid)) + geom_point() ``` ## No good: fixing it up - Residual plot has *curve*: middle residuals positive, high and low ones negative. Bad. - Fitting a curve would be better. Try this: ```{r bRegression-18 } curvy.2 <- lm(yy ~ xx + I(xx^2), data = curvy) ``` - Adding `xx`-squared term, to allow for curve. - Another way to do same thing: specify how model *changes*: ```{r bRegression-19 } curvy.2a <- update(curvy.1, . ~ . + I(xx^2)) ``` ## Regression 2 ```{r bRegression-20 } tidy(curvy.2) glance(curvy.2) # ``` ## Comments - `xx`-squared term definitely significant (P-value 0.000182), so need this curve to describe relationship. - Adding squared term has made R-squared go up from 0.5848 to 0.9502: great improvement. - This is a definite curve! ## The residual plot now No problems any more: ```{r bRegression-21, size="small", fig.height=3.4} ggplot(curvy.2, aes(x = .fitted, y = .resid)) + geom_point() ``` ## Another way to handle curves - Above, saw that changing $x$ (adding $x^2$) was a way of handling curved relationships. - Another way: change $y$ (transformation). - Can guess how to change $y$, or might be theory: - example: relationship $y=ae^{bx}$ (exponential growth): - take logs to get $\ln y=\ln a + bx$. - Taking logs has made relationship linear ($\ln y$ as response). - Or, *estimate* transformation, using Box-Cox method. ## Box-Cox - Install package `MASS` via `install.packages("MASS")` (only need to do *once*) - Every R session you want to use something in `MASS`, type `library(MASS)` ## Some made-up data ```{r bRegression-22, message=F} my_url <- "http://ritsokiguess.site/datafiles/madeup2.csv" madeup <- read_csv(my_url) madeup ``` Seems to be faster-than-linear growth, maybe exponential growth. ## Scatterplot: faster than linear growth ```{r dsljhsdjlhf,fig.height=3.2} ggplot(madeup, aes(x = x, y = y)) + geom_point() + geom_smooth() ``` ## Running Box-Cox - `library(MASS)` first. - Feed `boxcox` a model formula with a squiggle in it, such as you would use for `lm`. - Output: a graph (next page): ```{r bRegression-23, eval=F} boxcox(y ~ x, data = madeup) ``` ## The Box-Cox output ```{r trento,echo=F, fig.height=4} boxcox(y ~ x, data = madeup) ``` ## Comments - $\lambda$ (lambda) is the power by which you should transform $y$ to get the relationship straight (straighter). Power 0 is "take logs" - Middle dotted line marks best single value of $\lambda$ (here about 0.1). - Outer dotted lines mark 95% CI for $\lambda$, here $-0.3$ to 0.7, approx. (Rather uncertain about best transformation.) - Any power transformation within the CI supported by data. In this case, log ($\lambda=0$) and square root ($\lambda=0.5$) good, but no transformation ($\lambda=1$) not. - Pick a "round-number" value of $\lambda$ like $2,1,0.5,0,-0.5,-1$. Here 0 and 0.5 good values to pick. ## Did transformation straighten things? - Plot transformed $y$ against $x$. Here, log: ```{r bRegression-24, message=FALSE, fig.height=3.2 } ggplot(madeup, aes(x = x, y = log(y))) + geom_point() + geom_smooth() ``` Looks much straighter. ## Regression with transformed $y$ ```{r bRegression-25} madeup.1 <- lm(log(y) ~ x, data = madeup) glance(madeup.1) tidy(madeup.1) ``` R-squared now decently high. ## Multiple regression - What if more than one $x$? Extra issues: - Now one intercept and a slope for each $x$: how to interpret? - Which $x$-variables actually help to predict $y$? - Different interpretations of "global" $F$-test and individual $t$-tests. - R-squared no longer correlation squared, but still interpreted as "higher better". - In `lm` line, add extra $x$s after `~`. - Interpretation not so easy (and other problems that can occur). ## Multiple regression example Study of women and visits to health professionals, and how the number of visits might be related to other variables: ```{=tex} \begin{description} \item[timedrs:] number of visits to health professionals (over course of study) \item[phyheal:] number of physical health problems \item[menheal:] number of mental health problems \item[stress:] result of questionnaire about number and type of life changes \end{description} ``` `timedrs` response, others explanatory. ## The data ```{r bRegression-26 } my_url <- "http://ritsokiguess.site/datafiles/regressx.txt" visits <- read_delim(my_url, " ") ``` ## Check data ```{r bRegression-27} visits ``` ## Fit multiple regression ```{r bRegression-28} visits.1 <- lm(timedrs ~ phyheal + menheal + stress, data = visits) summary(visits.1) ``` ## The slopes - Model as a whole strongly significant even though R-sq not very big (lots of data). At least one of the $x$'s predicts `timedrs`. - The physical health and stress variables definitely help to predict the number of visits, but *with those in the model* we don't need `menheal`. However, look at prediction of `timedrs` from `menheal` by itself: ## Just `menheal` ```{r bRegression-30 } visits.2 <- lm(timedrs ~ menheal, data = visits) summary(visits.2) ``` ## `menheal` by itself - `menheal` by itself *does* significantly help to predict `timedrs`. - But the R-sq is much less (6.5% vs. 22%). - So other two variables do a better job of prediction. - With those variables in the regression (`phyheal` and `stress`), don't need `menheal` *as well*. ## Investigating via correlation Leave out first column (`subjno`): ```{r bRegression-31 } visits %>% select(-subjno) %>% cor() ``` - `phyheal` most strongly correlated with `timedrs`. - Not much to choose between other two. - But `menheal` has higher correlation with `phyheal`, so not as much to *add* to prediction as `stress`. - Goes to show things more complicated in multiple regression. ## Residual plot (from `timedrs` on all) ```{r iffy8,fig.height=3.5} ggplot(visits.1, aes(x = .fitted, y = .resid)) + geom_point() ``` Apparently random. But... ## Normal quantile plot of residuals ```{r bRegression-32, fig.height=3.5} ggplot(visits.1, aes(sample = .resid)) + stat_qq() + stat_qq_line() ``` Not normal at all; upper tail is way too long. ## Absolute residuals Is there trend in *size* of residuals (fan-out)? Plot *absolute value* of residual against fitted value: ```{r bRegression-33, fig.height=2.8, message=FALSE} ggplot(visits.1, aes(x = .fitted, y = abs(.resid))) + geom_point() + geom_smooth() ``` ## Comments - On the normal quantile plot: - highest (most positive) residuals are *way* too high - distribution of residuals skewed to right (not normal at all) - On plot of absolute residuals: - size of residuals getting bigger as fitted values increase - predictions getting more variable as fitted values increase - that is, predictions getting *less accurate* as fitted values increase, but predictions should be equally accurate all way along. - Both indicate problems with regression, of kind that transformation of response often fixes: that is, predict *function* of response `timedrs` instead of `timedrs` itself. ## Box-Cox transformations - Taking log of `timedrs` and having it work: lucky guess. How to find good transformation? - Box-Cox again. - Extra problem: some of `timedrs` values are 0, but Box-Cox expects all +. Note response for `boxcox`: ```{r bRegression-35, eval=F} boxcox(timedrs + 1 ~ phyheal + menheal + stress, data = visits) ``` ## Try 1 ```{r bRegression-36, echo=F,fig.height=4.5} boxcox(timedrs + 1 ~ phyheal + menheal + stress, data = visits) ``` ## Comments on try 1 - Best: $\lambda$ just less than zero. - Hard to see scale. - Focus on $\lambda$ in $(-0.3,0.1)$: ```{r bRegression-37, size="footnotesize"} my.lambda <- seq(-0.3, 0.1, 0.01) my.lambda ``` ## Try 2 ```{r bRegression-38, fig.height=3.5} boxcox(timedrs + 1 ~ phyheal + menheal + stress, lambda = my.lambda, data = visits ) ``` ## Comments - Best: $\lambda$ just about $-0.07$. - CI for $\lambda$ about $(-0.14,0.01)$. - Only nearby round number: $\lambda=0$, log transformation. ## Fixing the problems - Try regression again, with transformed response instead of original one. - Then check residual plot to see that it is OK now. ```{r bRegression-39 } visits.3 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress, data = visits ) ``` - `timedrs+1` because some `timedrs` values 0, can't take log of 0. - Won't usually need to worry about this, but when response could be zero/negative, fix that before transformation. ## Output \scriptsize ```{r bRegression-40 } summary(visits.3) ``` ## Comments - Model as a whole strongly significant again - R-sq higher than before (37% vs. 22%) suggesting things more linear now - Same conclusion re `menheal`: can take out of regression. - Should look at residual plots (next pages). Have we fixed problems? ## Residuals against fitted values ```{r bRegression-41, fig.height=3.5} ggplot(visits.3, aes(x = .fitted, y = .resid)) + geom_point() ``` ## Normal quantile plot of residuals ```{r bRegression-42, fig.height=3.8} ggplot(visits.3, aes(sample = .resid)) + stat_qq() + stat_qq_line() ``` ## Absolute residuals against fitted ```{r bRegression-43, fig.height=3.5, message=F} ggplot(visits.3, aes(x = .fitted, y = abs(.resid))) + geom_point() + geom_smooth() ``` ## Comments - Residuals vs. fitted looks a lot more random. - Normal quantile plot looks a lot more normal (though still a little right-skewness) - Absolute residuals: not so much trend (though still some). - Not perfect, but much improved. ## Testing more than one $x$ at once - The $t$-tests test only whether one variable could be taken out of the regression you're looking at. - To test significance of more than one variable at once, fit model with and without variables - then use `anova` to compare fit of models: ```{r bRegression-44 } visits.5 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress, data = visits) visits.6 <- lm(log(timedrs + 1) ~ stress, data = visits) ``` ## Results of tests ```{r bRegression-45} anova(visits.6, visits.5) ``` - Models don't fit equally well, so bigger one fits better. - Or "taking both variables out makes the fit worse, so don't do it". - Taking out those $x$'s is a mistake. Or putting them in is a good idea. ## The punting data Data set `punting.txt` contains 4 variables for 13 right-footed football kickers (punters): left leg and right leg strength (lbs), distance punted (ft), another variable called "fred". Predict punting distance from other variables: \scriptsize ``` left right punt fred 170 170 162.50 171 130 140 144.0 136 170 180 174.50 174 160 160 163.50 161 150 170 192.0 159 150 150 171.75 151 180 170 162.0 174 110 110 104.83 111 110 120 105.67 114 120 130 117.58 126 140 120 140.25 129 130 140 150.17 136 150 160 165.17 154 ``` ## Reading in - Separated by *multiple spaces* with *columns lined up*: ```{r bRegression-46 } my_url <- "http://ritsokiguess.site/datafiles/punting.txt" punting <- read_table(my_url) ``` ## The data ```{r bRegression-47} punting ``` ## Regression and output ```{r bRegression-48} punting.1 <- lm(punt ~ left + right + fred, data = punting) glance(punting.1) tidy(punting.1) summary(punting.1) ``` ## Comments - Overall regression strongly significant, R-sq high. - None of the $x$'s significant! Why? - $t$-tests only say that you could take any one of the $x$'s out without damaging the fit; doesn't matter which one. - Explanation: look at *correlations*. ## The correlations ```{r bRegression-49 } cor(punting) ``` - *All* correlations are high: $x$'s with `punt` (good) and with each other (bad, at least confusing). - What to do? Probably do just as well to pick one variable, say `right` since kickers are right-footed. ## Just `right` ```{r bRegression-50 } punting.2 <- lm(punt ~ right, data = punting) summary(punting.2) anova(punting.2, punting.1) punting.3 <- lm(punt ~ left, data = punting) summary(punting.3) ``` No significant loss by dropping other two variables. ## Comparing R-squareds ```{r bRegression-51 } summary(punting.1)$r.squared summary(punting.2)$r.squared ``` Basically no difference. In regression (over), `right` significant: ## Regression results ```{r bRegression-52 } tidy(punting.2) ``` ## But\ldots - Maybe we got the *form* of the relationship with `left` wrong. - Check: plot *residuals* from previous regression (without `left`) against `left`. - Residuals here are "punting distance adjusted for right leg strength". - If there is some kind of relationship with `left`, we should include in model. - Plot of residuals against original variable: `augment` from `broom`. ## Augmenting `punting.2` ```{r bRegression-53 } punting.2 %>% augment(punting) -> punting.2.aug punting.2.aug ``` ## Residuals against `left` ```{r basingstoke,fig.height=3.5} ggplot(punting.2.aug, aes(x = left, y = .resid)) + geom_point() ``` ## Comments - There is a *curved* relationship with `left`. - We should add `left`-squared to the regression (and therefore put `left` back in when we do that): ```{r bRegression-54 } punting.3 <- lm(punt ~ left + I(left^2) + right, data = punting ) ``` ## Regression with `left-squared` \scriptsize ```{r bRegression-55} summary(punting.3) ``` ## Comments - This was definitely a good idea (R-squared has clearly increased). - We would never have seen it without plotting residuals from `punting.2` (without `left`) against `left`. - Negative slope for `leftsq` means that increased left-leg strength only increases punting distance up to a point: beyond that, it decreases again.