---
title: "Regression revisited"
editor: 
  markdown: 
    wrap: 72
---

## Regression

-   Use regression when one variable is an outcome (*response*, $y$).

-   See if/how response depends on other variable(s), *explanatory*,
    $x_1, x_2,\ldots$.

-   Can have *one* or *more than one* explanatory variable, but always
    one response.

-   Assumes a *straight-line* relationship between response and
    explanatory.

-   Ask:

    -   *is there* a relationship between $y$ and $x$'s, and if so,
        which ones?
    -   what does the relationship look like?

## Packages

```{r bRegression-1, results='hide'}
library(MASS, exclude = "select") # for Box-Cox, later
library(tidyverse)
library(broom)
library(marginaleffects)
library(conflicted)
conflict_prefer("select", "dplyr")
```

## A regression with one $x$

13 children, measure average total sleep time (ATST, mins) and age
(years) for each. See if ATST depends on age. Data in `sleep.txt`, ATST
then age. Read in data:

```{r bRegression-2 }
my_url <- "http://ritsokiguess.site/datafiles/sleep.txt"
sleep <- read_delim(my_url, " ")
```

## Check data

```{r bRegression-3, size="footnotesize"}
summary(sleep)
```

```{r}
sleep
```

Make scatter plot of ATST (response) vs. age (explanatory) using code
overleaf:

## The scatterplot

```{r suggo, fig.height=6}
ggplot(sleep, aes(x = age, y = atst)) + geom_point()
```

## Correlation

-   Measures how well a straight line fits the data:

```{r bRegression-4 }
with(sleep, cor(atst, age))
```

-   $1$ is perfect upward trend, $-1$ is perfect downward trend, 0 is no
    trend.

-   This one close to perfect downward trend.

-   Can do correlations of all pairs of variables:

```{r bRegression-5 }
cor(sleep)
```

## Lowess curve

-   Sometimes nice to guide the eye: is the trend straight, or not?

-   Idea: *lowess curve*. "Locally weighted least squares", not affected
    by outliers, not constrained to be linear.

-   Lowess is a *guide*: even if straight line appropriate, may
    wiggle/bend a little. Looking for *serious* problems with linearity.

-   Add lowess curve to plot using `geom_smooth`:

## Plot with lowess curve

```{r icko,fig.height=6}
ggplot(sleep, aes(x = age, y = atst)) + geom_point() +
  geom_smooth()
```

## The regression

Scatterplot shows no obvious curve, and a pretty clear downward trend.
So we can run the regression:

```{r bRegression-6, echo=-1}
options(width = 60)
sleep.1 <- lm(atst ~ age, data = sleep)
```

## The output

```{r bRegression-7}
summary(sleep.1)
```

## Conclusions

-   The relationship appears to be a straight line, with a downward
    trend.

-   $F$-tests for model as a whole and $t$-test for slope (same) both
    confirm this (P-value $5.7\times 10^{-7}=0.00000057$).

-   Slope is $-14$, so a 1-year increase in age goes with a 14-minute
    decrease in ATST on average.

-   R-squared is correlation squared (when one $x$ anyway), between 0
    and 1 (1 good, 0 bad).

-   Here R-squared is 0.9054, pleasantly high.

## Doing things with the regression output

-   Output from regression (and eg. $t$-test) is all right to look at,
    but hard to extract and re-use information from.

-   Package `broom` extracts info from model output in way that can be
    used in pipe (later):

```{r bRegression-8, size="footnotesize"}
tidy(sleep.1)
```

## also one-line summary of model:

```{r bRegression-9}
glance(sleep.1)
```

## Broom part 2

```{r bRegression-10,warning=F}
sleep.1 %>% augment(sleep)
```

Useful for plotting residuals against an $x$-variable.

## CI for mean response and prediction intervals

Once useful regression exists, use it for prediction:

-   To get a single number for prediction at a given $x$, substitute
    into regression equation, eg. age 10: predicted ATST is
    $646.48-14.04(10)=506$ minutes.

-   To express uncertainty of this prediction:

-   *CI for mean response* expresses uncertainty about mean ATST for all
    children aged 10, based on data.

-   *Prediction interval* expresses uncertainty about predicted ATST for
    a new child aged 10 whose ATST not known. More uncertain.

-   Also do above for a child aged 5.

## The `marginaleffects` package 1/2

To get predictions for specific values, set up a dataframe with those
values first:

```{r}
new <- datagrid(model = sleep.1, age = c(10, 5))
new
```

Any variables in the dataframe that you don't specify are set to their
mean values (quantitative) or most common category (categorical).

## The `marginaleffects` package 2/2

Then feed into `newdata` in `predictions`. This contains a lot of
columns, so you probably want only to display the ones you care about:

```{r}
cbind(predictions(sleep.1, newdata = new)) %>% 
  select(estimate, conf.low, conf.high, age)
```

The confidence limits are a 95% confidence interval for the mean
response at that `age`.

## Prediction intervals

These are obtained (instead) with `predict` as below. Use the same
dataframe `new` as before:

```{r bRegression-12 }
pp <- predict(sleep.1, new, interval = "p")
pp
cbind(new, pp)

```

## Plotting the confidence intervals for mean response again:

```{r}
plot_predictions(sleep.1, condition = "age")
```

## Comments

-   Age 10 closer to centre of data, so intervals are both narrower than
    those for age 5.

-   Prediction intervals bigger than CI for mean (additional
    uncertainty).

-   Technical note: output from `predict` is R `matrix`, not data frame,
    so Tidyverse `bind_cols` does not work. Use base R `cbind`.

## That grey envelope

Marks confidence interval for mean for all $x$:

```{r bRegression-15, fig.height=6.0}
ggplot(sleep, aes(x = age, y = atst)) + geom_point() +
  geom_smooth(method = "lm") +
  scale_y_continuous(breaks = seq(420, 600, 20))
```

## Diagnostics

How to tell whether a straight-line regression is appropriate?

-   Before: check scatterplot for straight trend.

-   After: plot *residuals* (observed minus predicted response) against
    predicted values. Aim: a plot with no pattern.

## Residual plot

Not much pattern here --- regression appropriate.

```{r akjhkadjfhjahnkkk,fig.height=3.4}
ggplot(sleep.1, aes(x = .fitted, y = .resid)) + geom_point()
```

## An inappropriate regression

Different data:

```{r curvy}
my_url <- "http://ritsokiguess.site/datafiles/curvy.txt"
curvy <- read_delim(my_url, " ")
```

## Scatterplot

```{r bRegression-16, fig.height=3.8}
ggplot(curvy, aes(x = xx, y = yy)) + geom_point()
```

## Regression line, anyway

\scriptsize

```{r bRegression-17}
curvy.1 <- lm(yy ~ xx, data = curvy)
summary(curvy.1)
```

## Residual plot

```{r altoadige,fig.height=3.8}
ggplot(curvy.1, aes(x = .fitted, y = .resid)) + geom_point()
```

## No good: fixing it up

-   Residual plot has *curve*: middle residuals positive, high and low
    ones negative. Bad.

-   Fitting a curve would be better. Try this:

```{r bRegression-18 }
curvy.2 <- lm(yy ~ xx + I(xx^2), data = curvy)
```

-   Adding `xx`-squared term, to allow for curve.

-   Another way to do same thing: specify how model *changes*:

```{r bRegression-19 }
curvy.2a <- update(curvy.1, . ~ . + I(xx^2))
```

## Regression 2

```{r bRegression-20 }
tidy(curvy.2)
glance(curvy.2) #
```

## Comments

-   `xx`-squared term definitely significant (P-value 0.000182), so need
    this curve to describe relationship.

-   Adding squared term has made R-squared go up from 0.5848 to 0.9502:
    great improvement.

-   This is a definite curve!

## The residual plot now

No problems any more:

```{r bRegression-21, size="small", fig.height=3.4}
ggplot(curvy.2, aes(x = .fitted, y = .resid)) + geom_point()
```

## Another way to handle curves

-   Above, saw that changing $x$ (adding $x^2$) was a way of handling
    curved relationships.

-   Another way: change $y$ (transformation).

-   Can guess how to change $y$, or might be theory:

-   example: relationship $y=ae^{bx}$ (exponential growth):

-   take logs to get $\ln y=\ln a + bx$.

-   Taking logs has made relationship linear ($\ln y$ as response).

-   Or, *estimate* transformation, using Box-Cox method.

## Box-Cox

-   Install package `MASS` via `install.packages("MASS")` (only need to
    do *once*)

-   Every R session you want to use something in `MASS`, type
    `library(MASS)`

## Some made-up data

```{r bRegression-22, message=F}
my_url <- "http://ritsokiguess.site/datafiles/madeup2.csv"
madeup <- read_csv(my_url)
madeup
```

Seems to be faster-than-linear growth, maybe exponential growth.

## Scatterplot: faster than linear growth

```{r dsljhsdjlhf,fig.height=3.2}
ggplot(madeup, aes(x = x, y = y)) + geom_point() +
  geom_smooth()
```

## Running Box-Cox

-   `library(MASS)` first.

-   Feed `boxcox` a model formula with a squiggle in it, such as you
    would use for `lm`.

-   Output: a graph (next page):

```{r bRegression-23, eval=F}
boxcox(y ~ x, data = madeup)
```

## The Box-Cox output

```{r trento,echo=F, fig.height=4}
boxcox(y ~ x, data = madeup)
```

## Comments

-   $\lambda$ (lambda) is the power by which you should transform $y$ to
    get the relationship straight (straighter). Power 0 is "take logs"

-   Middle dotted line marks best single value of $\lambda$ (here about
    0.1).

-   Outer dotted lines mark 95% CI for $\lambda$, here $-0.3$ to 0.7,
    approx. (Rather uncertain about best transformation.)

-   Any power transformation within the CI supported by data. In this
    case, log ($\lambda=0$) and square root ($\lambda=0.5$) good, but no
    transformation ($\lambda=1$) not.

-   Pick a "round-number" value of $\lambda$ like $2,1,0.5,0,-0.5,-1$.
    Here 0 and 0.5 good values to pick.

## Did transformation straighten things?

-   Plot transformed $y$ against $x$. Here, log:

```{r bRegression-24, message=FALSE, fig.height=3.2 }
ggplot(madeup, aes(x = x, y = log(y))) + geom_point() +
  geom_smooth()
```

Looks much straighter.

## Regression with transformed $y$

```{r bRegression-25}
madeup.1 <- lm(log(y) ~ x, data = madeup)
glance(madeup.1)
tidy(madeup.1)
```

R-squared now decently high.

## Multiple regression

-   What if more than one $x$? Extra issues:

    -   Now one intercept and a slope for each $x$: how to interpret?

    -   Which $x$-variables actually help to predict $y$?

    -   Different interpretations of "global" $F$-test and individual
        $t$-tests.

    -   R-squared no longer correlation squared, but still interpreted
        as "higher better".

    -   In `lm` line, add extra $x$s after `~`.

    -   Interpretation not so easy (and other problems that can occur).

## Multiple regression example

Study of women and visits to health professionals, and how the number of
visits might be related to other variables:

```{=tex}
\begin{description}
\item[timedrs:] number of visits to health professionals (over course of study)
\item[phyheal:] number of physical health problems
\item[menheal:] number of mental health problems
\item[stress:] result of questionnaire about number and type of life changes
\end{description}
```
`timedrs` response, others explanatory.

## The data

```{r bRegression-26 }
my_url <- 
  "http://ritsokiguess.site/datafiles/regressx.txt"
visits <- read_delim(my_url, " ")
```

## Check data

```{r bRegression-27}
visits
```

## Fit multiple regression

```{r bRegression-28}
visits.1 <- lm(timedrs ~ phyheal + menheal + stress,
  data = visits)
summary(visits.1)
```

## The slopes

-   Model as a whole strongly significant even though R-sq not very big
    (lots of data). At least one of the $x$'s predicts `timedrs`.

-   The physical health and stress variables definitely help to predict
    the number of visits, but *with those in the model* we don't need
    `menheal`. However, look at prediction of `timedrs` from `menheal`
    by itself:

## Just `menheal`

```{r bRegression-30 }
visits.2 <- lm(timedrs ~ menheal, data = visits)
summary(visits.2)
```

## `menheal` by itself

-   `menheal` by itself *does* significantly help to predict `timedrs`.

-   But the R-sq is much less (6.5% vs. 22%).

-   So other two variables do a better job of prediction.

-   With those variables in the regression (`phyheal` and `stress`),
    don't need `menheal` *as well*.

## Investigating via correlation

Leave out first column (`subjno`):

```{r bRegression-31 }
visits %>% select(-subjno) %>% cor()
```

-   `phyheal` most strongly correlated with `timedrs`.

-   Not much to choose between other two.

-   But `menheal` has higher correlation with `phyheal`, so not as much
    to *add* to prediction as `stress`.

-   Goes to show things more complicated in multiple regression.

## Residual plot (from `timedrs` on all)

```{r iffy8,fig.height=3.5}
ggplot(visits.1, aes(x = .fitted, y = .resid)) + geom_point()
```

Apparently random. But...

## Normal quantile plot of residuals

```{r bRegression-32, fig.height=3.5}
ggplot(visits.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

Not normal at all; upper tail is way too long.

## Absolute residuals

Is there trend in *size* of residuals (fan-out)? Plot *absolute value*
of residual against fitted value:

```{r bRegression-33, fig.height=2.8, message=FALSE}
ggplot(visits.1, aes(x = .fitted, y = abs(.resid))) +
  geom_point() + geom_smooth()
```

## Comments

-   On the normal quantile plot:

    -   highest (most positive) residuals are *way* too high

    -   distribution of residuals skewed to right (not normal at all)

-   On plot of absolute residuals:

    -   size of residuals getting bigger as fitted values increase

    -   predictions getting more variable as fitted values increase

    -   that is, predictions getting *less accurate* as fitted values
        increase, but predictions should be equally accurate all way
        along.

-   Both indicate problems with regression, of kind that transformation
    of response often fixes: that is, predict *function* of response
    `timedrs` instead of `timedrs` itself.

## Box-Cox transformations

-   Taking log of `timedrs` and having it work: lucky guess. How to find
    good transformation?

-   Box-Cox again.

-   Extra problem: some of `timedrs` values are 0, but Box-Cox expects
    all +. Note response for `boxcox`:

```{r bRegression-35, eval=F}
boxcox(timedrs + 1 ~ phyheal + menheal + stress, data = visits)
```

## Try 1

```{r bRegression-36, echo=F,fig.height=4.5}
  boxcox(timedrs + 1 ~ phyheal + menheal + stress, 
         data = visits)
```

## Comments on try 1

-   Best: $\lambda$ just less than zero.

-   Hard to see scale.

-   Focus on $\lambda$ in $(-0.3,0.1)$:

```{r bRegression-37, size="footnotesize"}
my.lambda <- seq(-0.3, 0.1, 0.01)
my.lambda
```

## Try 2

```{r bRegression-38, fig.height=3.5}
boxcox(timedrs + 1 ~ phyheal + menheal + stress,
  lambda = my.lambda,
  data = visits
)
```

## Comments

-   Best: $\lambda$ just about $-0.07$.

-   CI for $\lambda$ about $(-0.14,0.01)$.

-   Only nearby round number: $\lambda=0$, log transformation.

## Fixing the problems

-   Try regression again, with transformed response instead of original
    one.

-   Then check residual plot to see that it is OK now.

```{r bRegression-39 }
visits.3 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress,
  data = visits
)
```

-   `timedrs+1` because some `timedrs` values 0, can't take log of 0.

-   Won't usually need to worry about this, but when response could be
    zero/negative, fix that before transformation.

## Output

\scriptsize

```{r bRegression-40 }
summary(visits.3)
```

## Comments

-   Model as a whole strongly significant again

-   R-sq higher than before (37% vs. 22%) suggesting things more linear
    now

-   Same conclusion re `menheal`: can take out of regression.

-   Should look at residual plots (next pages). Have we fixed problems?

## Residuals against fitted values

```{r bRegression-41, fig.height=3.5}
ggplot(visits.3, aes(x = .fitted, y = .resid)) +
  geom_point()
```

## Normal quantile plot of residuals

```{r bRegression-42, fig.height=3.8}
ggplot(visits.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

## Absolute residuals against fitted

```{r bRegression-43, fig.height=3.5, message=F}
ggplot(visits.3, aes(x = .fitted, y = abs(.resid))) +
  geom_point() + geom_smooth()
```

## Comments

-   Residuals vs. fitted looks a lot more random.

-   Normal quantile plot looks a lot more normal (though still a little
    right-skewness)

-   Absolute residuals: not so much trend (though still some).

-   Not perfect, but much improved.

## Testing more than one $x$ at once

-   The $t$-tests test only whether one variable could be taken out of
    the regression you're looking at.
-   To test significance of more than one variable at once, fit model
    with and without variables
    -   then use `anova` to compare fit of models:

```{r bRegression-44 }
visits.5 <- lm(log(timedrs + 1) ~ phyheal + menheal + stress, 
               data = visits)
visits.6 <- lm(log(timedrs + 1) ~ stress, data = visits)
```

## Results of tests

```{r bRegression-45}
anova(visits.6, visits.5)
```

-   Models don't fit equally well, so bigger one fits better.

-   Or "taking both variables out makes the fit worse, so don't do it".

-   Taking out those $x$'s is a mistake. Or putting them in is a good
    idea.

## The punting data

Data set `punting.txt` contains 4 variables for 13 right-footed football
kickers (punters): left leg and right leg strength (lbs), distance
punted (ft), another variable called "fred". Predict punting distance
from other variables:

\scriptsize

```         
left            right               punt         fred
170               170                162.50       171 
130               140                144.0        136   
170               180                174.50       174 
160               160                163.50       161 
150               170                192.0        159 
150               150                171.75       151 
180               170                162.0        174 
110               110                104.83       111 
110               120                105.67       114 
120               130                117.58       126 
140               120                140.25       129  
130               140                150.17       136 
150               160                165.17       154 
```

## Reading in

-   Separated by *multiple spaces* with *columns lined up*:

```{r bRegression-46 }
my_url <- "http://ritsokiguess.site/datafiles/punting.txt"
punting <- read_table(my_url)
```

## The data

```{r bRegression-47}
punting
```

## Regression and output

```{r bRegression-48}
punting.1 <- lm(punt ~ left + right + fred, data = punting)
glance(punting.1)
tidy(punting.1)
summary(punting.1)
```

## Comments

-   Overall regression strongly significant, R-sq high.

-   None of the $x$'s significant! Why?

-   $t$-tests only say that you could take any one of the $x$'s out
    without damaging the fit; doesn't matter which one.

-   Explanation: look at *correlations*.

## The correlations

```{r bRegression-49 }
cor(punting)
```

-   *All* correlations are high: $x$'s with `punt` (good) and with each
    other (bad, at least confusing).

-   What to do? Probably do just as well to pick one variable, say
    `right` since kickers are right-footed.

## Just `right`

```{r bRegression-50 }
punting.2 <- lm(punt ~ right, data = punting)
summary(punting.2)
anova(punting.2, punting.1)
punting.3 <- lm(punt ~ left, data = punting)
summary(punting.3)
```

No significant loss by dropping other two variables.

## Comparing R-squareds

```{r bRegression-51 }
summary(punting.1)$r.squared
summary(punting.2)$r.squared
```

Basically no difference. In regression (over), `right` significant:

## Regression results

```{r bRegression-52 }
tidy(punting.2)
```

## But\ldots

-   Maybe we got the *form* of the relationship with `left` wrong.

-   Check: plot *residuals* from previous regression (without `left`)
    against `left`.

-   Residuals here are "punting distance adjusted for right leg
    strength".

-   If there is some kind of relationship with `left`, we should include
    in model.

-   Plot of residuals against original variable: `augment` from `broom`.

## Augmenting `punting.2`

```{r bRegression-53 }
punting.2 %>% augment(punting) -> punting.2.aug
punting.2.aug 
```

## Residuals against `left`

```{r basingstoke,fig.height=3.5}
ggplot(punting.2.aug, aes(x = left, y = .resid)) +
  geom_point()
```

## Comments

-   There is a *curved* relationship with `left`.

-   We should add `left`-squared to the regression (and therefore put
    `left` back in when we do that):

```{r bRegression-54 }
punting.3 <- lm(punt ~ left + I(left^2) + right,
  data = punting
)
```

## Regression with `left-squared`

\scriptsize

```{r bRegression-55}
summary(punting.3)
```

## Comments

-   This was definitely a good idea (R-squared has clearly increased).

-   We would never have seen it without plotting residuals from
    `punting.2` (without `left`) against `left`.

-   Negative slope for `leftsq` means that increased left-leg strength
    only increases punting distance up to a point: beyond that, it
    decreases again.