---
title: "Case study: asphalt"
editor: 
  markdown: 
    wrap: 72
---

## The asphalt data

-   31 asphalt pavements prepared under different conditions. How does
    quality of pavement depend on these?
-   Variables:
    -   `pct.a.surf` Percentage of asphalt in surface layer
    -   `pct.a.base` Percentage of asphalt in base layer
    -   `fines` Percentage of fines in surface layer
    -   `voids` Percentage of voids in surface layer
    -   `rut.depth` Change in rut depth per million vehicle passes
    -   `viscosity` Viscosity of asphalt
    -   `run` 2 data collection periods: 1 for run 1, 0 for run 2.
-   `rut.depth` response. Depends on other variables, how?

## Packages for this section

```{r asphalt-1}
library(MASS)
library(tidyverse)
library(broom)
library(leaps)
```

Make sure to load `MASS` before `tidyverse` (for annoying technical
reasons).

## Getting set up

```{r asphalt-2}
my_url <- "http://ritsokiguess.site/datafiles/asphalt.txt"
asphalt <- read_delim(my_url, " ")
```

-   Quantitative variables with one response: multiple regression.
-   Some issues here that don't come up in "simple" regression; handle
    as we go. (STAB27/STAC67 ideas.)

## The data (some)

```{r asphalt-3}
asphalt
```

## Plotting response "rut depth" against everything else

Same idea as for plotting separate predictions on one plot:

```{r asphalt-4}
asphalt %>%
  pivot_longer(
    -rut.depth,
    names_to="xname", values_to="x"
  ) %>%
  ggplot(aes(x = x, y = rut.depth)) + geom_point() +
  facet_wrap(~xname, scales = "free") -> g
```

"collect all the x-variables together into one column called x, with
another column xname saying which x they were, then plot these x's
against rut.depth, a separate facet for each x-variable."

I saved this graph to plot later (on the next page).

## The plot

```{r asphalt-5}
g
```

## Interpreting the plots

-   One plot of rut depth against each of the six other variables.
-   Get rough idea of what's going on.
-   Trends mostly weak.
-   `viscosity` has strong but non-linear trend.
-   `run` has effect but variability bigger when run is 1.
-   Weak but downward trend for `voids`.
-   Non-linearity of `rut.depth`-`viscosity` relationship should concern
    us.

## Log of `viscosity`: more nearly linear?

-   Take this back to asphalt engineer: suggests log of `viscosity`:

```{r logvisplot, fig.keep="none", warning=F, message=F}
ggplot(asphalt, aes(y = rut.depth, x = log(viscosity))) +
  geom_point() + geom_smooth(se = F) -> g
```

(plot overleaf)

## Rut depth against log-viscosity

```{r asphalt-6, ref.label="logvisplot", echo=FALSE, message=FALSE}
g
```

## Comments and next steps

-   Not very linear, but better than before.
-   In multiple regression, hard to guess which x's affect response. So
    typically start by predicting from everything else.
-   Model formula has response on left, squiggle, explanatories on right
    joined by plusses:

```{r asphalt-7}
rut.1 <- lm(rut.depth ~ pct.a.surf + pct.a.base + fines +
  voids + log(viscosity) + run, data = asphalt)
summary(rut.1)
```

## Regression output: `summary(rut.1)` or:

\footnotesize

```{r asphalt-8}
glance(rut.1)
tidy(rut.1)
```

\normalsize

## Comments

-   R-squared 81%, not so bad.

-   P-value in `glance` asserts that something helping to predict
    rut.depth.

-   Table of coefficients says `log(viscosity)`.

-   But confused by clearly non-significant variables: remove those to
    get clearer picture of what is helpful.

-   

    ## Before we do anything, look at residual plots:

    ```         
    (a) of residuals against fitted values (as usual)
    ```

    -   

        (b) of residuals against each explanatory.

-   Problem fixes:

    -   with (a): fix response variable;
    -   with some plots in (b): fix those explanatory variables.

## Plot fitted values against residuals

```{r asphalt-9}
ggplot(rut.1, aes(x = .fitted, y = .resid)) + geom_point()
```

## Normal quantile plot of residuals

```{r}
ggplot(rut.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

## Plotting residuals against $x$ variables

-   Problem here is that residuals are in the fitted model, and the
    observed $x$-values are in the original data frame `asphalt`.
-   Package broom contains a function `augment` that combines these two
    together so that they can later be plotted: start with a model
    first, and then augment with a data frame:

```{r asphalt-10}
rut.1 %>% augment(asphalt) -> rut.1a
```

## What does rut.1a contain?

```{r asphalt-11, echo=FALSE}
#options(width = 70)
```

```{r asphalt-12}
names(rut.1a)
```

-   all the stuff in original data frame, plus:
-   quantities from regression (starting with a dot)

## Plotting residuals against $x$-variables

```{r asphalt-13}
rut.1a %>%
  mutate(log_vis=log(viscosity)) %>% 
  pivot_longer(
    c(pct.a.surf:voids, run, log_vis),
    names_to="xname", values_to="x"
  ) %>%
  ggplot(aes(x = x, y = .resid)) +
  geom_point() + facet_wrap(~xname, scales = "free") -> g
```

## The plot

```{r asphalt-14}
g
```

## Comments

-   There is serious curve in plot of residuals vs. fitted values.
    Suggests a transformation of $y$.
-   The residuals-vs-$x$'s plots don't show any serious trends. Worst
    probably that potential curve against log-viscosity.
-   Also, large positive residual, 10, that shows up on all plots.
    Perhaps transformation of $y$ will help with this too.
-   If residual-fitted plot OK, but some residual-$x$ plots not, try
    transforming those $x$'s, eg. by adding $x^2$ to help with curve.

## Which transformation?

-   Best way: consult with person who brought you the data.
-   Can't do that here!
-   No idea what transformation would be good.
-   Let data choose: "Box-Cox transformation".
-   Scale is that of "ladder of powers": power transformation, but 0 is
    log.

## Running Box-Cox

From package `MASS`:

```{r asphalt-15}
boxcox(rut.depth ~ pct.a.surf + pct.a.base + fines + voids +
  log(viscosity) + run, data = asphalt)
```

## Comments on Box-Cox plot

-   $\lambda$ represents power to transform $y$ with.
-   Best single choice of transformation parameter $\lambda$ is peak of
    curve, close to 0.
-   Vertical dotted lines give CI for $\lambda$, about (−0.05, 0.2).
-   $\lambda = 0$ means "log".
-   Narrowness of confidence interval mean that these not supported by
    data:
    -   No transformation ($\lambda = 1$)
    -   Square root ($\lambda = 0.5$)
    -   Reciprocal ($\lambda = −1$).

## Relationships with explanatories

-   As before: plot response (now `log(rut.depth)`) against other
    explanatory variables, all in one shot:

```{r asphalt-16}
asphalt %>%
  mutate(log_vis=log(viscosity)) %>% 
  pivot_longer(
    c(pct.a.surf:voids, run, log_vis),
    names_to="xname", values_to="x"
  ) %>%
  ggplot(aes(y = log(rut.depth), x = x)) + geom_point() +
  facet_wrap(~xname, scales = "free") -> g3
```

## The new plots

```{r asphalt-17}
g3
```

## Modelling with transformed response

-   These trends look pretty straight, especially with `log.viscosity`.
-   Values of `log.rut.depth` for each `run` have same spread.
-   Other trends weak, but are straight if they exist.
-   Start modelling from the beginning again.
-   Model `log.rut.depth` in terms of everything else, see what can be
    removed:

```{r asphalt-18}
rut.2 <- lm(log(rut.depth) ~ pct.a.surf + pct.a.base +
  fines + voids + log(viscosity) + run, data = asphalt)
```

-   use `tidy` from `broom` to display just the coefficients.

## Output

```{r asphalt-19}
tidy(rut.2)
summary(rut.2)
```

## Taking out everything non-significant

-   Try: remove everything but pct.a.surf and log.viscosity:

\footnotesize

```{r asphalt-20}
rut.3 <- lm(log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt)
summary(rut.3)
```

\normalsize

\footnotesize

-   Check that removing all those variables wasn't too much:

```{r asphalt-21}
anova(rut.3, rut.2)
```

\normalsize

-   $H_0$ : two models equally good; $H_a$ : bigger model better.
-   Null not rejected here; small model as good as the big one, so
    prefer simpler smaller model `rut.3`.

## Find the largest P-value by eye:

```{r asphalt-22}
tidy(rut.2)
```

-   Largest P-value is 0.78 for `pct.a.base`, not significant.
-   So remove this first, re-fit and re-assess.
-   Or, as over.

## Get the computer to find the largest P-value for you

-   Output from `tidy` is itself a data frame, thus:

```{r asphalt-23}
tidy(rut.2) %>% arrange(p.value)
```

-   Largest P-value at the bottom.

## Take out `pct.a.base`

-   Copy and paste the `lm` code and remove what you're removing:

\small

```{r asphalt-24}
rut.4 <- lm(log(rut.depth) ~ pct.a.surf + fines + voids + 
              log(viscosity) + run, data = asphalt)
tidy(rut.4) %>% arrange(p.value) %>% dplyr::select(term, p.value)
```

\normalsize

-   `fines` is next to go, P-value 0.32.

## "Update"

Another way to do the same thing:

```{r asphalt-25}
rut.4 <- update(rut.2, . ~ . - pct.a.base)
tidy(rut.4) %>% arrange(p.value)
```

-   Again, `fines` is the one to go. (Output identical as it should be.)

## Take out fines:

```{r asphalt-26}
rut.5 <- update(rut.4, . ~ . - fines)
tidy(rut.5) %>% arrange(p.value) %>% dplyr::select(term, p.value)
```

Can't take out intercept, so `run`, with P-value 0.36, goes next.

## Take out run:

```{r asphalt-27}
rut.6 <- update(rut.5, . ~ . - run)
tidy(rut.6) %>% arrange(p.value) %>% dplyr::select(term, p.value)
```

Again, can't take out intercept, so largest P-value is for `voids`,
0.044. But this is significant, so we shouldn't remove `voids`.

## Comments

-   Here we stop: `pct.a.surf`, `voids` and `log.viscosity` would all
    make fit significantly worse if removed. So they stay.
-   Different final result from taking things out one at a time (top),
    than by taking out 4 at once (bottom):

```{r asphalt-28}
summary(rut.6)
coef(rut.6)
coef(rut.3)
```

-   Point: Can make difference which way we go.

## Comments on variable selection

-   Best way to decide which $x$'s belong: expert knowledge: which of
    them should be important.
-   Best automatic method: what we did, "backward selection".
-   Do not learn about "stepwise regression"! [**eg.
    here**](https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df)
-   R has function `step` that does backward selection, like this:

```{r asphalt-29, eval=F}
step(rut.2, direction = "backward", test = "F")
```

Gets same answer as we did (by removing least significant x).

-   Removing non-significant $x$'s may remove interesting ones whose
    P-values happened not to reach 0.05. Consider using less stringent
    cutoff like 0.20 or even bigger.
-   Can also fit all possible regressions, as over (may need to do
    `install.packages("leaps")` first).

## All possible regressions (output over)

Uses package `leaps`:

```{r asphalt-30}
leaps <- regsubsets(log(rut.depth) ~ pct.a.surf + 
                      pct.a.base + fines + voids + 
                      log(viscosity) + run, 
                    data = asphalt, nbest = 2)
s <- summary(leaps)
with(s, data.frame(rsq, outmat)) -> d
```

## The output

```{r asphalt-31, echo=F}
wid=getOption("width")
options(width=80)
```

\scriptsize

```{r asphalt-32}
d %>% rownames_to_column("model") %>% arrange(desc(rsq))
```

\normalsize

```{r asphalt-33, echo=F}
options(width=wid)
```

## Comments

-   Problem: even adding a worthless x increases R-squared. So try for
    line where R-squared stops increasing "too much", eg. top line (just
    log.viscosity), first 3-variable line (backwards-elimination model).
    Hard to judge.
-   One solution (STAC67): adjusted R-squared, where adding worthless
    variable makes it go down.
-   `data.frame` rather than `tibble` because there are several columns
    in `outmat`.

## All possible regressions, adjusted R-squared

```{r asphalt-34, echo=F}
wid=getOption("width")
options(width=80)
```

\scriptsize

```{r asphalt-35}
with(s, data.frame(adjr2, outmat)) %>% 
  rownames_to_column("model") %>% 
  arrange(desc(adjr2))
```

\normalsize

```{r asphalt-36, echo=F}
options(width=wid)
```

## Revisiting the best model

-   Best model was our rut.6:

```{r asphalt-37}
tidy(rut.6)
```

## Revisiting (2)

-   Regression slopes say that rut depth increases as log-viscosity
    decreases, `pct.a.surf` increases and `voids` increases. This more
    or less checks out with out scatterplots against `log.viscosity`.
-   We should check residual plots again, though previous scatterplots
    say it's unlikely that there will be a problem:

```{r asphalt-38}
g <- ggplot(rut.6, aes(y = .resid, x = .fitted)) + 
geom_point()
```

## Residuals against fitted values

```{r asphalt-39}
g
```


```{r}
ggplot(rut.6, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

## Plotting residuals against x's

-   Do our trick again to put them all on one plot:

```{r asphalt-40}
augment(rut.6, asphalt) %>%
  mutate(log_vis=log(viscosity)) %>% 
  pivot_longer(
    c(pct.a.surf:voids, run, log_vis),
    names_to="xname", values_to="x",
  ) %>%
  ggplot(aes(y = .resid, x = x)) + geom_point() +
  facet_wrap(~xname, scales = "free") -> g2
```

## Residuals against the x's

```{r asphalt-41}
g2
```

## Comments

-   None of the plots show any sort of pattern. The points all look
    random on each plot.
-   On the plot of fitted values (and on the one of log.viscosity), the
    points seem to form a "left half" and a "right half" with a gap in
    the middle. This is not a concern.
-   One of the pct.a.surf values is low outlier (4), shows up top left
    of that plot.
-   Only two possible values of run; the points in each group look
    randomly scattered around 0, with equal spreads.
-   Residuals seem to go above zero further than below, suggesting a
    mild non-normality, but not enough to be a problem.

## Variable-selection strategies

-   Expert knowledge.
-   Backward elimination.
-   All possible regressions.
-   Taking a variety of models to experts and asking their opinion.
-   Use a looser cutoff to eliminate variables in backward elimination
    (eg. only if P-value greater than 0.20).
-   If goal is prediction, eliminating worthless variables less
    important.
-   If goal is understanding, want to eliminate worthless variables
    where possible.
-   Results of variable selection not always reproducible, so caution
    advised.