--- title: "Case study: asphalt" editor: markdown: wrap: 72 --- ## The asphalt data - 31 asphalt pavements prepared under different conditions. How does quality of pavement depend on these? - Variables: - pct.a.surf Percentage of asphalt in surface layer - pct.a.base Percentage of asphalt in base layer - fines Percentage of fines in surface layer - voids Percentage of voids in surface layer - rut.depth Change in rut depth per million vehicle passes - viscosity Viscosity of asphalt - run 2 data collection periods: 1 for run 1, 0 for run 2. - rut.depth response. Depends on other variables, how? ## Packages for this section {r asphalt-1} library(MASS) library(tidyverse) library(broom) library(leaps)  Make sure to load MASS before tidyverse (for annoying technical reasons). ## Getting set up {r asphalt-2} my_url <- "http://ritsokiguess.site/datafiles/asphalt.txt" asphalt <- read_delim(my_url, " ")  - Quantitative variables with one response: multiple regression. - Some issues here that don't come up in "simple" regression; handle as we go. (STAB27/STAC67 ideas.) ## The data (some) {r asphalt-3} asphalt  ## Plotting response "rut depth" against everything else Same idea as for plotting separate predictions on one plot: {r asphalt-4} asphalt %>% pivot_longer( -rut.depth, names_to="xname", values_to="x" ) %>% ggplot(aes(x = x, y = rut.depth)) + geom_point() + facet_wrap(~xname, scales = "free") -> g  "collect all the x-variables together into one column called x, with another column xname saying which x they were, then plot these x's against rut.depth, a separate facet for each x-variable." I saved this graph to plot later (on the next page). ## The plot {r asphalt-5} g  ## Interpreting the plots - One plot of rut depth against each of the six other variables. - Get rough idea of what's going on. - Trends mostly weak. - viscosity has strong but non-linear trend. - run has effect but variability bigger when run is 1. - Weak but downward trend for voids. - Non-linearity of rut.depth-viscosity relationship should concern us. ## Log of viscosity: more nearly linear? - Take this back to asphalt engineer: suggests log of viscosity: {r logvisplot, fig.keep="none", warning=F, message=F} ggplot(asphalt, aes(y = rut.depth, x = log(viscosity))) + geom_point() + geom_smooth(se = F) -> g  (plot overleaf) ## Rut depth against log-viscosity {r asphalt-6, ref.label="logvisplot", echo=FALSE, message=FALSE} g  ## Comments and next steps - Not very linear, but better than before. - In multiple regression, hard to guess which x's affect response. So typically start by predicting from everything else. - Model formula has response on left, squiggle, explanatories on right joined by plusses: {r asphalt-7} rut.1 <- lm(rut.depth ~ pct.a.surf + pct.a.base + fines + voids + log(viscosity) + run, data = asphalt) summary(rut.1)  ## Regression output: summary(rut.1) or: \footnotesize {r asphalt-8} glance(rut.1) tidy(rut.1)  \normalsize ## Comments - R-squared 81%, not so bad. - P-value in glance asserts that something helping to predict rut.depth. - Table of coefficients says log(viscosity). - But confused by clearly non-significant variables: remove those to get clearer picture of what is helpful. - ## Before we do anything, look at residual plots:  (a) of residuals against fitted values (as usual)  - (b) of residuals against each explanatory. - Problem fixes: - with (a): fix response variable; - with some plots in (b): fix those explanatory variables. ## Plot fitted values against residuals {r asphalt-9} ggplot(rut.1, aes(x = .fitted, y = .resid)) + geom_point()  ## Normal quantile plot of residuals {r} ggplot(rut.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()  ## Plotting residuals against $x$ variables - Problem here is that residuals are in the fitted model, and the observed $x$-values are in the original data frame asphalt. - Package broom contains a function augment that combines these two together so that they can later be plotted: start with a model first, and then augment with a data frame: {r asphalt-10} rut.1 %>% augment(asphalt) -> rut.1a  ## What does rut.1a contain? {r asphalt-11, echo=FALSE} #options(width = 70)  {r asphalt-12} names(rut.1a)  - all the stuff in original data frame, plus: - quantities from regression (starting with a dot) ## Plotting residuals against $x$-variables {r asphalt-13} rut.1a %>% mutate(log_vis=log(viscosity)) %>% pivot_longer( c(pct.a.surf:voids, run, log_vis), names_to="xname", values_to="x" ) %>% ggplot(aes(x = x, y = .resid)) + geom_point() + facet_wrap(~xname, scales = "free") -> g  ## The plot {r asphalt-14} g  ## Comments - There is serious curve in plot of residuals vs. fitted values. Suggests a transformation of $y$. - The residuals-vs-$x$'s plots don't show any serious trends. Worst probably that potential curve against log-viscosity. - Also, large positive residual, 10, that shows up on all plots. Perhaps transformation of $y$ will help with this too. - If residual-fitted plot OK, but some residual-$x$ plots not, try transforming those $x$'s, eg. by adding $x^2$ to help with curve. ## Which transformation? - Best way: consult with person who brought you the data. - Can't do that here! - No idea what transformation would be good. - Let data choose: "Box-Cox transformation". - Scale is that of "ladder of powers": power transformation, but 0 is log. ## Running Box-Cox From package MASS: {r asphalt-15} boxcox(rut.depth ~ pct.a.surf + pct.a.base + fines + voids + log(viscosity) + run, data = asphalt)  ## Comments on Box-Cox plot - $\lambda$ represents power to transform $y$ with. - Best single choice of transformation parameter $\lambda$ is peak of curve, close to 0. - Vertical dotted lines give CI for $\lambda$, about (−0.05, 0.2). - $\lambda = 0$ means "log". - Narrowness of confidence interval mean that these not supported by data: - No transformation ($\lambda = 1$) - Square root ($\lambda = 0.5$) - Reciprocal ($\lambda = −1$). ## Relationships with explanatories - As before: plot response (now log(rut.depth)) against other explanatory variables, all in one shot: {r asphalt-16} asphalt %>% mutate(log_vis=log(viscosity)) %>% pivot_longer( c(pct.a.surf:voids, run, log_vis), names_to="xname", values_to="x" ) %>% ggplot(aes(y = log(rut.depth), x = x)) + geom_point() + facet_wrap(~xname, scales = "free") -> g3  ## The new plots {r asphalt-17} g3  ## Modelling with transformed response - These trends look pretty straight, especially with log.viscosity. - Values of log.rut.depth for each run have same spread. - Other trends weak, but are straight if they exist. - Start modelling from the beginning again. - Model log.rut.depth in terms of everything else, see what can be removed: {r asphalt-18} rut.2 <- lm(log(rut.depth) ~ pct.a.surf + pct.a.base + fines + voids + log(viscosity) + run, data = asphalt)  - use tidy from broom to display just the coefficients. ## Output {r asphalt-19} tidy(rut.2) summary(rut.2)  ## Taking out everything non-significant - Try: remove everything but pct.a.surf and log.viscosity: \footnotesize {r asphalt-20} rut.3 <- lm(log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt) summary(rut.3)  \normalsize \footnotesize - Check that removing all those variables wasn't too much: {r asphalt-21} anova(rut.3, rut.2)  \normalsize - $H_0$ : two models equally good; $H_a$ : bigger model better. - Null not rejected here; small model as good as the big one, so prefer simpler smaller model rut.3. ## Find the largest P-value by eye: {r asphalt-22} tidy(rut.2)  - Largest P-value is 0.78 for pct.a.base, not significant. - So remove this first, re-fit and re-assess. - Or, as over. ## Get the computer to find the largest P-value for you - Output from tidy is itself a data frame, thus: {r asphalt-23} tidy(rut.2) %>% arrange(p.value)  - Largest P-value at the bottom. ## Take out pct.a.base - Copy and paste the lm code and remove what you're removing: \small {r asphalt-24} rut.4 <- lm(log(rut.depth) ~ pct.a.surf + fines + voids + log(viscosity) + run, data = asphalt) tidy(rut.4) %>% arrange(p.value) %>% dplyr::select(term, p.value)  \normalsize - fines is next to go, P-value 0.32. ## "Update" Another way to do the same thing: {r asphalt-25} rut.4 <- update(rut.2, . ~ . - pct.a.base) tidy(rut.4) %>% arrange(p.value)  - Again, fines is the one to go. (Output identical as it should be.) ## Take out fines: {r asphalt-26} rut.5 <- update(rut.4, . ~ . - fines) tidy(rut.5) %>% arrange(p.value) %>% dplyr::select(term, p.value)  Can't take out intercept, so run, with P-value 0.36, goes next. ## Take out run: {r asphalt-27} rut.6 <- update(rut.5, . ~ . - run) tidy(rut.6) %>% arrange(p.value) %>% dplyr::select(term, p.value)  Again, can't take out intercept, so largest P-value is for voids, 0.044. But this is significant, so we shouldn't remove voids. ## Comments - Here we stop: pct.a.surf, voids and log.viscosity would all make fit significantly worse if removed. So they stay. - Different final result from taking things out one at a time (top), than by taking out 4 at once (bottom): {r asphalt-28} summary(rut.6) coef(rut.6) coef(rut.3)  - Point: Can make difference which way we go. ## Comments on variable selection - Best way to decide which $x$'s belong: expert knowledge: which of them should be important. - Best automatic method: what we did, "backward selection". - Do not learn about "stepwise regression"! [**eg. here**](https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df) - R has function step that does backward selection, like this: {r asphalt-29, eval=F} step(rut.2, direction = "backward", test = "F")  Gets same answer as we did (by removing least significant x). - Removing non-significant $x$'s may remove interesting ones whose P-values happened not to reach 0.05. Consider using less stringent cutoff like 0.20 or even bigger. - Can also fit all possible regressions, as over (may need to do install.packages("leaps") first). ## All possible regressions (output over) Uses package leaps: {r asphalt-30} leaps <- regsubsets(log(rut.depth) ~ pct.a.surf + pct.a.base + fines + voids + log(viscosity) + run, data = asphalt, nbest = 2) s <- summary(leaps) with(s, data.frame(rsq, outmat)) -> d  ## The output {r asphalt-31, echo=F} wid=getOption("width") options(width=80)  \scriptsize {r asphalt-32} d %>% rownames_to_column("model") %>% arrange(desc(rsq))  \normalsize {r asphalt-33, echo=F} options(width=wid)  ## Comments - Problem: even adding a worthless x increases R-squared. So try for line where R-squared stops increasing "too much", eg. top line (just log.viscosity), first 3-variable line (backwards-elimination model). Hard to judge. - One solution (STAC67): adjusted R-squared, where adding worthless variable makes it go down. - data.frame rather than tibble because there are several columns in outmat. ## All possible regressions, adjusted R-squared {r asphalt-34, echo=F} wid=getOption("width") options(width=80)  \scriptsize {r asphalt-35} with(s, data.frame(adjr2, outmat)) %>% rownames_to_column("model") %>% arrange(desc(adjr2))  \normalsize {r asphalt-36, echo=F} options(width=wid)  ## Revisiting the best model - Best model was our rut.6: {r asphalt-37} tidy(rut.6)  ## Revisiting (2) - Regression slopes say that rut depth increases as log-viscosity decreases, pct.a.surf increases and voids increases. This more or less checks out with out scatterplots against log.viscosity. - We should check residual plots again, though previous scatterplots say it's unlikely that there will be a problem: {r asphalt-38} g <- ggplot(rut.6, aes(y = .resid, x = .fitted)) + geom_point()  ## Residuals against fitted values {r asphalt-39} g  {r} ggplot(rut.6, aes(sample = .resid)) + stat_qq() + stat_qq_line()  ## Plotting residuals against x's - Do our trick again to put them all on one plot: {r asphalt-40} augment(rut.6, asphalt) %>% mutate(log_vis=log(viscosity)) %>% pivot_longer( c(pct.a.surf:voids, run, log_vis), names_to="xname", values_to="x", ) %>% ggplot(aes(y = .resid, x = x)) + geom_point() + facet_wrap(~xname, scales = "free") -> g2  ## Residuals against the x's {r asphalt-41} g2  ## Comments - None of the plots show any sort of pattern. The points all look random on each plot. - On the plot of fitted values (and on the one of log.viscosity), the points seem to form a "left half" and a "right half" with a gap in the middle. This is not a concern. - One of the pct.a.surf values is low outlier (4), shows up top left of that plot. - Only two possible values of run; the points in each group look randomly scattered around 0, with equal spreads. - Residuals seem to go above zero further than below, suggesting a mild non-normality, but not enough to be a problem. ## Variable-selection strategies - Expert knowledge. - Backward elimination. - All possible regressions. - Taking a variety of models to experts and asking their opinion. - Use a looser cutoff to eliminate variables in backward elimination (eg. only if P-value greater than 0.20). - If goal is prediction, eliminating worthless variables less important. - If goal is understanding, want to eliminate worthless variables where possible. - Results of variable selection not always reproducible, so caution advised.