---
title: "Case study: windmill"
editor:
markdown:
wrap: 72
---
## The windmill data
- Engineer: does amount of electricity generated by windmill depend on
how strongly wind blowing?
- Measurements of wind speed and DC current generated at various
times.
- Assume the "various times" to be randomly selected --- aim to
generalize to "this windmill at all times".
- Research questions:
- Relationship between wind speed and current generated?
- If so, what kind of relationship?
- Can we model relationship to do predictions?
## Packages for this section
```{r windmill-1}
library(tidyverse)
library(broom)
```
## Reading in the data
```{r windmill-2}
my_url <-
"http://ritsokiguess.site/datafiles/windmill.csv"
windmill <- read_csv(my_url)
windmill
```
## Strategy
- Two quantitative variables, looking for relationship: regression
methods.
- Start with picture (scatterplot).
- Fit models and do model checking, fixing up things as necessary.
- Scatterplot:
- 2 variables, `DC_output` and `wind_velocity`.
- First is output/response, other is input/explanatory.
- Put `DC_output` on vertical scale.
- Add trend, but don't want to assume linear:
```{r windmill-4, eval=FALSE}
ggplot(windmill, aes(y = DC_output, x = wind_velocity)) +
geom_point() + geom_smooth()
```
## Scatterplot
```{r windmill-5, echo=FALSE, message=FALSE}
ggplot(windmill, aes(y = DC_output, x = wind_velocity)) +
geom_point() + geom_smooth(se = F)
```
## Comments
- Definitely a relationship: as wind velocity increases, so does DC
output. (As you'd expect.)
- Is relationship linear? To help judge, `geom_smooth` smooths
scatterplot trend. (Trend called "loess", "Locally weighted least
squares" which downweights outliers. Not constrained to be
straight.)
- Trend more or less linear for while, then curves downwards
(levelling off?). Straight line not so good here.
## Fit a straight line (and see what happens)
```{r windmill-7}
DC.1 <- lm(DC_output ~ wind_velocity, data = windmill)
summary(DC.1)
```
## Another way of looking at the output
- The standard output tends to go off the bottom of the page rather
easily. Package `broom` has these:
\scriptsize
```{r windmill-9}
glance(DC.1)
```
\normalsize
showing that the R-squared is 87%, and
\footnotesize
```{r windmill-10}
tidy(DC.1)
```
\normalsize
showing the intercept and slope and their significance.
## Comments
- Strategy: `lm` actually fits the regression. Store results in a
variable. Then look at the results, eg. via `summary` or
`glance`/`tidy`.
- My strategy for model names: base on response variable (or data
frame name) and a number. Allows me to fit several models to same
data and keep track of which is which.
- Results actually pretty good: `wind.velocity` strongly significant,
R-squared (87%) high.
- How to check whether regression is appropriate? Look at the
residuals, observed minus predicted, plotted against fitted
(predicted).
- Plot using the regression object as "data frame" (in a couple of
slides).
## Scatterplot, but with line
```{r windmill-11}
#| message = FALSE
ggplot(windmill, aes(y = DC_output, x = wind_velocity)) +
geom_point() + geom_smooth(method="lm", se = FALSE)
```
## Plot of residuals against fitted values
```{r windmill-13}
ggplot(DC.1, aes(y = .resid, x = .fitted)) + geom_point()
```
## Comments on residual plot
- Residual plot should be a random scatter of points.
- Should be no pattern "left over" after fitting the regression.
- Smooth trend should be more or less straight across at 0.
- Here, have a curved trend on residual plot.
- This means original relationship must have been a curve (as we saw
on original scatterplot).
- Possible ways to fit a curve:
- Add a squared term in explanatory variable.
- Transform response variable (doesn't work well here).
- See what science tells you about mathematical form of
relationship, and try to apply.
## normal quantile plot of residuals
```{r}
ggplot(DC.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```
## Parabolas and fitting parabola model
- A parabola has equation $$y = ax^2 + bx + c$$ with coefficients
$a, b, c$. About the simplest function that is not a straight line.
- Fit one using `lm` by adding $x^2$ to right side of model formula
with +:
```{r windmill-14}
DC.2 <- lm(DC_output ~ wind_velocity + I(wind_velocity^2),
data = windmill
)
```
- The `I()` necessary because `^` in model formula otherwise means
something different (to do with interactions in ANOVA).
- Call it *parabola model*.
## Parabola model output
```{r windmill-16}
summary(DC.2)
# tidy(DC.2)
```
```{r}
summary(DC.2)
```
\scriptsize
```{r windmill-17}
glance(DC.2)
```
\normalsize
## Comments on output
- R-squared has gone up a lot, from 87% (line) to 97% (parabola).
- Coefficient of squared term strongly significant (P-value
$6.59 \times 10^{−8}$).
- Adding squared term has definitely improved fit of model.
- Parabola model better than linear one.
- But...need to check residuals again.
## Residual plot from parabola model
```{r windmill-18}
ggplot(DC.2, aes(y = .resid, x = .fitted)) +
geom_point()
```
## normal quantile plot of residuals
```{r}
ggplot(DC.2, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```
This distribution has long tails, which should worry us at least some.
## Make scatterplot with fitted line and curve
- Residual plot basically random. Good.
- Scatterplot with fitted line and curve like this:
```{r fitcurve, eval=F}
ggplot(windmill, aes(y = DC_output, x = wind_velocity)) +
geom_point() + geom_smooth(method = "lm", se = F) +
geom_line(data = DC.2, aes(y = .fitted))
```
## Comments
- This plots:
- scatterplot (`geom_point`);
- straight line (via tweak to `geom_smooth`, which draws
best-fitting line);
- fitted curve, using the predicted `DC_output` values, joined by
lines (with points not shown).
- Trick in the `geom_line` is use the predictions as the `y`-points to
join by lines (from `DC.2`), instead of the original data points.
Without the `data` and `aes` in the `geom_line`, original data
points would be joined by lines.
## Scatterplot with fitted line and curve
```{r windmill-19, ref.label="fitcurve", echo=F}
```
Curve clearly fits better than line.
## Another approach to a curve
- There is a problem with parabolas, which we'll see later.
- Ask engineer, "what should happen as wind velocity increases?":
- Upper limit on electricity generated, but otherwise, the larger
the wind velocity, the more electricity generated.
- Mathematically, *asymptote*. Straight lines and parabolas don't have
them, but eg. $y = 1/x$ does: as $x$ gets bigger, $y$ approaches
zero without reaching it.
- What happens to $y = a + b(1/x)$ as $x$ gets large?
- $y$ gets closer and closer to $a$: that is, $a$ is asymptote.
- Fit this, call it asymptote model.
- Fitting the model here because we have math to justify it.
- Alternative, $y = a + be^{−x}$ , approaches asymptote faster.
## How to fit asymptote model?
- Define new explanatory variable to be $1/x$, and predict $y$ from
it.
- $x$ is velocity, distance over time.
- So $1/x$ is time over distance. In walking world, if you walk 5
km/h, take 12 minutes to walk 1 km, called your pace. So 1 over
`wind_velocity` we call `wind_pace`.
- Make a scatterplot first to check for straightness (next page).
```{r straightness, fig.keep="none"}
windmill %>% mutate(wind_pace = 1 / wind_velocity) -> windmill
ggplot(windmill, aes(y = DC_output, x = wind_pace)) +
geom_point() + geom_smooth(se = F)
```
- and run regression like this (output page after):
```{r asyreg}
DC.3 <- lm(DC_output ~ wind_pace, data = windmill)
summary(DC.3)
```
## Scatterplot for wind_pace
Pretty straight. Blue actually smooth curve not line:
```{r windmill-20, ref.label="straightness", echo=F}
ggplot(windmill, aes(y = DC_output, x = wind_pace)) +
geom_point() + geom_smooth(se = F)
```
## Regression output
\scriptsize
```{r windmill-21}
glance(DC.3)
```
\normalsize
```{r windmill-22}
tidy(DC.3)
```
## Comments
- R-squared, 98%, even higher than for parabola model (97%).
- Simpler model, only one explanatory variable (`wind.pace`) vs. 2 for
parabola model (`wind.velocity` and its square).
- `wind.pace` (unsurprisingly) strongly significant.
- Looks good, but check residual plot (over).
## Residual plot for asymptote model
```{r resida}
ggplot(DC.3, aes(y = .resid, x = .fitted)) + geom_point()
```
## normal quantile plot of residuals
```{r}
ggplot(DC.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```
This is skewed (left), but is not bad (and definitely better than the
one for the parabola model).
## Plotting trends on scatterplot
- Residual plot not bad. But residuals go up to 0.10 and down to
−0.20, suggesting possible skewness (not normal). I think it's not
perfect, but OK overall.
- Next: plot scatterplot with all three fitted lines/curves on it (for
comparison), with legend saying which is which.
- First make data frame containing what we need, taken from the right
places:
```{r windmill-23}
w2 <- tibble(
wind_velocity = windmill$wind_velocity,
DC_output = windmill$DC_output,
linear = fitted(DC.1),
parabola = fitted(DC.2),
asymptote = fitted(DC.3)
)
```
## What's in `w2`
```{r windmill-24}
w2
```
## Making the plot
- `ggplot` likes to have one column of $x$'s to plot, and one column
of $y$'s, with another column for distinguishing things.
- But we have three columns of fitted values, that need to be combined
into one.
- `pivot_longer`, then plot:
```{r allcurves, eval=F}
w2 %>%
pivot_longer(linear:asymptote, names_to="model",
values_to="fit") %>%
ggplot(aes(x = wind_velocity, y = DC_output)) +
geom_point() +
geom_line(aes(y = fit, colour = model))
```
## Scatterplot with fitted curves
```{r windmill-25, ref.label= "allcurves", echo=F}
```
## Comments
- Predictions from curves are very similar.
- Predictions from asymptote model as good, and from simpler model
(one $x$ not two), so prefer those.
- Go back to asymptote model summary.
## Asymptote model summary
```{r windmill-26}
tidy(DC.3)
```
## Comments
- Intercept in this model about 3.
- Intercept of asymptote model is the asymptote (upper limit of
`DC.output`).
- Not close to asymptote yet.
- Therefore, from this model, wind could get stronger and would
generate appreciably more electricity.
- This is extrapolation! Would like more data from times when
`wind.velocity` higher.
- Slope −7. Why negative?
- As wind.velocity increases, wind.pace goes down, and DC.output
goes up. Check.
- Actual slope number hard to interpret.
## Checking back in with research questions
- Is there a relationship between wind speed and current generated?
- Yes.
- If so, what kind of relationship is it?
- One with an asymptote.
- Can we model the relationship, in such a way that we can do
predictions?
- Yes, see model DC.3 and plot of fitted curve.
- Good. Job done.
## Job done, kinda
- Just because the parabola model and asymptote model agree over the
range of the data, doesn't necessarily mean they agree everywhere.
- Extend range of wind.velocity to 1 to 16 (steps of 0.5), and predict
DC.output according to the two models:
```{r}
#| echo = FALSE
options(width = 72)
```
```{r windmill-27}
wv <- seq(1, 16, 0.5)
wv
```
- R has `predict`, which requires what to predict for, as data frame.
The data frame has to contain values, with matching names, for all
explanatory variables in regression(s).
## Setting up data frame to predict from
- Linear model had just `wind_velocity`.
- Parabola model had that as well (squared one will be calculated)
- Asymptote model had just `wind_pace` (reciprocal of velocity).
- So create data frame called `wv_new` with those in:
```{r windmill-28}
wv_new <- tibble(wind_velocity = wv, wind_pace = 1 / wv)
```
## `wv_new`
```{r windmill-29}
wv_new
```
## Doing predictions, one for each model
- Use same names as before:
```{r windmill-30}
linear <- predict(DC.1, wv_new)
parabola <- predict(DC.2, wv_new)
asymptote <- predict(DC.3, wv_new)
```
- Put it all into a data frame for plotting, along with original data:
```{r windmill-31}
my_fits <- tibble(
wind_velocity = wv_new$wind_velocity,
linear, parabola, asymptote
)
```
## `my_fits`
```{r windmill-32}
my_fits
```
## Making a plot 1/2
- To make a plot, we use the same trick as last time to get all three
predictions on a plot with a legend (saving result to add to later):
```{r windmill-33}
my_fits %>%
pivot_longer(
linear:asymptote,
names_to="model",
values_to="fit"
) %>%
ggplot(aes(
y = fit, x = wind_velocity,
colour = model
)) + geom_line() -> g
```
## Making a plot 2/2
- The observed wind velocities were in this range:
```{r windmill-34}
(vels <- range(windmill$wind_velocity))
```
- `DC.output` between 0 and 3 from asymptote model. Add rectangle to
graph around where the data were:
```{r rectangle, eval=F}
g + geom_rect(
xmin = vels[1], xmax = vels[2], ymin = 0, ymax = 3,
alpha=0, colour = "black"
)
```
## The plot
```{r windmill-35, ref.label="rectangle", echo=F}
```
## Comments (1)
- Over range of data, two models agree with each other well.
- Outside range of data, they disagree violently!
- For larger `wind.velocity`, asymptote model behaves reasonably,
parabola model does not.
- What happens as `wind.velocity` goes to zero? Should find
`DC.output` goes to zero as well. Does it?
## Comments (2)
- For parabola model:
```{r windmill-36}
tidy(DC.2)
```
- Nope, goes to −1.16 (intercept), actually significantly different
from zero.
## Comments (3): asymptote model
\small
```{r windmill-37}
tidy(DC.3)
```
\normalsize
- As `wind.velocity` heads to 0, wind.pace heads to $+\infty$, so
DC.output heads to $−\infty$!
- Also need more data for small `wind.velocity` to understand
relationship. (Is there a lower asymptote?)
- Best we can do now is to predict `DC.output` to be zero for small
`wind.velocity`.
- Assumes a "threshold" wind velocity below which no electricity
generated at all.
## Summary
- Often, in data analysis, there is no completely satisfactory
conclusion, as here.
- Have to settle for model that works OK, with restrictions.
- Always something else you can try.
- At some point you have to say "I stop."