---
title: "Case study: windmill"
editor: 
  markdown: 
    wrap: 72
---

## The windmill data

-   Engineer: does amount of electricity generated by windmill depend on
    how strongly wind blowing?
-   Measurements of wind speed and DC current generated at various
    times.
-   Assume the "various times" to be randomly selected --- aim to
    generalize to "this windmill at all times".
-   Research questions:
    -   Relationship between wind speed and current generated?
    -   If so, what kind of relationship?
    -   Can we model relationship to do predictions?

## Packages for this section

```{r windmill-1}
library(tidyverse)
library(broom)
```

## Reading in the data

```{r windmill-2}
my_url <- 
  "http://ritsokiguess.site/datafiles/windmill.csv"
windmill <- read_csv(my_url)
windmill
```

## Strategy

-   Two quantitative variables, looking for relationship: regression
    methods.
-   Start with picture (scatterplot).
-   Fit models and do model checking, fixing up things as necessary.
-   Scatterplot:
    -   2 variables, `DC_output` and `wind_velocity`.
    -   First is output/response, other is input/explanatory.
    -   Put `DC_output` on vertical scale.
-   Add trend, but don't want to assume linear:

```{r windmill-4, eval=FALSE}
ggplot(windmill, aes(y = DC_output, x = wind_velocity)) +
  geom_point() + geom_smooth() 
```

## Scatterplot

```{r windmill-5, echo=FALSE, message=FALSE}
ggplot(windmill, aes(y = DC_output, x = wind_velocity)) +
  geom_point() + geom_smooth(se = FALSE) 
```

## Comments

-   Definitely a relationship: as wind velocity increases, so does DC
    output. (As you'd expect.)
-   Is relationship linear? To help judge, `geom_smooth` smooths
    scatterplot trend. (Trend called "loess", "Locally weighted least
    squares" which downweights outliers. Not constrained to be
    straight.)
-   Trend more or less linear for while, then curves downwards
    (levelling off?). Straight line not so good here.

## Fit a straight line (and see what happens)

```{r windmill-7}
DC.1 <- lm(DC_output ~ wind_velocity, data = windmill)
summary(DC.1)
```

## Another way of looking at the output

-   The standard output tends to go off the bottom of the page rather
    easily. Package `broom` has these:

\scriptsize

```{r windmill-9}
glance(DC.1)
```

\normalsize

showing that the R-squared is 87%, and

\footnotesize

```{r windmill-10}
tidy(DC.1)
```

\normalsize

showing the intercept and slope and their significance.

## Comments

-   Strategy: `lm` actually fits the regression. Store results in a
    variable. Then look at the results, eg. via `summary` or
    `glance`/`tidy`.
-   My strategy for model names: base on response variable (or data
    frame name) and a number. Allows me to fit several models to same
    data and keep track of which is which.
-   Results actually pretty good: `wind.velocity` strongly significant,
    R-squared (87%) high.
-   How to check whether regression is appropriate? Look at the
    residuals, observed minus predicted, plotted against fitted
    (predicted).
-   Plot using the regression object as "data frame" (in a couple of
    slides).

## Scatterplot, but with line

```{r windmill-11}
#| message = FALSE
ggplot(windmill, aes(y = DC_output, x = wind_velocity)) +
  geom_point() + geom_smooth(method="lm", se = FALSE)
```

## Plot of residuals against fitted values

```{r windmill-13}
ggplot(DC.1, aes(y = .resid, x = .fitted)) + geom_point()
```

## Comments on residual plot

-   Residual plot should be a random scatter of points.
-   Should be no pattern "left over" after fitting the regression.
-   Smooth trend should be more or less straight across at 0.
-   Here, have a curved trend on residual plot.
-   This means original relationship must have been a curve (as we saw
    on original scatterplot).
-   Possible ways to fit a curve:
    -   Add a squared term in explanatory variable.
    -   Transform response variable (doesn't work well here).
    -   See what science tells you about mathematical form of
        relationship, and try to apply.

## normal quantile plot of residuals

```{r}
ggplot(DC.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

## Parabolas and fitting parabola model

-   A parabola has equation $$y = ax^2 + bx + c$$ with coefficients
    $a, b, c$. About the simplest function that is not a straight line.
-   Fit one using `lm` by adding $x^2$ to right side of model formula
    with +:

```{r windmill-14}
DC.2 <- lm(DC_output ~ wind_velocity + I(wind_velocity^2),
  data = windmill
)
```

-   The `I()` necessary because `^` in model formula otherwise means
    something different (to do with interactions in ANOVA).
-   Call it *parabola model*.

## Parabola model output

```{r windmill-16}
summary(DC.2)
# tidy(DC.2)
```

```{r}
summary(DC.2)
```

\scriptsize

```{r windmill-17}
glance(DC.2)
```

\normalsize

## Comments on output

-   R-squared has gone up a lot, from 87% (line) to 97% (parabola).
-   Coefficient of squared term strongly significant (P-value
    $6.59 \times 10^{−8}$).
-   Adding squared term has definitely improved fit of model.
-   Parabola model better than linear one.
-   But...need to check residuals again.

## Residual plot from parabola model

```{r windmill-18}
ggplot(DC.2, aes(y = .resid, x =  .fitted)) +
  geom_point()
```

## normal quantile plot of residuals

```{r}
ggplot(DC.2, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

This distribution has long tails, which should worry us at least some.

## Make scatterplot with fitted line and curve

-   Residual plot basically random. Good.
-   Scatterplot with fitted line and curve like this:

```{r fitcurve, eval=F}
ggplot(windmill, aes(y = DC_output, x = wind_velocity)) +
  geom_point() + geom_smooth(method = "lm", se = F) +
  geom_line(data = DC.2, aes(y = .fitted))
```

## Comments

-   This plots:
    -   scatterplot (`geom_point`);
    -   straight line (via tweak to `geom_smooth`, which draws
        best-fitting line);
    -   fitted curve, using the predicted `DC_output` values, joined by
        lines (with points not shown).
-   Trick in the `geom_line` is use the predictions as the `y`-points to
    join by lines (from `DC.2`), instead of the original data points.
    Without the `data` and `aes` in the `geom_line`, original data
    points would be joined by lines.

## Scatterplot with fitted line and curve

```{r windmill-19, ref.label="fitcurve", echo=F}
```

Curve clearly fits better than line.

## Another approach to a curve

-   There is a problem with parabolas, which we'll see later.

-   Ask engineer, "what should happen as wind velocity increases?":

    -   Upper limit on electricity generated, but otherwise, the larger
        the wind velocity, the more electricity generated.

-   Mathematically, *asymptote*. Straight lines and parabolas don't have
    them, but eg. $y = 1/x$ does: as $x$ gets bigger, $y$ approaches
    zero without reaching it.

-   What happens to $y = a + b(1/x)$ as $x$ gets large?

    -   $y$ gets closer and closer to $a$: that is, $a$ is asymptote.

-   Fit this, call it asymptote model.

-   Fitting the model here because we have math to justify it.

    -   Alternative, $y = a + be^{−x}$ , approaches asymptote faster.

## How to fit asymptote model?

-   Define new explanatory variable to be $1/x$, and predict $y$ from
    it.
-   $x$ is velocity, distance over time.
-   So $1/x$ is time over distance. In walking world, if you walk 5
    km/h, take 12 minutes to walk 1 km, called your pace. So 1 over
    `wind_velocity` we call `wind_pace`.
-   Make a scatterplot first to check for straightness (next page).

```{r straightness, fig.keep="none"}
windmill %>% mutate(wind_pace = 1 / wind_velocity) -> windmill
ggplot(windmill, aes(y = DC_output, x = wind_pace)) +
  geom_point() + geom_smooth(se = F)
```

-   and run regression like this (output page after):

```{r asyreg}
DC.3 <- lm(DC_output ~ wind_pace, data = windmill)
summary(DC.3)
```

## Scatterplot for wind_pace

Pretty straight. Blue actually smooth curve not line:

```{r windmill-20, ref.label="straightness", echo=F}
ggplot(windmill, aes(y = DC_output, x = wind_pace)) +
  geom_point() + geom_smooth(se = F)
```

## Regression output

\scriptsize

```{r windmill-21}
glance(DC.3)
```

\normalsize

```{r windmill-22}
tidy(DC.3)
```

## Comments

-   R-squared, 98%, even higher than for parabola model (97%).
-   Simpler model, only one explanatory variable (`wind.pace`) vs. 2 for
    parabola model (`wind.velocity` and its square).
-   `wind.pace` (unsurprisingly) strongly significant.
-   Looks good, but check residual plot (over).

## Residual plot for asymptote model

```{r resida}
ggplot(DC.3, aes(y = .resid, x = .fitted)) + geom_point()
```

## normal quantile plot of residuals

```{r}
ggplot(DC.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()
```

This is skewed (left), but is not bad (and definitely better than the
one for the parabola model).

## Plotting trends on scatterplot

-   Residual plot not bad. But residuals go up to 0.10 and down to
    −0.20, suggesting possible skewness (not normal). I think it's not
    perfect, but OK overall.
-   Next: plot scatterplot with all three fitted lines/curves on it (for
    comparison), with legend saying which is which.
-   First make data frame containing what we need, taken from the right
    places:

```{r windmill-23}
w2 <- tibble(
  wind_velocity = windmill$wind_velocity,
  DC_output = windmill$DC_output,
  linear = fitted(DC.1),
  parabola = fitted(DC.2),
  asymptote = fitted(DC.3)
)
```

## What's in `w2`

```{r windmill-24}
w2
```

## Making the plot

-   `ggplot` likes to have one column of $x$'s to plot, and one column
    of $y$'s, with another column for distinguishing things.
-   But we have three columns of fitted values, that need to be combined
    into one.
-   `pivot_longer`, then plot:

```{r allcurves, eval=F}
w2 %>%
  pivot_longer(linear:asymptote, names_to="model", 
               values_to="fit") %>%
  ggplot(aes(x = wind_velocity, y = DC_output)) +
  geom_point() +
  geom_line(aes(y = fit, colour = model)) 
```

## Scatterplot with fitted curves

```{r windmill-25, ref.label= "allcurves", echo=F}

```

## Comments

-   Predictions from curves are very similar.
-   Predictions from asymptote model as good, and from simpler model
    (one $x$ not two), so prefer those.
-   Go back to asymptote model summary.

## Asymptote model summary

```{r windmill-26}
tidy(DC.3)
```

## Comments

-   Intercept in this model about 3.
-   Intercept of asymptote model is the asymptote (upper limit of
    `DC.output`).
-   Not close to asymptote yet.
-   Therefore, from this model, wind could get stronger and would
    generate appreciably more electricity.
-   This is extrapolation! Would like more data from times when
    `wind.velocity` higher.
-   Slope −7. Why negative?
    -   As wind.velocity increases, wind.pace goes down, and DC.output
        goes up. Check.
-   Actual slope number hard to interpret.

## Checking back in with research questions

-   Is there a relationship between wind speed and current generated?
    -   Yes.
-   If so, what kind of relationship is it?
    -   One with an asymptote.
-   Can we model the relationship, in such a way that we can do
    predictions?
    -   Yes, see model DC.3 and plot of fitted curve.
-   Good. Job done.

## Job done, kinda

-   Just because the parabola model and asymptote model agree over the
    range of the data, doesn't necessarily mean they agree everywhere.
-   Extend range of wind.velocity to 1 to 16 (steps of 0.5), and predict
    DC.output according to the two models:

```{r}
#| echo = FALSE 
options(width = 72)
```

```{r windmill-27}
wv <- seq(1, 16, 0.5)
wv
```

-   R has `predict`, which requires what to predict for, as data frame.
    The data frame has to contain values, with matching names, for all
    explanatory variables in regression(s).

## Setting up data frame to predict from

-   Linear model had just `wind_velocity`.
-   Parabola model had that as well (squared one will be calculated)
-   Asymptote model had just `wind_pace` (reciprocal of velocity).
-   So create data frame called `wv_new` with those in:

```{r windmill-28}
wv_new <- tibble(wind_velocity = wv, wind_pace = 1 / wv)
```

## `wv_new`

```{r windmill-29}
wv_new
```

## Doing predictions, one for each model

-   Use same names as before:

```{r windmill-30}
linear <- predict(DC.1, wv_new)
parabola <- predict(DC.2, wv_new)
asymptote <- predict(DC.3, wv_new)
```

-   Put it all into a data frame for plotting, along with original data:

```{r windmill-31}
my_fits <- tibble(
  wind_velocity = wv_new$wind_velocity,
  linear, parabola, asymptote
)
```

## `my_fits`

```{r windmill-32}
my_fits
```

## Making a plot 1/2

-   To make a plot, we use the same trick as last time to get all three
    predictions on a plot with a legend (saving result to add to later):

```{r windmill-33}
my_fits %>%
    pivot_longer(
    linear:asymptote,
    names_to="model", 
    values_to="fit"
  ) %>%
  ggplot(aes(
    y = fit, x = wind_velocity,
    colour = model
  )) + geom_line() -> g
```

## Making a plot 2/2

-   The observed wind velocities were in this range:

```{r windmill-34}
(vels <- range(windmill$wind_velocity))
```

-   `DC.output` between 0 and 3 from asymptote model. Add rectangle to
    graph around where the data were:

```{r rectangle, eval=F}
g + geom_rect(
  xmin = vels[1], xmax = vels[2], ymin = 0, ymax = 3,
  alpha=0, colour = "black"
)
```

## The plot

```{r windmill-35, ref.label="rectangle", echo=F}
```

## Comments (1)

-   Over range of data, two models agree with each other well.
-   Outside range of data, they disagree violently!
-   For larger `wind.velocity`, asymptote model behaves reasonably,
    parabola model does not.
-   What happens as `wind.velocity` goes to zero? Should find
    `DC.output` goes to zero as well. Does it?

## Comments (2)

-   For parabola model:

```{r windmill-36}
tidy(DC.2)
```

-   Nope, goes to −1.16 (intercept), actually significantly different
    from zero.

## Comments (3): asymptote model

\small

```{r windmill-37}
tidy(DC.3)
```

\normalsize

-   As `wind.velocity` heads to 0, wind.pace heads to $+\infty$, so
    DC.output heads to $−\infty$!
-   Also need more data for small `wind.velocity` to understand
    relationship. (Is there a lower asymptote?)
-   Best we can do now is to predict `DC.output` to be zero for small
    `wind.velocity`.
-   Assumes a "threshold" wind velocity below which no electricity
    generated at all.

## Summary

-   Often, in data analysis, there is no completely satisfactory
    conclusion, as here.
-   Have to settle for model that works OK, with restrictions.
-   Always something else you can try.
-   At some point you have to say "I stop."