--- title: "Case study: windmill" editor: markdown: wrap: 72 --- ## The windmill data - Engineer: does amount of electricity generated by windmill depend on how strongly wind blowing? - Measurements of wind speed and DC current generated at various times. - Assume the "various times" to be randomly selected --- aim to generalize to "this windmill at all times". - Research questions: - Relationship between wind speed and current generated? - If so, what kind of relationship? - Can we model relationship to do predictions? ## Packages for this section {r windmill-1} library(tidyverse) library(broom)  ## Reading in the data {r windmill-2} my_url <- "http://ritsokiguess.site/datafiles/windmill.csv" windmill <- read_csv(my_url) windmill  ## Strategy - Two quantitative variables, looking for relationship: regression methods. - Start with picture (scatterplot). - Fit models and do model checking, fixing up things as necessary. - Scatterplot: - 2 variables, DC_output and wind_velocity. - First is output/response, other is input/explanatory. - Put DC_output on vertical scale. - Add trend, but don't want to assume linear: {r windmill-4, eval=FALSE} ggplot(windmill, aes(y = DC_output, x = wind_velocity)) + geom_point() + geom_smooth()  ## Scatterplot {r windmill-5, echo=FALSE, message=FALSE} ggplot(windmill, aes(y = DC_output, x = wind_velocity)) + geom_point() + geom_smooth(se = FALSE)  ## Comments - Definitely a relationship: as wind velocity increases, so does DC output. (As you'd expect.) - Is relationship linear? To help judge, geom_smooth smooths scatterplot trend. (Trend called "loess", "Locally weighted least squares" which downweights outliers. Not constrained to be straight.) - Trend more or less linear for while, then curves downwards (levelling off?). Straight line not so good here. ## Fit a straight line (and see what happens) {r windmill-7} DC.1 <- lm(DC_output ~ wind_velocity, data = windmill) summary(DC.1)  ## Another way of looking at the output - The standard output tends to go off the bottom of the page rather easily. Package broom has these: \scriptsize {r windmill-9} glance(DC.1)  \normalsize showing that the R-squared is 87%, and \footnotesize {r windmill-10} tidy(DC.1)  \normalsize showing the intercept and slope and their significance. ## Comments - Strategy: lm actually fits the regression. Store results in a variable. Then look at the results, eg. via summary or glance/tidy. - My strategy for model names: base on response variable (or data frame name) and a number. Allows me to fit several models to same data and keep track of which is which. - Results actually pretty good: wind.velocity strongly significant, R-squared (87%) high. - How to check whether regression is appropriate? Look at the residuals, observed minus predicted, plotted against fitted (predicted). - Plot using the regression object as "data frame" (in a couple of slides). ## Scatterplot, but with line {r windmill-11} #| message = FALSE ggplot(windmill, aes(y = DC_output, x = wind_velocity)) + geom_point() + geom_smooth(method="lm", se = FALSE)  ## Plot of residuals against fitted values {r windmill-13} ggplot(DC.1, aes(y = .resid, x = .fitted)) + geom_point()  ## Comments on residual plot - Residual plot should be a random scatter of points. - Should be no pattern "left over" after fitting the regression. - Smooth trend should be more or less straight across at 0. - Here, have a curved trend on residual plot. - This means original relationship must have been a curve (as we saw on original scatterplot). - Possible ways to fit a curve: - Add a squared term in explanatory variable. - Transform response variable (doesn't work well here). - See what science tells you about mathematical form of relationship, and try to apply. ## normal quantile plot of residuals {r} ggplot(DC.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()  ## Parabolas and fitting parabola model - A parabola has equation $$y = ax^2 + bx + c$$ with coefficients $a, b, c$. About the simplest function that is not a straight line. - Fit one using lm by adding $x^2$ to right side of model formula with +: {r windmill-14} DC.2 <- lm(DC_output ~ wind_velocity + I(wind_velocity^2), data = windmill )  - The I() necessary because ^ in model formula otherwise means something different (to do with interactions in ANOVA). - Call it *parabola model*. ## Parabola model output {r windmill-16} summary(DC.2) # tidy(DC.2)  {r} summary(DC.2)  \scriptsize {r windmill-17} glance(DC.2)  \normalsize ## Comments on output - R-squared has gone up a lot, from 87% (line) to 97% (parabola). - Coefficient of squared term strongly significant (P-value $6.59 \times 10^{−8}$). - Adding squared term has definitely improved fit of model. - Parabola model better than linear one. - But...need to check residuals again. ## Residual plot from parabola model {r windmill-18} ggplot(DC.2, aes(y = .resid, x = .fitted)) + geom_point()  ## normal quantile plot of residuals {r} ggplot(DC.2, aes(sample = .resid)) + stat_qq() + stat_qq_line()  This distribution has long tails, which should worry us at least some. ## Make scatterplot with fitted line and curve - Residual plot basically random. Good. - Scatterplot with fitted line and curve like this: {r fitcurve, eval=F} ggplot(windmill, aes(y = DC_output, x = wind_velocity)) + geom_point() + geom_smooth(method = "lm", se = F) + geom_line(data = DC.2, aes(y = .fitted))  ## Comments - This plots: - scatterplot (geom_point); - straight line (via tweak to geom_smooth, which draws best-fitting line); - fitted curve, using the predicted DC_output values, joined by lines (with points not shown). - Trick in the geom_line is use the predictions as the y-points to join by lines (from DC.2), instead of the original data points. Without the data and aes in the geom_line, original data points would be joined by lines. ## Scatterplot with fitted line and curve {r windmill-19, ref.label="fitcurve", echo=F}  Curve clearly fits better than line. ## Another approach to a curve - There is a problem with parabolas, which we'll see later. - Ask engineer, "what should happen as wind velocity increases?": - Upper limit on electricity generated, but otherwise, the larger the wind velocity, the more electricity generated. - Mathematically, *asymptote*. Straight lines and parabolas don't have them, but eg. $y = 1/x$ does: as $x$ gets bigger, $y$ approaches zero without reaching it. - What happens to $y = a + b(1/x)$ as $x$ gets large? - $y$ gets closer and closer to $a$: that is, $a$ is asymptote. - Fit this, call it asymptote model. - Fitting the model here because we have math to justify it. - Alternative, $y = a + be^{−x}$ , approaches asymptote faster. ## How to fit asymptote model? - Define new explanatory variable to be $1/x$, and predict $y$ from it. - $x$ is velocity, distance over time. - So $1/x$ is time over distance. In walking world, if you walk 5 km/h, take 12 minutes to walk 1 km, called your pace. So 1 over wind_velocity we call wind_pace. - Make a scatterplot first to check for straightness (next page). {r straightness, fig.keep="none"} windmill %>% mutate(wind_pace = 1 / wind_velocity) -> windmill ggplot(windmill, aes(y = DC_output, x = wind_pace)) + geom_point() + geom_smooth(se = F)  - and run regression like this (output page after): {r asyreg} DC.3 <- lm(DC_output ~ wind_pace, data = windmill) summary(DC.3)  ## Scatterplot for wind_pace Pretty straight. Blue actually smooth curve not line: {r windmill-20, ref.label="straightness", echo=F} ggplot(windmill, aes(y = DC_output, x = wind_pace)) + geom_point() + geom_smooth(se = F)  ## Regression output \scriptsize {r windmill-21} glance(DC.3)  \normalsize {r windmill-22} tidy(DC.3)  ## Comments - R-squared, 98%, even higher than for parabola model (97%). - Simpler model, only one explanatory variable (wind.pace) vs. 2 for parabola model (wind.velocity and its square). - wind.pace (unsurprisingly) strongly significant. - Looks good, but check residual plot (over). ## Residual plot for asymptote model {r resida} ggplot(DC.3, aes(y = .resid, x = .fitted)) + geom_point()  ## normal quantile plot of residuals {r} ggplot(DC.3, aes(sample = .resid)) + stat_qq() + stat_qq_line()  This is skewed (left), but is not bad (and definitely better than the one for the parabola model). ## Plotting trends on scatterplot - Residual plot not bad. But residuals go up to 0.10 and down to −0.20, suggesting possible skewness (not normal). I think it's not perfect, but OK overall. - Next: plot scatterplot with all three fitted lines/curves on it (for comparison), with legend saying which is which. - First make data frame containing what we need, taken from the right places: {r windmill-23} w2 <- tibble( wind_velocity = windmill$wind_velocity, DC_output = windmill$DC_output, linear = fitted(DC.1), parabola = fitted(DC.2), asymptote = fitted(DC.3) )  ## What's in w2 {r windmill-24} w2  ## Making the plot - ggplot likes to have one column of $x$'s to plot, and one column of $y$'s, with another column for distinguishing things. - But we have three columns of fitted values, that need to be combined into one. - pivot_longer, then plot: {r allcurves, eval=F} w2 %>% pivot_longer(linear:asymptote, names_to="model", values_to="fit") %>% ggplot(aes(x = wind_velocity, y = DC_output)) + geom_point() + geom_line(aes(y = fit, colour = model))  ## Scatterplot with fitted curves {r windmill-25, ref.label= "allcurves", echo=F}  ## Comments - Predictions from curves are very similar. - Predictions from asymptote model as good, and from simpler model (one $x$ not two), so prefer those. - Go back to asymptote model summary. ## Asymptote model summary {r windmill-26} tidy(DC.3)  ## Comments - Intercept in this model about 3. - Intercept of asymptote model is the asymptote (upper limit of DC.output). - Not close to asymptote yet. - Therefore, from this model, wind could get stronger and would generate appreciably more electricity. - This is extrapolation! Would like more data from times when wind.velocity higher. - Slope −7. Why negative? - As wind.velocity increases, wind.pace goes down, and DC.output goes up. Check. - Actual slope number hard to interpret. ## Checking back in with research questions - Is there a relationship between wind speed and current generated? - Yes. - If so, what kind of relationship is it? - One with an asymptote. - Can we model the relationship, in such a way that we can do predictions? - Yes, see model DC.3 and plot of fitted curve. - Good. Job done. ## Job done, kinda - Just because the parabola model and asymptote model agree over the range of the data, doesn't necessarily mean they agree everywhere. - Extend range of wind.velocity to 1 to 16 (steps of 0.5), and predict DC.output according to the two models: {r} #| echo = FALSE options(width = 72)  {r windmill-27} wv <- seq(1, 16, 0.5) wv  - R has predict, which requires what to predict for, as data frame. The data frame has to contain values, with matching names, for all explanatory variables in regression(s). ## Setting up data frame to predict from - Linear model had just wind_velocity. - Parabola model had that as well (squared one will be calculated) - Asymptote model had just wind_pace (reciprocal of velocity). - So create data frame called wv_new with those in: {r windmill-28} wv_new <- tibble(wind_velocity = wv, wind_pace = 1 / wv)  ## wv_new {r windmill-29} wv_new  ## Doing predictions, one for each model - Use same names as before: {r windmill-30} linear <- predict(DC.1, wv_new) parabola <- predict(DC.2, wv_new) asymptote <- predict(DC.3, wv_new)  - Put it all into a data frame for plotting, along with original data: {r windmill-31} my_fits <- tibble( wind_velocity = wv_new$wind_velocity, linear, parabola, asymptote )  ## my_fits {r windmill-32} my_fits  ## Making a plot 1/2 - To make a plot, we use the same trick as last time to get all three predictions on a plot with a legend (saving result to add to later): {r windmill-33} my_fits %>% pivot_longer( linear:asymptote, names_to="model", values_to="fit" ) %>% ggplot(aes( y = fit, x = wind_velocity, colour = model )) + geom_line() -> g  ## Making a plot 2/2 - The observed wind velocities were in this range: {r windmill-34} (vels <- range(windmill$wind_velocity))  - DC.output between 0 and 3 from asymptote model. Add rectangle to graph around where the data were: {r rectangle, eval=F} g + geom_rect( xmin = vels[1], xmax = vels[2], ymin = 0, ymax = 3, alpha=0, colour = "black" )  ## The plot {r windmill-35, ref.label="rectangle", echo=F}  ## Comments (1) - Over range of data, two models agree with each other well. - Outside range of data, they disagree violently! - For larger wind.velocity, asymptote model behaves reasonably, parabola model does not. - What happens as wind.velocity goes to zero? Should find DC.output goes to zero as well. Does it? ## Comments (2) - For parabola model: {r windmill-36} tidy(DC.2)  - Nope, goes to −1.16 (intercept), actually significantly different from zero. ## Comments (3): asymptote model \small {r windmill-37} tidy(DC.3)  \normalsize - As wind.velocity heads to 0, wind.pace heads to $+\infty$, so DC.output heads to $−\infty$! - Also need more data for small wind.velocity to understand relationship. (Is there a lower asymptote?) - Best we can do now is to predict DC.output to be zero for small wind.velocity. - Assumes a "threshold" wind velocity below which no electricity generated at all. ## Summary - Often, in data analysis, there is no completely satisfactory conclusion, as here. - Have to settle for model that works OK, with restrictions. - Always something else you can try. - At some point you have to say "I stop."