---
title: "Introduction to linear regression"
subtitle: "Data Science for Biologists, Spring 2020"
author: "YOUR NAME GOES HERE"
output:
html_document:
highlight: tango
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(broom)
```
## Instructions
Standard grading criteria apply, except there is no "answer style" - just write out answers normally! Note the following:
+ When adding regression lines as plot subtitles, don't worry about writing $\epsilon$
+ Do not remove insignificant predictors from your regression line formula when reporting it (this is just a question I get a lot)
This assignment will use an external dataset from a field experiment studying the diversity of [Chinese Rowan](https://en.wikipedia.org/wiki/Rowan) trees. Researchers randomly sampled and recorded characteristics of leaves from three different species in the *Sorbus* genus. They recorded the following measurements for each tree (don't worry about units)
1. `species`: the species of tree
2. `altitude`: the altitude of the tree
3. `respiratory_rate`: average respiratory rate across a random sample of leaves from that tree
4. `leaf_len`: average leaf length across a random sample of leaves from that tree
5. `birds_nesting`: whether birds were actively nesting in the tree
For this assignment, you will examine how various predictors may explain variation in *respiratory rate.*
```{r}
# rowan trees, no relation :)
rowan <- read_csv("https://raw.githubusercontent.com/sjspielman/datascience_for_biologists/master/data/rowan_trees.csv")
dplyr::glimpse(rowan)
```
### Question 1
> Background for this completing question is under the header "Simple linear regression: Single numeric predictor" in the linear regression tutorial
Construct a linear model that examines how *altitude* might explain variation in *respiratory rate*. Take the following steps to do so (code as you go through the steps!)
Make a quick scatterplot to make sure the "linear relationship" assumption is met for this data. Be sure to have your predictor and response on the correct axis!:
```{r}
### figure to check linearity goes here.
### no need for fancy, can keep labels as is, etc.
```
Now that you have confirmed the relationship is linear (hint: it is linear), build your linear model. *BEFORE you examine its output*, evaluate whether the model residuals were normally distributed:
```{r}
### build model and check normal residuals
### do not look at the model summary yet!
```
Now that you have confirmed the residuals are roughly normally distributed (hint: they are), examine the output from your linear model. In the space below the code chunk, discuss in *bullet form* (1-2 sentences each): a) Provide an interpretation of the intercept, b) Provide an interpretation of the altitude coefficient, c) Provide an interpretation of the $R^2$ value (those dollar signs signify "math mode" - see the HTML output!), and finally d) Conclude whether altitude is a strong or weak predictor of respiratory rate, consider "biological significance" (effect size!) as well as statistical significance.
```{r}
## examine model output here
```
+ Intercept interpretation
+ altitude coefficient interpretation
+ $R^2$ interpretation
+ Model conclusion
Finally, make a stylish scatterplot of your findings. Your scatterplot should:
+ Use your favorite ggplot theme and colors (it's allowed to like the default!)
+ Clearly show the regression line and its 95% confidence interval
+ Include a meaningful title, along with a subtitle that is the fitted model itself, as well as other nice labels
+ Include a well-placed annotation that gives the model's $R^2$
```{r}
### stylish plot goes here
```
### Question 2
> Background for this completing question is under the header "Simple ANOVA: Single categorical predictor" in the linear regression tutorial
Construct a linear model that examines how *species* might explain variation in *respiratory rate*. Take the following steps to do so (code as you go through the steps!)
Make a quick plot (use `geom_point()`, seriously, not even a jitter!!) to make sure the "equal variance" assumption is met for this data:
```{r}
### figure to check assumption goes here.
### no need for fancy, can keep labels as is, etc.
```
Now that you have confirmed the variance is equal across groups (hint: it is), build your linear model. *BEFORE you examine its output*, evaluate whether the model residuals were normally distributed:
```{r}
### build model and check normal residuals
### do not look at the model summary yet!
```
Now that you have confirmed the residuals are roughly normally distributed (hint: they are), examine the output from your linear model. In the space below the code chunk, discuss in *bullet form* (1-2 sentences each): a) Provide an interpretation of the intercept, b) Provide an interpretation of the species coefficient, c) Provide an interpretation of the $R^2$ value, and finally d) Conclude whether species is a strong or weak predictor of respiratory rate, consider "biological significance" (effect size!) as well as statistical significance.
```{r}
## examine model output here
```
+ Intercept interpretation
+ species coefficient interpretation
+ $R^2$ interpretation
+ Model conclusion
Finally, make a stylish figure of your findings, choosing your own geom!
+ Use your favorite ggplot theme and colors (it's allowed to like the default!)
+ If your geom does not already show the center of each group (i.e. like a boxplot), be sure to add the means in with `stat_summary()`
+ Include a meaningful title, along with a subtitle that is the fitted model itself, as well as other nice labels
+ Include a well-placed annotation that gives the model's $R^2$
```{r}
### stylish plot goes here
```
### Question 3
> Background for this completing question is under the header "LM with numeric and categorical predictors" in the linear regression tutorial
Construct a linear model that examines how BOTH *species* and *leaf_len* as independent effects might explain variation in *respiratory rate*. Again, take the following steps one by one:
Since we already checked assumptions for `species` in the last question, make an appropriate plot to check the linearity assumption for `leaf_len`:
```{r}
### figure to check assumption goes here.
### no need for fancy, can keep labels as is, etc.
```
Build your linear model, and evaluate whether the model residuals were normally distributed:
```{r}
### build model and check normal residuals
### do not look at the model summary yet!
```
Now that you have confirmed the residuals are roughly normally distributed (hint: they are), examine the output from your linear model. In the space below the code chunk, discuss in *bullet form* (1-2 sentences each): a) Provide an interpretation of the intercept, b) Provide an interpretation of the `species` coefficient, c) Provide an interpretation of the `leaf_len` coefficient d) Provide an interpretation of the $R^2$ value, and finally e) Conclude whether species is a strong or weak predictor of respiratory rate, consider "biological significance" (effect size!) as well as statistical significance.
```{r}
## examine model output here
```
+ Intercept interpretation
+ species coefficient interpretation
+ leaf_len coefficient interpretation
+ $R^2$ interpretation
+ Model conclusion
Finally, make a stylish scatterplot of your findings:
+ Use your favorite ggplot theme and colors (it's allowed to like the default!)
+ Make sure to show a regression lines for EACH species. **NOTICE in theses lines** how they are consistent with your conclusions about species being a significant predictor. You do not have to write anything, just notice!
+ Include a meaningful title, along with a subtitle that is the fitted model itself, as well as other nice labels
+ Include a well-placed annotation that gives the model's $R^2$
```{r}
### stylish plot goes here
```