---
title: "Lab 20: Linear Regression"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(fig.height = 5)
knitr::opts_chunk$set(fig.width = 8.5)
knitr::opts_chunk$set(out.width = "100%")
knitr::opts_chunk$set(dpi = 300)
library(readr)
library(ggplot2)
library(dplyr)
library(viridis)
library(forcats)
library(smodels)
theme_set(theme_minimal())
```
## Instructions
Below you will find several empty R code scripts and answer prompts. Your task
is to fill in the required code snippets and answer the corresponding
questions.
## Diamonds Data - Simple Regression
Today, we will start by looking at a dataset of diamond prices:
```{r}
diamonds <- read_csv("https://statsmaths.github.io/stat_data/diamonds.csv")
```
Fit a linear regression predicting the diamond price as a function of
its weight (carats). Print out the regression table without any confidence
intervals.
```{r}
```
How would you interpret the slope coefficient in this model?
**Answer**:
Next, show the same regression as a scatter plot with a smoothing line:
```{r}
```
Now, print the same model as in the previous question but add a 95%
confidence interval.
```{r}
```
What range of values does the confidence interval give for the intercept
in this model?
**Answer**:
What range of values does the confidence interval give for the slope
in this model?
**Answer**:
Would you be surprised if we collected another set of diamonds, in exactly the
same way as before, and found a slope estimate equal to 5500?
**Answer**:
Would you be surprised if we collected another set of diamonds, in exactly the
same way as before, and found a slope estimate equal to 7500?
**Answer**:
The biggest use of confidence intervals, at least for us, is to
determine whether the confidence interval for a slope contains only values
that are all negative, all positive, or a mix of both.
Why might this be important? Let's look at our diamond example. Describe the
meaning of the slope in this linear regression:
**Answer**:
Therefore, if the confidence interval has only positive values, we can
fairly confidently say is that there is a positive relationship between
diamond weight and its price.
Now, fit a regression model on the diamonds dataset, predicting the price as
a function of depth. Compute the regression table using level equal to .95:
```{r}
```
What does the model predict will be the price of a diamond with a depth of
60?
**Answer**:
Describe in words the estimate of the slope in terms of the dataset.
**Answer**:
Using the regression table from the previous question, does the
confidence interval contain only values that are positive, only values that
are negative, or values that are both positive and negative?
**Answer**:
Interpret this in words (your words, not just the formal definition).
**Answer**:
## Diamonds Data - Multiple Linear Regression
While we lose the nice graphical summary, it is possible to build linear
models with more than one variable and an intercept. In the diamond example,
for instance, we can fit a model that has both depth and carat in it:
```{r}
model <- lm_basic(price ~ 1 + carat + depth, data = diamonds)
reg_table(model, level = 0.95)
```
Notice that switching the order of the inputs does not change the estimates
or confidence intervals; only the order in the output is modified, but this
is inconsequential:
```{r}
model <- lm_basic(price ~ 1 + depth + carat, data = diamonds)
reg_table(model, level = 0.95)
```
Using the model in the previous question, what do you expect will be the
price of a diamond with a depth of 50 that weights 2 carats?
**Answer**:
Are both of the slopes in the model statistically significant? What are
their signs?
**Answer**:
Use the `add_prediction` function to add predictions from this model
back into the diamonds dataset.
```{r}
```
What is the largest positive residual (you can either use the `max` function
or sort the data in the data viewer)?
**Answer**:
Interpreting the meaning of the slopes in a regression model with
multiple variables changes slightly from the version with a single slope.
Each slope measures the change in the response give a change in its
corresponding value, when holding the other variables in the model fixed.
This last part is actually important and we will see some examples that at
first seem counterintuitive that make sense once we remember this caveat.
Describe the meaning the slopes, in your own words, in the regression model
above.
**Answer**:
Notice that we can do some comparisons quickly with the slope estimates.
For example, if we have two diamonds with the same depth where one is 1
carat in weight and the other is 2 carats in weight, how much more do you
expect the heavier diamond to cost?
**Answer**: