---
title: "Lab 09: Linear Regression"
output: html_document
---
I have produced some fixes to the **tmodels** package and you need to
re-install it for today's lab to work. Do that by running the following
before you do anything else today:
```{r}
devtools::install_github("statsmaths/tmodels")
```
Then, read in all of the R packages that we will need for today:
```{r, message=FALSE}
library(dplyr)
library(ggplot2)
library(tmodels)
library(readr)
library(readxl)
```
## Diamonds Dataset
For the first part of this lab, I want you to read into R a dataset of diamond
prices, along with various characteristics of the diamonds:
```{r, message=FALSE}
diamonds <- read_csv("https://raw.githubusercontent.com/statsmaths/stat_data/gh-pages/diamonds_small.csv")
```
We are going to run a number of regression analyses to understand how various
features effect the price of a diamond. Unless otherwise noted, all of the models
use `price` as the response variable.
1. Start by running a regression model that predicts price as a function of
carat (the weight of the diamond). Take note of whether the result is
significant or not.
```{r}
```
2. Repeat the same analysis, but this time use the Pearson Correlation
Test. Verify that the T-statistic is the same value as for linear regression.
```{r}
```
Are the point estimates and confidence intervals the same? Why or why not?
3. Now, build a linear regression that predicts price as a function of
the "4 C's": carat, color, clarity and cut. Treat carat as the IV and
the other three as nusiance variables (as a reminder, the order of the
variable only matters in terms of which one goes first after the `~`;
otherwise there is no difference in the output).
```{r}
```
How does the estimate compare to the result without controlling for
the other variables? Is the effect stronger or weaker? Does this make
sense to you?
4. Now, run a regression model that predicts price as just a function
of clarity.
```{r}
```
Is the result significant at the 0.05 level? Notice how the output looks
different than the first result because we now have a categorical variable
as the input variable.
5. Run a One-way ANOVA test predicting price as a function of clarity.
```{r}
```
You should find that this has the exact same F-statistics and P-value
(the numbers in the F-statistic are different, but lead to the same
result).
6. Run a linear regression that predicts price as a function of clarity
with cut as a nusiance variable.
```{r}
```
You should see that the test is now no longer significant. What does that
mean in practice terms?
7. Run the price as a function of the 4 C's again, but this time use clarity
as the input variable.
```{r}
```
You should see that this is once again significant. Try to summarize the
results from questions 5-7 and take note of how tricky and difficult
linear regression can be to use!
8. There are no variables in the diamonds dataset that are categorical
with only two categories, so we need to make one for you to see how this
works with linear regression. Run the following to create a variable
named `color_good` that breaks the colors into two distinct categories:
```{r}
diamonds <- mutate(diamonds, color_good = if_else(color %in% c("D", "E"), "good", "poor"))
```
Run the code and see the new variable created in the data-viewer.
Then, run a linear regression predicting `price` as a function of
`color_good`:
```{r}
```
9. Run a T-sample t-test predicting price as a function of `color_good`:
```{r}
```
How do the results compare to the regression analysis? Note that,
for one thing, the defaults are switched so that the terms here
are all negative. You should also see that the point estimate is
the same, but the other values are slightly different.
What is going on? There is a slightly different assumption that the
linear regression model uses compared to the two-sample T-test. The
regression model assumes that the variation is the of price is the
same in both groups and the two-sample T-test doesn't. That is where
the discrepency arises. This isn't anything to worry about, but I
just wanted you to have seen it.
10. Finally, run a regression for `price` as a function of `color_good`
(IV) and the nusiance variables `carat`, `cut`, and `clarity`:
```{r}
```
Notice that the output is different than in question 7 where there are
multiple categories.
## Regression in scientific literature
Open the following article from the British Medical Journal on "Geographical
variation in glaucoma prescribing trends in England 2008–2012: an observationa
ecological study":
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4874115/
Read the Abstract and Methods sections. Also look at Tables 2-4. Then,
try to answer the following questions (you may need to search through the rest
of the paper for some of these):
1. Is this an observational or experimental study?
2. What statistical tests were used in this analysis? Are you familiar with
all of them from the course? If not, what tests or statistical terms are you
unfamiliar with?
3. Table 3 shows a multivariate linear regression model. What is the response
variable, the input/independent variable, and the nusiance variables (look at
the abstract for hint, and remember that this is a matter of perspective)?
Which p-value is the important one? What is are the null and alternative hypotheses
in this test?