library(tidyverse)
library(modelr)
library(broom)
library(rcfss)
set.seed(1234)
theme_set(theme_minimal())
ggplot2
Linear models are the simplest to understand. They adopt a generic form
\[y = \beta_0 + \beta_1 * x\]
where \(y\) is the outcome of interest, \(x\) is the explanatory or predictor variable, and \(\beta_0\) and \(\beta_1\) are parameters that vary to capture different patterns. Given the empirical values you have for \(x\) and \(y\), you generate a fitted model that finds the values for the parameters that best fit the data.
ggplot(sim1, aes(x, y)) +
geom_point()
This looks like a linear relationship. We could randomly generate parameters for the formula \(y = \beta_0 + \beta_1 * x\) to try and explain or predict the relationship between \(x\) and \(y\):
models <- tibble(
a1 = runif(250, -20, 40),
a2 = runif(250, -5, 5)
)
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
geom_point()
But obviously some parameters are better than others. We need a definition that can be used to differentiate good parameters from bad parameters. One approach widely used is called least squares - it means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation. The errors are simply the vertical difference between the actual values for \(y\) and the predicted values for \(y\).
R for Data Science walks you through the steps to perform all these calculations manually by writing your own functions. I encourage you to read through and practice some of this code, especially if you have no experience with linear models.
However for our purposes here I will assume you at least get the basics of this process. You can use ggplot2
to draw the best-fit line:
ggplot(sim1, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm")
The line in blue is the best-fit line, with 95% confidence intervals in grey indicating a range of values so defined that there is a 95% probability that the true value of a parameter lies within it. If you want to learn more about the precise definition of confidence intervals and the debate over how useful they actually are, you should take a statistics class.
lm()
But drawing a picture is not always good enough. What if you want to know the actual values of the estimated parameters? To do that, we use the lm()
function:
sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
## (Intercept) x
## 4.220822 2.051533
The lm()
function takes two parameters. The first is a formula specifying the equation to be estimated (lm()
translates y ~ x
into \(y = \beta_0 + \beta_1 * x\)). The second is of course the data frame containing the variables.
Note that we have now begun to leave the tidyverse
universe. lm()
is part of the base R program, and the result of lm()
is decidedly not tidy.
str(sim1_mod)
## List of 12
## $ coefficients : Named num [1:2] 4.22 2.05
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
## $ residuals : Named num [1:30] -2.072 1.238 -4.147 0.665 1.919 ...
## ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
## $ effects : Named num [1:30] -84.92 32.275 -4.13 0.761 2.015 ...
## ..- attr(*, "names")= chr [1:30] "(Intercept)" "x" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:30] 6.27 6.27 6.27 8.32 8.32 ...
## ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:30, 1:2] -5.477 0.183 0.183 0.183 0.183 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:30] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "x"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.18 1.24
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 28
## $ xlevels : Named list()
## $ call : language lm(formula = y ~ x, data = sim1)
## $ terms :Classes 'terms', 'formula' language y ~ x
## .. ..- attr(*, "variables")= language list(y, x)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. ..$ : chr "x"
## .. ..- attr(*, "term.labels")= chr "x"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(y, x)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## $ model :'data.frame': 30 obs. of 2 variables:
## ..$ y: num [1:30] 4.2 7.51 2.13 8.99 10.24 ...
## ..$ x: int [1:30] 1 1 1 2 2 2 3 3 3 4 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
## .. .. ..- attr(*, "variables")= language list(y, x)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. .. ..$ : chr "x"
## .. .. ..- attr(*, "term.labels")= chr "x"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(y, x)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## - attr(*, "class")= chr "lm"
The result is stored in a complex list that contains a lot of important information, some of which you may recognize but most of it you do not. Here I will show you tools for extracting useful information from lm()
.
We can use sim1_mod
to generate predicted values, or the expected value for \(Y\) given our knowledge of hypothetical observations with values for \(X\), based on the estimated parameters using the data_grid()
and add_predictions()
functions from the modelr
package. data_grid()
generates an evenly spaced grid of data points covering the region where observed data lies. The first argument is a data frame, and subsequent arguments identify unique columns and generates all possible combinations.
grid <- sim1 %>%
data_grid(x)
grid
## # A tibble: 10 × 1
## x
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
add_predictions()
takes a data frame and a model, and uses the model to generate predictions for each observation in the data frame.
grid <- grid %>%
add_predictions(sim1_mod)
grid
## # A tibble: 10 × 2
## x pred
## <int> <dbl>
## 1 1 6.272355
## 2 2 8.323888
## 3 3 10.375421
## 4 4 12.426954
## 5 5 14.478487
## 6 6 16.530020
## 7 7 18.581553
## 8 8 20.633087
## 9 9 22.684620
## 10 10 24.736153
Using this information, we can draw the best-fit line without using geom_smooth()
, and instead build it directly from the predicted values.
ggplot(sim1, aes(x)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1) +
geom_point(aes(y = pred), data = grid, color = "blue", size = 3)
This looks like the line from before, but without the confidence interval. This is a bit more involved of a process, but it can work with any type of model you create - not just very basic, linear models.
We can also calculate the residuals, or that distance between the actual and predicted values of \(y\).
sim1 <- sim1 %>%
add_residuals(sim1_mod)
sim1
## # A tibble: 30 × 3
## x y resid
## <int> <dbl> <dbl>
## 1 1 4.199913 -2.072442018
## 2 1 7.510634 1.238279125
## 3 1 2.125473 -4.146882207
## 4 2 8.988857 0.664969362
## 5 2 10.243105 1.919217378
## 6 2 11.296823 2.972935148
## 7 3 7.356365 -3.019056466
## 8 3 10.505349 0.129928252
## 9 3 10.511601 0.136179642
## 10 4 12.434589 0.007634878
## # ... with 20 more rows
ggplot(sim1, aes(resid)) +
geom_freqpoly(binwidth = 0.5)
Reviewing your residuals can be helpful. Sometimes your model is better at predicting some types of observations better than others. This could help you isolate further patterns and improve the predictive accuracy of your model.
gapminder
Recall the gapminder
dataset, which includes measures of life expectancy over time for all countries in the world.
library(gapminder)
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
Let’s say we want to try and understand how life expectancy changes over time. We could visualize the data using a line graph:
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line(alpha = 1/3)
But this is incredibly noise. Why not estimate a simple linear model that summarizes this trend?
gapminder_mod <- lm(lifeExp ~ year, data = gapminder)
summary(gapminder_mod)
##
## Call:
## lm(formula = lifeExp ~ year, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.949 -9.651 1.697 10.335 22.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -585.65219 32.31396 -18.12 <2e-16 ***
## year 0.32590 0.01632 19.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.63 on 1702 degrees of freedom
## Multiple R-squared: 0.1898, Adjusted R-squared: 0.1893
## F-statistic: 398.6 on 1 and 1702 DF, p-value: < 2.2e-16
grid <- gapminder %>%
data_grid(year, country)
grid
## # A tibble: 1,704 × 2
## year country
## <int> <fctr>
## 1 1952 Afghanistan
## 2 1952 Albania
## 3 1952 Algeria
## 4 1952 Angola
## 5 1952 Argentina
## 6 1952 Australia
## 7 1952 Austria
## 8 1952 Bahrain
## 9 1952 Bangladesh
## 10 1952 Belgium
## # ... with 1,694 more rows
grid <- grid %>%
add_predictions(gapminder_mod)
grid
## # A tibble: 1,704 × 3
## year country pred
## <int> <fctr> <dbl>
## 1 1952 Afghanistan 50.51208
## 2 1952 Albania 50.51208
## 3 1952 Algeria 50.51208
## 4 1952 Angola 50.51208
## 5 1952 Argentina 50.51208
## 6 1952 Australia 50.51208
## 7 1952 Austria 50.51208
## 8 1952 Bahrain 50.51208
## 9 1952 Bangladesh 50.51208
## 10 1952 Belgium 50.51208
## # ... with 1,694 more rows
ggplot(gapminder, aes(year, group = country)) +
geom_line(aes(y = lifeExp), alpha = .2) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)
So it appears that there is a positive trend - that is, over time life expectancy is rising. But we can also see a lot of variation in that trend - some countries are doing much better than others. We’ll come back to that in a bit.
Model objects are not very pretty in R. Recall the complicated data structure I mentioned above:
str(gapminder_mod)
## List of 12
## $ coefficients : Named num [1:2] -585.652 0.326
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "year"
## $ residuals : Named num [1:1704] -21.7 -21.8 -21.8 -21.4 -20.9 ...
## ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
## $ effects : Named num [1:1704] -2455.1 232.2 -20.8 -20.5 -20.2 ...
## ..- attr(*, "names")= chr [1:1704] "(Intercept)" "year" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:1704] 50.5 52.1 53.8 55.4 57 ...
## ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:1704, 1:2] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:1704] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "year"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.02 1.03
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 1702
## $ xlevels : Named list()
## $ call : language lm(formula = lifeExp ~ year, data = gapminder)
## $ terms :Classes 'terms', 'formula' language lifeExp ~ year
## .. ..- attr(*, "variables")= language list(lifeExp, year)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "lifeExp" "year"
## .. .. .. ..$ : chr "year"
## .. ..- attr(*, "term.labels")= chr "year"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(lifeExp, year)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
## $ model :'data.frame': 1704 obs. of 2 variables:
## ..$ lifeExp: num [1:1704] 28.8 30.3 32 34 36.1 ...
## ..$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language lifeExp ~ year
## .. .. ..- attr(*, "variables")= language list(lifeExp, year)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "lifeExp" "year"
## .. .. .. .. ..$ : chr "year"
## .. .. ..- attr(*, "term.labels")= chr "year"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(lifeExp, year)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
## - attr(*, "class")= chr "lm"
In order to extract model statistics and use them in a tidy manner, we can use a set of functions from the broom
package. For these functions, the input is always the model object itself, not the original data frame.
tidy()
tidy()
constructs a data frame that summarizes the model’s statistical findings. This includes coefficients and p-values for each parameter in a regression model. Note that depending on the statistical learning method employed, the statistics stored in the columns may vary.
tidy(gapminder_mod)
## term estimate std.error statistic p.value
## 1 (Intercept) -585.6521874 32.31396452 -18.12381 2.897807e-67
## 2 year 0.3259038 0.01632369 19.96509 7.546795e-80
tidy(gapminder_mod) %>%
str()
## 'data.frame': 2 obs. of 5 variables:
## $ term : chr "(Intercept)" "year"
## $ estimate : num -585.652 0.326
## $ std.error: num 32.314 0.0163
## $ statistic: num -18.1 20
## $ p.value : num 2.90e-67 7.55e-80
Notice that the structure of the resulting object is a tidy data frame. Every row contains a single parameter, every column contains a single statistic, and every cell contains exactly one value.
augment()
augment()
adds columns to the original data that was modeled. This could include predictions, residuals, and other observation-level statistics.
augment(gapminder_mod) %>%
as_tibble()
## # A tibble: 1,704 x 9
## lifeExp year .fitted .se.fit .resid .hat .sigma
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28.801 1952 50.51208 0.5299963 -21.71108 0.0020765619 11.62203
## 2 30.332 1957 52.14160 0.4629044 -21.80960 0.0015840967 11.62193
## 3 31.997 1962 53.77112 0.4012330 -21.77412 0.0011901244 11.62197
## 4 34.020 1967 55.40064 0.3478771 -21.38064 0.0008946453 11.62241
## 5 36.088 1972 57.03016 0.3072006 -20.94216 0.0006976591 11.62288
## 6 38.438 1977 58.65968 0.2846912 -20.22168 0.0005991661 11.62363
## 7 39.854 1982 60.28920 0.2846912 -20.43520 0.0005991661 11.62341
## 8 40.822 1987 61.91872 0.3072006 -21.09672 0.0006976591 11.62271
## 9 41.674 1992 63.54824 0.3478771 -21.87424 0.0008946453 11.62187
## 10 41.763 1997 65.17776 0.4012330 -23.41476 0.0011901244 11.62010
## # ... with 1,694 more rows, and 2 more variables: .cooksd <dbl>,
## # .std.resid <dbl>
augment()
will only return statistics to the original data used to estimate the model, so it cannot be used like add_predictions()
to generate predictions for new data.
glance()
glance()
constructs a concise one-row summary of the model. This typically contains values such as \(R^2\), adjusted \(R^2\), and residual standard error that are computed once for the entire model.
glance(gapminder_mod)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.1897571 0.1892811 11.63055 398.6047 7.546795e-80 2 -6597.866
## AIC BIC deviance df.residual
## 1 13201.73 13218.05 230229.2 1702
While broom
may not work with every model in R, it is compatible with a wide range of common statistical models. A full list of models with which broom
is compatible can be found on the GitHub page for the package.
What if instead we wanted to fit a separate model for the United States? We can filter gapminder
for that country and perform the analysis only on U.S. observations.
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(year, lifeExp)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "United States")
map()
and nested data framesWhat if we want to estimate separate models for every country? We could do this manually, creating a new data frame for each country. But this is tedious and repetitive. We learned a couple of weeks ago how to iterate using for
loops. We could do this using a for
loop, but this will take a bunch of code. Instead, let’s use the map()
functions we already learned, but add an additional component on top of that.
Instead of repeating an action for each column (variable), we want to repeat an action for each country, a subset of rows. To do that, we need a new data structure: the nested data frame. To create a nested data frame we start with a grouped data frame, and “nest” it:
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
by_country
## # A tibble: 142 × 3
## country continent data
## <fctr> <fctr> <list>
## 1 Afghanistan Asia <tibble [12 × 4]>
## 2 Albania Europe <tibble [12 × 4]>
## 3 Algeria Africa <tibble [12 × 4]>
## 4 Angola Africa <tibble [12 × 4]>
## 5 Argentina Americas <tibble [12 × 4]>
## 6 Australia Oceania <tibble [12 × 4]>
## 7 Austria Europe <tibble [12 × 4]>
## 8 Bahrain Asia <tibble [12 × 4]>
## 9 Bangladesh Asia <tibble [12 × 4]>
## 10 Belgium Europe <tibble [12 × 4]>
## # ... with 132 more rows
This looks very different from what you’ve seen in data frames before. Typically every cell in a data frame is a single value. But here, each element in the data
column is actually another data frame. This demonstrates the benefits of lists - they can be used recursively to store other lists, which is exactly what data frames are.
Now we have one row per country, with the variables associated with each country stored in their own column. All the original data is still in this nested data frame, just stored in a different way. Note that to see the values of the variables in data
, we use the special notation we learned previously:
by_country$data[[1]]
## # A tibble: 12 × 4
## year lifeExp pop gdpPercap
## <int> <dbl> <int> <dbl>
## 1 1952 28.801 8425333 779.4453
## 2 1957 30.332 9240934 820.8530
## 3 1962 31.997 10267083 853.1007
## 4 1967 34.020 11537966 836.1971
## 5 1972 36.088 13079460 739.9811
## 6 1977 38.438 14880372 786.1134
## 7 1982 39.854 12881816 978.0114
## 8 1987 40.822 13867957 852.3959
## 9 1992 41.674 16317921 649.3414
## 10 1997 41.763 22227415 635.3414
## 11 2002 42.129 25268405 726.7341
## 12 2007 43.828 31889923 974.5803
It’s hard to see the overall structure, but it’s easy to use the map()
functions to access this data and analyze it. We create a model fitting function:
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
And we want to apply it to each country. That is exactly what map()
is designed for.
models <- map(by_country$data, country_model)
And because models
is a list and we just saw how to create list-columns, we could store the models as a new column in by_country
to keep all the data and analysis together.
by_country <- by_country %>%
mutate(model = map(data, country_model))
by_country
## # A tibble: 142 × 4
## country continent data model
## <fctr> <fctr> <list> <list>
## 1 Afghanistan Asia <tibble [12 × 4]> <S3: lm>
## 2 Albania Europe <tibble [12 × 4]> <S3: lm>
## 3 Algeria Africa <tibble [12 × 4]> <S3: lm>
## 4 Angola Africa <tibble [12 × 4]> <S3: lm>
## 5 Argentina Americas <tibble [12 × 4]> <S3: lm>
## 6 Australia Oceania <tibble [12 × 4]> <S3: lm>
## 7 Austria Europe <tibble [12 × 4]> <S3: lm>
## 8 Bahrain Asia <tibble [12 × 4]> <S3: lm>
## 9 Bangladesh Asia <tibble [12 × 4]> <S3: lm>
## 10 Belgium Europe <tibble [12 × 4]> <S3: lm>
## # ... with 132 more rows
Now if we filter or change the order of the observations, models
also changes order.
by_country %>%
filter(continent == "Europe")
## # A tibble: 30 × 4
## country continent data model
## <fctr> <fctr> <list> <list>
## 1 Albania Europe <tibble [12 × 4]> <S3: lm>
## 2 Austria Europe <tibble [12 × 4]> <S3: lm>
## 3 Belgium Europe <tibble [12 × 4]> <S3: lm>
## 4 Bosnia and Herzegovina Europe <tibble [12 × 4]> <S3: lm>
## 5 Bulgaria Europe <tibble [12 × 4]> <S3: lm>
## 6 Croatia Europe <tibble [12 × 4]> <S3: lm>
## 7 Czech Republic Europe <tibble [12 × 4]> <S3: lm>
## 8 Denmark Europe <tibble [12 × 4]> <S3: lm>
## 9 Finland Europe <tibble [12 × 4]> <S3: lm>
## 10 France Europe <tibble [12 × 4]> <S3: lm>
## # ... with 20 more rows
What if we want to compute residuals for 142 countries and 142 models? We still use the add_residuals()
function, but we have to combine it with a map()
function call. Because add_residuals()
requires two arguments (data
and model
), we use the map2()
function. map2()
allows us to map()
over two sets of inputs rather than the normal one:
by_country <- by_country %>%
mutate(
resids = map2(data, model, add_residuals)
)
by_country
## # A tibble: 142 × 5
## country continent data model resids
## <fctr> <fctr> <list> <list> <list>
## 1 Afghanistan Asia <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 2 Albania Europe <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 3 Algeria Africa <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 4 Angola Africa <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 5 Argentina Americas <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 6 Australia Oceania <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 7 Austria Europe <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 8 Bahrain Asia <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 9 Bangladesh Asia <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## 10 Belgium Europe <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
## # ... with 132 more rows
What if you want to plot the residuals? We need to unnest the residuals. unnest()
makes each element of the list its own row:
resids <- unnest(by_country, resids)
resids
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap resid
## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 -1.10629487
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 -0.95193823
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 -0.66358159
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 -0.01722494
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 0.67413170
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134 1.64748834
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114 1.68684499
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959 1.27820163
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414 0.75355828
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414 -0.53408508
## # ... with 1,694 more rows
resids %>%
ggplot(aes(year, resid)) +
geom_line(aes(group = country), alpha = 1 / 3) +
geom_smooth(se = FALSE)
scorecard
Recall the scorecard
data set which contains information on U.S. institutions of higher learning.
library(rcfss)
scorecard
## # A tibble: 1,849 × 12
## unitid name state type
## <int> <chr> <chr> <chr>
## 1 450234 ITT Technical Institute-Wichita KS Private, for-profit
## 2 448479 ITT Technical Institute-Swartz Creek MI Private, for-profit
## 3 456427 ITT Technical Institute-Concord CA Private, for-profit
## 4 459596 ITT Technical Institute-Tallahassee FL Private, for-profit
## 5 459851 Herzing University-Brookfield WI Private, for-profit
## 6 482477 DeVry University-Illinois IL Private, for-profit
## 7 482547 DeVry University-Nevada NV Private, for-profit
## 8 482592 DeVry University-Oregon OR Private, for-profit
## 9 482617 DeVry University-Tennessee TN Private, for-profit
## 10 482662 DeVry University-Washington WA Private, for-profit
## # ... with 1,839 more rows, and 8 more variables: cost <int>,
## # admrate <dbl>, satavg <dbl>, avgfacsal <dbl>, pctpell <dbl>,
## # comprate <dbl>, firstgen <dbl>, debt <dbl>
Answer the following questions using the statistical modeling tools you have learned.
What is the relationship between admission rate and cost? Report this relationship using a scatterplot and a linear best-fit line.
ggplot(scorecard, aes(admrate, cost)) +
geom_point() +
geom_smooth(method = "lm")
Estimate a linear regression of the relationship between admission rate and cost, and report your results in a tidy table.
scorecard_mod <- lm(cost ~ admrate, data = scorecard)
tidy(scorecard_mod)
## term estimate std.error statistic p.value
## 1 (Intercept) 43607.1844 1001.38991 43.54666 1.036472e-284
## 2 admrate -181.6923 14.38405 -12.63151 3.980318e-35
Estimate separate linear regression models of the relationship between admission rate and cost for each type of college. Report the estimated parameters and standard errors in a tidy data frame.
# model-building function
type_model <- function(df) {
lm(cost ~ admrate, data = df)
}
# nest the data frame
by_type <- scorecard %>%
group_by(type) %>%
nest()
by_type
## # A tibble: 3 x 2
## type data
## <chr> <list>
## 1 Private, for-profit <tibble [216 x 11]>
## 2 Private, nonprofit <tibble [1,092 x 11]>
## 3 Public <tibble [541 x 11]>
# estimate the models
by_type <- by_type %>%
mutate(model = map(data, type_model))
by_type
## # A tibble: 3 x 3
## type data model
## <chr> <list> <list>
## 1 Private, for-profit <tibble [216 x 11]> <S3: lm>
## 2 Private, nonprofit <tibble [1,092 x 11]> <S3: lm>
## 3 Public <tibble [541 x 11]> <S3: lm>
# extract the parameters and print a tidy data frame
by_type %>%
mutate(results = map(model, tidy)) %>%
unnest(results)
## # A tibble: 6 x 6
## type term estimate std.error statistic
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Private, for-profit (Intercept) 33148.964036 1679.37739 19.7388414
## 2 Private, for-profit admrate -69.012413 21.21030 -3.2537210
## 3 Private, nonprofit (Intercept) 50797.468138 1112.33157 45.6675593
## 4 Private, nonprofit admrate -198.228160 16.49201 -12.0196449
## 5 Public (Intercept) 20192.509481 718.82795 28.0908798
## 6 Public admrate -7.202504 10.27522 -0.7009587
## # ... with 1 more variables: p.value <dbl>
The same approach by using an anonymous function with the one-sided formula format:
by_type %>%
mutate(model = map(data, ~lm(cost ~ admrate, data = .)),
results = map(model, tidy)) %>%
unnest(results)
devtools::session_info()
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-04-24
##
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## bindr 0.1.1 2018-03-13 CRAN (R 3.4.3)
## bindrcpp 0.2.2.9000 2018-04-08 Github (krlmlr/bindrcpp@bd5ae73)
## broom * 0.4.4 2018-03-29 CRAN (R 3.4.3)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.3 2017-12-07 local
## crayon 1.3.4 2017-10-03 Github (gaborcsardi/crayon@b5221ab)
## datasets * 3.4.3 2017-12-07 local
## devtools 1.13.5 2018-02-18 CRAN (R 3.4.3)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## dplyr * 0.7.4.9003 2018-04-08 Github (tidyverse/dplyr@b7aaa95)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.4.3)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.3)
## ggplot2 * 2.2.1.9000 2018-04-24 Github (tidyverse/ggplot2@3c9c504)
## glue 1.2.0 2017-10-29 CRAN (R 3.4.2)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## grid 3.4.3 2017-12-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.1 2018-01-18 CRAN (R 3.4.3)
## hms 0.4.2 2018-03-10 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.20 2018-02-20 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.3)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.4 2018-04-11 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr * 0.1.1 2017-08-10 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-137 2018-04-07 CRAN (R 3.4.4)
## parallel 3.4.3 2017-12-07 local
## pillar 1.2.1 2018-02-27 CRAN (R 3.4.3)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.8.3.3 2018-03-30 CRAN (R 3.4.4)
## purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## rcfss * 0.1.5 2018-04-08 Github (uc-cfss/rcfss@8d2b2c0)
## Rcpp 0.12.16 2018-03-13 CRAN (R 3.4.4)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.3 2017-12-11 CRAN (R 3.4.3)
## rlang 0.2.0.9001 2018-04-24 Github (r-lib/rlang@82b2727)
## rmarkdown 1.9 2018-03-01 CRAN (R 3.4.3)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.5.0.9000 2018-04-24 Github (hadley/scales@d767915)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.7 2018-03-12 CRAN (R 3.4.3)
## stringr * 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.4.3)
## tidyr * 0.8.0 2018-01-29 CRAN (R 3.4.3)
## tidyselect 0.2.4 2018-02-26 CRAN (R 3.4.3)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.2 2018-04-24 Github (jimhester/withr@79d7b0d)
## xml2 1.2.0 2018-01-24 CRAN (R 3.4.3)
## yaml 2.1.18 2018-03-08 CRAN (R 3.4.4)
This work is licensed under the CC BY-NC 4.0 Creative Commons License.