library(tidyverse)

The Department of Education collects annual statistics on colleges and universities in the United States. I have included a subset of this data from 2013 in the rcfss library from GitHub. To install the package, run the command devtools::install_github("uc-cfss/rcfss") in the console.

If you don’t already have the devtools library installed, you will get an error. Go back and install this first using install.packages("devtools"), then run devtools::install_github("uc-cfss/rcfss").

library(rcfss)
data("scorecard")
str(scorecard)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1849 obs. of  12 variables:
##  $ unitid   : int  450234 448479 456427 459596 459851 482477 482547 482592 482617 482662 ...
##  $ name     : chr  "ITT Technical Institute-Wichita" "ITT Technical Institute-Swartz Creek" "ITT Technical Institute-Concord" "ITT Technical Institute-Tallahassee" ...
##  $ state    : chr  "KS" "MI" "CA" "FL" ...
##  $ type     : chr  "Private, for-profit" "Private, for-profit" "Private, for-profit" "Private, for-profit" ...
##  $ cost     : int  28306 26994 26353 28894 23928 25625 24265 NA 20983 21999 ...
##  $ admrate  : num  81.3 98.3 89.3 58.4 68.8 ...
##  $ satavg   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avgfacsal: num  45054 52857 NA 47196 55089 ...
##  $ pctpell  : num  0.803 0.773 0.704 0.778 0.61 ...
##  $ comprate : num  0.6 0.336 NA NA NA ...
##  $ firstgen : num  0.506 0.506 0.506 0.506 0.517 ...
##  $ debt     : num  13000 13000 13000 13000 9500 ...

Type ?scorecard in the console to open up the help file for this data set. This includes the documentation for all the variables. Use your knowledge of dplyr and ggplot2 functions to answer the following questions.

Which type of college has the highest average SAT score?

NOTE: This time, use a graph to visualize your answer, not a table.

Click for the solution

We could use a boxplot to visualize the distribution of SAT scores.

ggplot(scorecard, aes(type, satavg)) +
  geom_boxplot()
## Warning: Removed 471 rows containing non-finite values (stat_boxplot).

According to this graph, private, nonprofit schools have the highest average SAT score, followed by public and then private, for-profit schools. But this doesn’t reveal the entire picture. What happens if we plot a histogram or frequency polygon?

ggplot(scorecard, aes(satavg)) +
  geom_histogram() +
  facet_wrap(~ type)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 471 rows containing non-finite values (stat_bin).

ggplot(scorecard, aes(satavg, color = type)) +
  geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 471 rows containing non-finite values (stat_bin).

Now we can see the averages for each college type are based on widely varying sample sizes.

ggplot(scorecard, aes(type)) +
  geom_bar()

There are far fewer private, for-profit colleges than the other categories. A boxplot alone would not reveal this detail, which could be important in future analysis.

What is the relationship between college attendance cost and faculty salaries? How does this relationship differ across types of colleges?

Click for the solution

# geom_point
ggplot(scorecard, aes(cost, avgfacsal)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).

# geom_point with alpha transparency to reveal dense clusters
ggplot(scorecard, aes(cost, avgfacsal)) +
  geom_point(alpha = .2) +
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).

## Warning: Removed 42 rows containing missing values (geom_point).

# geom_hex
ggplot(scorecard, aes(cost, avgfacsal)) +
  geom_hex() +
  geom_smooth()
## Warning: Removed 42 rows containing non-finite values (stat_binhex).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).

# geom_point with smoothing lines for each type
ggplot(scorecard, aes(cost, avgfacsal, color = type)) +
  geom_point(alpha = .2) +
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).

## Warning: Removed 42 rows containing missing values (geom_point).

# geom_point with facets for each type
ggplot(scorecard, aes(cost, avgfacsal, color = type)) +
  geom_point(alpha = .2) +
  geom_smooth() +
  facet_grid(. ~ type)
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).

## Warning: Removed 42 rows containing missing values (geom_point).

Session Info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-03-13
## Packages -----------------------------------------------------------------
##  package    * version    date       source                             
##  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.0)                     
##  backports    1.1.2      2017-12-13 CRAN (R 3.4.3)                     
##  base       * 3.4.3      2017-12-07 local                              
##  bindr        0.1.1      2018-03-13 CRAN (R 3.4.3)                     
##  bindrcpp     0.2        2017-06-17 CRAN (R 3.4.0)                     
##  broom        0.4.3      2017-11-20 CRAN (R 3.4.1)                     
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.0)                     
##  cli          1.0.0      2017-11-05 CRAN (R 3.4.2)                     
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.0)                     
##  compiler     3.4.3      2017-12-07 local                              
##  crayon       1.3.4      2017-10-03 Github (gaborcsardi/crayon@b5221ab)
##  datasets   * 3.4.3      2017-12-07 local                              
##  devtools     1.13.5     2018-02-18 CRAN (R 3.4.3)                     
##  digest       0.6.15     2018-01-28 CRAN (R 3.4.3)                     
##  dplyr      * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)   
##  evaluate     0.10.1     2017-06-24 CRAN (R 3.4.1)                     
##  forcats    * 0.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  foreign      0.8-69     2017-06-22 CRAN (R 3.4.3)                     
##  ggplot2    * 2.2.1      2016-12-30 CRAN (R 3.4.0)                     
##  glue         1.2.0      2017-10-29 CRAN (R 3.4.2)                     
##  graphics   * 3.4.3      2017-12-07 local                              
##  grDevices  * 3.4.3      2017-12-07 local                              
##  grid         3.4.3      2017-12-07 local                              
##  gtable       0.2.0      2016-02-26 CRAN (R 3.4.0)                     
##  haven        1.1.1      2018-01-18 CRAN (R 3.4.3)                     
##  hms          0.4.2      2018-03-10 CRAN (R 3.4.3)                     
##  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.0)                     
##  httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                     
##  jsonlite     1.5        2017-06-01 CRAN (R 3.4.0)                     
##  knitr        1.20       2018-02-20 CRAN (R 3.4.3)                     
##  lattice      0.20-35    2017-03-25 CRAN (R 3.4.3)                     
##  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.2)                     
##  lubridate    1.7.3      2018-02-27 CRAN (R 3.4.3)                     
##  magrittr     1.5        2014-11-22 CRAN (R 3.4.0)                     
##  memoise      1.1.0      2017-04-21 CRAN (R 3.4.0)                     
##  methods    * 3.4.3      2017-12-07 local                              
##  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.0)                     
##  modelr       0.1.1      2017-08-10 local                              
##  munsell      0.4.3      2016-02-13 CRAN (R 3.4.0)                     
##  nlme         3.1-131.1  2018-02-16 CRAN (R 3.4.3)                     
##  parallel     3.4.3      2017-12-07 local                              
##  pillar       1.2.1      2018-02-27 CRAN (R 3.4.3)                     
##  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.0)                     
##  plyr         1.8.4      2016-06-08 CRAN (R 3.4.0)                     
##  psych        1.7.8      2017-09-09 CRAN (R 3.4.1)                     
##  purrr      * 0.2.4      2017-10-18 CRAN (R 3.4.2)                     
##  R6           2.2.2      2017-06-17 CRAN (R 3.4.0)                     
##  Rcpp         0.12.15    2018-01-20 CRAN (R 3.4.3)                     
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.0)                     
##  readxl       1.0.0      2017-04-18 CRAN (R 3.4.0)                     
##  reshape2     1.4.3      2017-12-11 CRAN (R 3.4.3)                     
##  rlang        0.2.0      2018-02-20 cran (@0.2.0)                      
##  rmarkdown    1.9        2018-03-01 CRAN (R 3.4.3)                     
##  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.3)                     
##  rstudioapi   0.7        2017-09-07 CRAN (R 3.4.1)                     
##  rvest        0.3.2      2016-06-17 CRAN (R 3.4.0)                     
##  scales       0.5.0      2017-08-24 cran (@0.5.0)                      
##  stats      * 3.4.3      2017-12-07 local                              
##  stringi      1.1.7      2018-03-12 CRAN (R 3.4.3)                     
##  stringr    * 1.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  tibble     * 1.4.2      2018-01-22 CRAN (R 3.4.3)                     
##  tidyr      * 0.8.0      2018-01-29 CRAN (R 3.4.3)                     
##  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.2)                     
##  tools        3.4.3      2017-12-07 local                              
##  utils      * 3.4.3      2017-12-07 local                              
##  withr        2.1.1      2017-12-19 CRAN (R 3.4.3)                     
##  xml2         1.2.0      2018-01-24 CRAN (R 3.4.3)                     
##  yaml         2.1.18     2018-03-08 CRAN (R 3.4.4)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.