library(tidyverse)

The Department of Education collects annual statistics on colleges and universities in the United States. I have included a subset of this data from 2013 in the rcfss library from GitHub. To install the package, run the command devtools::install_github("uc-cfss/rcfss") in the console.

If you don’t already have the devtools library installed, you will get an error. Go back and install this first using install.packages("devtools"), then run devtools::install_github("uc-cfss/rcfss").

library(rcfss)
data("scorecard")
glimpse(scorecard)
## Observations: 1,849
## Variables: 12
## $ unitid    <int> 450234, 448479, 456427, 459596, 459851, 482477, 4825...
## $ name      <chr> "ITT Technical Institute-Wichita", "ITT Technical In...
## $ state     <chr> "KS", "MI", "CA", "FL", "WI", "IL", "NV", "OR", "TN"...
## $ type      <chr> "Private, for-profit", "Private, for-profit", "Priva...
## $ cost      <int> 28306, 26994, 26353, 28894, 23928, 25625, 24265, NA,...
## $ admrate   <dbl> 81.31, 98.31, 89.26, 58.37, 68.75, 70.40, 80.00, 50....
## $ satavg    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ avgfacsal <dbl> 45054, 52857, NA, 47196, 55089, 62793, 47556, 60003,...
## $ pctpell   <dbl> 0.8030, 0.7735, 0.7038, 0.7781, 0.6098, 0.6411, 0.63...
## $ comprate  <dbl> 0.6000, 0.3359, NA, NA, NA, 0.2939, 0.6364, 0.0000, ...
## $ firstgen  <dbl> 0.5057590, 0.5057590, 0.5057590, 0.5057590, 0.517160...
## $ debt      <dbl> 13000, 13000, 13000, 13000, 9500, 14250, 14250, 1425...

glimpse() is part of the tibble package and is a transposed version of print(): columns run down the page, and data runs across. With a data frame with multiple columns, sometimes there is not enough horizontal space on the screen to print each column. By transposing the data frame, we can see all the columns and the values recorded for the initial rows.

Type ?scorecard in the console to open up the help file for this data set. This includes the documentation for all the variables. Use your knowledge of the dplyr functions to perform the following tasks.

Generate a data frame of schools with a greater than 40% share of first-generation students

Click for the solution

filter(scorecard, firstgen > .40)
## # A tibble: 578 x 12
##    unitid                                 name state                type
##     <int>                                <chr> <chr>               <chr>
##  1 450234      ITT Technical Institute-Wichita    KS Private, for-profit
##  2 448479 ITT Technical Institute-Swartz Creek    MI Private, for-profit
##  3 456427      ITT Technical Institute-Concord    CA Private, for-profit
##  4 459596  ITT Technical Institute-Tallahassee    FL Private, for-profit
##  5 459851        Herzing University-Brookfield    WI Private, for-profit
##  6 482477            DeVry University-Illinois    IL Private, for-profit
##  7 482547              DeVry University-Nevada    NV Private, for-profit
##  8 482592              DeVry University-Oregon    OR Private, for-profit
##  9 482617           DeVry University-Tennessee    TN Private, for-profit
## 10 482662          DeVry University-Washington    WA Private, for-profit
## # ... with 568 more rows, and 8 more variables: cost <int>, admrate <dbl>,
## #   satavg <dbl>, avgfacsal <dbl>, pctpell <dbl>, comprate <dbl>,
## #   firstgen <dbl>, debt <dbl>

Generate a data frame with the 10 most expensive colleges in 2013

Click for the solution

We could use a combination of arrange() and slice() to sort the data frame from most to least expensive, then keep the first 10 rows:

arrange(scorecard, desc(cost)) %>%
  slice(1:10)
## # A tibble: 10 x 12
##    unitid                                        name state
##     <int>                                       <chr> <chr>
##  1 195304                      Sarah Lawrence College    NY
##  2 179867           Washington University in St Louis    MO
##  3 144050                       University of Chicago    IL
##  4 190150 Columbia University in the City of New York    NY
##  5 182670                           Dartmouth College    NH
##  6 130697                         Wesleyan University    CT
##  7 147767                     Northwestern University    IL
##  8 120254                          Occidental College    CA
##  9 115409                         Harvey Mudd College    CA
## 10 230816                          Bennington College    VT
## # ... with 9 more variables: type <chr>, cost <int>, admrate <dbl>,
## #   satavg <dbl>, avgfacsal <dbl>, pctpell <dbl>, comprate <dbl>,
## #   firstgen <dbl>, debt <dbl>

We can also use the top_n() function in dplyr to accomplish the same thing in one line of code.

top_n(scorecard, n = 10, wt = cost)
## # A tibble: 10 x 12
##    unitid                                        name state
##     <int>                                       <chr> <chr>
##  1 120254                          Occidental College    CA
##  2 195304                      Sarah Lawrence College    NY
##  3 115409                         Harvey Mudd College    CA
##  4 130697                         Wesleyan University    CT
##  5 147767                     Northwestern University    IL
##  6 144050                       University of Chicago    IL
##  7 230816                          Bennington College    VT
##  8 182670                           Dartmouth College    NH
##  9 179867           Washington University in St Louis    MO
## 10 190150 Columbia University in the City of New York    NY
## # ... with 9 more variables: type <chr>, cost <int>, admrate <dbl>,
## #   satavg <dbl>, avgfacsal <dbl>, pctpell <dbl>, comprate <dbl>,
## #   firstgen <dbl>, debt <dbl>
Notice that the resulting data frame is not sorted in order from most to least expensive - instead it is sorted in the original order from the data frame, but still only contains the 10 most expensive schools based on cost.

Generate a data frame with the average SAT score for each type of college

Click for the solution

scorecard %>%
  group_by(type) %>%
  summarize(mean_sat = mean(satavg, na.rm = TRUE))
## # A tibble: 3 x 2
##                  type mean_sat
##                 <chr>    <dbl>
## 1 Private, for-profit 1002.500
## 2  Private, nonprofit 1075.287
## 3              Public 1037.410

Calculate for each school how many students it takes to pay the average faculty member’s salary and generate a data frame with the school’s name and the calculated value

Click for the solution

scorecard %>%
  mutate(ratio = avgfacsal / cost) %>%
  select(name, ratio)
## # A tibble: 1,849 x 2
##                                    name    ratio
##                                   <chr>    <dbl>
##  1      ITT Technical Institute-Wichita 1.591677
##  2 ITT Technical Institute-Swartz Creek 1.958102
##  3      ITT Technical Institute-Concord       NA
##  4  ITT Technical Institute-Tallahassee 1.633419
##  5        Herzing University-Brookfield 2.302282
##  6            DeVry University-Illinois 2.450459
##  7              DeVry University-Nevada 1.959860
##  8              DeVry University-Oregon       NA
##  9           DeVry University-Tennessee 2.461993
## 10          DeVry University-Washington 2.552843
## # ... with 1,839 more rows

Calculate how many private, nonprofit schools have a smaller cost than the University of Chicago

Hint: the result should be a data frame with one row for the University of Chicago, and a column containing the requested value.

Report the number as the total number of schools

Click for the solution

scorecard %>%
  filter(type == "Private, nonprofit") %>%
  arrange(cost) %>%
  mutate(school_cheaper = row_number()) %>%
  filter(name == "University of Chicago") %>%
  glimpse()
## Observations: 1
## Variables: 13
## $ unitid         <int> 144050
## $ name           <chr> "University of Chicago"
## $ state          <chr> "IL"
## $ type           <chr> "Private, nonprofit"
## $ cost           <int> 62425
## $ admrate        <dbl> 8.81
## $ satavg         <dbl> 1504
## $ avgfacsal      <dbl> 153738
## $ pctpell        <dbl> 0.1419
## $ comprate       <dbl> 0.9268
## $ firstgen       <dbl> 0.1185808
## $ debt           <dbl> 16350
## $ school_cheaper <int> 1078

Report the number as the percentage of schools

Click for the solution

scorecard %>%
  filter(type == "Private, nonprofit") %>%
  mutate(cost_rank = percent_rank(cost)) %>%
  filter(name == "University of Chicago") %>%
  glimpse()
## Observations: 1
## Variables: 13
## $ unitid    <int> 144050
## $ name      <chr> "University of Chicago"
## $ state     <chr> "IL"
## $ type      <chr> "Private, nonprofit"
## $ cost      <int> 62425
## $ admrate   <dbl> 8.81
## $ satavg    <dbl> 1504
## $ avgfacsal <dbl> 153738
## $ pctpell   <dbl> 0.1419
## $ comprate  <dbl> 0.9268
## $ firstgen  <dbl> 0.1185808
## $ debt      <dbl> 16350
## $ cost_rank <dbl> 0.9981464

Session Info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-04-02
## Packages -----------------------------------------------------------------
##  package    * version    date       source                             
##  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.0)                     
##  backports    1.1.2      2017-12-13 CRAN (R 3.4.3)                     
##  base       * 3.4.3      2017-12-07 local                              
##  bindr        0.1.1      2018-03-13 CRAN (R 3.4.3)                     
##  bindrcpp   * 0.2        2017-06-17 CRAN (R 3.4.0)                     
##  broom        0.4.4      2018-03-29 CRAN (R 3.4.3)                     
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.0)                     
##  cli          1.0.0      2017-11-05 CRAN (R 3.4.2)                     
##  codetools    0.2-15     2016-10-05 CRAN (R 3.4.3)                     
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.0)                     
##  compiler     3.4.3      2017-12-07 local                              
##  crayon       1.3.4      2017-10-03 Github (gaborcsardi/crayon@b5221ab)
##  datasets   * 3.4.3      2017-12-07 local                              
##  devtools     1.13.5     2018-02-18 CRAN (R 3.4.3)                     
##  digest       0.6.15     2018-01-28 CRAN (R 3.4.3)                     
##  dplyr      * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)   
##  evaluate     0.10.1     2017-06-24 CRAN (R 3.4.1)                     
##  forcats    * 0.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  foreign      0.8-69     2017-06-22 CRAN (R 3.4.3)                     
##  ggplot2    * 2.2.1      2016-12-30 CRAN (R 3.4.0)                     
##  glue         1.2.0      2017-10-29 CRAN (R 3.4.2)                     
##  graphics   * 3.4.3      2017-12-07 local                              
##  grDevices  * 3.4.3      2017-12-07 local                              
##  grid         3.4.3      2017-12-07 local                              
##  gtable       0.2.0      2016-02-26 CRAN (R 3.4.0)                     
##  haven        1.1.1      2018-01-18 CRAN (R 3.4.3)                     
##  hms          0.4.2      2018-03-10 CRAN (R 3.4.3)                     
##  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.0)                     
##  httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                     
##  jsonlite     1.5        2017-06-01 CRAN (R 3.4.0)                     
##  knitr        1.20       2018-02-20 CRAN (R 3.4.3)                     
##  lattice      0.20-35    2017-03-25 CRAN (R 3.4.3)                     
##  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.2)                     
##  lubridate    1.7.3      2018-02-27 CRAN (R 3.4.3)                     
##  magrittr     1.5        2014-11-22 CRAN (R 3.4.0)                     
##  memoise      1.1.0      2017-04-21 CRAN (R 3.4.0)                     
##  methods    * 3.4.3      2017-12-07 local                              
##  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.0)                     
##  modelr       0.1.1      2017-08-10 local                              
##  munsell      0.4.3      2016-02-13 CRAN (R 3.4.0)                     
##  nlme         3.1-131.1  2018-02-16 CRAN (R 3.4.3)                     
##  parallel     3.4.3      2017-12-07 local                              
##  pillar       1.2.1      2018-02-27 CRAN (R 3.4.3)                     
##  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.0)                     
##  plyr         1.8.4      2016-06-08 CRAN (R 3.4.0)                     
##  psych        1.7.8      2017-09-09 CRAN (R 3.4.1)                     
##  purrr      * 0.2.4      2017-10-18 CRAN (R 3.4.2)                     
##  R6           2.2.2      2017-06-17 CRAN (R 3.4.0)                     
##  rcfss      * 0.1.5      2017-07-31 local                              
##  Rcpp         0.12.16    2018-03-13 CRAN (R 3.4.4)                     
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.0)                     
##  readxl       1.0.0      2017-04-18 CRAN (R 3.4.0)                     
##  reshape2     1.4.3      2017-12-11 CRAN (R 3.4.3)                     
##  rlang        0.2.0      2018-02-20 cran (@0.2.0)                      
##  rmarkdown    1.9        2018-03-01 CRAN (R 3.4.3)                     
##  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.3)                     
##  rstudioapi   0.7        2017-09-07 CRAN (R 3.4.1)                     
##  rvest        0.3.2      2016-06-17 CRAN (R 3.4.0)                     
##  scales       0.5.0      2017-08-24 cran (@0.5.0)                      
##  stats      * 3.4.3      2017-12-07 local                              
##  stringi      1.1.7      2018-03-12 CRAN (R 3.4.3)                     
##  stringr    * 1.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  tibble     * 1.4.2      2018-01-22 CRAN (R 3.4.3)                     
##  tidyr      * 0.8.0      2018-01-29 CRAN (R 3.4.3)                     
##  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.2)                     
##  tools        3.4.3      2017-12-07 local                              
##  utils      * 3.4.3      2017-12-07 local                              
##  withr        2.1.2      2018-03-15 CRAN (R 3.4.4)                     
##  xml2         1.2.0      2018-01-24 CRAN (R 3.4.3)                     
##  yaml         2.1.18     2018-03-08 CRAN (R 3.4.4)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.