library(tidyverse)
The Department of Education collects annual statistics on colleges and universities in the United States. I have included a subset of this data from 2013 in the rcfss
library from GitHub. To install the package, run the command devtools::install_github("uc-cfss/rcfss")
in the console.
If you don’t already have the
devtools
library installed, you will get an error. Go back and install this first usinginstall.packages("devtools")
, then rundevtools::install_github("uc-cfss/rcfss")
.
library(rcfss)
data("scorecard")
glimpse(scorecard)
## Observations: 1,849
## Variables: 12
## $ unitid <int> 450234, 448479, 456427, 459596, 459851, 482477, 4825...
## $ name <chr> "ITT Technical Institute-Wichita", "ITT Technical In...
## $ state <chr> "KS", "MI", "CA", "FL", "WI", "IL", "NV", "OR", "TN"...
## $ type <chr> "Private, for-profit", "Private, for-profit", "Priva...
## $ cost <int> 28306, 26994, 26353, 28894, 23928, 25625, 24265, NA,...
## $ admrate <dbl> 81.31, 98.31, 89.26, 58.37, 68.75, 70.40, 80.00, 50....
## $ satavg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ avgfacsal <dbl> 45054, 52857, NA, 47196, 55089, 62793, 47556, 60003,...
## $ pctpell <dbl> 0.8030, 0.7735, 0.7038, 0.7781, 0.6098, 0.6411, 0.63...
## $ comprate <dbl> 0.6000, 0.3359, NA, NA, NA, 0.2939, 0.6364, 0.0000, ...
## $ firstgen <dbl> 0.5057590, 0.5057590, 0.5057590, 0.5057590, 0.517160...
## $ debt <dbl> 13000, 13000, 13000, 13000, 9500, 14250, 14250, 1425...
glimpse()
is part of thetibble
package and is a transposed version ofprint()
: columns run down the page, and data runs across. With a data frame with multiple columns, sometimes there is not enough horizontal space on the screen to print each column. By transposing the data frame, we can see all the columns and the values recorded for the initial rows.
Type ?scorecard
in the console to open up the help file for this data set. This includes the documentation for all the variables. Use your knowledge of the dplyr
functions to perform the following tasks.
We could use a combination of arrange()
and slice()
to sort the data frame from most to least expensive, then keep the first 10 rows:
arrange(scorecard, desc(cost)) %>%
slice(1:10)
## # A tibble: 10 x 12
## unitid name state
## <int> <chr> <chr>
## 1 195304 Sarah Lawrence College NY
## 2 179867 Washington University in St Louis MO
## 3 144050 University of Chicago IL
## 4 190150 Columbia University in the City of New York NY
## 5 182670 Dartmouth College NH
## 6 130697 Wesleyan University CT
## 7 147767 Northwestern University IL
## 8 120254 Occidental College CA
## 9 115409 Harvey Mudd College CA
## 10 230816 Bennington College VT
## # ... with 9 more variables: type <chr>, cost <int>, admrate <dbl>,
## # satavg <dbl>, avgfacsal <dbl>, pctpell <dbl>, comprate <dbl>,
## # firstgen <dbl>, debt <dbl>
We can also use the top_n()
function in dplyr
to accomplish the same thing in one line of code.
top_n(scorecard, n = 10, wt = cost)
## # A tibble: 10 x 12
## unitid name state
## <int> <chr> <chr>
## 1 120254 Occidental College CA
## 2 195304 Sarah Lawrence College NY
## 3 115409 Harvey Mudd College CA
## 4 130697 Wesleyan University CT
## 5 147767 Northwestern University IL
## 6 144050 University of Chicago IL
## 7 230816 Bennington College VT
## 8 182670 Dartmouth College NH
## 9 179867 Washington University in St Louis MO
## 10 190150 Columbia University in the City of New York NY
## # ... with 9 more variables: type <chr>, cost <int>, admrate <dbl>,
## # satavg <dbl>, avgfacsal <dbl>, pctpell <dbl>, comprate <dbl>,
## # firstgen <dbl>, debt <dbl>
Notice that the resulting data frame is not sorted in order from most to least expensive - instead it is sorted in the original order from the data frame, but still only contains the 10 most expensive schools based on cost.
scorecard %>%
group_by(type) %>%
summarize(mean_sat = mean(satavg, na.rm = TRUE))
## # A tibble: 3 x 2
## type mean_sat
## <chr> <dbl>
## 1 Private, for-profit 1002.500
## 2 Private, nonprofit 1075.287
## 3 Public 1037.410
scorecard %>%
mutate(ratio = avgfacsal / cost) %>%
select(name, ratio)
## # A tibble: 1,849 x 2
## name ratio
## <chr> <dbl>
## 1 ITT Technical Institute-Wichita 1.591677
## 2 ITT Technical Institute-Swartz Creek 1.958102
## 3 ITT Technical Institute-Concord NA
## 4 ITT Technical Institute-Tallahassee 1.633419
## 5 Herzing University-Brookfield 2.302282
## 6 DeVry University-Illinois 2.450459
## 7 DeVry University-Nevada 1.959860
## 8 DeVry University-Oregon NA
## 9 DeVry University-Tennessee 2.461993
## 10 DeVry University-Washington 2.552843
## # ... with 1,839 more rows
Hint: the result should be a data frame with one row for the University of Chicago, and a column containing the requested value.
scorecard %>%
filter(type == "Private, nonprofit") %>%
arrange(cost) %>%
mutate(school_cheaper = row_number()) %>%
filter(name == "University of Chicago") %>%
glimpse()
## Observations: 1
## Variables: 13
## $ unitid <int> 144050
## $ name <chr> "University of Chicago"
## $ state <chr> "IL"
## $ type <chr> "Private, nonprofit"
## $ cost <int> 62425
## $ admrate <dbl> 8.81
## $ satavg <dbl> 1504
## $ avgfacsal <dbl> 153738
## $ pctpell <dbl> 0.1419
## $ comprate <dbl> 0.9268
## $ firstgen <dbl> 0.1185808
## $ debt <dbl> 16350
## $ school_cheaper <int> 1078
scorecard %>%
filter(type == "Private, nonprofit") %>%
mutate(cost_rank = percent_rank(cost)) %>%
filter(name == "University of Chicago") %>%
glimpse()
## Observations: 1
## Variables: 13
## $ unitid <int> 144050
## $ name <chr> "University of Chicago"
## $ state <chr> "IL"
## $ type <chr> "Private, nonprofit"
## $ cost <int> 62425
## $ admrate <dbl> 8.81
## $ satavg <dbl> 1504
## $ avgfacsal <dbl> 153738
## $ pctpell <dbl> 0.1419
## $ comprate <dbl> 0.9268
## $ firstgen <dbl> 0.1185808
## $ debt <dbl> 16350
## $ cost_rank <dbl> 0.9981464
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-04-02
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## bindr 0.1.1 2018-03-13 CRAN (R 3.4.3)
## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.0)
## broom 0.4.4 2018-03-29 CRAN (R 3.4.3)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## codetools 0.2-15 2016-10-05 CRAN (R 3.4.3)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.3 2017-12-07 local
## crayon 1.3.4 2017-10-03 Github (gaborcsardi/crayon@b5221ab)
## datasets * 3.4.3 2017-12-07 local
## devtools 1.13.5 2018-02-18 CRAN (R 3.4.3)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## dplyr * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.4.3)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.3)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
## glue 1.2.0 2017-10-29 CRAN (R 3.4.2)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## grid 3.4.3 2017-12-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.1 2018-01-18 CRAN (R 3.4.3)
## hms 0.4.2 2018-03-10 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.20 2018-02-20 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.3)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.3 2018-02-27 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-08-10 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-131.1 2018-02-16 CRAN (R 3.4.3)
## parallel 3.4.3 2017-12-07 local
## pillar 1.2.1 2018-02-27 CRAN (R 3.4.3)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.1)
## purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## rcfss * 0.1.5 2017-07-31 local
## Rcpp 0.12.16 2018-03-13 CRAN (R 3.4.4)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.3 2017-12-11 CRAN (R 3.4.3)
## rlang 0.2.0 2018-02-20 cran (@0.2.0)
## rmarkdown 1.9 2018-03-01 CRAN (R 3.4.3)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.5.0 2017-08-24 cran (@0.5.0)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.7 2018-03-12 CRAN (R 3.4.3)
## stringr * 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.4.3)
## tidyr * 0.8.0 2018-01-29 CRAN (R 3.4.3)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.2 2018-03-15 CRAN (R 3.4.4)
## xml2 1.2.0 2018-01-24 CRAN (R 3.4.3)
## yaml 2.1.18 2018-03-08 CRAN (R 3.4.4)
This work is licensed under the CC BY-NC 4.0 Creative Commons License.