library(tidyverse)
The Department of Education collects annual statistics on colleges and universities in the United States. I have included a subset of this data from 2013 in the rcfss
library from GitHub. To install the package, run the command devtools::install_github("uc-cfss/rcfss")
in the console.
If you don’t already have the
devtools
library installed, you will get an error. Go back and install this first usinginstall.packages("devtools")
, then rundevtools::install_github("uc-cfss/rcfss")
.
library(rcfss)
data("scorecard")
str(scorecard)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1849 obs. of 12 variables:
## $ unitid : int 450234 448479 456427 459596 459851 482477 482547 482592 482617 482662 ...
## $ name : chr "ITT Technical Institute-Wichita" "ITT Technical Institute-Swartz Creek" "ITT Technical Institute-Concord" "ITT Technical Institute-Tallahassee" ...
## $ state : chr "KS" "MI" "CA" "FL" ...
## $ type : chr "Private, for-profit" "Private, for-profit" "Private, for-profit" "Private, for-profit" ...
## $ cost : int 28306 26994 26353 28894 23928 25625 24265 NA 20983 21999 ...
## $ admrate : num 81.3 98.3 89.3 58.4 68.8 ...
## $ satavg : num NA NA NA NA NA NA NA NA NA NA ...
## $ avgfacsal: num 45054 52857 NA 47196 55089 ...
## $ pctpell : num 0.803 0.773 0.704 0.778 0.61 ...
## $ comprate : num 0.6 0.336 NA NA NA ...
## $ firstgen : num 0.506 0.506 0.506 0.506 0.517 ...
## $ debt : num 13000 13000 13000 13000 9500 ...
Type ?scorecard
in the console to open up the help file for this data set. This includes the documentation for all the variables. Use your knowledge of dplyr
and ggplot2
functions to answer the following questions.
NOTE: This time, use a graph to visualize your answer, not a table.
We could use a boxplot to visualize the distribution of SAT scores.
ggplot(scorecard, aes(type, satavg)) +
geom_boxplot()
## Warning: Removed 471 rows containing non-finite values (stat_boxplot).
According to this graph, private, nonprofit schools have the highest average SAT score, followed by public and then private, for-profit schools. But this doesn’t reveal the entire picture. What happens if we plot a histogram or frequency polygon?
ggplot(scorecard, aes(satavg)) +
geom_histogram() +
facet_wrap(~ type)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 471 rows containing non-finite values (stat_bin).
ggplot(scorecard, aes(satavg, color = type)) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 471 rows containing non-finite values (stat_bin).
Now we can see the averages for each college type are based on widely varying sample sizes.
ggplot(scorecard, aes(type)) +
geom_bar()
# geom_point
ggplot(scorecard, aes(cost, avgfacsal)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).
# geom_point with alpha transparency to reveal dense clusters
ggplot(scorecard, aes(cost, avgfacsal)) +
geom_point(alpha = .2) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).
# geom_hex
ggplot(scorecard, aes(cost, avgfacsal)) +
geom_hex() +
geom_smooth()
## Warning: Removed 42 rows containing non-finite values (stat_binhex).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).
# geom_point with smoothing lines for each type
ggplot(scorecard, aes(cost, avgfacsal, color = type)) +
geom_point(alpha = .2) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).
# geom_point with facets for each type
ggplot(scorecard, aes(cost, avgfacsal, color = type)) +
geom_point(alpha = .2) +
geom_smooth() +
facet_grid(. ~ type)
## `geom_smooth()` using method = 'gam'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-03-13
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## bindr 0.1.1 2018-03-13 CRAN (R 3.4.3)
## bindrcpp 0.2 2017-06-17 CRAN (R 3.4.0)
## broom 0.4.3 2017-11-20 CRAN (R 3.4.1)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.3 2017-12-07 local
## crayon 1.3.4 2017-10-03 Github (gaborcsardi/crayon@b5221ab)
## datasets * 3.4.3 2017-12-07 local
## devtools 1.13.5 2018-02-18 CRAN (R 3.4.3)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## dplyr * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.4.3)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.3)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
## glue 1.2.0 2017-10-29 CRAN (R 3.4.2)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## grid 3.4.3 2017-12-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.1 2018-01-18 CRAN (R 3.4.3)
## hms 0.4.2 2018-03-10 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.20 2018-02-20 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.3)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.3 2018-02-27 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-08-10 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-131.1 2018-02-16 CRAN (R 3.4.3)
## parallel 3.4.3 2017-12-07 local
## pillar 1.2.1 2018-02-27 CRAN (R 3.4.3)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.1)
## purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.3 2017-12-11 CRAN (R 3.4.3)
## rlang 0.2.0 2018-02-20 cran (@0.2.0)
## rmarkdown 1.9 2018-03-01 CRAN (R 3.4.3)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.5.0 2017-08-24 cran (@0.5.0)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.7 2018-03-12 CRAN (R 3.4.3)
## stringr * 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.4.3)
## tidyr * 0.8.0 2018-01-29 CRAN (R 3.4.3)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.1 2017-12-19 CRAN (R 3.4.3)
## xml2 1.2.0 2018-01-24 CRAN (R 3.4.3)
## yaml 2.1.18 2018-03-08 CRAN (R 3.4.4)
This work is licensed under the CC BY-NC 4.0 Creative Commons License.