library(tidyverse)
Data science workflow

Data science workflow

Rarely will your data arrive in exactly the form you require in order to analyze it appropriately. As part of the data science workflow you will need to transform your data in order to analyze it. Just as we established a syntax for generating graphics (the layered grammar of graphics), so too will we have a syntax for data transformation.

From the same author of ggplot2, I give you dplyr! This package contains useful functions for transforming and manipulating data frames, the bread-and-butter format for data in R. These functions can be thought of as verbs. The noun is the data, and the verb is acting on the noun. All of the dplyr verbs (and in fact all the verbs in the wider tidyverse) work similarly:

  1. The first argument is a data frame
  2. Subsequent arguments describe what to do with the data frame
  3. The result is a new data frame

Key functions in dplyr

function() Action performed
filter() Subsets observations based on their values
arrange() Changes the order of observations based on their values
select() Selects a subset of columns from the data frame
rename() Changes the name of columns in the data frame
mutate() Creates new columns (or variables)
group_by() Changes the unit of analysis from the complete dataset to individual groups
summarize() Collapses the data frame to a smaller number of rows which summarize the larger data

These are the basic verbs you will use to transform your data. By combining them together, you can perform powerful data manipulation tasks.

American vs. British English

Hadley Wickham is from New Zealand. As such he (and base R) favours British spellings:

While British spelling is perhaps the norm, let us recall a past statement by the president:

Fortunately many R functions can be written using American or British variants:

  • summarize() = summarise()
  • color() = colour()

Therefore in this class I will generally stick to American spelling.

Saving transformed data

dplyr never overwrites existing data. If you want a copy of the transformed data for later use in the program, you need to explicitly save it. You can do this by using the assignment operator <-:

filter(diamonds, cut == "Ideal")  # printed, but not saved
## # A tibble: 21,551 x 10
##    carat   cut color clarity depth table price     x     y     z
##    <dbl> <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.23 Ideal     J     VS1  62.8    56   340  3.93  3.90  2.46
##  3  0.31 Ideal     J     SI2  62.2    54   344  4.35  4.37  2.71
##  4  0.30 Ideal     I     SI2  62.0    54   348  4.31  4.34  2.68
##  5  0.33 Ideal     I     SI2  61.8    55   403  4.49  4.51  2.78
##  6  0.33 Ideal     I     SI2  61.2    56   403  4.49  4.50  2.75
##  7  0.33 Ideal     J     SI1  61.1    56   403  4.49  4.55  2.76
##  8  0.23 Ideal     G     VS1  61.9    54   404  3.93  3.95  2.44
##  9  0.32 Ideal     I     SI1  60.9    55   404  4.45  4.48  2.72
## 10  0.30 Ideal     I     SI2  61.0    59   405  4.30  4.33  2.63
## # ... with 21,541 more rows
diamonds_ideal <- filter(diamonds, cut == "Ideal")  # saved, but not printed
(diamonds_ideal <- filter(diamonds, cut == "Ideal"))  # saved and printed
## # A tibble: 21,551 x 10
##    carat   cut color clarity depth table price     x     y     z
##    <dbl> <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.23 Ideal     J     VS1  62.8    56   340  3.93  3.90  2.46
##  3  0.31 Ideal     J     SI2  62.2    54   344  4.35  4.37  2.71
##  4  0.30 Ideal     I     SI2  62.0    54   348  4.31  4.34  2.68
##  5  0.33 Ideal     I     SI2  61.8    55   403  4.49  4.51  2.78
##  6  0.33 Ideal     I     SI2  61.2    56   403  4.49  4.50  2.75
##  7  0.33 Ideal     J     SI1  61.1    56   403  4.49  4.55  2.76
##  8  0.23 Ideal     G     VS1  61.9    54   404  3.93  3.95  2.44
##  9  0.32 Ideal     I     SI1  60.9    55   404  4.45  4.48  2.72
## 10  0.30 Ideal     I     SI2  61.0    59   405  4.30  4.33  2.63
## # ... with 21,541 more rows

Do not use = to assign objects. Read this for more information on the difference between <- and =.

Missing values

NA represents an unknown value. Missing values are contagious, in that their properties will transfer to any operation performed on it.

NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA

To determine if a value is missing, use the is.na() function.

When filtering, you must explicitly call for missing values to be returned.

df <- tibble(x = c(1, NA, 3))
df
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     1
## 2    NA
## 3     3
filter(df, x > 1)
## # A tibble: 1 x 1
##       x
##   <dbl>
## 1     3
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
##       x
##   <dbl>
## 1    NA
## 2     3

Or when calculating summary statistics, you need to explicitly ignore missing values.

df <- tibble(
  x = c(1, 2, 3, 5, NA)
)
df
## # A tibble: 5 x 1
##       x
##   <dbl>
## 1     1
## 2     2
## 3     3
## 4     5
## 5    NA
summarize(df, meanx = mean(x))
## # A tibble: 1 x 1
##   meanx
##   <dbl>
## 1    NA
summarize(df, meanx = mean(x, na.rm = TRUE))
## # A tibble: 1 x 1
##   meanx
##   <dbl>
## 1  2.75

Piping

As we discussed, frequently you need to perform a series of intermediate steps to transform data for analysis. If we write each step as a discrete command and store their contents as new objects, your code can become convoluted.

Drawing on this example from R for Data Science, let’s explore the relationship between the distance and average delay for each location. At this point, we would write it something like this:

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

Decomposing the problem, there are three basic steps:

  1. Group flights by destination.
  2. Summarize to compute distance, average delay, and number of flights.
  3. Filter to remove noisy points and the Honolulu airport, which is almost twice as far away as the next closest airport.

The code as written is inefficient because we have to name and store each intermediate data frame, even though we don’t care about them. It also provides more opportunities for typos and errors.

Because all dplyr verbs follow the same syntax (data first, then options for the function), we can use the pipe operator %>% to chain a series of functions together in one command:

delays <- flights %>% 
  group_by(dest) %>% 
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

Now, we don’t have to name each intermediate step and store them as data frames. We only store a single data frame (delays) which contains the final version of the transformed data frame. We could read this code as use the flights data, then group by destination, then summarize for each destination the number of flights, the average disance, and the average delay, then subset only the destinations with at least 20 flights and exclude Honolulu.

Things not to do with piping

Remember that with pipes, we don’t have to save all of our intermediate steps. We only use one assignment, like this:

delays <- flights %>% 
  group_by(dest) %>% 
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

Do not do this:

delays <- flights %>% 
  by_dest <- group_by(dest) %>% 
  delay <- summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  delay <- filter(count > 20, dest != "HNL")
Error: bad assignment: 
     summarize(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, 
         na.rm = TRUE)) %>% delay <- filter(count > 20, dest != "HNL")

Or this:

delays <- flights %>% 
  group_by(flights, dest) %>% 
  summarize(flights,
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(flights, count > 20, dest != "HNL")
## Error in grouped_df_impl(data, unname(vars), drop): Column `flights` is unknown

If you use pipes, you don’t have to reference the data frame with each function - just the first time at the beginning of the pipe sequence.

Session Info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-03-13
## Packages -----------------------------------------------------------------
##  package    * version    date       source                             
##  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.0)                     
##  backports    1.1.2      2017-12-13 CRAN (R 3.4.3)                     
##  base       * 3.4.3      2017-12-07 local                              
##  bindr        0.1        2016-11-13 CRAN (R 3.4.0)                     
##  bindrcpp     0.2        2017-06-17 CRAN (R 3.4.0)                     
##  broom        0.4.3      2017-11-20 CRAN (R 3.4.1)                     
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.0)                     
##  cli          1.0.0      2017-11-05 CRAN (R 3.4.2)                     
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.0)                     
##  compiler     3.4.3      2017-12-07 local                              
##  crayon       1.3.4      2017-10-03 Github (gaborcsardi/crayon@b5221ab)
##  datasets   * 3.4.3      2017-12-07 local                              
##  devtools     1.13.5     2018-02-18 CRAN (R 3.4.3)                     
##  digest       0.6.15     2018-01-28 CRAN (R 3.4.3)                     
##  dplyr      * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)   
##  evaluate     0.10.1     2017-06-24 CRAN (R 3.4.1)                     
##  forcats    * 0.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  foreign      0.8-69     2017-06-22 CRAN (R 3.4.3)                     
##  ggplot2    * 2.2.1      2016-12-30 CRAN (R 3.4.0)                     
##  glue         1.2.0      2017-10-29 CRAN (R 3.4.2)                     
##  graphics   * 3.4.3      2017-12-07 local                              
##  grDevices  * 3.4.3      2017-12-07 local                              
##  grid         3.4.3      2017-12-07 local                              
##  gtable       0.2.0      2016-02-26 CRAN (R 3.4.0)                     
##  haven        1.1.1      2018-01-18 CRAN (R 3.4.3)                     
##  hms          0.4.1      2018-01-24 CRAN (R 3.4.3)                     
##  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.0)                     
##  httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                     
##  jsonlite     1.5        2017-06-01 CRAN (R 3.4.0)                     
##  knitr        1.20       2018-02-20 CRAN (R 3.4.3)                     
##  lattice      0.20-35    2017-03-25 CRAN (R 3.4.3)                     
##  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.2)                     
##  lubridate    1.7.2      2018-02-06 CRAN (R 3.4.3)                     
##  magrittr     1.5        2014-11-22 CRAN (R 3.4.0)                     
##  memoise      1.1.0      2017-04-21 CRAN (R 3.4.0)                     
##  methods    * 3.4.3      2017-12-07 local                              
##  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.0)                     
##  modelr       0.1.1      2017-08-10 local                              
##  munsell      0.4.3      2016-02-13 CRAN (R 3.4.0)                     
##  nlme         3.1-131.1  2018-02-16 CRAN (R 3.4.3)                     
##  parallel     3.4.3      2017-12-07 local                              
##  pillar       1.1.0      2018-01-14 CRAN (R 3.4.3)                     
##  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.0)                     
##  plyr         1.8.4      2016-06-08 CRAN (R 3.4.0)                     
##  psych        1.7.8      2017-09-09 CRAN (R 3.4.1)                     
##  purrr      * 0.2.4      2017-10-18 CRAN (R 3.4.2)                     
##  R6           2.2.2      2017-06-17 CRAN (R 3.4.0)                     
##  Rcpp         0.12.15    2018-01-20 CRAN (R 3.4.3)                     
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.0)                     
##  readxl       1.0.0      2017-04-18 CRAN (R 3.4.0)                     
##  reshape2     1.4.3      2017-12-11 CRAN (R 3.4.3)                     
##  rlang        0.2.0      2018-02-20 cran (@0.2.0)                      
##  rmarkdown    1.8        2017-11-17 CRAN (R 3.4.2)                     
##  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.3)                     
##  rstudioapi   0.7        2017-09-07 CRAN (R 3.4.1)                     
##  rvest        0.3.2      2016-06-17 CRAN (R 3.4.0)                     
##  scales       0.5.0      2017-08-24 cran (@0.5.0)                      
##  stats      * 3.4.3      2017-12-07 local                              
##  stringi      1.1.6      2017-11-17 CRAN (R 3.4.2)                     
##  stringr    * 1.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  tibble     * 1.4.2      2018-01-22 CRAN (R 3.4.3)                     
##  tidyr      * 0.8.0      2018-01-29 CRAN (R 3.4.3)                     
##  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.2)                     
##  tools        3.4.3      2017-12-07 local                              
##  utils      * 3.4.3      2017-12-07 local                              
##  withr        2.1.1      2017-12-19 CRAN (R 3.4.3)                     
##  xml2         1.2.0      2018-01-24 CRAN (R 3.4.3)                     
##  yaml         2.1.16     2017-12-12 CRAN (R 3.4.3)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.