Pipes are an extremely useful tool from the magrittr
package1 that allow you to express a sequence of multiple operations. They can greatly simplify your code and make your operations more intuitive. However they are not the only way to write your code and combine multiple operations. In fact, for many years the pipe did not exist in R. How else did people write their code?
Suppose we have the following assignment:
Using the
diamonds
dataset, calculate the average price for each cut of “I” colored diamonds.
Okay, first let’s load our libraries and check out the data frame.
library(tidyverse)
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 53,930 more rows
We can decompose the problem into a series of discrete steps:
diamonds
to only keep observations where the color is rated as “I”diamonds
data frame by cutdiamonds
data frame by calculating the average priceBut how do we implement the code?
One option is to save each step as a new object:
diamonds_1 <- filter(diamonds, color == "I")
diamonds_2 <- group_by(diamonds_1, cut)
(diamonds_3 <- summarize(diamonds_2, price = mean(price)))
## # A tibble: 5 x 2
## cut price
## <ord> <dbl>
## 1 Fair 4685.446
## 2 Good 5078.533
## 3 Very Good 5255.880
## 4 Premium 5946.181
## 5 Ideal 4451.970
Why do we not like doing this? We have to name each intermediate object. Here I just append a number to the end, but this is not good self-documentation. What should we expect to find in diamond_2
? It would be nicer to have an informative name, but there isn’t a natural one. Then we have to remember how the data exists in each intermediate step and remember to reference the correct one. What happens if we misidentify the data frame?
diamonds_1 <- filter(diamonds, color == "I")
diamonds_2 <- group_by(diamonds_1, cut)
(diamonds_3 <- summarize(diamonds_1, price = mean(price)))
## # A tibble: 1 x 1
## price
## <dbl>
## 1 5091.875
We don’t get the correct answer. Worse, we don’t get an explicit error message because the code, as written, works. R can execute this command for us and doesn’t know to warn us that we used diamonds_1
instead of diamonds_2
.
Instead of creating intermediate objects, let’s just replace the original data frame with the modified form.
# copy diamonds to diamonds_t just for demonstration purposes
diamonds_t <- diamonds
diamonds_t <- filter(diamonds_t, color == "I")
diamonds_t <- group_by(diamonds_t, cut)
(diamonds_t <- summarize(diamonds_t, price = mean(price)))
## # A tibble: 5 x 2
## cut price
## <ord> <dbl>
## 1 Fair 4685.446
## 2 Good 5078.533
## 3 Very Good 5255.880
## 4 Premium 5946.181
## 5 Ideal 4451.970
This works, but still has a couple of problems. What happens if I make an error in the middle of the operation? I need to rerun the entire operation from the beginning. With your own data sources, this means having to read in the .csv
file all over again to restore a fresh copy.
We could string all the function calls together into a single object and forget assigning it anywhere.
summarize(
group_by(
filter(diamonds, color == "I"),
cut
),
price = mean(price)
)
## # A tibble: 5 x 2
## cut price
## <ord> <dbl>
## 1 Fair 4685.446
## 2 Good 5078.533
## 3 Very Good 5255.880
## 4 Premium 5946.181
## 5 Ideal 4451.970
But now we have to read the function from the inside out. This is not intuitive for humans. Again, the computer will handle it just fine, but if you make a mistake debugging it will be a pain.
diamonds %>%
filter(color == "I") %>%
group_by(cut) %>%
summarize(price = mean(price))
## # A tibble: 5 x 2
## cut price
## <ord> <dbl>
## 1 Fair 4685.446
## 2 Good 5078.533
## 3 Very Good 5255.880
## 4 Premium 5946.181
## 5 Ideal 4451.970
Piping is the clearest syntax to implement, as it focuses on actions, not objects. Or as Hadley would say:
[I]t focuses on verbs, not nouns.
magrittr
automatically passes the output from the first line into the next line as the input. This is why tidyverse
functions always accept a data frame as the first argument.
<-
inside the piped operation. Only use this at the beginning if you want to save the output%>%
at the end of each line involved in the piped operation. A good rule of thumb: RStudio will automatically indent lines of code that are part of a piped operation. If the line isn’t indented, it probably hasn’t been added to the pipe. If you have an error in a piped operation, always check to make sure the pipe is connected as you expect.devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-04-04
## Packages -----------------------------------------------------------------
## package * version date source
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## compiler 3.4.3 2017-12-07 local
## datasets * 3.4.3 2017-12-07 local
## devtools 1.13.5 2018-02-18 CRAN (R 3.4.3)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## knitr 1.20 2018-02-20 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## Rcpp 0.12.16 2018-03-13 CRAN (R 3.4.4)
## rmarkdown 1.9 2018-03-01 CRAN (R 3.4.3)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.7 2018-03-12 CRAN (R 3.4.3)
## stringr 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.2 2018-03-15 CRAN (R 3.4.4)
## yaml 2.1.18 2018-03-08 CRAN (R 3.4.4)
The basic %>%
pipe is automatically imported as part of the tidyverse
library. If you wish to use any of the extra tools from magrittr
as demonstrated in R for Data Science, you need to explicitly load magrittr
.↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.