This is a general purpose complement to the specialised manipulation
functions filter(), select(), mutate(),
summarise() and arrange(). You can use do()
to perform arbitrary computation, returning either a data frame or
arbitrary objects which will be stored in a list. This is particularly
useful when working with models: you can fit models per group with
do() and then flexibly extract components with either another
do() or summarise().
do(.data, ...)
| .data | a tbl |
|---|---|
| ... | Expressions to apply to each group. If named, results will be
stored in a new column. If unnamed, should return a data frame. You can
use |
do() always returns a data frame. The first columns in the data frame
will be the labels, the others will be computed from .... Named
arguments become list-columns, with one element for each group; unnamed
elements must be data frames and labels will be duplicated accordingly.
Groups are preserved for a single unnamed input. This is different to
summarise() because do() generally does not reduce the
complexity of the data, it just expresses it in a special way. For
multiple named inputs, the output is grouped by row with
rowwise(). This allows other verbs to work in an intuitive
way.
For an empty data frame, the expressions will be evaluated once, even in the presence of a grouping. This makes sure that the format of the resulting data frame is the same for both empty and non-empty input.
If you're familiar with plyr, do() with named arguments is basically
equivalent to plyr::dlply(), and do() with a single unnamed argument
is basically equivalent to plyr::ldply(). However, instead of storing
labels in a separate attribute, the result is always a data frame. This
means that summarise() applied to the result of do() can
act like ldply().
#> # A tibble: 6 × 11 #> # Groups: cyl [3] #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 #> 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> 3 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 #> 4 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 #> 6 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .)) models#> Source: local data frame [3 x 2] #> Groups: <by row> #> #> # A tibble: 3 × 2 #> cyl mod #> * <dbl> <list> #> 1 4 <S3: lm> #> 2 6 <S3: lm> #> 3 8 <S3: lm>#> # A tibble: 3 × 1 #> rsq #> <dbl> #> 1 0.64840514 #> 2 0.01062604 #> 3 0.27015777models %>% do(data.frame(coef = coef(.$mod)))#> Source: local data frame [6 x 1] #> Groups: <by row> #> #> # A tibble: 6 × 1 #> coef #> * <dbl> #> 1 40.871955322 #> 2 -0.135141815 #> 3 19.081987419 #> 4 0.003605119 #> 5 22.032798914 #> 6 -0.019634095models %>% do(data.frame( var = names(coef(.$mod)), coef(summary(.$mod))) )#> Source: local data frame [6 x 5] #> Groups: <by row> #> #> # A tibble: 6 × 5 #> var Estimate Std..Error t.value Pr...t.. #> * <fctr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 40.871955322 3.589605400 11.3861973 1.202715e-06 #> 2 disp -0.135141815 0.033171608 -4.0740206 2.782827e-03 #> 3 (Intercept) 19.081987419 2.913992892 6.5483988 1.243968e-03 #> 4 disp 0.003605119 0.015557115 0.2317344 8.259297e-01 #> 5 (Intercept) 22.032798914 3.345241115 6.5863112 2.588765e-05 #> 6 disp -0.019634095 0.009315926 -2.1075838 5.677488e-02models <- by_cyl %>% do( mod_linear = lm(mpg ~ disp, data = .), mod_quad = lm(mpg ~ poly(disp, 2), data = .) ) models#> Source: local data frame [3 x 3] #> Groups: <by row> #> #> # A tibble: 3 × 3 #> cyl mod_linear mod_quad #> * <dbl> <list> <list> #> 1 4 <S3: lm> <S3: lm> #> 2 6 <S3: lm> <S3: lm> #> 3 8 <S3: lm> <S3: lm>compare <- models %>% do(aov = anova(.$mod_linear, .$mod_quad)) # compare %>% summarise(p.value = aov$`Pr(>F)`) if (require("nycflights13")) { # You can use it to do any arbitrary computation, like fitting a linear # model. Let's explore how carrier departure delays vary over the time carriers <- group_by(flights, carrier) group_size(carriers) mods <- do(carriers, mod = lm(arr_delay ~ dep_time, data = .)) mods %>% do(as.data.frame(coef(.$mod))) mods %>% summarise(rsq = summary(mod)$r.squared) not_run({ # This longer example shows the progress bar in action by_dest <- flights %>% group_by(dest) %>% filter(n() > 100) library(mgcv) by_dest %>% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .)) }) }