library(tidyverse)
library(nycflights13)
theme_set(theme_minimal())

For each exercise, use your knowledge of relational data and joining operations to compute a table or graph that answers the question. All questions use data frames from the nycflights13 package (if you have not previously installed it, do so using install.packages("nycflights13")).

Review the database structure before you begin the exercises.

Is there a relationship between the age of a plane and its departure delays?

Hint: all the data is from 2013.

Click for the solution

The first step is to calculate the age of each plane. To do that, use planes and the age variable:

plane_ages <- planes %>%
  mutate(age = 2013 - year) %>%
  select(tailnum, age)

The best approach to answering this question is a visualization. There are several different types of visualizations you could implement (e.g. scatterplot with smoothing line, line graph of average delay by age). The important thing is that we need to combine flights with plane_ages to determine for each flight the age of the plane. This is another mutating join. The best choice is inner_join() as this will automatically remove any rows in flights where we don’t have age data on the plane.

# smoothing line
flights %>%
  inner_join(plane_ages) %>%
  ggplot(aes(age, dep_delay)) +
  geom_smooth()
## Joining, by = "tailnum"
## `geom_smooth()` using method = 'gam'
## Warning: Removed 9374 rows containing non-finite values (stat_smooth).

# line graph of average delay by age
flights %>%
  inner_join(plane_ages) %>%
  group_by(age) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(age, delay)) +
  geom_point() +
  geom_line()
## Joining, by = "tailnum"
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).

In this situation, left_join() could also be used because ggplot() and mean(na.rm = TRUE) drop missing values (remember that left_join() keeps all rows from flights, even if we don’t have information on the plane).

flights %>%
  left_join(plane_ages) %>%
  ggplot(aes(age, dep_delay)) +
  geom_smooth()
## Joining, by = "tailnum"
## `geom_smooth()` using method = 'gam'
## Warning: Removed 61980 rows containing non-finite values (stat_smooth).

flights %>%
  left_join(plane_ages) %>%
  group_by(age) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(age, delay)) +
  geom_point() +
  geom_line()
## Joining, by = "tailnum"
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).

The important takeaway is that departure delays do not appear to increase with plane age – in fact they seem to decrease slightly (though with an expanding confidence interval). Care to think of a reason why this may be so?

Add the location of the origin and destination (i.e. the lat and lon) to flights.

Click for the solution

This is a mutating join, and the basic function you need to use here is left_join(). We have to perform the joining operation twice since we want to create new variables based on both the destination airport and the origin airport. And because the name of the key variable differs between the data frames, we need to explicitly define how to join the data frames using the by argument:

flights %>%
  left_join(airports, by = c(dest = "faa")) %>%
  left_join(airports, by = c(origin = "faa"))
## # A tibble: 336,776 x 33
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 26 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, name.x <chr>, lat.x <dbl>,
## #   lon.x <dbl>, alt.x <int>, tz.x <dbl>, dst.x <chr>, tzone.x <chr>,
## #   name.y <chr>, lat.y <dbl>, lon.y <dbl>, alt.y <int>, tz.y <dbl>,
## #   dst.y <chr>, tzone.y <chr>

Notice that with this approach, we are joining all of the columns in airports. The instructions just asked for latitude and longitude, so we can create a copy of airports that only includes the necessary variables (lat and lon, plus the primary key variable faa) and join flights to that data frame:

airports_lite <- airports %>%
  select(faa, lat, lon)

flights %>%
  left_join(airports_lite, by = c(dest = "faa")) %>%
  left_join(airports_lite, by = c(origin = "faa"))
## # A tibble: 336,776 x 23
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 16 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, lat.x <dbl>, lon.x <dbl>, lat.y <dbl>,
## #   lon.y <dbl>

This is better, but now we have two sets of latitude and longitude variables in the data frame: one for the destination airport, and one for the origin airport. When we perform the second left_join() operation, to avoid duplicate variable names the function automatically adds generic .x and .y suffixes to the output to disambiguate them. This is nice, but we might want something more intuitive to explicitly identify which variables are associated with the destination vs. the origin. To do that, we override the default suffix argument with custom suffixes:

airports_lite <- airports %>%
  select(faa, lat, lon)

flights %>%
  left_join(airports_lite, by = c(dest = "faa")) %>%
  left_join(airports_lite, by = c(origin = "faa"), suffix = c(".dest", ".origin"))
## # A tibble: 336,776 x 23
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 16 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, lat.dest <dbl>, lon.dest <dbl>,
## #   lat.origin <dbl>, lon.origin <dbl>

Acknowledgements

Session Info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-03-19
## Packages -----------------------------------------------------------------
##  package      * version    date       source                             
##  assertthat     0.2.0      2017-04-11 CRAN (R 3.4.0)                     
##  backports      1.1.2      2017-12-13 CRAN (R 3.4.3)                     
##  base         * 3.4.3      2017-12-07 local                              
##  bindr          0.1.1      2018-03-13 CRAN (R 3.4.3)                     
##  bindrcpp       0.2        2017-06-17 CRAN (R 3.4.0)                     
##  broom          0.4.3      2017-11-20 CRAN (R 3.4.1)                     
##  cellranger     1.1.0      2016-07-27 CRAN (R 3.4.0)                     
##  cli            1.0.0      2017-11-05 CRAN (R 3.4.2)                     
##  colorspace     1.3-2      2016-12-14 CRAN (R 3.4.0)                     
##  compiler       3.4.3      2017-12-07 local                              
##  crayon         1.3.4      2017-10-03 Github (gaborcsardi/crayon@b5221ab)
##  datasets     * 3.4.3      2017-12-07 local                              
##  devtools       1.13.5     2018-02-18 CRAN (R 3.4.3)                     
##  digest         0.6.15     2018-01-28 CRAN (R 3.4.3)                     
##  dplyr        * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)   
##  evaluate       0.10.1     2017-06-24 CRAN (R 3.4.1)                     
##  forcats      * 0.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  foreign        0.8-69     2017-06-22 CRAN (R 3.4.3)                     
##  ggplot2      * 2.2.1      2016-12-30 CRAN (R 3.4.0)                     
##  glue           1.2.0      2017-10-29 CRAN (R 3.4.2)                     
##  graphics     * 3.4.3      2017-12-07 local                              
##  grDevices    * 3.4.3      2017-12-07 local                              
##  grid           3.4.3      2017-12-07 local                              
##  gtable         0.2.0      2016-02-26 CRAN (R 3.4.0)                     
##  haven          1.1.1      2018-01-18 CRAN (R 3.4.3)                     
##  hms            0.4.2      2018-03-10 CRAN (R 3.4.3)                     
##  htmltools      0.3.6      2017-04-28 CRAN (R 3.4.0)                     
##  httr           1.3.1      2017-08-20 CRAN (R 3.4.1)                     
##  jsonlite       1.5        2017-06-01 CRAN (R 3.4.0)                     
##  knitr          1.20       2018-02-20 CRAN (R 3.4.3)                     
##  lattice        0.20-35    2017-03-25 CRAN (R 3.4.3)                     
##  lazyeval       0.2.1      2017-10-29 CRAN (R 3.4.2)                     
##  lubridate      1.7.3      2018-02-27 CRAN (R 3.4.3)                     
##  magrittr       1.5        2014-11-22 CRAN (R 3.4.0)                     
##  memoise        1.1.0      2017-04-21 CRAN (R 3.4.0)                     
##  methods      * 3.4.3      2017-12-07 local                              
##  mnormt         1.5-5      2016-10-15 CRAN (R 3.4.0)                     
##  modelr         0.1.1      2017-08-10 local                              
##  munsell        0.4.3      2016-02-13 CRAN (R 3.4.0)                     
##  nlme           3.1-131.1  2018-02-16 CRAN (R 3.4.3)                     
##  nycflights13 * 0.2.2      2017-01-27 CRAN (R 3.4.0)                     
##  parallel       3.4.3      2017-12-07 local                              
##  pillar         1.2.1      2018-02-27 CRAN (R 3.4.3)                     
##  pkgconfig      2.0.1      2017-03-21 CRAN (R 3.4.0)                     
##  plyr           1.8.4      2016-06-08 CRAN (R 3.4.0)                     
##  psych          1.7.8      2017-09-09 CRAN (R 3.4.1)                     
##  purrr        * 0.2.4      2017-10-18 CRAN (R 3.4.2)                     
##  R6             2.2.2      2017-06-17 CRAN (R 3.4.0)                     
##  Rcpp           0.12.15    2018-01-20 CRAN (R 3.4.3)                     
##  readr        * 1.1.1      2017-05-16 CRAN (R 3.4.0)                     
##  readxl         1.0.0      2017-04-18 CRAN (R 3.4.0)                     
##  reshape2       1.4.3      2017-12-11 CRAN (R 3.4.3)                     
##  rlang          0.2.0      2018-02-20 cran (@0.2.0)                      
##  rmarkdown      1.9        2018-03-01 CRAN (R 3.4.3)                     
##  rprojroot      1.3-2      2018-01-03 CRAN (R 3.4.3)                     
##  rstudioapi     0.7        2017-09-07 CRAN (R 3.4.1)                     
##  rvest          0.3.2      2016-06-17 CRAN (R 3.4.0)                     
##  scales         0.5.0      2017-08-24 cran (@0.5.0)                      
##  stats        * 3.4.3      2017-12-07 local                              
##  stringi        1.1.7      2018-03-12 CRAN (R 3.4.3)                     
##  stringr      * 1.3.0      2018-02-19 CRAN (R 3.4.3)                     
##  tibble       * 1.4.2      2018-01-22 CRAN (R 3.4.3)                     
##  tidyr        * 0.8.0      2018-01-29 CRAN (R 3.4.3)                     
##  tidyverse    * 1.2.1      2017-11-14 CRAN (R 3.4.2)                     
##  tools          3.4.3      2017-12-07 local                              
##  utils        * 3.4.3      2017-12-07 local                              
##  withr          2.1.1      2017-12-19 CRAN (R 3.4.3)                     
##  xml2           1.2.0      2018-01-24 CRAN (R 3.4.3)                     
##  yaml           2.1.18     2018-03-08 CRAN (R 3.4.4)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.