--- title: "NYC Flights - data wrangling II" author: "YOUR NAME HERE" date: "`r Sys.Date()`" format: html: self-contained: true toc: true toc_float: true number_section: false highlight: "tango" theme: "cosmo" editor: visual editor_options: chunk_output_type: console --- We will again work with data from the `nycflights13` package. ```{r} #| label: load-packages #| message: false library(nycflights13) library(tidyverse) ``` ## Exercise 1 Examine the documentation for the datasets `airports`, `flights`, and `planes`. What are the dimensions of each? How are these datasets related? > YOUR ANSWER HERE ## EXAMPLE Suppose you wanted to make a map of the route of every flight. What variables would you need from which datasets? You need the geographic location of the airports (from `airports`) and the path of flights (i.e., which airports were involved) from `flights`. We want to join flights to airports. Note these two datasets have no variables in common so we will have to specify the variable to join by using `by =`. Check out the documentation for more information. ```{r} flights |> left_join(airports, by = c("dest" = "faa")) ``` ## Exercise 2 Which airports are in `flights` but not in `airports`? Google to find out what these airports are. ```{r} ``` > YOUR ANSWER HERE Which airports are in `airports` but not in `flights`? What does this tell us about these airports (at least in 2013)? ```{r} ``` > YOUR ANSWER HERE ## Exercise 3 Starting with the `flights` dataset, create a new dataset `dest_delays` with the median arrival delay for each destination. *Note, this question does not require you to use joins. Make sure to add na.rm = TRUE when computing the median. Check: `dest_delays` should have dimensions 105 x 2.* ```{r} ``` ## Exercise 4 Join the columns in `airports` to `dest_delays` (preserving all rows in `dest_delays`. *Check: `delays_by_airport`* *should have dimensions 105 x 9.* ```{r} ``` Based on your answer to Exercise 2, how many rows in `delays_by_airport` do you expect to be missing latitude and longitude information? > YOUR ANSWER HERE ## Exercise 5 ***Is there a relationship between the age of a plane and its delays?*** The plane tail number is given in the `tailnum` variable in the `flights` dataset. The year the plane was manufactured is given in the `year` variable in the `planes` dataset. Start by finding the median arrival delay for each plane and store the resulting dataset in `plane_delays`. *Check: `plane_delays`* *should have dimensions 4044 x 2* ```{r} ``` Join `plane_delays` to the `planes` data using an appropriate join and then use `mutate` to create an `age` variable. Note this data is from 2013. ```{r} ``` Finally, create an effective visualization of the data to investigate if there a relationship between the age of a plane and its delays. Comment on your conclusions. ```{r} ``` ## (FOR LATER) Extra mapping fun Try re-creating the visualization below and/or exploring other ways to visualized the flights data spatially. *Note, to view the image in your html, download the "flights_map.png" file and place it in the images subfolder of your STAT_7500 folder.* ![](images/flights_map.png) Some starter code is provided below for creating a basic US map. ```{r} #| warning: false #| message: false library(maps) library(mapdata) state <- map_data("state") ggplot() + geom_polygon(data = state, aes(x = long, y = lat, group = group), fill = 'lightblue', color = "white") + theme_void() + coord_fixed(1.3) ``` Adding on the flight trajectories will require some creative data wrangling. You can use a `geom_curve()` layer to create the flight curves; the above plot has `curvature = 0.2`. Note, you can use different datasets and aesthetic mappings in different `geom` layers (see `geom_polygon()` layer above). ```{r} ```