--- title: "NYC flights - data wrangling" author: "YOUR NAME HERE" date: "`r Sys.Date()`" format: html: self-contained: true toc: true toc_float: true number_section: false highlight: "tango" theme: "cosmo" editor: visual editor_options: chunk_output_type: console --- ```{r} #| label: load-packages #| message: false library(tidyverse) library(nycflights13) ``` To demonstrate data wrangling we will use a dataset of characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013. We will use the `flights` data frame that comes from the `nycflights13` package. We first explore the data a bit. Examine the documentation as well. ```{r} glimpse(flights) ``` # Exercises Practice using `dplyr` verbs to perform each of the following tasks. Unless explicitly stated, you should not save the output into an object. # `select()` 1. Select only the departure delay and arrival delay columns. *Check: dimensions should be 336776 x 2* ```{r} ``` 2. Select all variables in `flights`, excluding departure delays. *Check: dimensions should be 336776 x 18* ```{r} ``` 3. Select just the first three variables in `flights` . *Check: dimensions should be 336776 x 3* ```{r} ``` # `arrange()` 4. Arrange the `flights` data by descending departure delays, with large departure delays on top. *Check: top row have `dep_delay`* *of 1301* ```{r} ``` # `slice()` 5. Select (`slice`) the first 5 rows of `flights` . *Check: dimensions should be 5 x 19.* ```{r} ``` 6. Combine `arrange` and `slice` to select the 5 flights with the longest departure delays. *Check: dimensions should be 5 x 19, 5th row should have `dep_delay`* *of 1005.* ```{r} ``` # `filter()` 7. Select all rows where the destination airport is "LAX". *Check: dimensions should be 16174 x 19* ```{r} ``` 8. Select all rows where the destination airport is "LAX" and the arrival delay is less than 0. *Check: dimensions should be 9790 x 19* ```{r} ``` 9. Select all rows where the destination airport is "LAX", "BUR", or "LGB" AND the departure or the arrival delay is negative. *Check: dimensions should be 12492 x 19* ```{r} ``` # `mutate()` 10. `air_time` is provided in minutes. Create a new variable called `hours` that coverts `air_time` to hours. Make sure to save this new variable by overwriting the `flights` dataframe. *Check: dimensions of `flights` should now be 336776 x 20. First value of `hours`* *should be* 3.7833333. ```{r} ``` 11. Create a new variable called `mph` that computes the miles traveled (`distance`) per hour. Make sure to save this variable. *Check: dimensions of `flights` should now be 336776 x 21. First value of `mph`* *should be* 370.0441 ```{r} ``` 12. Create a new variable called `on_time` that is `TRUE` if the flight is on time (or early) and `FALSE` if it is not. Make sure to save this variable. *Check: dimensions of `flights` should now be 336776 x 22. First value of `on_time`* should be FALSE. ```{r} ``` # `summarise()` and `group_by()` 13. Calculate the average departure delay. *Check: answer should be 12.6* ```{r} ``` 14. Calculate the average departure delay for each month. *Check: dimensions should be 12 x 2* ```{r} ``` 15. Count the number of on-time flights for each month. *Check: count should be 15248 in January.* ```{r} ``` 16. Compute the proportion of flights that are on-time for each month. *Check: proportion should be 0.565 for January.* ```{r} ``` 17. In addition to the proportion on time for each month, also find the maximum delay, median delay, and the count of flights in each month. *Check: dimensions should be 12 x 5.* ```{r} ``` 18. What is the proportion on time for flights traveling to "LAX", "BUR", and "LGB"? *Check: Proportion on time for Burbank is 0.55.* ```{r} ``` # More (optional) practice 19. Create a new dataset that only contains flights that do not have a missing departure time. Include the columns `year`,`month`, `day`, `dep_time`, `dep_delay`, and `dep_delay_hours` (the departure delay in hours). Note you may need to use `mutate()` to make one or more of these variables. *Check: dimensions should be 328521 x 6.* ```{r} ``` 20. For each airplane (uniquely identified by `tailnum`), use a `group_by()` paired with `summarize()` to find the sample size, mean, and standard deviation of flight distances. Then, pick off the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight. *Check: dimensions should be 10 x 4. Largest mean is 173, smallest is 4983.* ```{r} ```