--- title: "Week 3 HW Hall of Fame" author: "Rebecca Ferrell" date: "`r format(Sys.Date(), '%B %d, %Y')`" output: html_document: toc: true toc_float: true number_sections: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Here are some excerpts of the reports students prepared for the Week 3 homework on `dplyr` practice using the `nycflights13` data. All have been edited slightly for formatting or to consolidate code since in many cases there was exploratory data analysis that preceded these results. ```{r load_libraries, warning=FALSE, message=FALSE} library(ggplot2) library(dplyr) library(nycflights13) library(pander) ``` # Weather and departure delays {.tabset} Many students explored analyses related to the precipitation patterns we looked at in class. Here are a couple of nice examples of those. ## Visibility This analysis involves changing the departure delay variable into units of hours, merging the `weather` data onto the `flights` to get visibility information (matching airport, day, and hour), and then looking at a non-parametric estimate of the mean departure delay as a function of visibility. This reveals an interesting pattern at NYC area airports: generally, better visibility is associated with shorter departure delays on average, but delays in lowest visibility conditions (under 1 miles, so "fog" or "mist" by [international standards](https://en.wikipedia.org/wiki/Visibility#Fog.2C_mist.2C_and_haze)) are typically shorter than with slightly better visibility. This pattern could be worth looking into in more detail. ```{r kennedy_visibility} flights %>% mutate(dep_delay_hr = dep_delay/60) %>% select(origin, year, month, day, hour, dep_delay_hr) %>% inner_join(weather, by = c("origin", "year", "month", "day", "hour")) %>% select(dep_delay_hr, visib) %>% filter(!is.na(dep_delay_hr) & !is.na(visib)) %>% ggplot(aes(x = visib, y = dep_delay_hr)) + geom_smooth() + theme_bw(base_size = 16) + xlab("Visibility (miles)") + ylab("Average departure delay (hours)") + ggtitle("Trend in mean departure delay by visibility") + theme(plot.title=element_text(size=12)) ``` ## Pressure This student noted that high air pressure is often associated with nice weather, and we do see that as pressure increases, departure delays are typically shorter in the `nycflights13` data. ```{r riddhi_pressure} flights %>% select(origin, dest, year, month, day, hour, dep_delay, arr_delay) %>% inner_join(weather, by = c("origin", "year", "month", "day", "hour")) %>% select(dep_delay, pressure) %>% filter(!is.na(dep_delay) & !is.na(pressure)) %>% ggplot(aes(x = pressure, y = dep_delay)) + geom_smooth() + theme_bw(base_size = 10) + xlab("Sea-level Pressure at departure hour (millibars)") + ylab("Average departure delay (minutes)") + ggtitle("Impact of Sea-level Pressure on Departure Delays") ``` # NYC-SAN speeds by carrier This analysis looks at the airlines that fly to San Diego from New York, calculates how fast these flights were, and compares them across carriers. There is more variation in flight speeds than one might expect on ths common route, and some airlines appear to be faster than others. ```{r yusri_sandiego_speed, warning=FALSE} flights %>% filter(dest == 'SAN') %>% left_join(planes, by = "tailnum") %>% left_join(airlines, by = "carrier") %>% mutate(milesperminute = distance / air_time, milesperhour = milesperminute * 60) %>% ggplot(aes(x = carrier, y = milesperhour, fill = carrier)) + geom_boxplot() + guides(fill=FALSE) + ggtitle("New York-San Diego: Speeds by Carrier") + ylab("Miles Per Hour") + coord_flip() + theme(axis.title.y = element_blank(), panel.grid.minor=element_blank(), panel.grid.major=element_blank()) + scale_x_discrete(breaks=c("AA", "B6", "DL", "UA"), labels=c("American Airlines", "JetBlue Airways", "Delta Air Lines", "United Airlines")) ``` # United's delays by NYC airport This table looks at United Airline's track record for departure and arrival delays leaving from the three NYC airports. It turns out that almost half of all United Airlines flights passing through EWR depart late, which appears to be substantially higher than JFK and LGA. ```{r jordan_united_delays, warning=FALSE} flights %>% rename(destination = dest, departure_delay = dep_delay, arrival_delay = arr_delay, time_in_air = air_time) %>% select(-year, -dep_time, -arr_time, -tailnum, -flight, -hour, -minute, -day) %>% mutate(ind_delayed_dep = ifelse(departure_delay > 0, 1, 0), ind_delayed_arr = ifelse(arrival_delay > 0, 1, 0)) %>% left_join(airlines, by = "carrier") %>% group_by(origin, name) %>% filter(carrier == "UA") %>% summarize(n_obs = n(), per_delayed_dep = round(sum(ind_delayed_dep, na.rm=TRUE) / n(),2), per_delayed_arr = round(sum(ind_delayed_arr, na.rm=TRUE) / n(),2)) %>% rename(Airport = origin, `Carrier Name` = name, `Number of Flights` = n_obs, `Proportion with Departure Delays` = per_delayed_dep, `Proportion with Arrival Delays` = per_delayed_arr) %>% pander(style = "rmarkdown", split.tables = 200) ``` # Departure delays by month using base functions {.tabset} It makes sense that delays might vary by the time of year. This analysis uses base R functions to examine this. I'm showing the original way the student did this using `tapply` (which is a base R equivalent of `group_by` and `summarize`) with `barplot` and the delightful `rainbow` color-making function!, but I'm also presenting an alternative version using `dplyr` and `ggplot2` so that you can see the similarities and differences in the approaches. ## tapply/base ```{r kevin_rainbow_delays_orig} # Get rid of NAs flights_2 <- flights %>% filter(!is.na(dep_delay) & !is.na(month)) # Make string for months months_Str <- c("JAN","FEB","MAR","APR","MAY","JUN", "JUL","AUG","SEP","OCT","NOV","DEC") #Pull mean departure delays by month mean_dep_delay_month <- tapply(flights_2$dep_delay, flights_2$month, mean) color <- rainbow(12) barplot(mean_dep_delay_month, names.arg = months_Str, main= "Average delayed departure time by month", col=color, xlab="Month", ylab="Time (Minutes)") ``` ## dplyr/ggplot2 ```{r kevin_rainbow_delays_rev} months_Str <- c("JAN","FEB","MAR","APR","MAY","JUN", "JUL","AUG","SEP","OCT","NOV","DEC") flights %>% filter(!is.na(dep_delay) & !is.na(month)) %>% group_by(month) %>% summarize(mean_dep_delay = mean(dep_delay)) %>% # make month categorical by wrapping factor() around it # and telling it the underlying labels ggplot(aes(x = factor(month, labels = months_Str), y = mean_dep_delay)) + # rainbow colors on the bars, but not related to the values # so not inside an aes(). color = "black" does the outline geom_bar(stat = "identity", fill = rainbow(12), color = "black") + xlab("Month") + ylab("Time (minutes)") + ggtitle("Average delayed departure time by month") ``` # How many seats are on a plane? This analysis investigated the years aircraft were manufactured (`year` on the `planes` dataset, not to be confused with `year` on the `flights` data!) and how many seats were on the planes. It doesn't look like newer planes are holding more people, and you can see from the horizontal "stripes" in the plot that the airlines were probably making the same models in the same sizes year after year. It is also slightly terrifying that a plane made before 1970 would be flown in 2013 --- perhaps this is a data error? One change I made is to use `distinct` in subsetting the data for plotting, as we only need one row per year per maker per seat count observation. Without it, we get the same result, but the plot takes considerably longer to render because the `flights` dataset has hundreds of thousands of points. ```{r katie_seats_per_plane, warning=FALSE} data_manufacturer <- flights %>% select(carrier, tailnum) %>% left_join(airlines, by = "carrier") %>% inner_join(planes, by = "tailnum") %>% #let's group Airbus and Airbus Industries into one category mutate(makers = ifelse((manufacturer == "AIRBUS" | manufacturer == "AIRBUS INDUSTRIE"), "AIRBUS", as.character(manufacturer))) # set up a vector of the top 3 within a list list <- c("AIRBUS", "BOEING", "EMBRAER") # how has the number of seats per plane changed over time? ggplot(data = data_manufacturer %>% # filter to just the top 3 using the "in" function filter(makers %in% list) %>% select(makers, year, seats) %>% # just need one row per maker-year-seat # (many duplicates on the data!) distinct(makers, year, seats), aes(x = year, y=seats, color = makers)) + geom_point() + theme_light() + xlab("years") + ylab("number of seats") + ggtitle("How has number of seats per plane changed over time?") ``` # STL-bound delays by time of day This student looked at flights going to St. Louis from NYC on the largest carriers serving that route and which times of day had higher departure delays. As we might have guessed, flights leaving later in the day are more delayed. Southwest passengers in particular seem to have really suffered in taking afternoon and evening flights from NYC! ```{r tricia_stlouis_delays, warning=FALSE, message=FALSE} flights %>% filter (dest == "STL") %>% left_join(airlines) %>% mutate(dep_delay_hours = (dep_delay/60)) %>% filter(dep_delay_hours < 10 & carrier %in% c("AA", "EV", "WN", "MQ")) %>% ggplot(aes(x = hour, y = dep_delay_hours, color = name))+ facet_wrap( ~ name) + geom_point () + xlab ("Hour of the Day") + ylab ("Departure Delay (Hours)") + geom_line(stat = "smooth", method = "loess" , aes(group = name, color = name)) + theme_bw() + scale_color_manual(name = "Airline", values = c("American Airlines Inc." = "royalblue4", "DeltaAir Lines Inc." = "blue4", "Envoy Air" = "springgreen4", "ExpressJet Airlines Inc." = "yellow", "Southwest Airlines Co." = "orange", "United Air Lines Inc." = "skyblue")) + guides(color=FALSE) + ggtitle("Delays in flights to St. Louis by time of day") ```