--- title: "Data Wrangling II" output: revealjs::revealjs_presentation: theme: white center: true transition: none incremental: true --- ```{r setup, include=FALSE} library(ggplot2) library(dplyr) library(knitr) library(readr) library(dplyr) knitr::opts_chunk$set(echo = TRUE) ``` # Baltimore Towing Data ## An Overview This dataset contains information on vehicles towed in Baltimore, MD: - A larger version of this dataset along with additional descriptions can be found at: [https://data.baltimorecity.gov/Transportation/DOT-Towing/k78j-azhn](https://data.baltimorecity.gov/Transportation/DOT-Towing/k78j-azhn). - The full version of the dataset contains 61,000 rows and 36 columns, where each row corresponds to a vehicle and the columns are information pertaining to the vehicle. - We will be working with a smaller dataset with approximately 30,000 rows and 5 columns. ## The dataset First read in the data set which is available at: [http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv](http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv). ```{r, message = F} baltimore.tow <- read_csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv') str(baltimore.tow) ``` ## Information for a few vehicles ```{r, echo = F} kable(head(baltimore.tow, 8)) ``` ## Data Wrangling Concepts - Dealing with strings - Dealing with date/time objects - Dealing with factors # Goal 1. Average Towing Cost by Month ## Motivating Exercise: group_by() Now compute the average towing cost grouped by month. ## Not Solution: group_by() Now compute the average towing cost grouped by month. ```{r, eval = F} baltimore.tow %>% group_by(month) %>% summarize(mean.cost = mean(totalPaid)) ``` ## Motivating Exercise: group_by() Now compute the average towing cost grouped by month. ```{r} kable(head(baltimore.tow, 4)) ``` ## `substr()` function Consider adding a column for year to the data set. This can be done using `substr().` *Usage*: substr(x, start, stop) *Arguments*: - x, text a character vector. - start, first integer. The first element to be extracted - stop, last integer. The last element to be extracted ## Exercise: Using the substr() function Use the substr() function to extract month and create a new variable in R. ## Solution: Using the substr() function Use the substr() function to extract month and create a new variable in R. ```{r} baltimore.tow$month <- substr(baltimore.tow$receivingDateTime,0,2) head(baltimore.tow$month) ``` ## Motivating Exercise: group_by() Now compute the average towing cost grouped by month. ```{r, warning = F} baltimore.tow %>% group_by(month) %>% summarize(mean.cost = mean(totalPaid)) ``` ## strsplit() function In many situations, the year could be in a different position so the `substr()` might not work. For example month the date could be coded `4/1/2015` rather than `04/01/2015` So consider, using `strsplit()` instead. *Usage*: strsplit(x, split) *Arguments*: - x: character vector, each element of which is to be split. Other inputs, including a factor, will give an error. - split: character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. ## Exercise: Using the strsplit() function Use the strsplit() function to remove the dollar sign from the cost. ## Solution: Using the strsplit() function Use the strsplit() function to remove the dollar sign from the cost. ```{r} ## example for one row strsplit(baltimore.tow$totalPaid[1],'$', fixed = T)[[1]][2] ``` ## lubridate package lubridate is a tidyverse package for manipulating date objects. There is a nice [website](https://lubridate.tidyverse.org). ## Date objects for Baltimore tow ```{r, message=F} library(lubridate) class(baltimore.tow$receivingDateTime) baltimore.tow <- baltimore.tow %>% mutate(date_time = mdy_hms(receivingDateTime)) class(baltimore.tow$date_time) ``` ## Date objects for Baltimore tow ```{r} head(month(baltimore.tow$date_time)) head(year(baltimore.tow$date_time)) ``` ## Stringr Package The `stringr` package ([cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf)) provides a nice set of tools. There is also an [information page](https://stringr.tidyverse.org/articles/from-base.html). ## Exercise: Stringr approach Use the `stringr` package to remove (replace) the dollar sign. Note that a dollar sign is a special character, so you'll need to use `\\$`. ## Solution: Stringr approach Use the `stringr` package to remove (replace) the dollar sign ```{r} library(stringr) baltimore.tow$cost <- as.numeric(str_replace(baltimore.tow$totalPaid, '\\$','')) ``` ## Motivating Exercise: group_by() Now compute the average towing cost grouped by month. ```{r} baltimore.tow %>% group_by(month) %>% summarize(mean.cost = mean(cost), .groups = 'keep') ``` # Goal 2. Type of Vehicles Towed by Month ## Goal 2. Type of Vehicles Towed by Month Next we wish to compute how many vehicles were towed for each vehicle type However, we want to take a close look at the vehicle types in the data set and perhaps create more useful groups. ## Unique function - how to group vehicles First examine the unique types of vehicles in this data set. ```{r} unique(baltimore.tow$vehicleType) ``` ## Grouping First consider reasonable groups for vehicle types. 1. Cars - (Car, convertible) 2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi) 3. Trucks - (Pick-up Truck, Pickup Truck) 4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck) 5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike) 6. Misc (delete) - (Boat, Golf Cart, Trailer) ## Messy Data: Grouping Next examine values in some of these groups, we will just look at the vehicle type of 'Truck'. ```{r, mysize=TRUE, size='\\scriptsize', tidy=T} unique(baltimore.tow$vehicleMake[baltimore.tow$vehicleType == 'Truck']) ``` \normalsize Note that there are several spelling errors in this data set. How do we combine them? ## Messy Data: Data Cleaning Spelling errors can be addressed, by reassigning vehicles to the correct spelling. ```{r} baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 'Peterbelt'] <- 'Peterbilt' baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 'Internantional'] <- 'International' baltimore.tow$vehicleMake <- str_replace(baltimore.tow$vehicleMake,'Izuzu','Isuzu') baltimore.tow$vehicleMake <- str_replace(baltimore.tow$vehicleMake,'Frightliner','Freightliner') ``` Also note that many of the groupings have mis-classified vehicles, but we will not focus on that yet. ## Exercise: Delete Misc. Type Vehicles First we will delete golf carts, boats, and trailers. There are several ways to do this, consider making a new data frame called balt.tow.small that does not include golf carts, boats, and trailers. ## Solution: Delete Misc. Type Vehicles First we will delete golf carts, boats, and trailers. ```{r} balt.tow.small <- baltimore.tow %>% filter(!vehicleType %in% c("Golf Cart",'Boat', 'Trailer')) ``` ## Exercise: Create Additional Groups Now we need to create a variable for the additional groups below. 1. Cars - (Car, convertible) 2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi) 3. Trucks - (Pick-up Truck, Pickup Truck) 4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck) 5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike) ## Solution: Create Additional Groups One way to create groups is by creating a new variable ```{r} balt.tow.small$Group <- '' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('Car','Convertible')] <- 'Cars' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('SUV', 'Station Wagon','Sport Utility Vehicle','Van','Taxi')] <- 'Large Cars' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('Pick-up Truck','Pickup Truck')] <- 'Trucks' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('Truck','Tractor Trailer','Tow Truck','Tractor', 'Construction Equipment','Commercial Truck')] <- 'Large Trucks' balt.tow.small$Group[balt.tow.small$vehicleType %in% c('Motor Cycle (Street Bike)','Dirt Bike','Mini-Bike', 'All terrain - 4 wheel bike')] <- 'Bikes' ``` ## Solution: Next we wish to compute how many vehicles were towed for each vehicle type ```{r} balt.tow.small %>% group_by(Group) %>% tally() ``` # Factors ## Factors... Factors are a specific way to store categorical data. Using factors results in a more efficient data storage process, but can be cumbersome. Factors can be necessary for making plots and fitting models in R. ## forcats Package The `forcats` package, [website](https://forcats.tidyverse.org), is a tidyverse package designed for dealing with categorical factors. ## Character Values ```{r} favorite_day <- c('Friday', 'Saturday', 'Sunday', 'Tuesday', 'Saturday', 'Saturday') class(favorite_day) ``` ## Creating Factors ```{r} day_factor <- as.factor(favorite_day) class(day_factor) sort(day_factor) ``` ## Reordering Factors ```{r} library(forcats) day_factor <- fct_relevel(day_factor,c('Sunday','Tuesday','Friday','Saturday')) sort(day_factor) ``` ## Creating Factors Rather than coercing a class variable to be a factor, the factor can be created directly. ```{r} day_factor2 <- factor(c('Friday', 'Saturday', 'Sunday', 'Monday'), levels = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday')) sort(day_factor2) ``` ## Collapsing Factors Factors can also easily be collapsed with `forcats` ```{r} balt.tow.small %>% mutate(Group2 = fct_collapse(vehicleType, Cars = c('Car','Convertible'), Large_Cars = c('SUV','Station Wagon','Sport Utility Vehicle','Van','Taxi'), Trucks = c('Pick-up Truck','Pickup Truck'), Large_Trucks = c('Truck','Tractor Trailer','Tow Truck','Tractor','Construction Equipment','Commercial Truck'), Bikes = c('Motor Cycle (Street Bike)','Dirt Bike','Mini-Bike','All terrain - 4 wheel bike'))) %>% mutate(Group2 = fct_infreq(Group2)) %>% group_by(Group2) %>% summarize(ave_cost = mean(cost), .groups = 'drop') ```