---
title: "Data Wrangling II"
output:
  revealjs::revealjs_presentation:
    theme: white
    center: true
    transition: none
    incremental: true
---

```{r setup, include=FALSE}
library(ggplot2)
library(dplyr)
library(knitr)
library(readr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)
```

# Baltimore Towing Data

## An Overview
This dataset contains information on vehicles towed in Baltimore, MD:

- A larger version of this dataset along with additional descriptions can be found at:  [https://data.baltimorecity.gov/Transportation/DOT-Towing/k78j-azhn](https://data.baltimorecity.gov/Transportation/DOT-Towing/k78j-azhn). 

- The full version of the dataset contains 61,000 rows and 36 columns, where each row corresponds to a vehicle and the columns are information pertaining to the vehicle.

- We will be working with a smaller dataset with approximately 30,000 rows and 5 columns.

## The dataset
First read in the data set which is available at: [http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv](http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv).
```{r, message = F}
baltimore.tow <- read_csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/BaltimoreTowing.csv')
str(baltimore.tow)
```

## Information for a few vehicles

```{r, echo = F}
kable(head(baltimore.tow, 8))
```

## Data Wrangling Concepts

- Dealing with strings
- Dealing with date/time objects
- Dealing with factors

# Goal 1. Average Towing Cost by Month

## Motivating Exercise: group_by()

Now compute the average towing cost grouped by month.

## Not Solution: group_by()

Now compute the average towing cost grouped by month.

```{r, eval = F}
baltimore.tow %>% 
  group_by(month) %>% 
  summarize(mean.cost = mean(totalPaid))
```


## Motivating Exercise: group_by()

Now compute the average towing cost grouped by month.


```{r}
kable(head(baltimore.tow, 4))
```


##  `substr()` function
Consider adding a column for year to the data set. This can be done using `substr().`

*Usage*: substr(x, start, stop)

*Arguments*: 

- x, text	a character vector.
- start, first integer. The first element to be extracted
- stop, last integer. The last element to be extracted

## Exercise: Using the substr() function
Use the substr() function to extract month and create a new variable in R. 

## Solution: Using the substr() function
Use the substr() function to extract month and create a new variable in R. 

```{r}
baltimore.tow$month <- substr(baltimore.tow$receivingDateTime,0,2)
head(baltimore.tow$month)
```

## Motivating Exercise: group_by()

Now compute the average towing cost grouped by month.

```{r, warning = F}
baltimore.tow %>% 
  group_by(month) %>% 
  summarize(mean.cost = mean(totalPaid))
```

## strsplit() function
In many situations, the year could be in a different position so the `substr()` might not work. For example month the date could be coded `4/1/2015` rather than `04/01/2015` So consider, using `strsplit()` instead.

*Usage*: strsplit(x, split)

*Arguments*:

- x: character vector, each element of which is to be split. Other inputs, including a factor, will give an error.
- split: character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. 

## Exercise: Using the strsplit() function
Use the strsplit() function to remove the dollar sign from the cost.

## Solution: Using the strsplit() function
Use the strsplit() function to remove the dollar sign from the cost.

```{r}
## example for one row
strsplit(baltimore.tow$totalPaid[1],'$', fixed = T)[[1]][2]
```

## lubridate package

lubridate is a tidyverse package for manipulating date objects. There is a nice [website](https://lubridate.tidyverse.org).

## Date objects for Baltimore tow

```{r, message=F}
library(lubridate)
class(baltimore.tow$receivingDateTime)
baltimore.tow <- baltimore.tow %>% 
  mutate(date_time = mdy_hms(receivingDateTime))
class(baltimore.tow$date_time)
```


## Date objects for Baltimore tow

```{r}
head(month(baltimore.tow$date_time))
head(year(baltimore.tow$date_time))
```


## Stringr Package
The `stringr` package ([cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf)) provides a nice set of tools. There is also an [information page](https://stringr.tidyverse.org/articles/from-base.html).


## Exercise: Stringr approach
Use the `stringr` package to remove (replace) the dollar sign. Note that a dollar sign is a special character, so you'll need to use `\\$`.

## Solution: Stringr approach
Use the `stringr` package to remove (replace) the dollar sign
```{r}
library(stringr)
baltimore.tow$cost <- as.numeric(str_replace(baltimore.tow$totalPaid, '\\$',''))
```

## Motivating Exercise: group_by()

Now compute the average towing cost grouped by month.

```{r}
baltimore.tow %>% 
  group_by(month) %>% 
  summarize(mean.cost = mean(cost), .groups = 'keep')
```


# Goal 2. Type of Vehicles Towed by Month


## Goal 2. Type of Vehicles Towed by Month
Next we wish to compute how many vehicles were towed for each vehicle type

However, we want to take a close look at the vehicle types in the data set and perhaps create more useful groups.

## Unique function - how to group vehicles
First examine the unique types of vehicles in this data set.

```{r}
unique(baltimore.tow$vehicleType)
```

## Grouping
First consider reasonable groups for vehicle types.

1. Cars - (Car, convertible)
2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi)
3. Trucks - (Pick-up Truck, Pickup Truck)
4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck)
5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike)
6. Misc (delete) - (Boat, Golf Cart, Trailer)

## Messy Data: Grouping
Next examine values in some of these groups, we will just look at the vehicle type of 'Truck'.

```{r, mysize=TRUE, size='\\scriptsize', tidy=T}
unique(baltimore.tow$vehicleMake[baltimore.tow$vehicleType == 'Truck'])
```
\normalsize
Note that there are several spelling errors in this data set. How do we combine them?


## Messy Data: Data Cleaning
Spelling errors can be addressed, by reassigning vehicles to the correct spelling.
```{r}
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Peterbelt'] <- 'Peterbilt'
baltimore.tow$vehicleMake[baltimore.tow$vehicleMake == 
                            'Internantional'] <- 'International'
baltimore.tow$vehicleMake <-
  str_replace(baltimore.tow$vehicleMake,'Izuzu','Isuzu')
baltimore.tow$vehicleMake <-
  str_replace(baltimore.tow$vehicleMake,'Frightliner','Freightliner')
```
Also note that many of the groupings have mis-classified vehicles, but we will not focus on that yet.

## Exercise: Delete Misc. Type Vehicles

First we will delete golf carts, boats, and trailers. There are several ways to do this, consider making a new data frame called balt.tow.small that does not include golf carts, boats, and trailers.

## Solution: Delete Misc. Type Vehicles

First we will delete golf carts, boats, and trailers.

```{r}
balt.tow.small <- baltimore.tow %>% 
  filter(!vehicleType %in% c("Golf Cart",'Boat', 'Trailer'))
```


## Exercise: Create Additional Groups

Now we need to create a variable for the additional groups below.

1. Cars - (Car, convertible)
2. Large Cars - (SUV, Station Wagon, Sport Utility Vehicle, Van, Taxi)
3. Trucks - (Pick-up Truck, Pickup Truck)
4. Large Trucks - (Truck, Tractor Trailer, Tow Truck, Tractor, Construction Equipment, Commercial Truck)
5. Bikes - (Motor Cycle (Street Bike), Dirt Bike, All terrain - 4 wheel bike, Mini-Bike)

## Solution: Create Additional Groups

One way to create groups is by creating a new variable

```{r}
balt.tow.small$Group <- ''

balt.tow.small$Group[balt.tow.small$vehicleType %in% 
                       c('Car','Convertible')] <- 'Cars'

balt.tow.small$Group[balt.tow.small$vehicleType %in% c('SUV',
 'Station Wagon','Sport Utility Vehicle','Van','Taxi')] <- 'Large Cars'

balt.tow.small$Group[balt.tow.small$vehicleType %in% 
                       c('Pick-up Truck','Pickup Truck')] <- 'Trucks'

balt.tow.small$Group[balt.tow.small$vehicleType %in% 
    c('Truck','Tractor Trailer','Tow Truck','Tractor',
      'Construction Equipment','Commercial Truck')] <- 'Large Trucks'

balt.tow.small$Group[balt.tow.small$vehicleType %in% 
    c('Motor Cycle (Street Bike)','Dirt Bike','Mini-Bike',
      'All terrain - 4 wheel bike')] <- 'Bikes'
```


## Solution: 
Next we wish to compute how many vehicles were towed for each vehicle type

```{r}
balt.tow.small %>% group_by(Group) %>% tally()
```

# Factors

## Factors...
Factors are a specific way to store categorical data. Using factors results in a more efficient data storage process, but can be cumbersome.

Factors can be necessary for making plots and fitting models in R.


## forcats Package
The `forcats` package, [website](https://forcats.tidyverse.org), is a tidyverse package designed for dealing with categorical factors.

## Character Values

```{r}
favorite_day <- c('Friday', 'Saturday', 'Sunday', 'Tuesday', 'Saturday', 'Saturday')
class(favorite_day)
```

## Creating Factors

```{r}
day_factor <- as.factor(favorite_day)
class(day_factor)
sort(day_factor)
```

## Reordering Factors

```{r}
library(forcats)
day_factor <- fct_relevel(day_factor,c('Sunday','Tuesday','Friday','Saturday'))
sort(day_factor)
```


## Creating Factors
Rather than coercing a class variable to be a factor, the factor can be created directly.
```{r}
day_factor2 <- factor(c('Friday', 'Saturday', 'Sunday', 'Monday'), 
levels = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))
sort(day_factor2)
```

## Collapsing Factors
Factors can also easily be collapsed with `forcats`

```{r}
balt.tow.small %>% 
  mutate(Group2 = fct_collapse(vehicleType, 
                              Cars = c('Car','Convertible'),
                              Large_Cars = c('SUV','Station Wagon','Sport Utility Vehicle','Van','Taxi'),
 Trucks = c('Pick-up Truck','Pickup Truck'),
 Large_Trucks = c('Truck','Tractor Trailer','Tow Truck','Tractor','Construction Equipment','Commercial Truck'),
 Bikes = c('Motor Cycle (Street Bike)','Dirt Bike','Mini-Bike','All terrain - 4 wheel bike'))) %>% 
  mutate(Group2 = fct_infreq(Group2)) %>%
  group_by(Group2) %>% summarize(ave_cost = mean(cost), .groups = 'drop')
```