---
title: "Lecture 7: Exercises"
date: October 18th, 2018
output: 
  html_notebook:
    toc: true
    toc_float: true
---

```{r}
library(tidyverse)
```


# Exercise 1: Hypothesis testing

Similarly to dataset `mtcars`, the dataset `mpg` from `ggplot` package 
includes data on automobiles. However, `mpg` includes data for newer
cars from year 1999 and 2008. The variables measured for each car is slighly 
different. Here we are interested in the variable, `hwy`, the highway miles per 
gallon.


```{r}
# We create a new culumn with manual/automatic data only
mpg <- mpg %>% 
  mutate(
    transmission = factor(
        gsub("\\((.*)", "", trans), levels = c("auto", "manual"))
  )
mpg
```

## Part 1: One-sample test

a. Subset the `mpg` dataset to inlude only cars from  year 2008.


b. Test whether cars from 2008 have mean the highway miles per gallon, `hwy`, 
equal to 30 mpg.


c. Test whether cars from 2008 with 4 cylinders have mean `hwy` equal to 30 mpg.


## Part 2: Two-sample test

a. Test if the mean `hwy` for automatic is **less than** that for manual cars
**in 2008**. Generate a boxplot with jittered points for `hwy` for each 
transmission group.


b. Test if the mean `hwy` for cars from 1999 and is greater than that for
cars from 2008. Generate a boxplot with jittered points
for `hwy` for each year group.


# Exercise 2: Logistic Regression

In this you will use a dataset `Default`, on customer default records for 
a credit card company, which is included in [ISL book](www.statlearning.com). 
To obtain the data you will need to install a package `ISLR`.

```{r}
# install.packages("ISLR")
library(ISLR)
(Default <- tbl_df(Default))  # convert data.frame to tibble
```

a. First, divide your dataset into a train and test set. Randomly sample
6000 observations and include them in the train set, and the remaining use
as a test set.


b. Fit a logistic regression including all the features to predict
whether a customer defaulted or not.


c. Note if any variables seem not significant. Then, adjust your model
accordingly (by removing them).


d. Compute the predicted probabilities of 'default' for the observations
in the test set. Then evaluate the model accuracy.

d. For the test set, generate a scatterplot of 'balance' vs 'default' 
with points colored by 'student' factor. Then, overlay a line plot 
of the predicted probability of default as computed in the previous question.
You should plot two lines for student and non student separately by setting 
the 'color = student'.


# Exercise 3: Random Forest

In this exercise we will build a random forest model based
on the data used to create the visualization [here](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/).

```{r, message=FALSE}
# Skip first 2 lines since they were comments
url <- paste0("https://raw.githubusercontent.com/jadeyee/r2d3-part-1-data/",
              "master/part_1_data.csv")
houses <- read_csv(url, skip = 2)
houses <- tbl_df(houses)
houses <- houses %>%
    mutate(city = factor(in_sf, levels = c(1, 0), labels = c("SF", "NYC")))
houses 
```

a. Using `pairs()` function plot the relationship between every variable pairs.
You can color the points by the city the observation corresponds to; set the color argument 
in `pairs()` as follows: `col = houses$in_sf + 3L` 


b. Split the data into (70%-30%) train and test set.
How many observations are in your train and test sets?


c. Train a random forest on the train set, using all the variables in the model,
to classify houses into the ones from San Francisco and from New York.
Remember to remove 'in_sf', as it is the same variable as 'city'. 


d. Compute predictions and print out 
[the confusion (error) matrix](https://en.wikipedia.org/wiki/Confusion_matrix)
for the test set to asses the model accuracy. Also, compute the model 
accuracy.


e. Which features were the most predictive for classifying houses into SF vs NYC groups?
Use importance measures to answer the question.