---
title: "Homework"
author: 'Assigned: October 6, 2018'
date: 'Due: October 19, 2018 at 11:59pm'
output:
  pdf_document:
    number_sections: no
  html_notebook:
    df_print: paged
    toc: yes
    toc_float: yes
subtitle: CME/STAT 195, Fall 2018
---

```{r setup, include=FALSE}
rm(list = ls())
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```

# Instructions

This assignment is due on Friday October 19, 2018 at 11:59pm. You must submit 
your homework as report including a write up and all relevant code, and 
associated outputs it generated. You should learn how to use R Markdown to
generate such a report and render it either as an HTML, PDF or Word Document.
You can refer to chapter 3 of 
[R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)
to review details on generating reports using R Markdown. If you generate
your report with R Markdown, submit both the ".Rmd" file 
AND a rendered document.

Remarks:

* This homework covers the material of lectures 1-6, so you are encouraged 
to start working on it early and continue gradually as we advance into the 
class. 

* If you generate random numbers, set a seed and report it, 
so that all your work can be perfectly reproduced. 

* Do not print an entire `data.frame` or a whole vector with more than 20
entries in your homework, since it would produce an unnecessarily large 
file, when you render your document.

* Clearly label your final answer to the questions asked instead of only
producing output of your code. Your answers should be as concise as possible 
and your code should be easily understandable. 

* You are welcome to work with other students, but you must write your own code,
and prepare your own write up separately. If you choose to collaborate,
indicate who you worked with on your submission. 

* You are free to use the web or any books to look up the documentation of 
relevant R functions and debug errors, but you cannot look for the solution
to the specific homework problems. 

* If you have any questions regarding the homework, please ask on [canvas](https://canvas.stanford.edu/) or at office hours.


# Exercise 1: R Basics [20 pt]

**Material: Lecture 1.**

## a. Arithmetic operations [5pt] 

Compute the following using R:

* $4.84 \log_{10}(51!)$, where $x!$ is factorial of $x$
* $4.02 \sqrt[3]{7^2 + e^7}$
* $20\cos( 2\pi + 0.25) + 32 \sin \left({3 \pi \over 4} \right)$
* $\left \lfloor {4.011 \pi \over 3}  {5 \choose 2} \right \rfloor$, where
$\lfloor x \rfloor$ means rounding to the largest integer not greater than x.
where ${x \choose y}$ is the notation for [combination](https://en.wikipedia.org/wiki/Combination)
* $8.1 \sum_{i = 1}^{100} {1 \over i}$


## b. Matrix operations [5pt] 

Generate a matrix $A$ with 15 rows and 5 columns with entries 
being random uniform numbers on an interval $[0, 1]$.
Then generate a matrix $B$ with 5 rows and 7 columns where entries
drawn from a Gaussian distribution with mean 0 and variance 10.
Use `set.seed()` function with a chosen seed (record the seed)
for reproducibility. Type in `?set.seed` in the R console to learn
more about the function. With the two matrices compute:

* $AB$ (a matrix product)
* multiply the 3rd row of $A$ by the 4th column of $B$ and compute
the sum of entries in the resulting vector, then check that agrees with
a corresponding term in the matrix product you computed in the previous
part.
* obtain a vector which is a product of matrix multiplication
between in matrix $A$ and the 4th column of $B$. 


## c. Factors [5pt] 

In this part of the exercise we use a built-in data set, `sleep`, 
on student's sleep. This data stores information on the 
increase in hours of sleep, type of drug administered, and the patient ID,
in respective columns. You can learn more about this dataset from the page,
accessed by calling `?sleep`. 

The column `ID` for patient identity is a factor with labels from 1 to 10.
Rename the ID labels to letters of the alphabet in the reverse order,
with label "A" assigned to patient 10, "B" to patient 9, "C" to patient 8, 
..., and "J" to patient 1.


## d. Tibbles [5pt] 

Create a tibble, 'birthdays', which stores information on 
the birthdays of 5 people either real of fictional. 
The data table should have columns:

1. 'first': first name
2. 'last': last name
3. 'birthday': the person's birthday in format YYYY-MM-DD ("%Y-%m-%d")
4. 'city': city where the person lives

Convert the birthdays to date objects using `as.Date()` function.
Compute the difference (in days) between your birthday and the birthday 
of each of the people and append that information as a new column 
'bday_diff' of the data-frame.


# Exercise 2: Programming [20pt]

**Material: Lecture 2.**

## a. Parametric function [5pt].

* Write a function in are that evaluates the following:

\begin{align*}
f(\theta) &= 7 - 0.5\sin(\theta) + 2.5\sin(3\theta) + 2\sin(5\theta) - 1.7\sin(7\theta) + \\
& \quad  +  3\cos(2\theta) - 2\cos(4\theta) - 0.4\cos(16\theta)
\end{align*}

* Generate a vector, \texttt{theta}, equal to a sequence from $0$ to $2\pi$ 
with increments of $0.01$
* Compute a vector $x = f(\theta) \cdot \cos(\theta)$ and 
$y = f(\theta) \cdot \sin(\theta)$ for $\theta$ you just created.
* Plot a scatter plot of (x, y) with two vectors computed.


## b. Multiple arguments [10pt]

Write a function `time_diff()` that takes two dates as inputs and returns
the difference between them in units of "hours", "days", "weeks", "months",
or "years", defined by an optional argument 'units', set by default
to "days". Use the function to compute time left to your next birthday
separately in units of: months, days, and hours.
 

## c. Control flow: Fibonacci numbers [5pt]. 

Fibonacci sequence starts with numbers 1 and 2, and each subsequent term 
is generated by adding the previous two terms. The first 10 terms of 
the Fibonacci sequence are thus: 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... .
Find the total sum of **even numbers** in the Fibonacci sequence, 
each not exceeding one million.

# Exercise 3: Data Import/Export and transformation [20pt]

**Material: Lecture 3 and 5.**

## a. Import data [5pt]

Visit the following URL:
https://raw.githubusercontent.com/cme195/cme195.github.io/master/assets/data/share-of-people-who-say-they-are-happy.txt

Observe the structure and format of the data. Then, use a function from
`readr` package to read the data in the URL into R.
Then, find the country with the highest share of happy people in 2014.

## b. Export data [5pt]

Filter observations from the data set on happiness that correspond to years 
after 2000. Export the subset of the data as a tab-delimited text file 
to a chosen location on your computer.


## c. `dplyr` functions [5pt]

In this exercise we use the package, `nycflights13`, 
storing datasets on flights and airports in the city of New York in 2013.
Install the package with `install.packages("nycflights13")` if you
have not done so already, then load it with `library(nycflights13)`. 

The dataset 'fights' is a tibble with 336,776 observations! To learn about 
the details about this dataset, type `?flights` in your R console. 

Use `dplyr` functions (and the `%>%` operator) to compute, for each 
combination of departure airport,'origin', destination airport, 'dest',
and 'carrier': the average 'dep_delay', the average 'air_time'
and the average ratio 'dep_delay'/'air_time'

For each 'carrier' report the route ('origin'-'dest') with
the highest mean ratio of departure delay over air time. Now, you know
which flights not to take with a given carrier.


**Note**: Since, the dataset contains missing values, when computing the 
averages, remember to exclude the missing values (use 'na.rm = TRUE' in 
`mean()` function).


##d. Joining/merging datasets [5pt]

The `nycflights13` package also includes datasets other than `flights`.
In this exercise you will combine data tables together.

* First, drop columns that contain delay (ending with '_delay'), and
scheduling (starting with 'sched_') information, and
save as a new tibble `flights2`.

* Add a column with the full name of carriers operating the flights to 
`flights2` by merging it with the `airline` tibble. Which column is(are) 
used for joining?

* Another dataset available in the package is `weather`, storing data on 
weather at different airports at specific days and times. Use this dataset to 
merge the weather information to the data in the previous step. 
Which column is(are) used for joining?

* There, is also a data-frame `planes` included in the package.
`planes` share columns 'year' and 'tailnum' in `planes` with `flights2`
data-frame, but column 'year' in `planes` means a different thing 
(year produced) than in `flights2`(year of flight). Use only the column 'tailnum'
to merge `flights2` and `planes`. Note that the 'suffix' argument can be
used to set different names to distinguish between year of production and year
of of the flight.

* Now, use the data-frame `airports` to merge `flights2`
with the information **on the origin airport**. Note that the column 
'faa' is the airport identifier in the dataset `airports`.
You must use the `by = ` argument in the join function
and specify which columns from `flights2` you are matching
to which column in  `airports`.


# Exercise 4: Data Visualization [20pt]

**Material: Lecture 3-4.**

The following url contains data on fossil fuel emissions for different 
countries between 1751 and 2014: "http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/nation.1751_2014.csv"

## a. Import data with `readr` [5pt]

This dataset is messy, and you will need to fix the warning messages
`read_csv()` returns.

* Rows 1-3 in contain information on the dataset itself, and not the variables;
so after reading the data in, we need to delete these rows. 
* The datasets contains characters "." which needs to be replaced
with `NA` for missing data.
* Rename column `Total CO2 emissions from fossil-fuels and cement production (thousand metric tons of C)` to something shorter, e.g. 'Total_CO2'

## b. Summarize data [5pt]

Use  `dplyr` functions to compute the total yearly $CO_2$ emissions (column `Total.CO2.emissions.from.fossil.fuels.and.cement.production..thousand.metric.
tons.of.C.`) summed over all countries (the world total $CO_2$ emission). 
Use the dataset to plot the World's yearly $CO_2$ emission in Gt.


## c. Line plots [5pt]

Find the top 10 countries with highest emission after year 2000 (including 
2000).

Plot the yearly total CO2 emissions of these top 10 countries
with a different color for each country. Use billion tonnes (Gt) units, 
i.e. divide the total emissions by 10^6.


## d. Stacked plots [5pt]

Use `geom_area()` to generate a plot similar to the one you generated
above but, with emission levels stacked on top
of each other (summing to the total for the ten countries)
with areas colored by countries.


# Exercise 5: Linear Models [20pt]

**Material: Lecture 4 and 6.**

In this exercise we will use a dataset containing information on sales of
a product and the amount spent on advertising using different media channels.
The data are available from: http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv.

## a. Import and plot the data [5pt]

Read the data from "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv".

Then, generate a scatterplot of sales against the amount of TV advertising. 
Color the points by the mount of 'radio' advertising. Then, add a linear 
fit line.


## b. Simple Linear Regression [5pt]

The dataset has 200 rows. Divide it into a train set with 150 observations
and a test set with 50 observations, i.e. use `sample()` without
replacement to randomly choose row indices of the advertising dataset 
to include in the train set. The remaining indices should be used for 
the test set. 

Fit a linear model to the training set, where the sales values are
predicted by the amount of TV advertising. Print the summary of the fitted model.
Then, predict the sales values for the test set and evaluate the test model 
accuracy in terms of root mean squared error (RMSE), which measures 
the average level of error between the prediction and the true response.
$$RMSE = \sqrt{\frac{1}{n} \sum\limits_{i = 1}^n(\hat y_i - y_i)^2}$$


## c. Multiple linear regression [5pt]

Fit a multiple linear regression model including all the variables 'TV',
'radio', 'newspaper' to model the 'sales' in the training set. Then, compute 
the predicted sales for the test set with the new model and evaluate the RMSE.  
Did the error decrease from the one corresponding to the previous model?


## d. Evaluate the model [5pt]

Look at the summary output for the multiple regression model and note which 
of the coefficient in the model is significant. Are all of them significant?
If not refit the model including only the features found significant.
Which of the models should you choose?