---
title: "Lecture 3: Exercises"
date: October 4th, 2018
output: 
  html_notebook:
    toc: true
    toc_float: true
    df_print: paged
---

```{r global_options, echo = FALSE, include = FALSE}
options(width = 80)
knitr::opts_chunk$set(
  warning = FALSE, message = FALSE,
  cache = FALSE, tidy = FALSE, size = "small")
```

```{r, warning=FALSE, message=FALSE}
library(tidyverse)
```


# Exercise 1: Parsing data
 
## Part 1: Parsing dates

Generate the correct format string to parse each of the following dates and times:
```{r}
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015) - 3:04PM", "July 1 (2015) - 4:04PM")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"     # 5:05PM
```


## Part 2: Read a CSV file
 
Download this NCHS dataset on leading Causes of death in the United States, 
from 1999 to 2015: https://data.cdc.gov/api/views/bi63-dtpu/rows.csv.

Then, import it into R. Are some of the colums the wrong type?
If not is there any column that could be a factor instead of character
type?

# Execise 2: Data manipulation

## Part 1: `dplyr` verbs

Load in the dataset `movies.csv` used in the lecture:

```{r}
url <- "https://raw.githubusercontent.com/Juanets/movie-stats/master/movies.csv"
movies <- read_csv(url)
```


a. Find a subset of the movies produced after 2010. 
Save the subset in 'movies.sub' variable.

b. Keep columns 'name', 'director', 'year', 'country', 'genre', 'budget', 'gross', 'score'
in the 'movies.sub'.

c. Find the profit for each movie in 'movies.sub' as a fraction of its budget.
Convert 'budget' and 'gross' columns million dollar units founded to the first decimal point.
Use `round()` to round numbers.

d. Count the number of movies in 'movies.sub' produced by each genre,
and order them in the descending count order.

e. Now group movies in 'movies.sub' by countries and genre.
Then, count the number of movies in each group and the corresponding 
median fractional profit, the mean and standard deviation of 
the movie score for each group.


## Part 2: Chaining

Using chaining and pipes, for each genre find the three directors the
top mean movie scores received for the movies produced after 2000,
after filtering out the directors with fewer than 5 movies in total.
Hint: Use `top_n()` function to select top n from each group.

Find the movies in your favourite genre by the 5 directors
you just found. These could serve as suggestions for your next movie night!