#' CSSS 508, Week 7
#' ===
#' author: Rebecca Ferrell
#' date: May 11, 2016
#' transition: rotate
#' width: 1100
#' height: 750
#' 
#' 
#' Vectorization
#' ===
#' type: section
#' 
#' 
#' Non-vectorized example
#' ===
#' incremental: true
#' 
#' We have a vector of numbers, and we want to add 1 to each entry.
## ------------------------------------------------------------------------
my_vector <- rnorm(1000000)

#' 
#' A `for` loop works but is relatively slow:
## ------------------------------------------------------------------------
for_start <- proc.time() # start the clock
new_vector <- rep(NA, length(my_vector))
for(position in 1:length(my_vector)) {
    new_vector[position] <- my_vector[position] + 1
}
(for_time <- proc.time() - for_start) # time elapsed

#' 
#' 
#' Vectorization wins
#' ===
#' incremental: true
#' 
#' Recognize that we can instead use R's vector addition (with recycling):
## ------------------------------------------------------------------------
vec_start <- proc.time()
new_vector <- my_vector + 1
(vec_time <- proc.time() - vec_start)
for_time / vec_time

#' 
#' Vector/matrix arithmetic is implemented using fast, optimized functions that a `for` loop can't compete with.
#' 
#' 
#' Vectorization examples
#' ===
#' incremental: true
#' 
#' * `rowSums`, `colSums`, `rowMeans`, `colMeans` give sums or averages over rows or columns of matrices/data frames
#' 
## ------------------------------------------------------------------------
(a_matrix <- matrix(1:12, nrow = 3, ncol = 4))
rowSums(a_matrix)

#' 
#' 
#' Back to Goofus example from last week
#' ===
#' incremental: true
#' 
#' Remember when Goofus tried find the mean for each variable in the `swiss` data? The most Gallant solution is to just use `colMeans` without even thinking about pre-allocation or `for` loops:
## ------------------------------------------------------------------------
colMeans(swiss)

#' 
#' 
#' More examples of vectorized functions
#' ===
#' incremental: true
#' 
#' * `cumsum`, `cumprod`, `cummin`, `cummax` give back a vector with cumulative quantities (e.g. running totals)
## ------------------------------------------------------------------------
cumsum(1:7)

#' 
#' * `pmax` and `pmin` take a matrix or set of vectors, output the min or max for each **p**osition (after recycling):
## ------------------------------------------------------------------------
pmax(c(0, 2, 4), c(1, 1, 1), c(2, 2, 2))

#' 
#' 
#' Lab exercise: running totals
#' ===
#' 
#' 1. Use `rnorm` to randomly generate a vector of length 10 million
#' 
#' 2. Using a `for` loop and timing your code, calculate another vector of length 10 million that gives the running total
#' 
#' 3. Repeat using `cumsum` instead of a loop
#' 
#' Did you get the same result both ways? What's the difference in speed? Which is easier to read?
#' 
#' 
#' Writing your own functions
#' ===
#' type: section
#' 
#' 
#' Examples of existing functions
#' ===
#' incremental: true
#' 
#' * `mean`:
#'     + Input: a vector
#'     + Output: a single number
#' * `dplyr::filter`:
#'     + Input: a data frame, logical conditions
#'     + Output: a data frame with rows removed using those conditions
#' * `readr::read_csv`:
#'     + Input: file path, optionally variable names or types
#'     + Output: data frame containing info read in from file
#' 
#' 
#' Why write your own functions?
#' ===
#' incremental: true
#' 
#' Functions can encapsulate repeated actions such as:
#' 
#' * Given a vector, compute some special summary stats
#' * Given a vector and definition of "gross" values, replace with `NA`
#' * Templates for favorite `ggplot`s used in reports
#' 
#' Advanced uses for functions (not covered in this class):
#' 
#' * Parallel processing
#' * Making custom packages containing your functions
#' 
#' 
#' Simple homebrewed function
#' ===
#' incremental: true
#' 
#' Let's look at a function that takes a vector as input and outputs a named vector of the first and last elements:
#' 
## ------------------------------------------------------------------------
first_and_last <- function(x) {
    first <- x[1]
    last <- x[length(x)]
    return(c("first" = first, "last" = last))
}

#' 
#' Test it out:
#' 
## ------------------------------------------------------------------------
first_and_last(c(4, 3, 1, 8))

#' 
#' 
#' More testing of simple function
#' ===
#' incremental: true
#' 
#' What if I give `first_and_last` a vector of length 1?
#' 
## ------------------------------------------------------------------------
first_and_last(7)

#' 
#' Of length 0?
#' 
## ------------------------------------------------------------------------
first_and_last(numeric(0))

#' 
#' Maybe we want it to be a little smarter.
#' 
#' 
#' Checking inputs
#' ===
#' incremental: true
#' 
#' Let's make sure we get an error message is the vector is too small:
#' 
## ------------------------------------------------------------------------
smarter_first_and_last <- function(x) {
    if(length(x) == 0L) { # specify integers with L
        stop("The input has no length!")
    } else {
        first <- x[1]
        last <- x[length(x)]
        return(c("first" = first, "last" = last))        
    }
}

#' 
#' 
#' Testing the smarter function
#' ===
#' incremental: true
#' 
## ---- error=TRUE---------------------------------------------------------
smarter_first_and_last(numeric(0))
smarter_first_and_last(c(4, 3, 1, 8))

#' 
#' 
#' Cracking open functions
#' ===
#' incremental: true
#' 
#' If you type the function name without any parentheses or arguments, you can see its guts:
#' 
## ------------------------------------------------------------------------
smarter_first_and_last

#' 
#' 
#' Anatomy of a function
#' ===
#' incremental: true
#' 
#' * Name: what you assign the function to so you can use it later
#'     + Can have "anonymous" (no-name) functions
#' * Arguments (aka inputs, parameters): things the user passes to the function that affects how it works
#'     + e.g. `x` or `na.rm` in `my_new_func <- function(x, na.rm = FALSE) {...}`
#'     + `na.rm = FALSE` is example of setting a default value: if user doesn't say what `na.rm` is, it'll be `FALSE`
#'     + `x`, `na.rm` values won't exist in R outside of the function
#' * Body: the guts!
#' * Return value: the output thing inside `return()`. Could be a vector, list, data frame, another function, or even nothing
#'     + If unspecified, will be the last thing calculated (maybe not what you want?)
#'     
#' 
#' Example: reporting quantiles
#' ===
#' incremental: true
#' 
#' Maybe you want to know more detailed quantile information than `summary` gives you with interpretable names. Here's a starting point:
#' 
## ------------------------------------------------------------------------
quantile_report <- function(x, na.rm = FALSE) {
    quants <- quantile(x,
                       probs = c(0.01, 0.05, 0.10, 0.25, 0.5,
                                 0.75, 0.90, 0.95, 0.99),
                       na.rm = na.rm)
    names(quants) <- c("Bottom 1%", "Bottom 5%", "Bottom 10%",
                       "Bottom 25%", "Median", "Top 25%", "Top 10%",
                       "Top 5%", "Top 1%")
    return(quants)
}
quantile_report(rnorm(10000))

#' 
#' 
#' lapply: list + applying functions
#' ===
#' 
#' `lapply` **apply**s a function over a **l**ist of any kind (e.g. a data frame), and returns a list. This is a lot easier than preparing a `for` loop!
#' 
## ------------------------------------------------------------------------
lapply(swiss, FUN = quantile_report)

#' 
#' 
#' More usable lapply output with sapply
#' ===
#' 
#' A downside to `lapply` is that lists are not always natural to work with. `sapply` will **s**implify the list output by converting each list element to a column in a matrix:
#' 
## ------------------------------------------------------------------------
sapply(swiss, FUN = quantile_report)

#' 
#' 
#' apply
#' ===
#' incremental: true
#' 
#' There's also a function just called `apply` that works over matrices or data frames. You tell it whether to apply the function to each row (`MARGIN = 1`) or column (`MARGIN = 2`).
#' 
## ------------------------------------------------------------------------
apply(swiss, MARGIN = 2, FUN = quantile_report)

#' 
#' 
#' Example: discretizing continuous data
#' ===
#' type: incremental
#' 
#' Maybe you often want to bucket variables in your data into groups based on quantiles:
#' 
#' | Person | Income | Income Bucket |
#' |:------:|-------:|--------------:|
#' |    1   |   8000 |             1 |
#' |    2   | 103000 |             3 |
#' |    3   |  12000 |             1 |
#' |    4   |  52000 |             2 |
#' |    5   | 150000 |             3 |
#' |    6   |  45000 |             2 |
#' 
#' 
#' Bucketing function
#' ===
#' incremental: true
#' 
#' There's already a function in R called `cut` that does this, but you need to tell it cutpoints or the number of buckets. Let's make a convenience function that calls `cut` using quantiles for splitting and returns an integer:
#' 
## ------------------------------------------------------------------------
bucket <- function(x, quants = c(0.333, 0.667)) {
    # set low extreme, quantile points, high extreme
    new_breaks <- c(min(x)-1,
                    quantile(x, probs = quants),
                    max(x)+1)
    # labels = FALSE will return integer codes instead of ranges
    return(cut(x, breaks = new_breaks, labels = FALSE))
}

#' 
#' Trying out the bucket
#' ===
#' 
## ---- fig.width = 10, fig.height = 3, dpi=300, out.width="1100px", out.height="330px"----
rando_data <- rnorm(100)
bucketed_rando_data <- bucket(rando_data,
                              quants = c(0.05, 0.25, 0.5, 0.75, 0.95))
plot(x = bucketed_rando_data, y = rando_data,
     main = "Buckets and values")

#' 
#' 
#' Example: removing bad data values
#' ===
#' type: incremental
#' 
#' Let's say we have data where impossible values occur:
#' 
## ------------------------------------------------------------------------
(school_data <- data.frame(school = letters[1:10],
                           pct_passing_exam = c(0.78, 0.55, 0.91, -1, 0.88, 0.81, 0.90, 0.76, 999, 999),
                           pct_free_lunch = c(0.33, 999, 0.25, 0.05, 0.12, 0.09, 0.22, -13, 0.21, 999)))

#' 
#' 
#' Function to remove extreme values
#' ===
#' incremental: true
#' 
#' * Input: a vector `x`, cutoff for `low`, cutoff for `high`
#' * Output: a vector with `NA` in the extreme places
#' 
## ------------------------------------------------------------------------
remove_extremes <- function(x, low, high) {
    x_no_low <- ifelse(x < low, NA, x)
    x_no_low_no_high <- ifelse(x_no_low > high, NA, x)
    return(x_no_low_no_high)
}
remove_extremes(school_data$pct_passing_exam, low = 0, high = 1)

#' 
#' 
#' dplyr::mutate_each, summarize_each
#' ===
#' 
#' `dplyr` functions `summarize_each` and `mutate_each` take an argument `funs()`. This will apply our function to every variable (besides `school`) to update the columns in `school_data`:
#' 
## ------------------------------------------------------------------------
library(dplyr)
school_data %>%
    mutate_each(funs(remove_extremes(x = ., low = 0, high = 1)),
                -school)

#' 
#' 
#' mutate_each in data downloading demo
#' ===
#' 
#' In the data downloading demo, I had this block of code to fix ID values that were mangled like `"12345678.000000"`:
#' 
## ---- eval=FALSE---------------------------------------------------------
## CA_OSHPD_util <- CA_OSHPD_util %>%
##     # translating the regular expression pattern:
##     # \\. matches the location of the period.
##     # 0+ matches at least one zero and possibly more following that period.
##     # replacement for period + 0s is nothing (empty string)
##     mutate_each(funs(gsub(pattern = "\\.0+",
##                           x = .,
##                           replacement = "")),
##                 # variables to fix
##                 FAC_NO, HSA, HFPA)

#' 
#' 
#' Standard and non-standard evaluation
#' ===
#' 
#' `dplyr` uses what is called **non-standard evaluation** that lets you refer to "naked" variables (no quotes around them) like `FAC_NO, HSA, HFPA`. There are **standard evaluation** versions of `dplyr` functions that use the quoted versions instead, which can sometimes be more convenient. These end in an underscore (`_`).
#' 
#' Example converting character data to dates from the data downloading demo:
## ---- eval=FALSE---------------------------------------------------------
## yearly_frames[[i]] <- yearly_frames[[i]] %>%
##     # cnames is a character vector of var names
##     # 4th and 5th variables are strings to become dates
##     mutate_each_(funs(mdy), cnames[4:5])

#' 
#' Anonymous functions in dplyr
#' ===
#' 
#' You can skip naming your function in `dplyr` if you won't use it again. Code below will return the mean divided by the standard deviation for each variable in `swiss`:
#' 
## ------------------------------------------------------------------------
swiss %>%
    summarize_each(funs( mean(., na.rm = TRUE) / sd(., na.rm = TRUE) ))

#' 
#' 
#' Anonymous functions in lapply
#' ===
#' 
#' Like with `dplyr`, you can use anonymous functions in `lapply`, but a difference is you'll need to have the `function` part at the beginning:
#' 
## ------------------------------------------------------------------------
lapply(swiss,
       function(x) mean(x, na.rm = TRUE) / sd(x, na.rm = TRUE))

#' 
#' 
#' Example: ggplot templates
#' ===
#' 
#' Let's say you have a particular way you like your charts:
#' 
## ---- warning=FALSE, message=FALSE, eval=FALSE---------------------------
## library(gapminder); library(ggplot2)
## ggplot(gapminder %>% filter(country == "Afghanistan"),
##        aes(x = year, y = pop / 1000000)) +
##     geom_line(color = "firebrick") +
##     xlab(NULL) + ylab("Population (millions)") +
##     ggtitle("Population of Afghanistan since 1952") +
##     theme_minimal() + theme(text = element_text(family = "Times"), plot.title = element_text(hjust = 0, size = 20))

#' 
#' * How could we make this flexible for any country?
#' * How could we make this flexible for any `gapminder` variable?
#' 
#' Example of desired chart
#' ===
#' 
## ---- warning=FALSE, message=FALSE, echo=FALSE, fig.width = 10, fig.height = 6, dpi=300, out.width="1100px", out.height="660px"----
library(gapminder); library(ggplot2)
ggplot(gapminder %>%
           filter(country == "Afghanistan"),
       aes(x = year, y = pop / 1000000)) +
    geom_line(color = "firebrick") +
    xlab(NULL) +
    ylab("Population (millions)") +
    ggtitle("Population of Afghanistan since 1952") +
    theme_minimal() +
    theme(text = element_text(family = "Times"),
          plot.title = element_text(hjust = 0, size = 20))

#' 
#' Another example of desired chart
#' ===
#' 
## ---- warning=FALSE, message=FALSE, echo=FALSE, fig.width = 10, fig.height = 6, dpi=300, out.width="1100px", out.height="660px"----
ggplot(gapminder %>%
           filter(country == "Peru"),
       aes(x = year, y = lifeExp)) +
    geom_line(color = "firebrick") +
    xlab(NULL) +
    ylab("Life expectancy") +
    ggtitle("Life expectancy in Peru since 1952") +
    theme_minimal() +
    theme(text = element_text(family = "Times"),
          plot.title = element_text(hjust = 0, size = 20))

#' 
#' 
#' Making country flexible
#' ===
#' 
#' We can have the user input a character string for `cntry` as an argument to the function to get subsetting and the title right:
## ------------------------------------------------------------------------
gapminder_lifeplot <- function(cntry) {
    ggplot(gapminder %>%
               filter(country == cntry),
           aes(x = year, y = lifeExp)) +
        geom_line(color = "firebrick") +
        xlab(NULL) +
        ylab("Life expectancy") +
        ggtitle(paste0("Life expectancy in ", cntry, " since 1952")) +
        theme_minimal() +
        theme(text = element_text(family = "Times"),
              plot.title = element_text(hjust = 0, size = 20))
}

#'  
#' 
#' Testing out life expectancy plot function
#' ===
#' 
## ---- warning=FALSE, message=FALSE, fig.width = 10, fig.height = 4, dpi=300, out.width="1100px", out.height="440px"----
gapminder_lifeplot(cntry = "Turkey")

#' 
#' Making y value flexible
#' ===
#' 
#' Now let's allow the user to say which variable they want plotted on the y-axis. First, let's think about how we can get the right labels for the axis and title. A named character vector to serve as a "lookup table" inside the function would work:
#' 
## ------------------------------------------------------------------------
y_axis_label <- c("lifeExp" = "Life expectancy",
                  "pop" = "Population (millions)",
                  "gdpPercap" = "GDP per capita, USD")
title_text <- c("lifeExp" = "Life expectancy in ",
                "pop" = "Population of ",
                "gdpPercap" = "GDP per capita in ")
# example use:
y_axis_label["pop"]

#' 
#' aes_string
#' ===
#' 
#' `ggplot` is usually looking for "naked" variables, but we can tell it to take them as quoted strings using `aes_string` instead of `aes`, which is handy when making functions:
## ------------------------------------------------------------------------
gapminder_plot <- function(cntry, yvar) {
    y_axis_label <- c("lifeExp" = "Life expectancy",
                      "pop" = "Population (millions)",
                      "gdpPercap" = "GDP per capita, USD")[yvar]
    title_text <- c("lifeExp" = "Life expectancy in ",
                    "pop" = "Population of ",
                    "gdpPercap" = "GDP per capita in ")[yvar]
    ggplot(gapminder %>%
               filter(country == cntry) %>%
               mutate(pop = pop / 1000000),
           aes_string(x = "year", y = yvar)) +
        geom_line(color = "firebrick") +
        xlab(NULL) +
        ylab(y_axis_label) +
        ggtitle(paste0(title_text, cntry, " since 1952")) +
        theme_minimal() +
        theme(text = element_text(family = "Times"),
              plot.title = element_text(hjust = 0, size = 20))
}

#' 
#' 
#' Testing out the gapminder plot function
#' ===
#' 
## ---- warning=FALSE, message=FALSE, fig.width = 10, fig.height = 4, dpi=300, out.width="1100px", out.height="440px"----
gapminder_plot(cntry = "Turkey", yvar = "pop")

#' 
#' 
#' debugging
#' ===
#' 
#' Something not working as hoped? Try using `debug` on a function, which will show you the world as perceived from inside the function:
#' 
## ---- eval=FALSE---------------------------------------------------------
debug(gapminder_plot)
gapminder_plot(cntry = "Turkey", yvar = "pop")
#' 
#' Then when you've fixed your problem, use `undebug` so that you won't go into debug mode every time you run it:
#' 
## ---- eval=FALSE---------------------------------------------------------
undebug(gapminder_plot)

#' 
#' 
#' 
#' Homework
#' ===
#' type: section
#' 
#' 
#' Homework
#' ===
#' 
#' DUE IN TWO WEEKS: download and analyze data from the first year of Seattle's Pronto! bike sharing program.
#' 
#' You'll be writing a loop to read in the data, functions to clean it up, and another function to visualize ridership over the first year.
#' 
#' There is some string processing needed, much of which you have already seen or can probably Google, but some will come in next week's lecture.