--- title: "Functions" editor: markdown: wrap: 72 --- ## Packages for this section ```{r functions-1} library(tidyverse) library(broom) # some regression stuff later ``` ## Don’t repeat yourself - See this: ```{r functions-2} a <- 50 b <- 11 d <- 3 as <- sqrt(a - 1) as bs <- sqrt(b - 1) bs ds <- sqrt(d - 1) ds ``` ## What's the problem? - Same calculation done three different times, by copying, pasting and editing. - Dangerous: what if you forget to change something after you pasted? - Programming principle: "don't repeat yourself". - Hadley Wickham: don't copy-paste more than twice. - Instead: *write a function*. ## Anatomy of function - Header line with function name and input value(s). - Body with calculation of values to output/return. - Return value: the output from function. In our case: ```{r functions-3} sqrt_minus_1 <- function(x) { ans <- sqrt(x - 1) return(ans) } ``` or more simply ("the R way", better style) ```{r functions-4} sqrt_minus_1 <- function(x) { sqrt(x - 1) } ``` If last line of function calculates value without saving it, that value is returned. ## About the input; testing - The input to a function can be called anything. Here we called it `x`. This is the name used inside the function. - The function is a “machine” for calculating square-root-minus-1. It doesn’t do anything until you call it: ```{r functions-5} sqrt_minus_1(50) sqrt_minus_1(11) sqrt_minus_1(3) ``` ```{r} #| error: true q <- 17 sqrt_minus_1(q) sqrt_minus_1("text") ``` - It works! ## Vectorization 1/2 - We conceived our function to work on numbers: ```{r functions-6} sqrt_minus_1(3.25) ``` - but it actually works on vectors too, as a free bonus of R: ```{r functions-7} sqrt_minus_1(c(50, 11, 3)) ``` - or... (over) ## Vectorization 2/2 - or even data frames: ```{r functions-8} d <- data.frame(x = 1:2, y = 3:4) d sqrt_minus_1(d) ``` ## More than one input - Allow the value to be subtracted, before taking square root, to be input to function as well, thus: ```{r functions-9} sqrt_minus_value <- function(x, d) { sqrt(x - d) } ``` - Call the function with the x and d inputs in the right order: ```{r functions-10} sqrt_minus_value(51, 2) ``` - or give the inputs names, in which case they can be in *any order*: ```{r functions-11} sqrt_minus_value(d = 2, x = 51) ``` ```{r} lm(y ~ x, data = d) ``` ## Defaults 1/2 - Many R functions have values that you can change if you want to, but usually you don’t want to, for example: ```{r functions-12} x <- c(3, 4, 5, NA, 6, 7) mean(x) mean(x, na.rm = TRUE) ``` - By default, the mean of data with a missing value is missing, but if you specify `na.rm=TRUE`, the missing values are removed before the mean is calculated. - That is, `na.rm` has a default value of `FALSE`: that’s what it will be unless you change it. ## Defaults 2/2 - In our function, set a default value for d like this: ```{r functions-13} sqrt_minus_value <- function(x, d = 1) { sqrt(x - d) } ``` - If you specify a value for d, it will be used. If you don't, 1 will be used instead: ```{r functions-14} sqrt_minus_value(51, 2) sqrt_minus_value(51) ``` ## Catching errors before they happen - What happened here? ```{r functions-15, error=TRUE, warning=TRUE} sqrt_minus_value(6, 8) ``` - Message not helpful. Actually, function tried to take square root of negative number. - In fact, not even error, just warning. - Check that the square root will be OK first. Here’s how: ```{r functions-16} sqrt_minus_value <- function(x, d = 1) { stopifnot(x - d >= 0) sqrt(x - d) } ``` ## What happens with `stopifnot` - This should be good, and is: ```{r functions-17, error=TRUE} sqrt_minus_value(8, 6) ``` - This should fail, and see how it does: ```{r functions-18, error=TRUE} sqrt_minus_value(6, 8) ``` - Where the function fails, we get informative error, but if everything good, the `stopifnot` does nothing. - `stopifnot` contains one or more logical conditions, and all of them have to be true for function to work. So put in everything that you want to be true. ## Using R’s built-ins - When you write a function, you can use anything built-in to R, or even any functions that you defined before. - For example, if you will be calculating a lot of regression-line slopes, you don’t have to do this from scratch: you can use R’s regression calculations, like this: ```{r functions-19} my_df <- data.frame(x = 1:4, y = c(10, 11, 10, 14)) my_df my_df.1 <- lm(y ~ x, data = my_df) summary(my_df.1) tidy(my_df.1) ``` ## Pulling out just the slope Use `pluck`: ```{r functions-20} tidy(my_df.1) %>% pluck("estimate", 2) ``` ## Making this into a function - First step: make sure you have it working without a function (we do) - Inputs: two, an `x` and a `y`. - Output: just the slope, a number. Thus: ```{r functions-21} slope <- function(xx, yy) { y.1 <- lm(yy ~ xx) tidy(y.1) %>% pluck("estimate", 2) } ``` - Check using our data from before: correct: ```{r functions-22} with(my_df, slope(x, y)) ``` ## Passing things on - `lm` has a lot of options, with defaults, that we might want to change. Instead of intercepting all the possibilities and passing them on, we can do this: ```{r functions-23} slope <- function(xx, yy, ...) { y.1 <- lm(yy ~ xx, ...) tidy(y.1) %>% pluck("estimate", 2) } ``` - The `...` in the header line means “accept any other input”, and the `...` in the `lm` line means “pass anything other than `x` and `y` straight on to `lm`”. ## Using `...` - One of the things `lm` will accept is a vector called `subset` containing the list of observations to include in the regression. - So we should be able to do this: ```{r functions-24} with(my_df, slope(x, y, subset = 3:4)) ``` - Just uses the last two observations in `x` and `y`: ```{r functions-25} my_df %>% slice(3:4) ``` - so the slope should be $(14 − 10)/(4 − 3) = 4$ and is. ## Running a function for each of several inputs - Suppose we have a data frame containing several different `x`’s to use in regressions, along with the `y` we had before: ```{r functions-26} (d <- tibble(x1 = 1:4, x2 = c(8, 7, 6, 5), x3 = c(2, 4, 6, 9))) ``` - Want to use these as different x’s for a regression with `y` from `my_df` as the response, and collect together the three different slopes. - Python-like way: a `for` loop. - R-like way: `map_dbl`: less coding, but more thinking. ## The loop way - “Pull out” column `i` of data frame `d` as `d %>% pull(i)`. - Create empty vector `slopes` to store the slopes. - Looping variable `i` goes from 1 to 3 (3 columns, thus 3 slopes): ```{r functions-27} slopes <- numeric(3) for (i in 1:3) { d %>% pull(i) -> xx slopes[i] <- slope(xx, my_df$y) } slopes ``` - Check this by doing the three `lm`s, one at a time. ## The `map_dbl` way - In words: for each of these (columns of `d`), run function (`slope`) with inputs "it" and `y`), and collect together the answers. - Since slope returns a decimal number (a `dbl`), appropriate function-running function is `map_dbl`: ```{r functions-28} map_dbl(d, \(d) slope(d, my_df$y)) ``` - Same as loop, with a lot less coding. ## Square roots - “Find the square roots of each of the numbers 1 through 10”: ```{r functions-29} x <- 1:10 map_dbl(x, \(x) sqrt(x)) ``` ## Summarizing all columns of a data frame, two ways - use my `d` from above: ```{r functions-30} map_dbl(d, \(d) mean(d)) d %>% summarize(across(everything(), \(x) mean(x))) ``` The mean of each column, with the columns labelled. ## What if summary returns more than one thing? - For example, finding quartiles: ```{r functions-31} quartiles <- function(x) { quantile(x, c(0.25, 0.75)) } quartiles(1:5) ``` - When function returns more than one thing, `map` (or `map_df`) instead of `map_dbl`. ## map results - Try: ```{r functions-32} map(d, \(d) quartiles(d)) ``` - A list. ## Or - Better: pretend output from quartiles is one-column data frame: ```{r functions-33} map_df(d, \(d) quartiles(d)) ``` ## Or even ```{r functions-34} d %>% map_df(\(d) quartiles(d)) ``` ## Comments - This works because the implicit first thing in map is (the columns of) the data frame that came out of the previous step. - These are 1st and 3rd quartiles of each column of `d`, according to R’s default definition (see help for `quantile`). ## `Map` in data frames with `mutate` - `map` can also be used within data frames to calculate new columns. Let’s do the square roots of 1 through 10 again: ```{r functions-35} d <- tibble(x = 1:10) d %>% mutate(root = map_dbl(x, \(x) sqrt(x))) ``` ## Write a function first and then map it - If the “for each” part is simple, go ahead and use `map_`-whatever. - If not, write a function to do the complicated thing first. - Example: “half or triple plus one”: if the input is an even number, halve it; if it is an odd number, multiply it by three and add one. - This is hard to do as a one-liner: first we have to figure out whether the input is odd or even, and then we have to do the right thing with it. ## Odd or even? - Odd or even? Work out the remainder when dividing by 2: ```{r functions-36} 6 %% 2 5 %% 2 ``` - 5 has remainder 1 so it is odd. ## Write the function - First test for integerness, then test for odd or even, and then do the appropriate calculation: ```{r functions-37, error=TRUE} hotpo <- function(x) { stopifnot(round(x) == x) # passes if input an integer remainder <- x %% 2 if (remainder == 1) { # odd number ans <- 3 * x + 1 } else { # even number ans <- x %/% 2 # integer division } ans } ``` ```{r} x <- 4 ifelse((x %% 2) == 1, 3 * x + 1, x %/% 2) ``` ## Test it ```{r functions-38, error=T} hotpo(3) hotpo(12) hotpo(4.5) ``` ## One through ten - Use a data frame of numbers 1 through 10 again: ```{r functions-39} tibble(x = 1:10) %>% mutate(y = map_int(x, \(x) hotpo(x))) ``` ## Until I get to 1 (if I ever do) {.smaller} - If I start from a number, find `hotpo` of it, then find `hotpo` of that, and keep going, what happens? - If I get to 4, 2, 1, 4, 2, 1 I’ll repeat for ever, so let’s stop when we get to 1: ```{r functions-40} hotpo_seq <- function(x) { ans <- x while (x != 1) { x <- hotpo(x) ans <- c(ans, x) } ans } ``` - Strategy: keep looping “while `x` is not 1”. - Each new `x`: add to the end of `ans`. When I hit 1, I break out of the `while` and return the whole `ans`. ## Trying it 1/2 - Start at 6: ```{r functions-41} hotpo_seq(6) ``` ## Trying it 2/2 - Start at 27: ```{r} #| echo = FALSE w <- getOption("width") options(width = 60) ``` \footnotesize ```{r functions-42} hotpo_seq(27) ``` \normalsize ```{r} #| echo = FALSE options(width = w) ``` ## Which starting points have the longest sequences? - The `length` of the vector returned from `hotpo_seq` says how long it took to get to 1. - Out of the starting points 1 to 100, which one has the longest sequence? ## Top 10 longest sequences ```{r functions-43} tibble(start = 1:100) %>% mutate(seq_length = map_int( start, \(start) length(hotpo_seq(start)))) %>% slice_max(seq_length, n = 10) ``` - 27 is an unusually low starting point to have such a long sequence. ## What happens if we save the entire sequence? ```{r functions-44} tibble(start = 1:7) %>% mutate(sequence = map(start, \(start) hotpo_seq(start))) ``` - Each entry in `sequence` is itself a vector. `sequence` is a “list-column”. ## Using the whole sequence to find its length and its max ```{r functions-45} tibble(start = 1:7) %>% mutate(sequence = map(start, \(start) hotpo_seq(start))) %>% mutate( seq_length = map_int(sequence, \(sequence) length(sequence)), seq_max = map_int(sequence, \(sequence) max(sequence)) ) ``` ## Does it work with `rowwise`? ```{r functions-46} tibble(start=1:7) %>% rowwise() %>% mutate(sequence = list(hotpo_seq(start))) %>% mutate(seq_length = length(sequence)) %>% mutate(seq_max = max(sequence)) ``` It does. ## Final thoughts on this - Called the **Collatz conjecture**. - Nobody knows whether the sequence always gets to 1. - Nobody has found an $n$ for which it doesn’t. - A [tree](https://www.jasondavies.com/collatz-graph/).