---
title: "Functions"
editor: 
  markdown: 
    wrap: 72
---

## Packages for this section

```{r functions-1}
library(tidyverse)
library(broom) # some regression stuff later
```

## Don’t repeat yourself

-   See this:

```{r functions-2}
a <- 50
b <- 11
d <- 3
as <- sqrt(a - 1)
as
bs <- sqrt(b - 1)
bs
ds <- sqrt(d - 1)
ds
```

## What's the problem?

-   Same calculation done three different times, by copying, pasting and
    editing.

-   Dangerous: what if you forget to change something after you pasted?

-   Programming principle: "don't repeat yourself".

-   Hadley Wickham: don't copy-paste more than twice.

-   Instead: *write a function*.

## Anatomy of function

-   Header line with function name and input value(s).
-   Body with calculation of values to output/return.
-   Return value: the output from function. In our case:

```{r functions-3}
sqrt_minus_1 <- function(x) {
  ans <- sqrt(x - 1)
  return(ans)
}
```

or more simply ("the R way", better style)

```{r functions-4}
sqrt_minus_1 <- function(x) {
  sqrt(x - 1)
}
```

If last line of function calculates value without saving it, that value
is returned.

## About the input; testing

-   The input to a function can be called anything. Here we called it
    `x`. This is the name used inside the function.
-   The function is a “machine” for calculating square-root-minus-1. It
    doesn’t do anything until you call it:

```{r functions-5}
sqrt_minus_1(50)
sqrt_minus_1(11)
sqrt_minus_1(3)
```

```{r}
#| error: true
q <- 17
sqrt_minus_1(q)
sqrt_minus_1("text")
```

-   It works!

## Vectorization 1/2

-   We conceived our function to work on numbers:

```{r functions-6}
sqrt_minus_1(3.25)
```

-   but it actually works on vectors too, as a free bonus of R:

```{r functions-7}
sqrt_minus_1(c(50, 11, 3))
```

-   or... (over)

## Vectorization 2/2

-   or even data frames:

```{r functions-8}
d <- data.frame(x = 1:2, y = 3:4)
d
sqrt_minus_1(d)
```

## More than one input

-   Allow the value to be subtracted, before taking square root, to be
    input to function as well, thus:

```{r functions-9}
sqrt_minus_value <- function(x, d) {
  sqrt(x - d)
}
```

-   Call the function with the x and d inputs in the right order:

```{r functions-10}
sqrt_minus_value(51, 2)
```

-   or give the inputs names, in which case they can be in *any order*:

```{r functions-11}
sqrt_minus_value(d = 2, x = 51)
```

```{r}
lm(y ~ x, data = d)
```

## Defaults 1/2

-   Many R functions have values that you can change if you want to, but
    usually you don’t want to, for example:

```{r functions-12}
x <- c(3, 4, 5, NA, 6, 7)
mean(x)
mean(x, na.rm = TRUE)
```

-   By default, the mean of data with a missing value is missing, but if
    you specify `na.rm=TRUE`, the missing values are removed before the
    mean is calculated.

-   That is, `na.rm` has a default value of `FALSE`: that’s what it will
    be unless you change it.

## Defaults 2/2

-   In our function, set a default value for d like this:

```{r functions-13}
sqrt_minus_value <- function(x, d = 1) {
  sqrt(x - d)
}
```

-   If you specify a value for d, it will be used. If you don't, 1 will
    be used instead:

```{r functions-14}
sqrt_minus_value(51, 2)
sqrt_minus_value(51)
```

## Catching errors before they happen

-   What happened here?

```{r functions-15, error=TRUE, warning=TRUE}
sqrt_minus_value(6, 8)
```

-   Message not helpful. Actually, function tried to take square root of
    negative number.
-   In fact, not even error, just warning.
-   Check that the square root will be OK first. Here’s how:

```{r functions-16}
sqrt_minus_value <- function(x, d = 1) {
  stopifnot(x - d >= 0)
  sqrt(x - d)
}
```

## What happens with `stopifnot`

-   This should be good, and is:

```{r functions-17, error=TRUE}
sqrt_minus_value(8, 6)
```

-   This should fail, and see how it does:

```{r functions-18, error=TRUE}
sqrt_minus_value(6, 8)
```

-   Where the function fails, we get informative error, but if
    everything good, the `stopifnot` does nothing.
-   `stopifnot` contains one or more logical conditions, and all of them
    have to be true for function to work. So put in everything that you
    want to be true.

## Using R’s built-ins

-   When you write a function, you can use anything built-in to R, or
    even any functions that you defined before.
-   For example, if you will be calculating a lot of regression-line
    slopes, you don’t have to do this from scratch: you can use R’s
    regression calculations, like this:

```{r functions-19}
my_df <- data.frame(x = 1:4, y = c(10, 11, 10, 14))
my_df
my_df.1 <- lm(y ~ x, data = my_df)
summary(my_df.1)
tidy(my_df.1)
```

## Pulling out just the slope

Use `pluck`:

```{r functions-20}
tidy(my_df.1) %>% pluck("estimate", 2)
```

## Making this into a function

-   First step: make sure you have it working without a function (we do)
-   Inputs: two, an `x` and a `y`.
-   Output: just the slope, a number. Thus:

```{r functions-21}
slope <- function(xx, yy) {
  y.1 <- lm(yy ~ xx)
  tidy(y.1) %>% pluck("estimate", 2)
}
```

-   Check using our data from before: correct:

```{r functions-22}
with(my_df, slope(x, y))
```

## Passing things on

-   `lm` has a lot of options, with defaults, that we might want to
    change. Instead of intercepting all the possibilities and passing
    them on, we can do this:

```{r functions-23}
slope <- function(xx, yy, ...) {
  y.1 <- lm(yy ~ xx, ...)
  tidy(y.1) %>% pluck("estimate", 2)
}
```

-   The `...` in the header line means “accept any other input”, and the
    `...` in the `lm` line means “pass anything other than `x` and `y`
    straight on to `lm`”.

## Using `...`

-   One of the things `lm` will accept is a vector called `subset`
    containing the list of observations to include in the regression.
-   So we should be able to do this:

```{r functions-24}
with(my_df, slope(x, y, subset = 3:4))
```

-   Just uses the last two observations in `x` and `y`:

```{r functions-25}
my_df %>% slice(3:4)
```

-   so the slope should be $(14 − 10)/(4 − 3) = 4$ and is.

## Running a function for each of several inputs

-   Suppose we have a data frame containing several different `x`’s to
    use in regressions, along with the `y` we had before:

```{r functions-26}
(d <- tibble(x1 = 1:4, x2 = c(8, 7, 6, 5), x3 = c(2, 4, 6, 9)))
```

-   Want to use these as different x’s for a regression with `y` from
    `my_df` as the response, and collect together the three different
    slopes.
-   Python-like way: a `for` loop.
-   R-like way: `map_dbl`: less coding, but more thinking.

## The loop way

-   “Pull out” column `i` of data frame `d` as `d %>% pull(i)`.
-   Create empty vector `slopes` to store the slopes.
-   Looping variable `i` goes from 1 to 3 (3 columns, thus 3 slopes):

```{r functions-27}
slopes <- numeric(3)
for (i in 1:3) {
  d %>% pull(i) -> xx
  slopes[i] <- slope(xx, my_df$y)
}
slopes
```

-   Check this by doing the three `lm`s, one at a time.

## The `map_dbl` way

-   In words: for each of these (columns of `d`), run function (`slope`)
    with inputs "it" and `y`), and collect together the answers.
-   Since slope returns a decimal number (a `dbl`), appropriate
    function-running function is `map_dbl`:

```{r functions-28}
map_dbl(d, \(d) slope(d, my_df$y))
```

-   Same as loop, with a lot less coding.

## Square roots

-   “Find the square roots of each of the numbers 1 through 10”:

```{r functions-29}
x <- 1:10
map_dbl(x, \(x) sqrt(x))
```

## Summarizing all columns of a data frame, two ways

-   use my `d` from above:

```{r functions-30}
map_dbl(d, \(d) mean(d))
d %>% summarize(across(everything(), \(x) mean(x)))
```

The mean of each column, with the columns labelled.

## What if summary returns more than one thing?

-   For example, finding quartiles:

```{r functions-31}
quartiles <- function(x) {
  quantile(x, c(0.25, 0.75))
}
quartiles(1:5)
```

-   When function returns more than one thing, `map` (or `map_df`)
    instead of `map_dbl`.

## map results

-   Try:

```{r functions-32}
map(d, \(d) quartiles(d))
```

-   A list.

## Or

-   Better: pretend output from quartiles is one-column data frame:

```{r functions-33}
map_df(d, \(d) quartiles(d))
```

## Or even

```{r functions-34}
d %>% map_df(\(d) quartiles(d))
```

## Comments

-   This works because the implicit first thing in map is (the columns
    of) the data frame that came out of the previous step.
-   These are 1st and 3rd quartiles of each column of `d`, according to
    R’s default definition (see help for `quantile`).

## `Map` in data frames with `mutate`

-   `map` can also be used within data frames to calculate new columns.
    Let’s do the square roots of 1 through 10 again:

```{r functions-35}
d <- tibble(x = 1:10)
d %>% mutate(root = map_dbl(x, \(x) sqrt(x)))
```

## Write a function first and then map it

-   If the “for each” part is simple, go ahead and use `map_`-whatever.
-   If not, write a function to do the complicated thing first.
-   Example: “half or triple plus one”: if the input is an even number,
    halve it; if it is an odd number, multiply it by three and add one.
-   This is hard to do as a one-liner: first we have to figure out
    whether the input is odd or even, and then we have to do the right
    thing with it.

## Odd or even?

-   Odd or even? Work out the remainder when dividing by 2:

```{r functions-36}
6 %% 2
5 %% 2
```

-   5 has remainder 1 so it is odd.

## Write the function

-   First test for integerness, then test for odd or even, and then do
    the appropriate calculation:

```{r functions-37, error=TRUE}
hotpo <- function(x) {
  stopifnot(round(x) == x) # passes if input an integer
  remainder <- x %% 2
  if (remainder == 1) { # odd number
    ans <- 3 * x + 1
  }
  else { # even number
    ans <- x %/% 2 # integer division
  }
  ans
}
```

```{r}
x <- 4
ifelse((x %% 2) == 1, 3 * x + 1, x %/% 2)
```

## Test it

```{r functions-38, error=T}
hotpo(3)
hotpo(12)
hotpo(4.5)
```

## One through ten

-   Use a data frame of numbers 1 through 10 again:

```{r functions-39}
tibble(x = 1:10) %>% mutate(y = map_int(x, \(x) hotpo(x)))
```

## Until I get to 1 (if I ever do) {.smaller}

-   If I start from a number, find `hotpo` of it, then find `hotpo` of
    that, and keep going, what happens?
-   If I get to 4, 2, 1, 4, 2, 1 I’ll repeat for ever, so let’s stop
    when we get to 1:

```{r functions-40}
hotpo_seq <- function(x) {
  ans <- x
  while (x != 1) {
    x <- hotpo(x)
    ans <- c(ans, x)
  }
  ans
}
```

-   Strategy: keep looping “while `x` is not 1”.
-   Each new `x`: add to the end of `ans`. When I hit 1, I break out of
    the `while` and return the whole `ans`.

## Trying it 1/2

-   Start at 6:

```{r functions-41}
hotpo_seq(6)
```

## Trying it 2/2

-   Start at 27:

```{r}
#| echo = FALSE
w <- getOption("width")
options(width = 60)
```

\footnotesize

```{r functions-42}
hotpo_seq(27)
```

\normalsize

```{r}
#| echo = FALSE
options(width = w)
```

## Which starting points have the longest sequences?

-   The `length` of the vector returned from `hotpo_seq` says how long
    it took to get to 1.
-   Out of the starting points 1 to 100, which one has the longest
    sequence?

## Top 10 longest sequences

```{r functions-43}
tibble(start = 1:100) %>%
  mutate(seq_length = map_int(
    start, \(start) length(hotpo_seq(start)))) %>%
  slice_max(seq_length, n = 10)
```

-   27 is an unusually low starting point to have such a long sequence.

## What happens if we save the entire sequence?

```{r functions-44}
tibble(start = 1:7) %>%
  mutate(sequence = map(start, \(start) hotpo_seq(start)))
```

-   Each entry in `sequence` is itself a vector. `sequence` is a
    “list-column”.

## Using the whole sequence to find its length and its max

```{r functions-45}
tibble(start = 1:7) %>%
  mutate(sequence = map(start, \(start) hotpo_seq(start))) %>%
  mutate(
    seq_length = map_int(sequence, \(sequence) length(sequence)),
    seq_max = map_int(sequence, \(sequence) max(sequence))
  )
```

## Does it work with `rowwise`?

```{r functions-46}
tibble(start=1:7) %>% 
  rowwise() %>% 
  mutate(sequence = list(hotpo_seq(start))) %>% 
  mutate(seq_length = length(sequence)) %>% 
  mutate(seq_max = max(sequence))
```

It does.

## Final thoughts on this

-   Called the **Collatz conjecture**.
-   Nobody knows whether the sequence always gets to 1.
-   Nobody has found an $n$ for which it doesn’t.
-   A [tree](https://www.jasondavies.com/collatz-graph/).