--- execute: echo: true message: false warning: false fig-format: "svg" format: revealjs: theme: lecture_styles.scss highlight-style: a11y-dark reference-location: margin slide-number: true code-link: true chalkboard: true incremental: false smaller: true preview-links: true code-line-numbers: true history: false progress: true link-external-icon: true code-annotations: hover pointer: color: "#b18eb1" revealjs-plugins: - pointer --- ```{r} #| echo: false #| cache: false require(downlit) require(xml2) require(tidyverse) knitr::opts_chunk$set(comment = ">") ``` ## {#title-slide data-menu-title="Writing Functions" background="#1e4655" background-image="../../images/csss-logo.png" background-position="center top 5%" background-size="50%"} [Writing Functions]{.custom-title} [CS&SS 508 • Lecture 8]{.custom-subtitle} [{{< var lectures.eight >}}]{.custom-subtitle2} [Victoria Sass]{.custom-subtitle3} # Roadmap {.section-title background-color="#99a486"} ------------------------------------------------------------------------ ::: columns ::: {.column width="50%"}
### Last time, we learned: - Types of Data - Strings - Pattern Matching & Regular Expressions ::: ::: {.column width="50%"}
::: fragment ### Today, we will cover: - Function Basics - Types of Functions - Vector Functions - Dataframe Functions - Plot Functions - Function Style Guide ::: ::: ::: # Function Basics {.section-title background-color="#99a486"} ## Why Functions?
R (as well as mathematics in general) is full of functions! . . .
We use functions to: - Compute summary statistics (`mean()`, `sd()`, `min()`) - Fit models to data (`lm(Fertility ~ Agriculture, data = swiss)`) - Read in data (`read_csv()`) - Create visualizations (`ggplot()`) - And a lot more!! ## Examples of Existing Functions ::: incremental - `mean()`: - Input: a vector - Output: a single number - `dplyr::filter()`: - Input: a data frame, logical conditions - Output: a data frame with rows removed using those conditions - `readr::read_csv()`: - Input: a file path, optionally variable names or types - Output: a data frame containing info read in from file ::: . . . Each function requires **inputs**, and returns **outputs** ## Why Write Your Own Functions? ::: {.incremental} * Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting * As requirements change, you only need to update code in one place, instead of many. * You eliminate the chance of making incidental mistakes compared to when you copy and paste (i.e. updating a variable name in one place, but not in another). * It makes it easier to reuse work from project-to-project, increasing your productivity over time. * If well named, your function can make your overall code easier to understand. ::: ## Plan your Function before Writing
Before you can write effective code, you need to know *exactly* what you want: ::: incremental - **Goal:** Do I want a single value? vector? one observation per person? per year? - **Current State:** What do I currently have? data frame, vector? long or wide format? - **Translate:** How can I take what I have and turn it into my goal? - Sketch out the steps! - Break it down into little operations ::: . . . **As we become more advanced coders, this concept is key!!** **Remember:** *When you're stuck, try searching your problem on Google!!* ## Simple, Motivating Example :::: {.columns} ::: {.column width="62%"} ```{r} #| echo: false set.seed(5000) ``` ```{r} #| eval: false df <- tibble( a = rnorm(5), b = rnorm(5), c = rnorm(5), d = rnorm(5) ) df df |> mutate( a = (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)), b = (b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)), c = (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)), d = (d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) ) ``` ::: {.incremental .fragment fragment-index=3} * What do you think this code does? * Are there any typos? * Could we write this more efficiently as a function? ::: ::: ::: {.column width="38%"} ::: {.fragment fragment-index=1} ```{r} #| echo: false set.seed(5000) df <- tibble( a = rnorm(5), b = rnorm(5), c = rnorm(5), d = rnorm(5) ) df ``` ::: ::: {.fragment fragment-index=2} ```{r} #| echo: false df |> mutate( a = (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)), b = (b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)), c = (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)), d = (d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) ) ``` ::: ::: :::: ## Writing a Function To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. . . .
Let's look at the contents of the mutate from the last slide again. . . . ```{r} #| eval: false (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)) (b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)) (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)) (d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) ``` . . . There's quite a bit of repetition here and only a few elements that change. . . .
We can see how concise our code can be if we replace the varying part with 🟪: ```{r} #| eval: false (🟪 - min(🟪, na.rm = TRUE)) / (max(🟪, na.rm = TRUE) - min(🟪, na.rm = TRUE)) ``` ## Anatomy of a Function To turn our code into a function we need three things: ::: incremental - **Name**: What you call the function so you can use it later. The more explanatory this is the easier your code will be to understand. - **Argument(s)** (aka input(s), parameter(s)): What the user passes to the function that affects how it works. This is what [varies]{.custom-red} across calls. - **Body**: The code that’s [repeated]{.custom-red} across all the calls. ::: . . . **Function Template** ```{r} #| eval: false NAME <- function(ARGUMENT1, ARGUMENT2 = DEFAULT){ # <1> BODY } ``` 1. In this example, `ARGUMENT1`, `ARGUMENT2` values won't exist outside of the function. `ARGUMENT2` is an optional argument as it's been given a default value to use if the user does not specify one. . . . For our current example, this would be: ```{r} rescale01 <- function(x) { # <2> (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) } ``` 2. You can name the placeholder value(s) whatever you want but `x` is the conventional name for a numeric vector so we'll use `x` here. ## Testing Your Function It's good practice to test a few simple inputs to make sure your function works as expected. . . . ```{r} rescale01(c(-10, 0, 10)) rescale01(c(1, 2, 3, NA, 5)) ``` . . . Now we can rewrite our original code in a much simpler way!^[We'll see how we can simplify this even further next week!] . . . ```{r} df |> mutate(a = rescale01(a), b = rescale01(b), c = rescale01(c), d = rescale01(d)) ``` ## Improving Your Function Writing a function is often an iterative process: you'll write the core of the function and then notice the ways it can be made more efficient or that it needs to include additional syntax to handle a specific use-case. :::: {.columns} ::: {.column width="42%"} ::: {.fragment}
For instance, you might observe that our function does some unnecessary computational repetition by evaluating `min()` twice and `max()` once when both can be computed once with `range()`. ::: ::: {.fragment}
```{r} #| code-line-numbers: false rescale01 <- function(x) { rng <- range(x, na.rm = TRUE) (x - rng[1]) / (rng[2] - rng[1]) } ``` ::: ::: ::: {.column width="58%"} ::: {.fragment}
Or you might find out through trial and error that our function doesn't handle infinite values well. ```{r} #| code-line-numbers: false x <- c(1:10, Inf) rescale01(x) ``` ::: ::: {.fragment} Updating it to exclude infinite values makes it more general as it accounts for more use cases. ```{r} #| code-line-numbers: false rescale01 <- function(x) { rng <- range(x, na.rm = TRUE, finite = TRUE) (x - rng[1]) / (rng[2] - rng[1]) } ``` ::: ::: :::: # Vector Functions {.section-title background-color="#99a486"} ## What are Vector Functions? The function we just created is a vector function! . . . Vector functions are simply functions that take one or more vectors as input and return a vector as output. . . . There are two types of vector functions: mutate functions and summary functions.
::: {.fragment} #### Mutate Functions * Return an output the same length as the input * Therefore, these functions work well within `mutate()` and `filter()` :::
::: {.fragment} #### Summary Functions * Return a single value * Therefore well suited for use in `summarize()` ::: ## Examples of Mutate Functions . . . ```{r} z_score <- function(x) { (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE) # <1> } ages <- c(25, 82, 73, 44, 5) z_score(ages) ``` 1. Rescales a vector to have a mean of zero and a standard deviation of one. . . . ```{r} clamp <- function(x, min, max) { case_when( # <2> x < min ~ min, # <2> x > max ~ max, # <2> .default = x # <2> ) } clamp(1:10, min = 3, max = 7) ``` 2. Ensures all values of a vector lie in between a minimum or a maximum. . . . ```{r} first_upper <- function(x) { str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1)) # <3> x # <3> } first_upper("hi there, how's your day going?") ``` 3. Make the first character upper case. ## Examples of Summarize Functions . . . ```{r} cv <- function(x, na.rm = FALSE) { sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm) # <4> } cv(runif(100, min = 0, max = 50)) ``` 4. Calculation for the coefficient of variation, which divides the standard deviation by the mean. . . . ```{r} n_missing <- function(x) { sum(is.na(x)) # <5> } var <- sample(c(seq(1, 20, 1), NA, NA), size = 100, replace = TRUE) # <6> n_missing(var) ``` 5. Calculates the number of missing values ([Source](https://twitter.com/gbganalyst/status/1571619641390252033)). 6. Creating a random sample of 100 values with a mix of integers from 1 to 100 and `NA` values. . . . ```{r} mape <- function(actual, predicted) { sum(abs((actual - predicted) / actual)) / length(actual) # <7> } model1 <- lm(dist ~ speed, data = cars) mape(cars$dist, model1$fitted.values) # <8> ``` 7. Calculates the mean absolute percentage error which measures the average magnitude of error produced by a model, or how far off predictions are on average. 8. This tells us that the average absolute percentage difference between the predicted values and the actual values is ~ 38%. # Data Frame Functions {.section-title background-color="#99a486"} ## What are Data Frame Functions? Vector functions are useful for pulling out code that’s repeated within a dplyr verb. . . . But if you are building a long pipeline that is used repeatedly you'll want to write a dataframe function. . . . Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or a vector. #### Example ```{r} #| error: true #| output-location: fragment grouped_mean <- function(df, group_var, mean_var) { df |> group_by(group_var) |> # <1> summarize(mean(mean_var)) # <1> } diamonds |> grouped_mean(cut, carat) ``` 1. The goal of this function is to compute the mean of `mean_var` grouped by `group_var.`
::: {.fragment} Uh oh, what happened? ::: ## Tidy Evaluation Tidy evaluation is what allows us to refer to the names of variables inside a data frame without any special treatment. . . . This is the reason we don't have to use the `$` operator and can just call the variables directly and `tidyverse` functions know what we're referring to. . . . ::: {.panel-tabset} ### Base `R` ```{r} #| eval: false diamonds[diamonds$cut == "Ideal" & diamonds$price < 1000, ] ``` ### `tidyverse` ```{r} #| eval: false diamonds |> filter(cut == "Ideal" & price < 1000) ``` ::: . . .
Most of the time tidy evaluation does exactly what we want it to do. . . . The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. . . . Here we need some way to tell the functions within *our function* not to treat our argument names as the name of the variables, but instead *look inside them* for the variable we actually want to use. ## Embracing The tidy evaluation solution to this issue is called embracing, which means wrapping variable names in two sets of curly braces (i.e. `var` becomes `{{ var }}`). . . . Embracing a variable tells `dplyr` to use the value stored inside the argument, not the argument as the literal variable name. . . . ```{r} #| output-location: fragment grouped_mean <- function(df, group_var, mean_var) { df |> group_by({{ group_var }}) |> summarize(mean({{ mean_var }})) } diamonds |> grouped_mean(cut, carat) ``` ## When to Embrace? [{{< fa scroll >}}]{style="color:#99a486"} {.scrollable} Look up the documentation of the function! . . . The two most common sub-types of tidy evaluation are **data-masking**^[Used in functions like `arrange()`, `filter()`, and `summarize()` that compute with variables.] and **tidy-selection**^[Used in functions like `select()`, `relocate()`, and `rename()` that select variables.]. ![](mutate_help.png){fig-align="center"} ## Data Frame Function Examples ```{r} #| output-location: fragment summary6 <- function(data, var) { data |> summarize( # <2> min = min({{ var }}, na.rm = TRUE), # <2> mean = mean({{ var }}, na.rm = TRUE), # <2> median = median({{ var }}, na.rm = TRUE), # <2> max = max({{ var }}, na.rm = TRUE), # <2> n = n(), # <2> n_miss = sum(is.na({{ var }})), # <2> .groups = "drop" # <3> ) } diamonds |> summary6(carat) ``` 2. The goal of this function is to compute six common summary statistics for a specified variable of a dataset. 3. Whenever you wrap `summarize()` in a helper function it’s good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state. ## Data Frame Function Examples ```{r} #| output-location: fragment count_prop <- function(df, var, sort = FALSE) { df |> # <4> count({{ var }}, sort = sort) |> # <4> mutate(prop = n / sum(n)) # <4> } diamonds |> count_prop(clarity) ``` 4. This function is a variation of `count()` which also calculates the proportion ([Source](https://twitter.com/Diabb6/status/1571635146658402309)). # Plot Functions {.section-title background-color="#99a486"} ## What are Plot Functions? :::: {.columns} ::: {.column width="60%"} What if you have a lot of similar plots to create? You can use a function to eliminate redundency.
::: {.fragment} The same technique can be used if you want to write a function that returns a plot since `aes()` is a data-masking function. :::
::: {.fragment} Simply use embracing within the `aes()` call to `ggplot()`! :::
::: {.fragment} ```{r} #| output: false histogram <- function(df, var, binwidth = NULL) { df |> ggplot(aes(x = {{ var }})) + # <1> geom_histogram(binwidth = binwidth) # <1> } diamonds |> histogram(carat, 0.1) # <2> ``` 1. This is a useful function for quickly getting histograms of a specified binwidth from a dataset. 2. Note that `histogram()` returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from `|>` to `+`. ::: ::: ::: {.column width="40%"} ```{r} #| echo: false ggsave("histogram.png", width = 6, height = 8, units = "in") ``` ::: {.fragment} ![](histogram.png){.absolute right=5, top=25} ::: ::: :::: ## Data Manipulation & Plotting You might want to create a function that has a bit of data manipulation **_and_** returns a plot. . . . ```{r} #| output-location: fragment #| fig-align: center sorted_bars <- function(df, var) { # <3> df |> mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |> # <4> ggplot(aes(y = {{ var }})) + geom_bar() } diamonds |> sorted_bars(clarity) ``` 3. This function creates a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`. 4. `:=` (commonly referred to as the “walrus operator”) is used here because we are generating the variable name based on user-supplied data. `R`’s syntax doesn’t allow anything to the left of `=` except for a single, literal name. To work around this problem, we use the special operator `:=` which tidy evaluation treats in exactly the same way as `=`. ## Functions that Label :::: {.columns} ::: {.column width="60%"} What if we want to add labels using our function?
::: {.fragment fragment-index=1} For that we need to use the low-level package `rlang` that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools). :::
::: ::: {.column width="40%"} ::: :::: . . . Let's take our histogram example from before: ```{r} #| output: false #| echo: true histogram <- function(df, var, binwidth) { label <- rlang::englue("A histogram of {{ var }} with binwidth {binwidth}") # <5> df |> ggplot(aes(x = {{ var }})) + geom_histogram(binwidth = binwidth) + labs(title = label) } diamonds |> histogram(carat, 0.1) ``` 5. `rlang::englue()` works similarly to `str_glue()`, so any value wrapped in
`{ }` will be inserted into the string. But it also understands `{{ }}`, which automatically inserts the appropriate variable name. . . . ```{r} #| echo: false #| output: false histogram <- function(df, var, binwidth) { label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}") df |> ggplot(aes(x = {{ var }})) + geom_histogram(binwidth = binwidth) + labs(title = label) } diamonds |> histogram(carat, 0.1) ggsave("histogram_label.png", width = 6, height = 8, units = "in") ``` ![](histogram_label.png){.absolute right=50 top=0 width="325" height="400"} # Function Style Guide {.section-title background-color="#99a486"} ## Best Practices ::: {.incremental} * Make function names descriptive; again longer is better due to RStudio's auto-complete feature. * Generally, function names should be verbs, and arguments should be nouns. - Some exceptions: computation of a well-known noun (i.e. `mean()`), accessing a property of an object (i.e. `coef()`) * `function()` should always be followed by squiggly brackets (`{}`), and the contents should be indented by an additional two spaces^[This makes it easier to see the hierarchy in your code by skimming the left-hand margin.]. * You should put extra spaces inside of `{{ }}`. This makes it very obvious that something unusual is happening. ::: ::: aside You can read the official tidyverse style guide for functions [here](https://style.tidyverse.org/functions). ::: # Lab {.section-title background-color="#99a486"} ## Writing Functions 1. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? . . . ```{r} #| eval: false mean(is.na(x)) mean(is.na(y)) mean(is.na(z)) x / sum(x, na.rm = TRUE) y / sum(y, na.rm = TRUE) z / sum(z, na.rm = TRUE) round(x / sum(x, na.rm = TRUE) * 100, 1) round(y / sum(y, na.rm = TRUE) * 100, 1) round(z / sum(z, na.rm = TRUE) * 100, 1) ``` . . .
2. Bonus: Write a function that takes a name as an input (i.e. a character string) and returns a greeting based on the current time of day. **Hint**: use a time argument that defaults to `lubridate::now()`. That will make it easier to test your function. ## Answers 1. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? ```{r} #| eval: false mean(is.na(x)) mean(is.na(y)) mean(is.na(z)) ``` . . .
```{r} #| output-location: fragment prop_na <- function(x){ mean(is.na(x)) } set.seed(50) # <1> values <- sample(c(seq(1, 10, 1), NA), 5, replace = TRUE) values ``` 1. `set.seed()` is a function that can be used to create reproducible results when writing code that involves creating variables that take on random values. ```{r} #| output-location: fragment prop_na(values) ``` This code calculates the proportion of `NA` values in a vector. I would call it `prop_na()` which would take a single argument, `x`, and return a single numeric value, between 0 and 1. ## Answers 1. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? ```{r} #| eval: false x / sum(x, na.rm = TRUE) y / sum(y, na.rm = TRUE) z / sum(z, na.rm = TRUE) ``` . . . ```{r} #| output-location: fragment sums_to_one <- function(x, na.rm = FALSE) { x / sum(x, na.rm = na.rm) } sums_to_one(values) ``` ```{r} #| output-location: fragment sums_to_one(values, na.rm = TRUE) ``` This code standardizes a vector so that it sums to one. It takes a numeric vector and an optional specification for removing `NA`s. While the original code had `na.rm = TRUE`, it's best to set the default to `FALSE` which will alert the user if `NA`s are present by returning `NA`. ## Answers 1. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? ```{r} #| eval: false round(x / sum(x, na.rm = TRUE) * 100, 1) round(y / sum(y, na.rm = TRUE) * 100, 1) round(z / sum(z, na.rm = TRUE) * 100, 1) ``` . . . ```{r} #| output-location: fragment pct_vec <- function(x, na.rm = FALSE){ round(x / sum(x, na.rm = na.rm) * 100, 1) } pct_vec(values, na.rm = TRUE) ``` This code takes a numeric vector and finds what each value represents as a percentage of the sum of the entire vector and rounds it to the first decimal place. There is also an optional `na.rm` argument set to `FALSE` by default. ## Answers 2. Bonus: Write a function that takes a name as an input (i.e. a character string) and returns a greeting based on the current time of day. **Hint 1**: use a time argument that defaults to `lubridate::now()`. That will make it easier to test your function. **Hint 2**: Use `rlang::englue` to combine your greetings with the name input. . . . ```{r} #| output-location: fragment greet <- function(name, time = now()){ # <2> hr <- hour(time) greeting <- case_when(hr < 12 & hr >= 5 ~ rlang::englue("Good morning {name}."), # <3> hr < 17 & hr >= 12 ~ rlang::englue("Good afternoon {name}."), # <3> hr >= 17 ~ rlang::englue("Good evening {name}."), # <3> .default = rlang::englue("Why are you awake rn, {name}???")) # <3> return(greeting) # <4> } greet("Vic") # <5> ``` 2. By default this function will take the current time to determine the specific greeting. 3. Using `englue()` allows you to include user-specified values with `{ }`. 4. `return()` or `print()` or simply calling the variable `greeting` is necessary for the function to work as expected. 5. The last time this lecture (and therefore this code) was rendered was at `r lubridate::now()` ```{r} #| output-location: fragment greet("Vic", time = ymd_h("2024-05-14 2am")) ``` # Homework{.section-title background-color="#1e4655"} ## {data-menu-title="Homework 8" background-iframe="https://vsass.github.io/CSSS508/Homework/HW8/homework8.html" background-interactive=TRUE}