--- title: "Writing functions in R" author: "ENS-215" date: "26-Feb-2025" output: html_document: df_print: paged theme: journal toc: yes toc_float: yes ---
```{r message=FALSE, warning=FALSE} library(tidyverse) library(patchwork) # we'll use this for some figures in this class ```
You have been using **functions** in R since the very first day of this class. These functions have been functions from base R or from one of many packages that we've used. Functions are great as they allow you to automate tasks that you commonly need to do. Think about one of the most basic functions `mean()` -- this allows you to take the mean of a dataset without having to compute the sum and then divide that sum by the number of values in the dataset. More complex functions such as `lm()`, which computes a linear model, allow you to perform operations that take many lines of code in just a single function call. While, base R and the thousands of available R packages provide a wide variety of functions, many times you will find yourself needing a function that does not exist. You can write your own R functions to automate tasks that you frequently find yourself performing. Writing a function is a much, much better approach than simply copying and pasting code to repeat tasks. The benefits of function writing over the copy-and-paste approach is nicely summarized in the R4DS (excerpt below): > 1. You can give a function an evocative name that makes your code easier to understand. > 2. As requirements change, you only need to update code in one place, instead of many. > 3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another). So if you find yourself copying and pasting code to perform the same task repeatedly, you should stop and write a function.
## Writing functions ### Function basics Before we move ahead, let's have a brief refresher on some concepts and terminology related to functions. Functions, are passed **arguments** (*i.e.* inputs), the **body** of the function then perform operation(s) on these **arguments** and then the function returns **outputs**. When you use a function it is referred to as **calling** the function. A function is written in R as follows ```{r eval = FALSE} function_name <- function(argument_1, argument_2) { # code performing functions operation(s) # code performing functions operation(s) # code performing functions operation(s)... } ``` You can see that on the first line of the code above, the function `function_name` is defined. By assigning `function(argument_1, argument_2)` to `function_name` we are declaring a new function called `function_name`, which takes two arguments `argument_1` and `argument_2`. Note that you can write functions that have as many (of as few) arguments as you want/need. Inside the curly brackets `{}` is the **body** of the function -- this is the code that is executed when the function is called. In R, a function will **return**(*i.e.* output) the result of the last line in the body of the function.
To illustrate how we create a function in R, let's create a very simple function that takes a number as input and then returns the cube of that number. It is very good practice to include a description of the function, its inputs, and its outputs. ```{r} cube_it <- function(cube_me){ # Function computes the cube of the input # inputs: cube_me is the value to cube # output: the cube of cube_me cube_me^3 # take the cube of the input cube_me } ``` Notice how we gave the function a descriptive name, in this case we called it `cube_it`. We also gave the argument (input) an informative name, `cube_me`. While we can technically name our function and its arguments whatever we like, it is smart to give them informative names that will help us (and others) understand the purpose of the function. Let's give our function a quick test to confirm that it works as we expect. ```{r} cube_it(5) ```
Arguments are matched by name or position in the argument list. In the above example with the function `cube_it()`, we didn't refer to the argument name when calling the function. The argument is thus matched by position, which in the `cube_it()` example does not leave any room for confusion since there is only one argument. Note however, that we can call the function as follows ```{r} cube_it(cube_me = 5) ```
Referring to the arguments by name when you have only one argument is not really necessary, however when you have mulitiple arguments this can be very helpful. Oftentimes you may not recall the position of the arguments in the function definition and thus you do not want to rely on positional matching as you may mismatch arguments. Let's highlight this point with a basic example where we create a function that takes one number and raises it to a specified power: ```{r} pow_it <- function(base_val, exp_val){ # Raises an input value to a specified power # inputs: base_val is the value to be raised to a power; exp_val is the exponent value # output: base_val raised to the exp_val base_val^exp_val } ``` Recall that unnamed arguments are matched in the order in which they are defined in the function declaration. If I do not use the argument names in my function call, then they will be matched by position -- this could lead us to generate a result that doesn't match our intentions. To prevent this from occuring we can specify the names of the arguments when we calll the function.
Let's raise two to the third power $2^3$ ```{r} pow_it(base_val = 2, exp_val = 3) ``` Since we are naming the arguments, the position does not matter -- the name takes precedence. ```{r} pow_it(exp_val = 3, base_val = 2) ```
We could have also called the function and relied on positional matching ```{r} pow_it(2, 3) ``` Though this latter function call is less clear (from a human perspective) and we could have easily forgotten and called `pow(3, 2)` which would have squared three $3^2$, which would not have been our intention.
#### Exercise 1. Create a function that converts temperatures from degrees Fahrenheit to Celsius. Make sure to give descriptive names to your function and the function argument(s). i. Once you've created your function, test it out to make sure it is working properly. ii. Try passing a vector of values to your function. FYI, $Celsius = (Fahrenheit-32)*\frac{5}{9}$ 2. Create a simple function of your own. ```{r echo = FALSE} F_to_C <- function(deg_F){ (deg_F - 32)*(5/9) } ``` ```{r echo=FALSE, eval=FALSE} F_to_C(75) # pass a single value to our function F_to_C(c(10,20,30,40)) # pass a vector to our function ```
### Variables and scoping #### Argument defaults When calling a function, if you do not pass a value to an argument that is used in the body of the function, an error will result. Let's illustrate this point using our `pow_it()` function. Recall that `pow_it()` takes two arguments `base_val` and `exp_val`. Run the following code and see what happens. ```{r eval=FALSE} pow_it(base_val = 3) ``` You've now seen that the code above throws the following error, telling us that we are missing an argument > `Error in pow_it(3) : argument "exp_val" is missing, with no default`
While in many cases we want the code to "break" when it is used incorrectly, there are nonetheless situations where we would like the code to be "smart" and supply reasonable default arguments that prevent an error from occuring. We can specify default values for arguments when we declare a function. Let's redeclare `pow_it()` -- this time supplying a reasonable default for the `exp_val` argument ```{r} pow_it <- function(base_val, exp_val = 1){ base_val^exp_val } ``` In the function declaration above, we've supplied `exp_val` a default value of 1. If the user supplies a value for `exp_val` in their function call, then the user supplied value will be used, otherwise the default value will be applied. Let's illustrate this below. ```{r} pow_it(base_val = 3) ``` You can see that a default of 1 was used for `exp_val` ```{r} pow_it(base_val = 3, exp_val = 3) ``` However, when we supply a value for `exp_val` in our function call, this value is used.
#### Variable scoping In programming, **scoping** refers to the rules used to look up values associated with a variable. A function first looks within the function itself (*i.e.* its local environment) to identify the values associated with the variables in use. If it finds the values there, then it stops looking, otherwise it expands its search to the next level of the environment. If a value is never found, then an error is thrown. Let's illustrate this with an example. ```{r} add_something <- function(my_number){ add_number <- 1 my_number + add_number } ``` Notice that the variable `add_number` does not appear in our **Global Environment** (see the Environment pane in the top right of your RStudio window). This is because, variables defined within functions are **locally** defined and thus are only accessible within that function. R will first look for the variable `add_number` within the function -- since it finds a value there, it stops looking and uses that value for `add_number`. ```{r} add_something(my_number = 5) ```
To illustrate that R first looks inside the function and then stops if it finds the variable, let's create a variable `add_number` outside of our function. This variable will like in our **Global Environment** (see the Environment pane in the top right of your RStudio window). ```{r} add_number <- 4 add_something(my_number = 5) ``` You see that `add_something()` still uses the locally defined value for `add_number` when executing the function. However, if we hadn't assigned a value to `add_number` within our function, R would look outside of the function for this value. Let's illustrate this with an example. First, we will redefine the `add_something()` function, though this time we will not supply a value for `add_number` within the function ```{r} add_something <- function(my_number){ my_number + add_number } ``` Let's assign `add_number` a value outside of our function. You'll see that this variable is found in the **Global Environment** ```{r} add_number <- 10 ``` Now, let's run `add_something()` ```{r} add_something(my_number = 5) ``` First R looked for a value for `add_number` locally (*i.e.* within the function), since it did not find a value there it then looked outside of the function and found a value in the global environment, which it then used in the function evaluation.
#### Returning multiple outputs Functions in R return the last evaluated expression. In many cases we would like to return more than one value from a function. To return multiple outputs we can have the function return a vector `c()` or list `list()` that contains the set of values/objects we want to return. Let's illustrate how we do this with a simple example. We will create a function that computes the minimum, maximum, and mean of a dataset. ```{r} get_stats_bad <- function(input_data){ min(input_data) max(input_data) mean(input_data) } ``` Now, let's test this function on an example dataset ```{r} x = runif(1000, min = 0, max = 50) #generate a vector of 1000 random values between 0 and 50 ``` ```{r} get_stats_bad(x) ``` Since the last statement evaluated in `get_stats_bad()` is the mean, then that is the only value that is returned by our function. However, we would like to have the min, max, and mean returned.
Let's define a new function `get_stats_good()` that returns all of the computed statistics ```{r} get_stats_good <- function(input_data){ # Computes basic statistics on a univariate dataset # inputs: input_data is a vector of values # outputs: list containing the min, max, and mean of the input_data stat_min <- min(input_data) stat_max <- max(input_data) stat_mean <- mean(input_data) c(stat_min, stat_max, stat_mean) } ``` You can see in the function declaration above, the last evaluated statement is a vector containing all of the values we would like to return. Let's test out this function to demonstrate how it works. ```{r} get_stats_good(x) ``` You can see that the function performed as desired -- returning all three of the computed statistics (min, max, mean) in a single vector.
When you have outputs of the same type, then a vector works great for returning multiple values, however, when you want to return dissimilar data objects (e.g. a figure object and a data frame) a vector will not work. In these cases you can use a `list()` object, which can contain different data structures within it. Let's illustrate this with an updated version of our `get_stats_good()` function. We will compute the same statistics as before, though now we will also generate a histogram to plot the datasets distribution. The `list()` at the end of the function returns the statistics along with the `fig_hist` figure object. ```{r} get_stats_good <- function(input_data){ stat_min <- min(input_data) stat_max <- max(input_data) stat_mean <- mean(input_data) fig_hist <- ggplot() + geom_histogram(aes(x = input_data)) + theme_classic() list(stat_min, stat_max, stat_mean, fig_hist) } ``` Let's run the function to demonstrate how it works. ```{r} get_stats_good(x) ```
#### Exercise Let's create a function to efficiently download earthquake data from the USGS Earthquake Hazards Program real-time database of global earthquakes. The USGS has a great [webpage](https://earthquake.usgs.gov/earthquakes/search/) where you can input search parameters, query the earthquake real-time database, and then download the data. While this is really useful it nonetheless requires you to navigate to their website, input search parameters into several text boxes, download the data, and then save the data to your computer. All of these steps must be done before you can even load the data into R (which itself would be an additional step). You can imagine if you wanted to perform many queries of this database it would become very cumbersome and require that you repeat the above steps every time you wanted to do so. In order to avoid having to do this we can write and R function that will allow us to specify our search criteria and then it would automate the data acquisition process. In order to write an R function to automate this process, we first need to understand how the USGS database works. To do this let's do the following: 1) Go to the USGS Earthquake Hazards [search page](https://earthquake.usgs.gov/earthquakes/search/) 2) Keep the default search parameters and scroll down to the bottom of the page and click "Search" 3) A new window/tab will open. On the left-hand side of that page, scroll down to the very bottom and you will see a "DOWNLOAD" button. 4) Click the "DOWNLOAD" button and a box with download options will appear. Right-click the "CSV" link and and then click "Copy Link Address" 5) Paste this link in your notebook. You should get something that have the same structure/formatting as below > https://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=2025-02-18%2000:00:00&endtime=2025-02-25%2023:59:59&minmagnitude=2.5&orderby=time When we inspect this URL we can see that it has a logical structure when the query parameters are part of the URL. Given the formulaic approach to building the URL from the users' query parameters, we can see that this is amenable to being turned into an R function. Let's go ahead and write a function, where the user can specify the start date, end date, and minimum earthquake magnitude and then the function will download that data directly into R as a data frame that you can immediately start to work with and analyze. Go ahead and try to write a function called `get_usgs_earthquakes()` to do this. Hint, you will want to construct the URL where you `paste()` together the different URL sections, making sure to put the user specified search parameters in their correct portion of the URL. Once you have constructed the URL, you can use the `read.csv()` function to download the data.
Try working with your neighbors to write this function. If you get stuck you can click below to reveal a solution, however, you should first give a concerted effort to solve the challenge before looking at the answer.
Click to expand
**Create the function ** ```{r} get_usgs_earthquakes <- function(start_date, end_date, min_magnitude = 2.5, start_time = "00:00:00", end_time = "23:59:59"){ ## Function info # start_date: start date for your query. Entered as a text string in the format "YYYY-MM-DD" # end_date: end date for your query. Entered as a text string in the format "YYYY-MM-DD" # min_magnitude: ## Function code base_url <- "https://earthquake.usgs.gov/fdsnws/event/1/query.csv?" full_url <- paste0(base_url,"starttime=", start_date, "%20", start_time, "&", "endtime=", end_date, "%20", end_time, "&", "minmagnitude=", min_magnitude, "&", "orderby=time") df_earthquakes <- read.csv(full_url) } ```
**Test the function** ```{r} df_earthquakes <- get_usgs_earthquakes(start_date = "2025-02-01", end_date = "2025-02-25") ``` ```{r} head(df_earthquakes) ```

#### Exercise 1. Create a function that take in NOAA state-wide temperature data and generates for a single, user specified state the following i. Table of monthly average temperatures (*i.e.* table with 12 rows, one for each month) ii. Table of annual average temperatures (*i.e.* table with a row for each year in the data) iii. Figure with a plot of time-series plot of the annual average temperatures. The data for this exercise is here ```{r message = FALSE} state_temps <- read_csv("https://stahlm.github.io/ENS_215/Data/noaa_cag_state_temperatures.csv") ``` ```{r} state_temps <- state_temps %>% rename(Avg_Temp_F = Value) %>% filter(YEAR < 2025) ``` You can call your function `state_climate_summary(state_temps, state_to_select)`. Where the argument `state_temps` is the data frame with the temperature data and the argument `state_to_select` is the name of the state of interest. An example function call would look like `state_climate_summary(state_temps, "New York")` ```{r echo=FALSE} state_climate_summary <- function(climate_data, state_to_select){ table_months <- climate_data %>% filter(STATE == state_to_select) %>% group_by(MONTH) %>% summarise(month_mean = mean(Avg_Temp_F), month_max = max(Avg_Temp_F), month_min = min(Avg_Temp_F) ) table_annual <- climate_data %>% filter(STATE == state_to_select) %>% group_by(YEAR) %>% summarise(annual_mean = mean(Avg_Temp_F)) fig_annual <- table_annual %>% ggplot(aes(x = YEAR, y = annual_mean)) + geom_line() + geom_point() + geom_smooth(method = "lm") + labs(title = paste("Annual mean temperature:", state_to_select), x = "", y = "Annual Mean Temp (F)", caption = "Data source: NOAA") + theme_classic() list(table_months, table_annual, fig_annual) } ``` Below is an example of what your output should look like ```{r} state_climate_summary(state_temps, "New York") ```
### Extra-credit exercise As an interesting exercise let's examine some meteorological data and write a function to estimate the wet-bulb temperature. The wet-bulb temperature is the temperature measured by a thermometer that is covered in a water-saturated cloth over which ambient air is allowed to pass. The wet-bulb temperature provides a measure of the temperature that would be reach under ambient conditions of evaporative cooling. From the standpoint of human and organism survival/comfort, the wet-bulb temperature is very important as it indicates the temperature that the organism could reach under the ambient conditions by evaporative cooling (e.g., sweating in humans, panting in dogs). In a hot but very dry environment, evaporative cooling is very effective, while in a hot and humid environment evaporation is minimal and thus evaporative cooling potential is very limited. This is why you can feel relatively comfortable in a hot and dry place (e.g., Arizona) while often feeling much less comfortable in a less hot but humid environment (e.g., New York in the summer). Thus, the wet-bulb temperature is always less than or equal to the air temperature (wet bulb and air temperature are exactly equal when the relative humidity is 100%). When wet-bulb temperatures reach the high 20's (in degrees C), human health can become endangered and heat stress can occur. For a fit, acclimated person who is in the shade with a strong breeze and is fully hydrated, physical labor becomes difficult or impossible at a wet-bulb temperature of 31 degrees C or higher. For these same conditions a wet-bulb temperature of 35 degrees C or higher for more than 6 hours can lead to death [(Buzan and Huber, 2020)](https://www.annualreviews.org/doi/10.1146/annurev-earth-053018-060100). These wet-bulb temperature thresholds for physical labor and risk of death are even lower for individuals who are not acclimated, have health conditions, and/or where access to shaded and breezy conditions are not available. For instance, high levels of fatalities occurred during the 2003 European and 2010 Russian heat waves, both of which had wet-bulb temperatures that did not exceed 28 degrees C [(Raymond et al., 2020)](https://www.science.org/doi/10.1126/sciadv.aaw1838). Notably, some areas in southwest Asia are expected to have wet-bulb temperatures that frequently exceed the threshold for human adaptability by 2100 [(Pal and Eltahir 2016)](http://www.nature.com/articles/nclimate2833). While most meteorological stations measure air temperature and relative humidity, the wet-bulb temperature is much less commonly measured. However, it is possible to estimate the wet-bulb temperature from measurements of air temperature and relative humidity. We will use the approach of [Stull (2011)](https://journals.ametsoc.org/view/journals/apme/50/11/jamc-d-11-0143.1.xml) to do this. $$\begin{aligned} T_{wet} = T_{air}*atan(0.151977*(RH + 8.313659)^{1/2}) + atan(T_{air} +RH) - atan(RH - 1.676331) + \\ (0.00391838*(RH)^{3/2}) * atan(0.023101*RH) - 4.686035 \end{aligned} $$ Where $T_{wet}$ is the estimated wet-bulb temperature; $T_{air}$ is the measured air temperature in degrees C and $RH$ is the measured relative humidity in percent. Note that the relative humidity would in the equation above is entered as a numeric value for the percent (e.g., 85% relative humidity would be entered as 85). 1. Write a function called `calc_wetbulb()` that takes air temperature and relative humidity as inputs and returns the wet-bulb temperature. You can check if your function is working correctly by entering an air temperature of 20 deg C and a relative humidity of 15 percent, which should give you a wet-bulb temperature of 6.96 deg C. ```{r echo = F, message= F} calc_wetbulb <- function(T_air, rh_percent){ # T_air is the air temperature in degrees C # rh_percent is the relative humidity (e.g., 85% relative humidity would be entered as 85) T_wetbulb <- T_air*atan(0.151977*(rh_percent + 8.313659)^0.5) + atan(T_air + rh_percent) - atan(rh_percent - 1.676331) + (0.00391838*(rh_percent)^(3/2)) * atan(0.023101*rh_percent) - 4.686035 } ``` 2. Compute the hourly wet-bulb temperatures for Chicago in 1995 and for Samawah, Iraq in 2015. There was a significant heat wave in [Chicago in 1995](https://en.wikipedia.org/wiki/1995_Chicago_heat_wave), which led to \>700 deaths and there was significant heat wave across many areas of the [Middle East in 2015](https://www.theguardian.com/world/2015/aug/04/middle-east-swelters-in-heatwave-as-temperatures-top-50c). To get hourly meteorological data we will use the `worldmet` package (FYI, we used this package [earlier in the term](https://stahlm.github.io/ENS_215/Winter_2025/Lectures/2025_02_07.html)) ```{r} library(worldmet) ``` Let's get data for 1995 for a met station in Chicago and for 2015 for a met station in Samawah, Iraq. ```{r message = F, warning = F, eval = F} worldmet_data_chicago <- importNOAA(code = "725300-94846", year = 1995) worldmet_data_chicago <- worldmet_data_chicago %>% filter(!is.na(RH), !is.na(air_temp)) # remove observations that are missing RH and/or air temp data ``` ```{r message = F, warning = F, eval = F} worldmet_data_samwawah <- importNOAA(code = "406740-99999", year = 2015) worldmet_data_samwawah <- worldmet_data_samwawah %>% filter(!is.na(RH), !is.na(air_temp)) # remove observations that are missing RH and/or air temp data ``` Now you should compute the wet-bulb temperatures for Chicago and add this data as a column to your Chicago data frame. Do the same for Samawah, Iraq. Note that you can do this by putting your `calc_wetbulb()` function inside of a `mutate()`. See an example below. ```{r eval = F} worldmet_data_chicago <- worldmet_data_chicago %>% mutate(wet_bulb = calc_wetbulb(air_temp, RH)) worldmet_data_samwawah <- worldmet_data_samwawah %>% mutate(wet_bulb = calc_wetbulb(air_temp, RH)) ``` Once, you've computed the wet-bulb temperatures, explore this data. You might try plotting time-series of the wet-bulb temperature, computing statistics on the wet-bulb temperatures (e.g., mean, max, 95th percentile,...) or any other analysis that you think would be interesting. ```{r eval = F, echo = F, message= F, warning= F} library(patchwork) fig_1 <- worldmet_data_chicago %>% ggplot(aes(x = date, y = wet_bulb)) + geom_line() + labs(x = "", y = "Wet-bulb temperature (C)") + theme_bw() ``` ```{r eval = F, echo = F, message= F, warning= F} fig_2 <- worldmet_data_samwawah %>% ggplot(aes(x = date, y = wet_bulb)) + geom_line() + labs(x = "", y = "Wet-bulb temperature (C)") + theme_bw() ``` ```{r eval = F, echo = F, message= F, warning= F} fig_1 / fig_2 ``` ```{r eval = F, echo = F, message= F, warning= F} all_met <- bind_rows(worldmet_data_chicago, worldmet_data_samwawah) ``` ```{r eval = F, echo = F, message= F, warning= F} all_met %>% group_by(station) %>% summarise(wetbulb_max = max(wet_bulb), wetbul_mean = mean(wet_bulb), wet_bulb_90th = quantile(wet_bulb, 0.90), wet_bulb_95th = quantile(wet_bulb, 0.95), wet_bulb_99th = quantile(wet_bulb, 0.99) ) ```