library(tidyverse)
Functions are an important tool in the computational social scientist’s toolkit. They enable you to avoid repetition and copy-and-paste and greatly increase the efficiency of your code writing.
In fact, you have used functions the entire time you have programmed in R. The only difference is that the functions were written for you. tidyr
, dplyr
, ggplot2
, all of these libraries contain major functions for tidying, transforming, and visualizing data. You have the power to write your own functions. Well, if you don’t already you soon will.
Good rule of thumb - if you have copied and pasted a block of code more than twice, it is time to convert it to a function.
Functions have three key components:
function()
.{}
that immediately follows function(...)
, and is the code that you developed to perform the action described in the name using the arguments you provide.rescale
functionHere is a user-generated function from R for Data Science. Analyze it and identify the three key components.
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
## [1] 0.0 0.5 1.0
rescale01(c(-10, 0, 10))
## [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
## [1] 0.00 0.25 0.50 NA 1.00
rescale01
x
- the variable to be transformedx
is a conventional namex
, y
, z
, etc., or take on informative names such as data
, formula
, na.rm
, etc.rng
.This function can easily be reused for any numeric variable. Rather than writing out the contents of the function every time, we just use the function itself.
Analyze the following function.
a = 3
and b = 4
, what should we expect the output to be?pythagorean <- function(a, b){
hypotenuse <- sqrt(a^2 + b^2)
return(hypotenuse)
}
pythagorean
function
a
- length of one side of a right triangleb
- length of another side of a right triangle{}
that immediately follows function(...)
hypotenuse
which is the square root of the sum of squares of the two sides of the right triangle (also called the hypotenuse)return
hypotenuse
as the output of the function. I could also have written the function as:pythagorean <- function(a, b){
hypotenuse <- sqrt(a^2 + b^2)
}
or even:
pythagorean <- function(a, b){
sqrt(a^2 + b^2)
}
But I wanted to explicitly identify each step of the code for others to review. Early on in your function writing career, you will want to be more explicit so future you can interpret your own code. As you practice and become more comfortable writing functions, you can be more relaxed in your coding style and documentation.
When using functions, by default the returned object is merely printed to the screen.
pythagorean(a = 3, b = 4)
## [1] 5
If you want it saved, you need to assign it to an object.
(tri_c <- pythagorean(a = 3, b = 4))
## [1] 5
pythagorean(a = 3, b = 4)
## [1] 5
hypotenuse
## Error in eval(expr, envir, enclos): object 'hypotenuse' not found
Why does this generate an error? Why can we not see the results of hypotenuse
? After all, it was generated by pythagorean
, right?
Objects created inside a function exist within their own environment. Typically you are working in the global environment. You can see all objects that exist in that environment in the top-right panel.
Objects created within a function are destroyed once the function completes its execution, unless you return
the object as part of the output. This is why you do not see hypotenuse
listed in the environment - it has already been destroyed.
Write a function that calculates the sum of the squared value of two numbers. For instance, it should generate the following output:
my_function(3, 4)
## [1] 25
sum_of_squares <- function(x, y){
x^2 + y^2
}
sum_of_squares(3, 4)
## [1] 25
sum_of_squares
x
- one numbery
- a second numberx
and y
independently and then adds the results togetherCool fact - this function also works with vectors of numbers
x <- c(2, 4, 6)
y <- c(1, 3, 5)
sum_of_squares(x, y)
## [1] 5 25 61
Sometimes you only want to execute code if a condition is met. To do that, use an if-else statement.
if (condition) {
# code executed when condition is TRUE
} else {
# code executed when condition is FALSE
}
condition
must always evaluate to either TRUE
or FALSE
.1 This is similar to filter()
, except condition
can only be a single value (i.e. a vector of length 1), whereas filter()
works for entire vectors (or columns).
You can chain conditional statements together:
if (this) {
# do that
} else if (that) {
# do something else
} else {
# do something completely different
}
But this can get tedious if you need to consider many conditions. There are alternatives in R for some of these long conditional statements. For instance, if you want to convert a continuous (or numeric) variable to categories, use cut()
:
diamonds %>%
select(carat) %>%
mutate(carat_autobin = cut(carat, breaks = 5),
carat_manbin = cut(carat,
breaks = c(0, 1, 2, 3, 6),
labels = c("Small", "Medium", "Large", "Huge")))
## # A tibble: 53,940 x 3
## carat carat_autobin carat_manbin
## <dbl> <fctr> <fctr>
## 1 0.23 (0.195,1.16] Small
## 2 0.21 (0.195,1.16] Small
## 3 0.23 (0.195,1.16] Small
## 4 0.29 (0.195,1.16] Small
## 5 0.31 (0.195,1.16] Small
## 6 0.24 (0.195,1.16] Small
## 7 0.24 (0.195,1.16] Small
## 8 0.26 (0.195,1.16] Small
## 9 0.22 (0.195,1.16] Small
## 10 0.23 (0.195,1.16] Small
## # ... with 53,930 more rows
if
versus if_else()
Because if-else conditional statements like the ones outlined above must always resolve to a single TRUE
or FALSE
, they cannot be used for vector operations. Vector operations are where you make multiple comparisons simultaneously for each value stored inside a vector. Consider the gun_deaths
data and imagine you wanted to create a new column identifying whether or not an individual had at least a high school education.
library(rcfss)
data("gun_deaths")
(educ <- select(gun_deaths, education))
## # A tibble: 100,798 x 1
## education
## <fctr>
## 1 BA+
## 2 Some college
## 3 BA+
## 4 BA+
## 5 HS/GED
## 6 Less than HS
## 7 HS/GED
## 8 HS/GED
## 9 Some college
## 10 NA
## # ... with 100,788 more rows
This sounds like a classic if-else operation. For each individual, if education
equals “Less than HS”, then the value in the new column should be “Less than HS”. Otherwise, it should be “HS+”. But what happens if we try to implement this using an if-else operation like above?
(educ_if <- educ %>%
mutate(hsPlus = if(education == "Less than HS"){
"Less than HS"
} else {
"HS+"
}))
## Warning in if (education == "Less than HS") {: the condition has length > 1
## and only the first element will be used
## # A tibble: 100,798 x 2
## education hsPlus
## <fctr> <chr>
## 1 BA+ HS+
## 2 Some college HS+
## 3 BA+ HS+
## 4 BA+ HS+
## 5 HS/GED HS+
## 6 Less than HS HS+
## 7 HS/GED HS+
## 8 HS/GED HS+
## 9 Some college HS+
## 10 NA HS+
## # ... with 100,788 more rows
This did not work correctly. Because if()
can only handle a single TRUE
/FALSE
value, it only checked the first row of the data frame. That row contained “BA+” for the individual, so it generated a vector of length 100798 with each value being “HS+”.
count(educ_if, hsPlus)
## # A tibble: 1 x 2
## hsPlus n
## <chr> <int>
## 1 HS+ 100798
Because we in fact want to make this if-else comparison 100798 times, we should instead use if_else()
. This vectorizes the if-else comparison and makes a separate comparison for each row of the data frame. This allows us to correctly generate this new column.2
(educ_ifelse <- educ %>%
mutate(hsPlus = if_else(education == "Less than HS", "Less than HS", "HS+")))
## # A tibble: 100,798 x 2
## education hsPlus
## <fctr> <chr>
## 1 BA+ HS+
## 2 Some college HS+
## 3 BA+ HS+
## 4 BA+ HS+
## 5 HS/GED HS+
## 6 Less than HS Less than HS
## 7 HS/GED HS+
## 8 HS/GED HS+
## 9 Some college HS+
## 10 <NA> <NA>
## # ... with 100,788 more rows
count(educ_ifelse, hsPlus)
## # A tibble: 3 x 2
## hsPlus n
## <chr> <int>
## 1 HS+ 77553
## 2 Less than HS 21823
## 3 <NA> 1422
fizzbuzz
functionFizz buzz is a children’s game that teaches about division. Players take turns counting incrementally, replacing any number divisible by three with the word “fizz” and any number divisible by five with the word “buzz”.
Likewise, a fizzbuzz
function takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number.
The output of your function should look like this:
my_function(3)
## [1] "fizz"
my_function(5)
## [1] "buzz"
my_function(15)
## [1] "fizzbuzz"
my_function(4)
## [1] 4
%%
is modular division. It returns the remainder left over after the division, rather than a floating point number.
5 / 3
## [1] 1.666667
5 %% 3
## [1] 2
fizzbuzz <- function(x){
if(x %% 3 == 0 && x %% 5 == 0){
return("fizzbuzz")
} else if(x %% 3 == 0){
return("fizz")
} else if(x %% 5 == 0){
return("buzz")
} else{
return(x)
}
}
fizzbuzz(3)
## [1] "fizz"
fizzbuzz(5)
## [1] "buzz"
fizzbuzz(15)
## [1] "fizzbuzz"
fizzbuzz(4)
## [1] 4
fizzbuzz
x
- a numberx
is evenly divisible with 3 and/or 5.x
is a “fizzbuzz” (evenly divisible by 3 and 5). This should be the first comparison because it needs to return “fizzbuzz”. If we had this at the end of the comparison chain, the function would prematurely return on “fizz” or “buzz”.
TRUE
, then print “fizzbuzz”x
is a “fizz” (divisible by 3).
TRUE
, then print “fizz”x
is a “buzz” (divisible by 5).
TRUE
, then print “buzz”FALSE
, then print the original number x
.
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-04-04
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## bindr 0.1.1 2018-03-13 CRAN (R 3.4.3)
## bindrcpp 0.2 2017-06-17 CRAN (R 3.4.0)
## broom 0.4.4 2018-03-29 CRAN (R 3.4.3)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.3 2017-12-07 local
## crayon 1.3.4 2017-10-03 Github (gaborcsardi/crayon@b5221ab)
## datasets * 3.4.3 2017-12-07 local
## devtools 1.13.5 2018-02-18 CRAN (R 3.4.3)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## dplyr * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.4.3)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.3)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
## glue 1.2.0 2017-10-29 CRAN (R 3.4.2)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## grid 3.4.3 2017-12-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.1 2018-01-18 CRAN (R 3.4.3)
## hms 0.4.2 2018-03-10 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.20 2018-02-20 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.3)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.3 2018-02-27 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-08-10 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-131.1 2018-02-16 CRAN (R 3.4.3)
## parallel 3.4.3 2017-12-07 local
## pillar 1.2.1 2018-02-27 CRAN (R 3.4.3)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.1)
## purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## Rcpp 0.12.16 2018-03-13 CRAN (R 3.4.4)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.3 2017-12-11 CRAN (R 3.4.3)
## rlang 0.2.0 2018-02-20 cran (@0.2.0)
## rmarkdown 1.9 2018-03-01 CRAN (R 3.4.3)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.4.1)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.5.0 2017-08-24 cran (@0.5.0)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.7 2018-03-12 CRAN (R 3.4.3)
## stringr * 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.4.3)
## tidyr * 0.8.0 2018-01-29 CRAN (R 3.4.3)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.2 2018-03-15 CRAN (R 3.4.4)
## xml2 1.2.0 2018-01-24 CRAN (R 3.4.3)
## yaml 2.1.18 2018-03-08 CRAN (R 3.4.4)
These are Boolean logical values - we used them to make comparisons and will talk more next class about logical vectors.↩
Notice that is also preserves missing values in the new column. Remember, any operation performed on a missing value will itself become a missing value.↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.