library(tidyverse)
set.seed(1234)
A software bug is “an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways.”1 In an ideal world, the computer will warn you when it encounters a bug. R has the ability to do this in some situations (see our discussion below of errors, warnings, and messages). However bugs also arise because you expect the program to do one thing but provide it the ability to perform different from expectations.
As I have repeatedly emphasized in class, computers are powerful tools that are incredibly stupid. They will do exactly what you tell them to, nothing more and nothing less. If you write your code in a way that allows it to behave in an unintended way, this is your fault. The first goal of debugging should be to prevent unintended behaviors before they strike. However, when such bugs occur we need the tools and knowledge to track down these unintended behaviors and correct them in our code.
The most important step to debugging is to prevent bugs in the first place. There are several methods we can employ to do that. Some of them are simple such as styling our code so that we follow consistent practices when writing scripts and programs. Consistency will prevent silly and minor mistakes such as typos. Good styles also make our code more readable for the human eye and allow us to isolate and detect errors merely by looking at the screen. Others are more advanced and focus on the concept of failing fast - as soon as something goes wrong, stop executing the program and announce an error.
Think back to the analogy of programming languages to human languages. Programming languages adhere to a specific grammar and syntax, they contain a vocabulary, etymology, cultural conventions, word roots (prefixes and suffixes), just like English or any other written or spoken language. We can therefore equate different components of a program to their language counterparts:
Programming | Language |
---|---|
Scripts | Essays |
Sections | Paragraphs |
Lines Breaks | Sentences |
Parentheses | Punctuation |
Functions | Verbs |
Variables | Nouns |
Now think about how you write a document in English. In 1987, the Challenger space shuttle exploded just 73 seconds after takeoff. The deaths of seven crewmembers were seen live by millions of American schoolchildren watching around the country. A few hours after the tragedy, President Ronald Reagan gave a national address. Here is an excerpt of that address:
weve grown used to wonders in this century its hard to dazzle us but for 25 years the united states space program has been doing just that weve grown used to the idea of space and perhaps we forget that weve only just begun were still pioneers they the members of the Challenger crew were pioneers and i want to say something to the school children of America who were watching the live coverage of the shuttles takeoff i know it is hard to understand but sometimes painful things like this happen its all part of the process of exploration and discovery its all part of taking a chance and expanding mans horizons the future doesnt belong to the fainthearted it belongs to the brave the challenger crew was pulling us into the future and well continue to follow them the crew of the space shuttle challenger honored us by the manner in which they lived their lives we will never forget them nor the last time we saw them this morning as they prepared for the journey and waved goodbye and slipped the surly bonds of earth to touch the face of god
Wait a minute, this doesn’t look right. What happened to the punctuation? The capitalization? Where are all the sentences and paragraph breaks? Isn’t this hard to read and understand? Do you feel any of the emotions of the moment? Probably not, because the normal rules of grammar and syntax have been destroyed. Here’s the same excerpt, but properly styled:
We’ve grown used to wonders in this century. It’s hard to dazzle us. But for 25 years the United States space program has been doing just that. We’ve grown used to the idea of space, and perhaps we forget that we’ve only just begun. We’re still pioneers. They, the members of the Challenger crew, were pioneers.
And I want to say something to the school children of America who were watching the live coverage of the shuttle’s takeoff. I know it is hard to understand, but sometimes painful things like this happen. It’s all part of the process of exploration and discovery. It’s all part of taking a chance and expanding man’s horizons. The future doesn’t belong to the fainthearted; it belongs to the brave. The Challenger crew was pulling us into the future, and we’ll continue to follow them….
The crew of the space shuttle Challenger honoured us by the manner in which they lived their lives. We will never forget them, nor the last time we saw them, this morning, as they prepared for the journey and waved goodbye and ‘slipped the surly bonds of earth’ to ‘touch the face of God.’
That makes much more sense. Adhering to standard rules of style make the text more legible and interpretable. This is what we should aim for when writing programs in R.2
Here are some common rules you should adopt when writing code in R, adapted from Hadley Wickham’s style guide.
Files should have intuitive and meaningful names. Avoid spaces or non-standard characters in your file names. R scripts should always end in .R
; R Markdown documents should always end in .Rmd
.
# Good
fit-models.R
utility-functions.R
gun-deaths.Rmd
# Bad
foo.r
stuff.r
gun deaths.rmd
Variables refer to data objects such as vectors, lists, or data frames. Variable and function names should be lowercase. Use an underscore (_
) to separate words within a name. Avoid using periods (.
).3 Variable names should generally be nouns and function names should be verbs. Try to pick names that are concise and meaningful.
# Good
day_one
day_1
# Bad
first_day_of_the_month
DayOne
dayone
djm1
Where possible, avoid using names of existing functions and variables. Doing so will cause confusion for the readers of your code, not to mention make it difficult to access the existing functions and variables.
# Bad
T <- FALSE
c <- 10
For instance, what would happen if I created a new mean()
function?
x <- 1:10
mean(x)
## [1] 5.5
# create new mean function
mean <- function(x) sum(x)
mean(x)
[1] 55
Place spaces around all infix operators (=, +, -, <-, etc.). The same rule applies when using =
in function calls.
Always put a space after a comma, and never before (just like in regular English).
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
Place a space before left parentheses, except in a function call.
Note: I’m terrible at remembering to do this for
if-else
orfor
loops. I typically never place a space before left parentheses, but it is supposed to be good practice. Just remember to be consistent whatever approach you choose.
# Good
if (debug) do(x)
plot(x, y)
# Bad
if(debug)do(x)
plot (x, y)
Do not place spaces around code in parentheses or square brackets (unless there’s a comma, in which case see above).
# Good
if (debug) do(x)
diamonds[5, ]
# Bad
if ( debug ) do(x) # No spaces around debug
x[1,] # Needs a space after the comma
x[1 ,] # Space goes after comma not before
An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else.
Always indent the code inside curly braces.
# Good
if (y < 0 && debug) {
message("Y is negative")
}
if (y == 0) {
log(x)
} else {
y ^ x
}
# Bad
if (y < 0 && debug)
message("Y is negative")
if (y == 0) {
log(x)
}
else {
y ^ x
}
It’s ok to leave very short statements on the same line:
if (y < 0 && debug) message("Y is negative")
Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. For instance, if I wanted to convert the chief
column to a factor for building a faceted graph:
# Good
scdbv <- mutate(scdbv,
chief = factor(chief, levels = c("Jay", "Rutledge", "Ellsworth",
"Marshall", "Taney", "Chase",
"Waite", "Fuller", "White",
"Taft", "Hughes", "Stone",
"Vinson", "Warren", "Burger",
"Rehnquist", "Roberts")))
# Bad
scdbv <- mutate(scdbv, chief = factor(chief, levels = c("Jay", "Rutledge", "Ellsworth", "Marshall", "Taney", "Chase", "Waite", "Fuller", "White", "Taft", "Hughes", "Stone", "Vinson", "Warren", "Burger", "Rehnquist", "Roberts")))
When indenting your code, use two spaces. Never use tabs or mix tabs and spaces.
By default, RStudio automatically converts tabs to two spaces in your code. So if you use the tab button in R Studio, you’re good to go.
The only exception is if a function definition runs over multiple lines. In that case, indent the second line to where the definition starts:
# pure function
long_function_name <- function(a = "a long argument",
b = "another argument",
c = "another long argument") {
# As usual code is indented by two spaces.
}
# in a mutate() function
scdbv <- scdbv %>%
mutate(majority = majority - 1,
chief = factor(chief, levels = c("Jay", "Rutledge", "Ellsworth",
"Marshall", "Taney", "Chase",
"Waite", "Fuller", "White",
"Taft", "Hughes", "Stone",
"Vinson", "Warren", "Burger",
"Rehnquist", "Roberts")))
Use <-
, not =
, for assignment. Why? Because I said so. Or read more here.
# Good
x <- 5
# Bad
x = 5
There are two built-in methods of using RStudio to automatically format and clean up your code. They are not perfect, but can help in some circumstances.
Code > Reformat Code (Shift + Cmd/Ctrl + A)
# comments are retained
1+1
if(TRUE){
x=1 # inline comments
}else{
x=2;print('Oh no... ask the right bracket to go away!')}
1*3 # one space before this comment will become two!
2+2+2 # only 'single quotes' are allowed in comments
diamonds %>%
filter(color == "I") %>%
group_by(cut) %>%
summarize(price = mean(price))
lm(y~x1+x2, data=data.frame(y=rnorm(100),x1=rnorm(100),x2=rnorm(100))) ### a linear model
1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1 ## comments after a long line
## here is a long long long long long long long long long long long long long long long long long long long comment
# comments are retained
1 + 1
if (TRUE) {
x = 1 # inline comments
} else{
x = 2
print('Oh no... ask the right bracket to go away!')
}
1 * 3 # one space before this comment will become two!
2 + 2 + 2 # only 'single quotes' are allowed in comments
diamonds %>%
filter(color == "I") %>%
group_by(cut) %>%
summarize(price = mean(price))
lm(y ~ x1 + x2, data = data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
)) ### a linear model
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 ## comments after a long line
## here is a long long long long long long long long long long long long long long long long long long long comment
Format code will attempt to adjust the source code formatting to adhere to the style guide specified above. It doesn’t look perfect, but is more readable than the original. We should still clean up some of this manually, such as the comment on the last line that flows over.
Code > Reindent Lines (Cmd/Ctrl + I)
# comments are retained
1 + 1
if (TRUE) {
x = 1 # inline comments
} else{
x = 2
print('Oh no... ask the right bracket to go away!')
}
1 * 3 # one space before this comment will become two!
2 + 2 + 2 # only 'single quotes' are allowed in comments
diamonds %>%
filter(color == "I") %>%
group_by(cut) %>%
summarize(price = mean(price))
lm(y ~ x1 + x2, data = data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
)) ### a linear model
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 ## comments after a long line
## here is a long long long long long long long long long long long long long long long long long long long comment
# comments are retained
1 + 1
if (TRUE) {
x = 1 # inline comments
} else{
x = 2
print('Oh no... ask the right bracket to go away!')
}
1 * 3 # one space before this comment will become two!
2 + 2 + 2 # only 'single quotes' are allowed in comments
diamonds %>%
filter(color == "I") %>%
group_by(cut) %>%
summarize(price = mean(price))
lm(y ~ x1 + x2, data = data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
)) ### a linear model
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1 ## comments after a long line
## here is a long long long long long long long long long long long long long long long long long long long comment
Reindent lines will add spacing to conditional expression blocks, multi-line functions, expressions which run over multiple lines, and piped operations. Again, it is not perfect but it does some of the formatting work for us.
Here’s a chunk of code from an exercise from a different class. It is formatted terribly, but as you can see it does work - the computer can interpret it. Use the style guide to clean it up and make it readable.
library(tidyverse)
library(modelr)
library(broom)
library(gam)
College <- as_tibble(ISLR::College)%>%mutate(Outstate =Outstate/1000,Room.Board=Room.Board/ 1000)# rescale Outstate in thousands of dollars
crossv_kfold(College,k=10)%>%mutate(linear=map(train,~glm(Outstate~PhD, data=.)),log= map(train,~glm(Outstate ~log(PhD), data=.)),spline=map(train,~ glm(Outstate ~bs(PhD, df=5), data=.)))%>%gather(type,model,linear:spline)%>%mutate(mse=map2_dbl(model,test,mse))%>%group_by(type)%>%summarize(mse = mean(mse))# k-fold cv of three model types
## Warning in bs(PhD, degree = 3L, knots = structure(c(67.6666666666667, 81:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(PhD, degree = 3L, knots = structure(c(68, 82), .Names =
## c("33.33333%", : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## # A tibble: 3 x 2
## type mse
## <chr> <dbl>
## 1 linear 13.89460
## 2 log 14.78268
## 3 spline 12.61390
college_phd_spline<-gam(Outstate~bs(PhD,df=5),data=College)# spline has the best model fit
college_phd_terms<-preplot(college_phd_spline,se=TRUE,rug=FALSE)# get first difference for age
#age plot
data_frame(x=college_phd_terms$`bs(PhD, df = 5)`$x,
y=college_phd_terms$`bs(PhD, df = 5)`$y,
se.fit = college_phd_terms$`bs(PhD, df = 5)`$se.y)%>%
mutate(y_low = y - 1.96 * se.fit,y_high = y+1.96 * se.fit) %>%ggplot(aes(x, y))+geom_line()+
geom_line(aes(y = y_low), linetype = 2)+
geom_line(aes(y = y_high), linetype = 2)+
labs(title = "Cubic spline of out-of-state tuition",subtitle = "Knots = 2",x = "Percent of faculty with PhDs",y=expression(f[1](PhD)))
library(tidyverse)
library(modelr)
library(broom)
library(gam)
College <- as_tibble(ISLR::College) %>%
# rescale Outstate in thousands of dollars
mutate(Outstate = Outstate / 1000,
Room.Board = Room.Board / 1000)
# k-fold cv of three model types
crossv_kfold(College, k = 10) %>%
mutate(linear = map(train, ~ glm(Outstate ~ PhD, data = .)),
log = map(train, ~ glm(Outstate ~ log(PhD), data = .)),
spline = map(train, ~ glm(Outstate ~ bs(PhD, df = 5), data = .))) %>%
gather(type, model, linear:spline) %>%
mutate(mse = map2_dbl(model, test, mse)) %>%
group_by(type) %>%
summarize(mse = mean(mse))
# spline has the best model fit
college_phd_spline <- gam(Outstate ~ bs(PhD, df = 5), data = College)
# get first difference for age
college_phd_terms <- preplot(college_phd_spline, se = TRUE, rug = FALSE)
# age plot
data_frame(x = college_phd_terms$`bs(PhD, df = 5)`$x,
y = college_phd_terms$`bs(PhD, df = 5)`$y,
se.fit = college_phd_terms$`bs(PhD, df = 5)`$se.y) %>%
mutate(y_low = y - 1.96 * se.fit,
y_high = y + 1.96 * se.fit) %>%
ggplot(aes(x, y)) +
geom_line() +
geom_line(aes(y = y_low), linetype = 2) +
geom_line(aes(y = y_high), linetype = 2) +
labs(title = "Cubic spline of out-of-state tuition",
subtitle = "Knots = 2",
x = "Percent of faculty with PhDs",
y = expression(f[1](PhD)))
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.1 (2017-06-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2017-10-19
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.1 2017-07-07 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.0)
## bindrcpp 0.2 2017-06-17 CRAN (R 3.4.0)
## boxes 0.0.0.9000 2017-07-19 Github (r-pkgs/boxes@03098dc)
## broom 0.4.2 2017-08-09 local
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0)
## clisymbols 1.2.0 2017-05-21 cran (@1.2.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.1 2017-07-07 local
## crayon 1.3.4 2017-10-03 Github (gaborcsardi/crayon@b5221ab)
## datasets * 3.4.1 2017-07-07 local
## devtools 1.13.3 2017-08-02 CRAN (R 3.4.1)
## digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
## dplyr * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.4.0)
## foreign 0.8-69 2017-06-22 CRAN (R 3.4.1)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
## glue 1.1.1 2017-06-21 CRAN (R 3.4.1)
## graphics * 3.4.1 2017-07-07 local
## grDevices * 3.4.1 2017-07-07 local
## grid 3.4.1 2017-07-07 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.1)
## hms 0.3 2016-11-22 CRAN (R 3.4.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.17 2017-08-10 cran (@1.17)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.1)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
## lubridate 1.6.0 2016-09-13 CRAN (R 3.4.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.1 2017-07-07 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0)
## modelr 0.1.1 2017-08-10 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.1)
## parallel 3.4.1 2017-07-07 local
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## psych 1.7.5 2017-05-03 CRAN (R 3.4.1)
## purrr * 0.2.3 2017-08-02 CRAN (R 3.4.1)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## Rcpp 0.12.13 2017-09-28 cran (@0.12.13)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.0)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.0)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
## rlang 0.1.2 2017-08-09 CRAN (R 3.4.1)
## rmarkdown 1.6 2017-06-15 CRAN (R 3.4.0)
## rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
## rstudioapi 0.6 2016-06-27 CRAN (R 3.4.0)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.0)
## scales 0.4.1 2016-11-09 CRAN (R 3.4.0)
## stats * 3.4.1 2017-07-07 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.4.0)
## tibble * 1.3.4 2017-08-22 CRAN (R 3.4.1)
## tidyr * 0.7.0 2017-08-16 CRAN (R 3.4.1)
## tidyverse * 1.1.1.9000 2017-07-19 Github (tidyverse/tidyverse@a028619)
## tools 3.4.1 2017-07-07 local
## utils * 3.4.1 2017-07-07 local
## withr 2.0.0 2017-07-28 CRAN (R 3.4.1)
## xml2 1.1.1 2017-01-24 CRAN (R 3.4.0)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
And for that matter, in any other programming language as well. Note however that these style rules are specific to R; other languages by necessity may use different rules and conventions.↩
These are useful for writing functions for generic methods.↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.
Comments
Comment your code. Each line of a comment should begin with the comment symbol and a single space:
#
. Comments should explain the why, not the what.To take advantage of RStudio’s code folding feature, add at least four trailing dashes (-), equal signs (=), or pound signs (#) after the comment text