library(tidyverse)
set.seed(1234)
Not the kind of bugs we’re looking for

Not the kind of bugs we’re looking for

A software bug is “an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways.”1 In an ideal world, the computer will warn you when it encounters a bug. R has the ability to do this in some situations (see our discussion below of errors, warnings, and messages). However bugs also arise because you expect the program to do one thing but provide it the ability to perform different from expectations.

As I have repeatedly emphasized in class, computers are powerful tools that are incredibly stupid. They will do exactly what you tell them to, nothing more and nothing less. If you write your code in a way that allows it to behave in an unintended way, this is your fault. The first goal of debugging should be to prevent unintended behaviors before they strike. However, when such bugs occur we need the tools and knowledge to track down these unintended behaviors and correct them in our code.

The most important step to debugging is to prevent bugs in the first place. There are several methods we can employ to do that. Some of them are simple such as styling our code so that we follow consistent practices when writing scripts and programs. Consistency will prevent silly and minor mistakes such as typos. Good styles also make our code more readable for the human eye and allow us to isolate and detect errors merely by looking at the screen. Others are more advanced and focus on the concept of failing fast - as soon as something goes wrong, stop executing the program and announce an error.

Writing code

Think back to the analogy of programming languages to human languages. Programming languages adhere to a specific grammar and syntax, they contain a vocabulary, etymology, cultural conventions, word roots (prefixes and suffixes), just like English or any other written or spoken language. We can therefore equate different components of a program to their language counterparts:

Programming Language
Scripts Essays
Sections Paragraphs
Lines Breaks Sentences
Parentheses Punctuation
Functions Verbs
Variables Nouns

Now think about how you write a document in English. In 1987, the Challenger space shuttle exploded just 73 seconds after takeoff. The deaths of seven crewmembers were seen live by millions of American schoolchildren watching around the country. A few hours after the tragedy, President Ronald Reagan gave a national address. Here is an excerpt of that address:

weve grown used to wonders in this century its hard to dazzle us but for 25 years the united states space program has been doing just that weve grown used to the idea of space and perhaps we forget that weve only just begun were still pioneers they the members of the Challenger crew were pioneers and i want to say something to the school children of America who were watching the live coverage of the shuttles takeoff i know it is hard to understand but sometimes painful things like this happen its all part of the process of exploration and discovery its all part of taking a chance and expanding mans horizons the future doesnt belong to the fainthearted it belongs to the brave the challenger crew was pulling us into the future and well continue to follow them the crew of the space shuttle challenger honored us by the manner in which they lived their lives we will never forget them nor the last time we saw them this morning as they prepared for the journey and waved goodbye and slipped the surly bonds of earth to touch the face of god

Wait a minute, this doesn’t look right. What happened to the punctuation? The capitalization? Where are all the sentences and paragraph breaks? Isn’t this hard to read and understand? Do you feel any of the emotions of the moment? Probably not, because the normal rules of grammar and syntax have been destroyed. Here’s the same excerpt, but properly styled:

We’ve grown used to wonders in this century. It’s hard to dazzle us. But for 25 years the United States space program has been doing just that. We’ve grown used to the idea of space, and perhaps we forget that we’ve only just begun. We’re still pioneers. They, the members of the Challenger crew, were pioneers.

And I want to say something to the school children of America who were watching the live coverage of the shuttle’s takeoff. I know it is hard to understand, but sometimes painful things like this happen. It’s all part of the process of exploration and discovery. It’s all part of taking a chance and expanding man’s horizons. The future doesn’t belong to the fainthearted; it belongs to the brave. The Challenger crew was pulling us into the future, and we’ll continue to follow them….

The crew of the space shuttle Challenger honoured us by the manner in which they lived their lives. We will never forget them, nor the last time we saw them, this morning, as they prepared for the journey and waved goodbye and ‘slipped the surly bonds of earth’ to ‘touch the face of God.’

That makes much more sense. Adhering to standard rules of style make the text more legible and interpretable. This is what we should aim for when writing programs in R.2

Style guide

Here are some common rules you should adopt when writing code in R, adapted from Hadley Wickham’s style guide.

Notation and naming

File names

Files should have intuitive and meaningful names. Avoid spaces or non-standard characters in your file names. R scripts should always end in .R; R Markdown documents should always end in .Rmd.

# Good
fit-models.R
utility-functions.R
gun-deaths.Rmd

# Bad
foo.r
stuff.r
gun deaths.rmd

Object names

Variables refer to data objects such as vectors, lists, or data frames. Variable and function names should be lowercase. Use an underscore (_) to separate words within a name. Avoid using periods (.).3 Variable names should generally be nouns and function names should be verbs. Try to pick names that are concise and meaningful.

# Good
day_one
day_1

# Bad
first_day_of_the_month
DayOne
dayone
djm1

Where possible, avoid using names of existing functions and variables. Doing so will cause confusion for the readers of your code, not to mention make it difficult to access the existing functions and variables.

# Bad
T <- FALSE
c <- 10

For instance, what would happen if I created a new mean() function?

x <- 1:10
mean(x)
## [1] 5.5
# create new mean function
mean <- function(x) sum(x)
mean(x)
[1] 55

Syntax

Spacing

Place spaces around all infix operators (=, +, -, <-, etc.). The same rule applies when using = in function calls.

Always put a space after a comma, and never before (just like in regular English).

# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)

Place a space before left parentheses, except in a function call.

Note: I’m terrible at remembering to do this for if-else or for loops. I typically never place a space before left parentheses, but it is supposed to be good practice. Just remember to be consistent whatever approach you choose.

# Good
if (debug) do(x)
plot(x, y)

# Bad
if(debug)do(x)
plot (x, y)

Do not place spaces around code in parentheses or square brackets (unless there’s a comma, in which case see above).

# Good
if (debug) do(x)
diamonds[5, ]

# Bad
if ( debug ) do(x)  # No spaces around debug
x[1,]   # Needs a space after the comma
x[1 ,]  # Space goes after comma not before

Curly braces

An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else.

Always indent the code inside curly braces.

# Good

if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} else {
  y ^ x
}

# Bad

if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

It’s ok to leave very short statements on the same line:

if (y < 0 && debug) message("Y is negative")

Line length

Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. For instance, if I wanted to convert the chief column to a factor for building a faceted graph:

# Good
scdbv <- mutate(scdbv,
                chief = factor(chief, levels = c("Jay", "Rutledge", "Ellsworth",
                                                 "Marshall", "Taney", "Chase",
                                                 "Waite", "Fuller", "White",
                                                 "Taft", "Hughes", "Stone",
                                                 "Vinson", "Warren", "Burger",
                                                 "Rehnquist", "Roberts")))

# Bad
scdbv <- mutate(scdbv, chief = factor(chief, levels = c("Jay", "Rutledge", "Ellsworth", "Marshall", "Taney", "Chase", "Waite", "Fuller", "White", "Taft", "Hughes", "Stone", "Vinson", "Warren", "Burger", "Rehnquist", "Roberts")))

Indentation

When indenting your code, use two spaces. Never use tabs or mix tabs and spaces.

By default, RStudio automatically converts tabs to two spaces in your code. So if you use the tab button in R Studio, you’re good to go.

Insert spaces for tab setting in RStudio

“Insert spaces for tab” setting in RStudio

The only exception is if a function definition runs over multiple lines. In that case, indent the second line to where the definition starts:

# pure function
long_function_name <- function(a = "a long argument", 
                               b = "another argument",
                               c = "another long argument") {
  # As usual code is indented by two spaces.
}

# in a mutate() function
scdbv <- scdbv %>%
  mutate(majority = majority - 1,
         chief = factor(chief, levels = c("Jay", "Rutledge", "Ellsworth",
                                          "Marshall", "Taney", "Chase",
                                          "Waite", "Fuller", "White",
                                          "Taft", "Hughes", "Stone",
                                          "Vinson", "Warren", "Burger",
                                          "Rehnquist", "Roberts")))

Assignment

Use <-, not =, for assignment. Why? Because I said so. Or read more here.

# Good
x <- 5
# Bad
x = 5

Comments

Comment your code. Each line of a comment should begin with the comment symbol and a single space: #. Comments should explain the why, not the what.

To take advantage of RStudio’s code folding feature, add at least four trailing dashes (-), equal signs (=), or pound signs (#) after the comment text

# Section One ---------------------------------
 
# Section Two =================================
 
### Section Three #############################

Auto-formatting in RStudio

There are two built-in methods of using RStudio to automatically format and clean up your code. They are not perfect, but can help in some circumstances.

Format code

Code > Reformat Code (Shift + Cmd/Ctrl + A)

Bad code
# comments are retained
1+1

if(TRUE){
x=1  # inline comments
}else{
x=2;print('Oh no... ask the right bracket to go away!')}
1*3 # one space before this comment will become two!
2+2+2    # only 'single quotes' are allowed in comments

diamonds %>%
filter(color == "I") %>%
group_by(cut) %>%
summarize(price = mean(price))

lm(y~x1+x2, data=data.frame(y=rnorm(100),x1=rnorm(100),x2=rnorm(100)))  ### a linear model

1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1  ## comments after a long line
## here is a long long long long long long long long long long long long long long long long long long long comment
Better code
# comments are retained
1 + 1

if (TRUE) {
x = 1  # inline comments
} else{
x = 2
print('Oh no... ask the right bracket to go away!')
}
1 * 3 # one space before this comment will become two!
2 + 2 + 2    # only 'single quotes' are allowed in comments

diamonds %>%
filter(color == "I") %>%
group_by(cut) %>%
summarize(price = mean(price))

lm(y ~ x1 + x2, data = data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
))  ### a linear model

1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1  ## comments after a long line
## here is a long long long long long long long long long long long long long long long long long long long comment

Format code will attempt to adjust the source code formatting to adhere to the style guide specified above. It doesn’t look perfect, but is more readable than the original. We should still clean up some of this manually, such as the comment on the last line that flows over.

Reindent lines

Code > Reindent Lines (Cmd/Ctrl + I)

Bad code
# comments are retained
1 + 1

if (TRUE) {
x = 1  # inline comments
} else{
x = 2
print('Oh no... ask the right bracket to go away!')
}
1 * 3 # one space before this comment will become two!
2 + 2 + 2    # only 'single quotes' are allowed in comments

diamonds %>%
filter(color == "I") %>%
group_by(cut) %>%
summarize(price = mean(price))

lm(y ~ x1 + x2, data = data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
))  ### a linear model

1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
1 + 1 + 1 + 1 + 1  ## comments after a long line
## here is a long long long long long long long long long long long long long long long long long long long comment
Better code
# comments are retained
1 + 1

if (TRUE) {
  x = 1  # inline comments
} else{
  x = 2
  print('Oh no... ask the right bracket to go away!')
}
1 * 3 # one space before this comment will become two!
2 + 2 + 2    # only 'single quotes' are allowed in comments

diamonds %>%
  filter(color == "I") %>%
  group_by(cut) %>%
  summarize(price = mean(price))

lm(y ~ x1 + x2, data = data.frame(
  y = rnorm(100),
  x1 = rnorm(100),
  x2 = rnorm(100)
))  ### a linear model

1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +
  1 + 1 + 1 + 1 + 1  ## comments after a long line
## here is a long long long long long long long long long long long long long long long long long long long comment

Reindent lines will add spacing to conditional expression blocks, multi-line functions, expressions which run over multiple lines, and piped operations. Again, it is not perfect but it does some of the formatting work for us.

Exercise: style this code

Here’s a chunk of code from an exercise from a different class. It is formatted terribly, but as you can see it does work - the computer can interpret it. Use the style guide to clean it up and make it readable.

library(tidyverse)
library(modelr)
library(broom)
library(gam)
College <- as_tibble(ISLR::College)%>%mutate(Outstate =Outstate/1000,Room.Board=Room.Board/ 1000)# rescale Outstate in thousands of dollars
crossv_kfold(College,k=10)%>%mutate(linear=map(train,~glm(Outstate~PhD, data=.)),log= map(train,~glm(Outstate ~log(PhD), data=.)),spline=map(train,~ glm(Outstate ~bs(PhD, df=5),   data=.)))%>%gather(type,model,linear:spline)%>%mutate(mse=map2_dbl(model,test,mse))%>%group_by(type)%>%summarize(mse = mean(mse))# k-fold cv of three model types
## Warning in bs(PhD, degree = 3L, knots = structure(c(67.6666666666667, 81:
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(PhD, degree = 3L, knots = structure(c(68, 82), .Names =
## c("33.33333%", : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## # A tibble: 3 x 2
##     type      mse
##    <chr>    <dbl>
## 1 linear 13.89460
## 2    log 14.78268
## 3 spline 12.61390
college_phd_spline<-gam(Outstate~bs(PhD,df=5),data=College)# spline has the best model fit
college_phd_terms<-preplot(college_phd_spline,se=TRUE,rug=FALSE)# get first difference for age
#age plot
data_frame(x=college_phd_terms$`bs(PhD, df = 5)`$x,
y=college_phd_terms$`bs(PhD, df = 5)`$y,
se.fit = college_phd_terms$`bs(PhD, df = 5)`$se.y)%>%
mutate(y_low = y - 1.96 * se.fit,y_high = y+1.96 * se.fit) %>%ggplot(aes(x, y))+geom_line()+
geom_line(aes(y = y_low), linetype = 2)+
geom_line(aes(y = y_high), linetype = 2)+
labs(title = "Cubic spline of out-of-state tuition",subtitle = "Knots = 2",x = "Percent of faculty with PhDs",y=expression(f[1](PhD)))

Click for the solution

library(tidyverse)
library(modelr)
library(broom)
library(gam)

College <- as_tibble(ISLR::College) %>%
  # rescale Outstate in thousands of dollars
  mutate(Outstate = Outstate / 1000,
         Room.Board = Room.Board / 1000)

# k-fold cv of three model types
crossv_kfold(College, k = 10) %>%
  mutate(linear = map(train, ~ glm(Outstate ~ PhD, data = .)),
         log = map(train, ~ glm(Outstate ~ log(PhD), data = .)),
         spline = map(train, ~ glm(Outstate ~ bs(PhD, df = 5), data = .))) %>%
  gather(type, model, linear:spline) %>%
  mutate(mse = map2_dbl(model, test, mse)) %>%
  group_by(type) %>%
  summarize(mse = mean(mse))

# spline has the best model fit
college_phd_spline <- gam(Outstate ~ bs(PhD, df = 5), data = College)

# get first difference for age
college_phd_terms <- preplot(college_phd_spline, se = TRUE, rug = FALSE)

# age plot
data_frame(x = college_phd_terms$`bs(PhD, df = 5)`$x,
           y = college_phd_terms$`bs(PhD, df = 5)`$y,
           se.fit = college_phd_terms$`bs(PhD, df = 5)`$se.y) %>%
  mutate(y_low = y - 1.96 * se.fit,
         y_high = y + 1.96 * se.fit) %>%
  ggplot(aes(x, y)) +
  geom_line() +
  geom_line(aes(y = y_low), linetype = 2) +
  geom_line(aes(y = y_high), linetype = 2) +
  labs(title = "Cubic spline of out-of-state tuition",
       subtitle = "Knots = 2",
       x = "Percent of faculty with PhDs",
       y = expression(f[1](PhD)))

Session Info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.1 (2017-06-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2017-10-19
## Packages -----------------------------------------------------------------
##  package    * version    date       source                              
##  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.0)                      
##  backports    1.1.0      2017-05-22 CRAN (R 3.4.0)                      
##  base       * 3.4.1      2017-07-07 local                               
##  bindr        0.1        2016-11-13 CRAN (R 3.4.0)                      
##  bindrcpp     0.2        2017-06-17 CRAN (R 3.4.0)                      
##  boxes        0.0.0.9000 2017-07-19 Github (r-pkgs/boxes@03098dc)       
##  broom        0.4.2      2017-08-09 local                               
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.0)                      
##  clisymbols   1.2.0      2017-05-21 cran (@1.2.0)                       
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.0)                      
##  compiler     3.4.1      2017-07-07 local                               
##  crayon       1.3.4      2017-10-03 Github (gaborcsardi/crayon@b5221ab) 
##  datasets   * 3.4.1      2017-07-07 local                               
##  devtools     1.13.3     2017-08-02 CRAN (R 3.4.1)                      
##  digest       0.6.12     2017-01-27 CRAN (R 3.4.0)                      
##  dplyr      * 0.7.4.9000 2017-10-03 Github (tidyverse/dplyr@1a0730a)    
##  evaluate     0.10.1     2017-06-24 CRAN (R 3.4.1)                      
##  forcats    * 0.2.0      2017-01-23 CRAN (R 3.4.0)                      
##  foreign      0.8-69     2017-06-22 CRAN (R 3.4.1)                      
##  ggplot2    * 2.2.1      2016-12-30 CRAN (R 3.4.0)                      
##  glue         1.1.1      2017-06-21 CRAN (R 3.4.1)                      
##  graphics   * 3.4.1      2017-07-07 local                               
##  grDevices  * 3.4.1      2017-07-07 local                               
##  grid         3.4.1      2017-07-07 local                               
##  gtable       0.2.0      2016-02-26 CRAN (R 3.4.0)                      
##  haven        1.1.0      2017-07-09 CRAN (R 3.4.1)                      
##  hms          0.3        2016-11-22 CRAN (R 3.4.0)                      
##  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.0)                      
##  httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                      
##  jsonlite     1.5        2017-06-01 CRAN (R 3.4.0)                      
##  knitr        1.17       2017-08-10 cran (@1.17)                        
##  lattice      0.20-35    2017-03-25 CRAN (R 3.4.1)                      
##  lazyeval     0.2.0      2016-06-12 CRAN (R 3.4.0)                      
##  lubridate    1.6.0      2016-09-13 CRAN (R 3.4.0)                      
##  magrittr     1.5        2014-11-22 CRAN (R 3.4.0)                      
##  memoise      1.1.0      2017-04-21 CRAN (R 3.4.0)                      
##  methods    * 3.4.1      2017-07-07 local                               
##  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.0)                      
##  modelr       0.1.1      2017-08-10 local                               
##  munsell      0.4.3      2016-02-13 CRAN (R 3.4.0)                      
##  nlme         3.1-131    2017-02-06 CRAN (R 3.4.1)                      
##  parallel     3.4.1      2017-07-07 local                               
##  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.0)                      
##  plyr         1.8.4      2016-06-08 CRAN (R 3.4.0)                      
##  psych        1.7.5      2017-05-03 CRAN (R 3.4.1)                      
##  purrr      * 0.2.3      2017-08-02 CRAN (R 3.4.1)                      
##  R6           2.2.2      2017-06-17 CRAN (R 3.4.0)                      
##  Rcpp         0.12.13    2017-09-28 cran (@0.12.13)                     
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.0)                      
##  readxl       1.0.0      2017-04-18 CRAN (R 3.4.0)                      
##  reshape2     1.4.2      2016-10-22 CRAN (R 3.4.0)                      
##  rlang        0.1.2      2017-08-09 CRAN (R 3.4.1)                      
##  rmarkdown    1.6        2017-06-15 CRAN (R 3.4.0)                      
##  rprojroot    1.2        2017-01-16 CRAN (R 3.4.0)                      
##  rstudioapi   0.6        2016-06-27 CRAN (R 3.4.0)                      
##  rvest        0.3.2      2016-06-17 CRAN (R 3.4.0)                      
##  scales       0.4.1      2016-11-09 CRAN (R 3.4.0)                      
##  stats      * 3.4.1      2017-07-07 local                               
##  stringi      1.1.5      2017-04-07 CRAN (R 3.4.0)                      
##  stringr    * 1.2.0      2017-02-18 CRAN (R 3.4.0)                      
##  tibble     * 1.3.4      2017-08-22 CRAN (R 3.4.1)                      
##  tidyr      * 0.7.0      2017-08-16 CRAN (R 3.4.1)                      
##  tidyverse  * 1.1.1.9000 2017-07-19 Github (tidyverse/tidyverse@a028619)
##  tools        3.4.1      2017-07-07 local                               
##  utils      * 3.4.1      2017-07-07 local                               
##  withr        2.0.0      2017-07-28 CRAN (R 3.4.1)                      
##  xml2         1.1.1      2017-01-24 CRAN (R 3.4.0)                      
##  yaml         2.1.14     2016-11-12 CRAN (R 3.4.0)

  1. Source: Wikipedia

  2. And for that matter, in any other programming language as well. Note however that these style rules are specific to R; other languages by necessity may use different rules and conventions.

  3. These are useful for writing functions for generic methods.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.