1.3 Basic summary statistics, histograms, and boxplots using R

For the following material, you will need to install and load the mosaic package (Pruim, Kaplan, and Horton 2019).

> library(mosaic)

It provides a suite of enhanced functions to aid our initial explorations. With RStudio running, the mosaic package loaded, a place to write and save code, and the treadmill data set loaded, we can (finally!) start to summarize the results of the study. The treadmill object is what R calls a tibble¹⁰ and contains columns corresponding to each variable in the spreadsheet. Every function in R will involve specifying the variable(s) of interest and how you want to use them. To access a particular variable (column) in a tibble, you can use a $ between the name of the tibble and the name of the variable of interest, generically as tibblename$variablename. You can think of this as tibblename’s variablename where the ’s is replaced by the dollar sign. To identify the RunTime variable here it would be treadmill$RunTime. In the command line it would look like:

> treadmill$RunTime
[1]  8.63  8.17  8.92  8.65 10.33  9.93 10.13 10.08  9.22  8.95 10.85  9.40 11.50 10.50
[15] 10.60 10.25 10.00 11.17 10.47 11.95  9.63 10.07 11.08 11.63 11.12 11.37 10.95 13.08
[29] 12.63 12.88 14.03

Just as in the previous section, we can generate summary statistics using functions like mean and sd by running them on a specific variable:

> mean(treadmill$RunTime)
[1] 10.58613
> sd(treadmill$RunTime)
[1] 1.387414

And now we know that the average running time for 1.5 miles for the subjects in the study was 10.6 minutes with a standard deviation (SD) of 1.39 minutes. But you should remember that the mean and SD are only appropriate summaries if the distribution is roughly symmetric (both sides of the distribution are approximately the same shape and length). The mosaic package provides a useful function called favstats that provides the mean and SD as well as the 5 number summary: the minimum (min), the first quartile (Q1, the 25^th percentile), the median (50^th percentile), the third quartile (Q3, the 75^th percentile), and the maximum (max). It also provides the number of observations (n) which was 31, as noted above, and a count of whether any missing values were encountered (missing), which was 0 here since all subjects had measurements available on this variable.

> favstats(treadmill$RunTime)
  min   Q1 median    Q3   max     mean       sd  n missing
 8.17 9.78  10.47 11.27 14.03 10.58613 1.387414 31       0

We are starting to get somewhere with understanding that the runners were somewhat fit with worst runner covering 1.5 miles in 14 minutes (the equivalent of a 9.3 minute mile) and the best running at a 5.4 minute mile pace. The limited variation in the results suggests that the sample was obtained from a restricted group with somewhat common characteristics. When you explore the ages and weights of the subjects in the Practice Problems in Section 1.6, you will get even more information about how similar all the subjects in this study were. Researchers often publish numerical summaries of this sort of demographic information to help readers understand the subjects that they studied and that their results might apply to.

A graphical display of these results will help us to assess the shape of the distribution of run times – including considering the potential for the presence of a skew (whether the right or left tail of the distribution is noticeably more spread out, with left skew meaning that the left tail is more spread out than the right tail) and outliers (unusual observations). A histogram is a good place to start. Histograms display connected bars with counts of observations defining the height of bars based on a set of bins of values of the quantitative variable. We will apply the hist function to the RunTime variable, which produces Figure 1.5.

> hist(treadmill$RunTime)

$Histogram of Run Times (minutes) of $n$=31 subjects in Treadmill study, bar heights are counts.$

Figure 1.5: Histogram of Run Times (minutes) of $n$=31 subjects in Treadmill study, bar heights are counts.

You can save this plot by clicking on the Export button found above the plot, followed by Copy to Clipboard and clicking on the Copy Plot button. Then if you open your favorite word-processing program, you should be able to paste it into a document for writing reports that include the figures. You can see the first parts of this process in the screen grab in Figure 1.6. You can also directly save the figures as separate files using Save as Image or Save as PDF and then insert them into your word processing documents.

Figure 1.6: RStudio while in the process of copying the histogram.

The function hist defaults into providing a histogram on the frequency (count) scale. In most R functions, there are the default options that will occur if we don’t make any specific choices but we can override the default options if we desire. One option we can modify here is to add labels to the bars to be able to see exactly how many observations fell into each bar. Specifically, we can turn the labels option “on” by making it true (“T”) by adding labels=T to the previous call to the hist function, separated by a comma. Note that we will use the = sign only for changing options within functions.

> hist(treadmill$RunTime, labels=T)

Figure 1.7: Histogram of Run Times with counts in bars labeled.

Based on this histogram, it does not appear that there any outliers in the responses since there are no bars that are separated from the other observations. However, the distribution does not look symmetric and there might be a skew to the distribution. Specifically, it appears to be skewed right (the right tail is longer than the left). But histograms can sometimes mask features of the data set by binning observations and it is hard to find the percentiles accurately from the plot.

When assessing outliers and skew, the boxplot (or Box and Whiskers plot) can also be helpful (Figure 1.8) to describe the shape of the distribution as it displays the 5-number summary and will also indicate observations that are “far” above the middle of the observations. R’s boxplot function uses the standard rule to indicate an observation as a potential outlier if it falls more than 1.5 times the IQR (Inter-Quartile Range, calculated as Q3 – Q1) below Q1 or above Q3. The potential outliers are plotted with circles and the Whiskers (lines that extend from Q1 and Q3 typically to the minimum and maximum) are shortened to only go as far as observations that are within $1.5*$IQR of the upper and lower quartiles. The box part of the boxplot is a box that goes from Q1 to Q3 and the median is displayed as a line somewhere inside the box.¹¹ Looking back at the summary statistics above, Q1=9.78 and Q3=11.27, providing an IQR of:

> IQR <- 11.27 - 9.78
> IQR
[1] 1.49

One observation (the maximum value of 14.03) is indicated as a potential outlier based on this result by being larger than Q3 $+1.5*$IQR, which was 13.505:

> 11.27 + 1.5*IQR
[1] 13.505

The boxplot also shows a slight indication of a right skew (skew towards larger values) with the distance from the minimum to the median being smaller than the distance from the median to the maximum. Additionally, the distance from Q1 to the median is smaller than the distance from the median to Q3. It is modest skew, but worth noting.

Figure 1.8: Boxplot of 1.5 mile Run Times.

> boxplot(treadmill$RunTime)

While the default boxplot is fine, it fails to provide good graphical labels, especially on the y-axis. Additionally, there is no title on the plot. The following code provides some enhancements to the plot by using the ylab and main options in the call to boxplot, with the results displayed in Figure 1.9. When we add text to plots, it will be contained within quotes and be assigned into the options ylab (for y-axis) or main (for the title) here to put it into those locations.

Figure 1.9: Boxplot of Run Times with improved labels.

> boxplot(treadmill$RunTime, ylab="1.5 Mile Run Time (minutes)", 
          main="Boxplot of the Run Times of n=31 participants")

Throughout the book, we will often use extra options to make figures that are easier for you to understand. There are often simpler versions of the functions that will suffice but the extra work to get better labeled figures is often worth it. I guess the point is that “a picture is worth a thousand words” but in data visualization, that is only true if the reader can understand what is being displayed. It is also important to think about the quality of the information that is being displayed, regardless of how pretty the graphic might be. So maybe it is better to say “a picture can be worth a thousand words” if it is well-labeled?

All the previous results were created by running the R code and then copying the results from either the console or by copying the figure and then pasting the results into the typesetting program. There is another way to use RStudio where you can have it compile the results (both output and figures) directly into a document together with the code that generated it, using what is called R Markdown (http://shiny.rstudio.com/articles/rmarkdown.html). It is basically what we used to prepare this book and what you should learn to use to do your work. From here forward, you will see a change in formatting of the R code and output as you will no longer see the command prompt (“>”) with the code. The output will be flagged by having two “##”’s before it. For example, the summary statistics for the RunTime variable from favstats function would look like when run using R Markdown:

favstats(treadmill$RunTime)

##   min   Q1 median    Q3   max     mean       sd  n missing
##  8.17 9.78  10.47 11.27 14.03 10.58613 1.387414 31       0

Statisticians (and other scientists) are starting to use R Markdown and similar methods because they provide what is called “Reproducible research” (Gandrud 2015) where all the code and output it produced are available in a single place. This allows different researchers to run and verify results (so “reproducible results”) or the original researchers to revisit their earlier work at a later date and recreate all their results exactly. Scientific publications are currently encouraging researchers to work in this way and may someday require it. The term reproducible can also be related to whether repeated studies get the same result (also called replication) - further discussion of these terms and the implications for scientific research are discussed in Chapter XX.

In order to get some practice using R Markdown, there are two options. First, try to create a sample document in this format using File -> New File -> R Markdown… Choose a title for your file and select the “Word” option. This will create a new file in the XXX window. Save that file to your computer. Then you can use the “Knit” button to have RStudio run the code and create a word document with the results. R Markdown documents contain basically two components, “code chunks” that contain your code and places where you can write descriptions and interpretations of those results. The code chunks can be inserted using the “Insert” button and select the “R” option. Then write your code in between the {r} and lines. Once you do this, you can test your code using the triangle on the right side of the code chunk to run that chunk. Keep your write up outside of these code chunks. Once you think your code and writing is done, you can use the “Knit” button to try to compile the file. As you are learning, you may find this challenging, so start with trying to review the sample document and

Finally, when you are done with your work and attempt to exit out of RStudio, it will ask you to save your workspace. DO NOT DO THIS! It will just create a cluttered workspace and could even cause you to get incorrect results. If you save your R code (and edit it to only contain the parts of it that worked) via the script window, you can re-create any results by simply re-running that code. If you find that you have lots of “stuff” in your workspace because you accidentally saved your workspace, just run rm(list = ls()). It will delete all the data sets from your workspace.

References

Gandrud, Christopher. 2015. Reproducible Research with R and R Studio, Second Edition. Chapman Hall, CRC.

Pruim, Randall, Daniel T. Kaplan, and Nicholas J. Horton. 2019. Mosaic: Project Mosaic Statistics and Mathematics Teaching Utilities. https://CRAN.R-project.org/package=mosaic.

Tibbles are R are objects that can contain both categorical and quantitative variables on your $n$ subjects with a name for each variable that is also the name of each column in a matrix. Each subject is a row of the data set. The name (supposedly) is due to the way table sounds in the accent of a particularly influential developer at RStudio who is from New Zealand.↩
The median, quartiles and whiskers sometimes occur at the same values when there are many tied observations. If you can’t see all the components of the boxplot, produce the numerical summary to help you understand what happened.↩