This page was last updated on October 28, 2018.


Getting started

It is recommended that you start a new RStudio project and / or R Markdown document and document all your efforts as you work through this tutorial.

As we learned in the Visualizing and describing a single variable tutorial, the first task after data have been collected is to visualize and describe the data. In that tutorial you learned which types of graph and descriptive statistics to use, depending upon the type of data being explored (e.g. categorical or numerical).

Here you will learn how to produce tables of descriptive statistics within R Markdown.

Ideally we would construct tables whose format conform to the Biology guidelines for data presentation.

However, depending on the structure of the data you wish to describe, we may only be able to approximate the desired format within R Markdown.

If you wish to skip directly to learning the commands involved in creating tables, go to the Creating and formatting a table.


Install and load required packages

We are going to use the kableExtra package, described more at this website.

install.packages("kableExtra")
install.packages("magick")

NOTE: If you get the following message:

There is a binary version available but the source version is later:

      binary source needs_compilation

openssl 0.9.6 0.9.7 TRUE

Do you want to install from sources the package which needs compilation?

You should click on the command console window (below markdown window) and respond “n” to the question provided there. Your preference should always be for installing a “binary” version.

You also need to load the knitr package, which is probably already installed.

Now that you’ve installed the packages in the command console, in an R chunk in your markdown document, type:

library(knitr)
library(kableExtra)
library(magick)

We will again make use of the tigerstats package.

library(tigerstats)

Import data

We will use the locust data we used previously: these data describe serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours.

We use the read.csv function in combination with the url function for importing data from a website:

locust <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f1_2locustSerotonin.csv"))
inspect(locust)
## 
## quantitative variables:  
##             name   class min    Q1 median     Q3  max     mean        sd
## 1 serotoninLevel numeric 3.2 4.675    5.9 11.475 21.3 8.406667 5.2119183
## 2  treatmentTime integer 0.0 0.000    1.0  2.000  2.0 1.000000 0.8304548
##    n missing
## 1 30       0
## 2 30       0

Using the inspect function, we see that there are 30 cases (records) in the data frame, with no missing values. We also see that the serotoninLevel variable is a numeric variable and the treatmentTime variable is an integer type variable. Although this is classified as an integer variable in the data frame, in practice treatmentTime can be treated as an ordinal categorical variable.

Changing the “class” of a variable

Currently the variable treatmentTime is of class integer. Let’s change it to a variable of class factor, which is what categorical variables are generally supposed to be. To do this, we use the function as.factor, as follows:

locust$treatmentTime <- as.factor(locust$treatmentTime)

TIP: If the term factor doesn’t ring a bell, please go back and re-do the DataCamp tutorial on factors.

Let’s repeat the inspection of the data frame now:

inspect(locust)
## 
## categorical variables:  
##            name  class levels  n missing
## 1 treatmentTime factor      3 30       0
##                                    distribution
## 1 0 (33.3%), 1 (33.3%), 2 (33.3%)              
## 
## quantitative variables:  
##             name   class min    Q1 median     Q3  max     mean       sd  n
## 1 serotoninLevel numeric 3.2 4.675    5.9 11.475 21.3 8.406667 5.211918 30
##   missing
## 1       0

We see now that treatmentTime is recognized as a categorical variable (technically of class factor - as you’d see if you use the str function), and it has 3 levels: 0, 1, and 2.


Visualize the data with a strip chart

Let’s visualize the data with a strip chart:

stripchart(serotoninLevel ~ treatmentTime, data = locust, 
           ylab = "Serotonin (pmoles)",
           xlab = "Treatment time (hours)",
           method = "jitter",  # jitters the symbols
           pch = 1,  # pch changes the symbol type
           col = "firebrick",
           vertical = TRUE,
           las = 1)  # orients y-axis tick labels properly
Figure 1: Stripchart of serotonin levels in locusts (n = 10 per group)

Figure 1: Stripchart of serotonin levels in locusts (n = 10 per group)

You can see that the observations in the 0 and 1 groups are distributed a bit strangely: many locusts exhibited serotonin levels near 5 pmoles, while a few exhibited much higher levels. In other words, there is a bit of a skew in the frequency distribution of serotonin levels within groups 0 and 1. This might raise a flag when it comes to deciding which descriptive statistics to use. However, for reasons that will be explained in a later tutorial, we can go ahead and use the mean and the standard deviation to describe the centre and spread of serotonin levels in each treatment group.

Recall from the Visualizing associations between two variables tutorial that you can use the handy favstats function to calculate descriptive statistics for a numerical variable in relation to the categories of a categorical variable.


Calculate descriptive statistics

Here, we want to describe the serotonin levels observed in each of the 3 treatment categories, as you learned in the Visualizing associations between two variables tutorial:

favstats(serotoninLevel ~ treatmentTime, data = locust)
##   treatmentTime min    Q1 median     Q3  max  mean       sd  n missing
## 1             0 3.3 3.825   4.40  5.125 18.0  6.36 4.819912 10       0
## 2             1 3.2 5.500   5.80  8.825 18.7  8.04 4.963690 10       0
## 3             2 5.7 6.750   9.05 13.300 21.3 10.82 5.327664 10       0

We don’t need all of these descriptors, so let’s first assign the output to a data frame object called locust.stats, then we’ll get rid of extraneous descriptors:

locust.stats <- favstats(serotoninLevel ~ treatmentTime, data = locust)

Now let’s verify the column names in the data frame:

names(locust.stats)
##  [1] "treatmentTime" "min"           "Q1"            "median"       
##  [5] "Q3"            "max"           "mean"          "sd"           
##  [9] "n"             "missing"

And then keep only those columns that we want:

locust.stats <- locust.stats[, c("treatmentTime", "n", "mean", "sd")] 
locust.stats
##   treatmentTime  n  mean       sd
## 1             0 10  6.36 4.819912
## 2             1 10  8.04 4.963690
## 3             2 10 10.82 5.327664

TIP: If you don’t recognize the syntax used here, please go back and re-do the DataCamp tutorial on data frames.

Now we have a table with the columns we need, but the formatting needs adjusting.


Creating and formatting a table

The kable function will produce a table using any data frame as an input. For instance, here we create a simple table out of the first five rows of the iris data frame, which is included in the base R package:

kable(iris[1:5, ], format = "html")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa

Above, we created a data frame locust.stats that includes the information we want to show in a table:

locust.stats
##   treatmentTime  n  mean       sd
## 1             0 10  6.36 4.819912
## 2             1 10  8.04 4.963690
## 3             2 10 10.82 5.327664

Let’s rename the variables the way we want them to show in the final table:

names(locust.stats) <- c("Treatment time (h)", "n", "Mean", "SD")
locust.stats
##   Treatment time (h)  n  Mean       SD
## 1                  0 10  6.36 4.819912
## 2                  1 10  8.04 4.963690
## 3                  2 10 10.82 5.327664

Now we’re ready to format the table and ensure that the correct number of decimal places are shown.

NOTE: You need to use slightly different code depending on whether you’re knitting to HTML or to Word/PDF.

NOTE: For now, we can only create tables that will knit successfully to HTML. The PDF/Word option is not available.

Here is the code for rendering to Word or PDF (note that knitting directly to PDF may result in too large a table):

Disregard this code for now:

kable(locust.stats, format = "latex",  
      caption = "Descriptive statistics for the locust data",
      digits = c(0, 0, 2, 3), align = "ccrr") %>%
  kable_styling(full_width = FALSE, position = "left") %>%
  add_header_above(c(" " = 1, " " = 1, 
                     "Serotonin level (pmoles)" = 2)) %>%
  kable_as_image()

And below is the code that works for knitting to HTML.

You can see there’s a lot going on here, including some new syntax: %>%

This simply tells R: “Wait, I’m not finished with all my operations, so go to the next line before evaluating my code.”

The other key items you should learn more about are:

  • the add_header_above function
  • the caption, digits, and align arguments to the kable function

See more information at this webpage.

kable(locust.stats, format = "html",  
      caption = "Table 1: Descriptive statistics for the locust data",
      digits = c(0, 0, 2, 3), align = "ccrr") %>%
  kable_styling(full_width = FALSE, position = "left") %>%
  add_header_above(c(" " = 1, " " = 1, "Serotonin level (pmoles)" = 2))
Table 1: Descriptive statistics for the locust data
Serotonin level (pmoles)
Treatment time (h) n Mean SD
0 10 6.36 4.820
1 10 8.04 4.964
2 10 10.82 5.328

If you compare the formatting of this table to the one displayed in the Biology guidelines for data presentation, you’ll note some differences, including where borders appear. This is OK! We won’t have time to learn how to produce perfectly formatted tables within R Markdown, but we’ll get close enough. If you are keen, you could always do this procedure in R Markdown (to provide for reproducibility), and then ALSO create a perfectly formatted table in Word - a table that would of course have all the same numbers as your table in

You’ll notice that the standard deviation is reported to one more digit than the mean, and the 2 columns containing values with decimal places (mean, sd) are right-justified (using the align argument to the kable function), whereas the treatment column and “n” column are centred.

  1. Try making a similarly formatted table of descriptive statistics using the iris data, using Sepal.Length as the numerical variable and Species as the categorical variable.

Extra help

You can format tables in many different ways with the kableExtra package. Further instructions and examples are provided at this webpage.

Other packages that can be used to produce tables in R Markdown are:

  • xtable
  • pander

List of functions (and the source packages) used in tutorial

Getting started:

  • read.csv
  • url
  • library

Data frame structure:

  • names
  • inspect (tigerstats / mosaic packages)

Graphs:

  • stripchart (lattice package)

Descriptive stats:

  • favstats (tigerstats and mosaic packages)

Tables (all from the kableExtra package):

  • kable
  • kable_styling
  • add_header_above
  • kable_as_image