This page was last updated on October 28, 2018.
It is recommended that you start a new RStudio project and / or R Markdown document and document all your efforts as you work through this tutorial.
As we learned in the Visualizing and describing a single variable tutorial, the first task after data have been collected is to visualize and describe the data. In that tutorial you learned which types of graph and descriptive statistics to use, depending upon the type of data being explored (e.g. categorical or numerical).
Here you will learn how to produce tables of descriptive statistics within R Markdown.
Ideally we would construct tables whose format conform to the Biology guidelines for data presentation.
However, depending on the structure of the data you wish to describe, we may only be able to approximate the desired format within R Markdown.
If you wish to skip directly to learning the commands involved in creating tables, go to the Creating and formatting a table.
We are going to use the kableExtra
package, described more at this website.
install.packages("kableExtra")
install.packages("magick")
NOTE: If you get the following message:
There is a binary version available but the source version is later:
binary source needs_compilation
openssl 0.9.6 0.9.7 TRUE
Do you want to install from sources the package which needs compilation?
You should click on the command console window (below markdown window) and respond “n” to the question provided there. Your preference should always be for installing a “binary” version.
You also need to load the knitr
package, which is probably already installed.
Now that you’ve installed the packages in the command console, in an R chunk in your markdown document, type:
library(knitr)
library(kableExtra)
library(magick)
We will again make use of the tigerstats
package.
library(tigerstats)
We will use the locust data we used previously: these data describe serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours.
We use the read.csv
function in combination with the url
function for importing data from a website:
locust <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f1_2locustSerotonin.csv"))
inspect(locust)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd
## 1 serotoninLevel numeric 3.2 4.675 5.9 11.475 21.3 8.406667 5.2119183
## 2 treatmentTime integer 0.0 0.000 1.0 2.000 2.0 1.000000 0.8304548
## n missing
## 1 30 0
## 2 30 0
Using the inspect
function, we see that there are 30 cases (records) in the data frame, with no missing values. We also see that the serotoninLevel
variable is a numeric variable and the treatmentTime
variable is an integer
type variable. Although this is classified as an integer
variable in the data frame, in practice treatmentTime
can be treated as an ordinal categorical variable.
Currently the variable treatmentTime
is of class integer
. Let’s change it to a variable of class factor
, which is what categorical variables are generally supposed to be. To do this, we use the function as.factor
, as follows:
locust$treatmentTime <- as.factor(locust$treatmentTime)
TIP: If the term factor
doesn’t ring a bell, please go back and re-do the DataCamp tutorial on factors.
Let’s repeat the inspection of the data frame now:
inspect(locust)
##
## categorical variables:
## name class levels n missing
## 1 treatmentTime factor 3 30 0
## distribution
## 1 0 (33.3%), 1 (33.3%), 2 (33.3%)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n
## 1 serotoninLevel numeric 3.2 4.675 5.9 11.475 21.3 8.406667 5.211918 30
## missing
## 1 0
We see now that treatmentTime
is recognized as a categorical variable (technically of class factor
- as you’d see if you use the str
function), and it has 3 levels: 0, 1, and 2.
Let’s visualize the data with a strip chart:
stripchart(serotoninLevel ~ treatmentTime, data = locust,
ylab = "Serotonin (pmoles)",
xlab = "Treatment time (hours)",
method = "jitter", # jitters the symbols
pch = 1, # pch changes the symbol type
col = "firebrick",
vertical = TRUE,
las = 1) # orients y-axis tick labels properly
Figure 1: Stripchart of serotonin levels in locusts (n = 10 per group)
You can see that the observations in the 0 and 1 groups are distributed a bit strangely: many locusts exhibited serotonin levels near 5 pmoles, while a few exhibited much higher levels. In other words, there is a bit of a skew in the frequency distribution of serotonin levels within groups 0 and 1. This might raise a flag when it comes to deciding which descriptive statistics to use. However, for reasons that will be explained in a later tutorial, we can go ahead and use the mean and the standard deviation to describe the centre and spread of serotonin levels in each treatment group.
Recall from the Visualizing associations between two variables tutorial that you can use the handy favstats
function to calculate descriptive statistics for a numerical variable in relation to the categories of a categorical variable.
Here, we want to describe the serotonin levels observed in each of the 3 treatment categories, as you learned in the Visualizing associations between two variables tutorial:
favstats(serotoninLevel ~ treatmentTime, data = locust)
## treatmentTime min Q1 median Q3 max mean sd n missing
## 1 0 3.3 3.825 4.40 5.125 18.0 6.36 4.819912 10 0
## 2 1 3.2 5.500 5.80 8.825 18.7 8.04 4.963690 10 0
## 3 2 5.7 6.750 9.05 13.300 21.3 10.82 5.327664 10 0
We don’t need all of these descriptors, so let’s first assign the output to a data frame object called locust.stats
, then we’ll get rid of extraneous descriptors:
locust.stats <- favstats(serotoninLevel ~ treatmentTime, data = locust)
Now let’s verify the column names in the data frame:
names(locust.stats)
## [1] "treatmentTime" "min" "Q1" "median"
## [5] "Q3" "max" "mean" "sd"
## [9] "n" "missing"
And then keep only those columns that we want:
locust.stats <- locust.stats[, c("treatmentTime", "n", "mean", "sd")]
locust.stats
## treatmentTime n mean sd
## 1 0 10 6.36 4.819912
## 2 1 10 8.04 4.963690
## 3 2 10 10.82 5.327664
TIP: If you don’t recognize the syntax used here, please go back and re-do the DataCamp tutorial on data frames.
Now we have a table with the columns we need, but the formatting needs adjusting.
The kable
function will produce a table using any data frame as an input. For instance, here we create a simple table out of the first five rows of the iris
data frame, which is included in the base R package:
kable(iris[1:5, ], format = "html")
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
Above, we created a data frame locust.stats
that includes the information we want to show in a table:
locust.stats
## treatmentTime n mean sd
## 1 0 10 6.36 4.819912
## 2 1 10 8.04 4.963690
## 3 2 10 10.82 5.327664
Let’s rename the variables the way we want them to show in the final table:
names(locust.stats) <- c("Treatment time (h)", "n", "Mean", "SD")
locust.stats
## Treatment time (h) n Mean SD
## 1 0 10 6.36 4.819912
## 2 1 10 8.04 4.963690
## 3 2 10 10.82 5.327664
Now we’re ready to format the table and ensure that the correct number of decimal places are shown.
NOTE: You need to use slightly different code depending on whether you’re knitting to HTML or to Word/PDF.
NOTE: For now, we can only create tables that will knit successfully to HTML. The PDF/Word option is not available.
Here is the code for rendering to Word or PDF (note that knitting directly to PDF may result in too large a table):
Disregard this code for now:
kable(locust.stats, format = "latex",
caption = "Descriptive statistics for the locust data",
digits = c(0, 0, 2, 3), align = "ccrr") %>%
kable_styling(full_width = FALSE, position = "left") %>%
add_header_above(c(" " = 1, " " = 1,
"Serotonin level (pmoles)" = 2)) %>%
kable_as_image()
And below is the code that works for knitting to HTML.
You can see there’s a lot going on here, including some new syntax: %>%
This simply tells R: “Wait, I’m not finished with all my operations, so go to the next line before evaluating my code.”
The other key items you should learn more about are:
add_header_above
functioncaption
, digits
, and align
arguments to the kable
functionSee more information at this webpage.
kable(locust.stats, format = "html",
caption = "Table 1: Descriptive statistics for the locust data",
digits = c(0, 0, 2, 3), align = "ccrr") %>%
kable_styling(full_width = FALSE, position = "left") %>%
add_header_above(c(" " = 1, " " = 1, "Serotonin level (pmoles)" = 2))
Treatment time (h) | n | Mean | SD |
---|---|---|---|
0 | 10 | 6.36 | 4.820 |
1 | 10 | 8.04 | 4.964 |
2 | 10 | 10.82 | 5.328 |
If you compare the formatting of this table to the one displayed in the Biology guidelines for data presentation, you’ll note some differences, including where borders appear. This is OK! We won’t have time to learn how to produce perfectly formatted tables within R Markdown, but we’ll get close enough. If you are keen, you could always do this procedure in R Markdown (to provide for reproducibility), and then ALSO create a perfectly formatted table in Word - a table that would of course have all the same numbers as your table in
You’ll notice that the standard deviation is reported to one more digit than the mean, and the 2 columns containing values with decimal places (mean, sd) are right-justified (using the align
argument to the kable
function), whereas the treatment column and “n” column are centred.
iris
data, using Sepal.Length
as the numerical variable and Species
as the categorical variable.You can format tables in many different ways with the kableExtra
package. Further instructions and examples are provided at this webpage.
Other packages that can be used to produce tables in R Markdown are:
xtable
pander
Getting started:
read.csv
url
library
Data frame structure:
names
inspect
(tigerstats
/ mosaic
packages)Graphs:
stripchart
(lattice
package)Descriptive stats:
favstats
(tigerstats
and mosaic
packages)Tables (all from the kableExtra
package):
kable
kable_styling
add_header_above
kable_as_image