This page was last updated on September 11, 2019.
The type of graph that is most suitable for visualizing an association between two variables depends upon the type of data being visualized:
In this tutorial you’ll learn to construct and interpret each of these types of visualization. In later tutorials you’ll learn how to conduct statistical analyses of these associations.
Import the students.csv
data set and also a data set concerning locusts:
students <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/students.csv"), header = TRUE)
locust <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f1_2locustSerotonin.csv"))
The students
object that we created in your workspace is a data frame, with each row representing a case and each column representing a variable. Data frames can store a mixture of data types: numeric variables, categorical variables, logical variables etc… all in the same data frame (as separate columns). This isn’t the case with other object types (e.g. matrices).
Use the str
command to get an overview of the dataset:
## 'data.frame': 154 obs. of 7 variables:
## $ height_cm : num 160 180 192 165 182 ...
## $ head_circum_cm : num 55 56 56 61 61 53 53.5 54 54 54 ...
## $ sex : Factor w/ 2 levels "f","m": 1 1 2 1 2 1 1 1 1 1 ...
## $ number_siblings: int 1 1 5 2 1 1 1 0 3 1 ...
## $ dominant_hand : Factor w/ 2 levels "l","r": 2 2 2 2 2 2 2 2 2 2 ...
## $ dominant_foot : Factor w/ 2 levels "l","r": 2 2 2 2 1 2 2 2 2 2 ...
## $ dominant_eye : Factor w/ 2 levels "l","r": 1 2 2 2 2 1 2 2 2 2 ...
Even better, the mosaic
package, which is loaded along with the tigerstats
package, includes a function inspect
that really provides a nice overview of a data frame, along with some descriptive statistics (we’ll come back to these later):
##
## categorical variables:
## name class levels n missing
## 1 sex factor 2 154 0
## 2 dominant_hand factor 2 154 0
## 3 dominant_foot factor 2 154 0
## 4 dominant_eye factor 2 154 0
## distribution
## 1 f (58.4%), m (41.6%)
## 2 r (90.3%), l (9.7%)
## 3 r (89%), l (11%)
## 4 r (68.8%), l (31.2%)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd
## 1 height_cm numeric 150 165 171.475 180.0 210.8 171.971234 10.027280
## 2 head_circum_cm numeric 53 56 57.000 58.5 63.0 57.185065 1.884848
## 3 number_siblings integer 0 1 2.000 2.0 6.0 1.707792 1.053629
## n missing
## 1 154 0
## 2 154 0
## 3 154 0
You see that there are:
class
of each variable is listed (e.g. factor
or numeric
or integer
)levels
(categories)Now that we have explored the basic structure of our dataset, we’re ready to start visualizing the data graphically!
If two categorical variables are associated, the relative frequencies for one variable will differ among categories of the other variable. To visualize such an association, we can construct:
We learned earlier that we can use the xtabs
command (from the tigerstats
package) to produce a frequency table for a single categorical variable:
## sex
## f m
## 90 64
This shows the frequency of observations (cases) in each category (‘f’ and ‘m’) of the single categorical variable (sex
). There are 90 females and 64 males.
Let’s check whether there is an association between handedness (using the variable dominant_hand
) and gender (using the sex
variable).
To construct a contingency table involving 2 variables, we include both variable names after the ~
, separated by a +
sign:
## dominant_hand
## sex l r
## f 4 86
## m 11 53
We can see that 4 out of 90 female students are left-handed, while 11 out of 64 male students are left-handed. At first glance, it appears that the frequency (count) of left-handed males is high, but it’s easier to interpret relative frequencies or percentages. We’ll do this below. But first:
TIP: Contingency tables should always show the row and column totals, as well as the grand total (= total count of observations).
This can be achieved with the addmargins
function, wrapped around the xtabs
function:
## dominant_hand
## sex l r Sum
## f 4 86 90
## m 11 53 64
## Sum 15 139 154
Voila - you now have a contingency table showing an association between two categorical variables.
This particular table has dimensions 2 x 2, because each of our 2 categorical variables happens to have 2 categories. If one of our variables instead had, say, 3 categories, we would have a 2 x 3 dimension table.
Displaying the row and column totals, and the grand total, allows the reader to easily see how many observations (here, students) were in each category of each variable. For example, we see from the table above that there were 90 female and 64 male students in the dataset. We also see that there were 15 left-handed students and 139 right-handed students in the dataset.
Often it is easier to interpret categorical data if they are presented as relative frequencies, or proportions.
To do this, we need to decide which of the “margins”, i.e. rows or columns, we’re going to use as the denominator when calculating the relative frequencies.
For instance, if we wanted to answer the question “What proportion of males in the class is right handed?”, then we’d need to divide the frequencies in the table by the row totals. We can use the prop.table
command to do this, making sure to include the argument margin = 1
so that it expresses the proportions based on the row totals:
## dominant_hand
## sex l r
## f 0.04444444 0.95555556
## m 0.17187500 0.82812500
You’ll see that if you sum the values in the same row, the add up to 1.
And if you’d rather use percentages, simply multiply by 100:
## dominant_hand
## sex l r
## f 4.444444 95.555556
## m 17.187500 82.812500
Now you see that 17.2 percent of males were left-handed, while only 4.4 percent of females were left handed.
If instead we wanted to answer the question “What proportion of right-handed students in the dataset is male?”, then we’d need to divide the frequencies in the table by the column totals.
Thus, change the margin
argument value to 2:
## dominant_hand
## sex l r
## f 0.2666667 0.6187050
## m 0.7333333 0.3812950
We see that of the right-handed students in the dataset, 38% (corresponding to a proportion of 0.38) are male.
Other functions that are handy when evaluating contingency tables are rowPerc
and colPerc
:
## dominant_hand
## sex l r Total
## f 4.44 95.56 100.00
## m 17.19 82.81 100.00
## dominant_hand
## sex l r
## f 26.67 61.87
## m 73.33 38.13
## Total 100.00 100.00
Although the continency tables were informative, graphs can be much more effective at showing associations.
We can use the barchartGC
function to visualize associations between two categorical variables, creating a grouped bar chart:
Figure X. Grouped bar graph of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.
Note the additional argument type
, which specifies whether we want to show the raw counts (“freq”) or percentages, as in the following:
Figure X. Grouped bar graph of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.
Sometimes it’s better to display the graph horizontally. We can do this thanks to the horizonal
argument in the barchartGC
function:
Figure X. Grouped bar graph of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.
In the figures above, we used sex
(gender) as the variable shown on the graph axis, and dominant_hand
differentiated using colours and a legend.
We can switch this around by changing which variable appears immediately after the ~
in the barchartGC
command:
Figure X. Grouped bar graph of the association between gender and handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) in a class of 90 female (‘f’) and 64 male (‘m’) students.
Note the different figure caption in this figure compared to the preceding ones: here it says “association between gender and handedness”… we have “gender” appearing first, and the categories “m” and “f” are differentiated by colours, as indicated in the legend. It is typical to use this convention, i.e. the second variable in the statement is the one put on the axis of a grouped bar chart.
You should decide on which variable appears on the graph axis, and which in the legend, depending on what you wish to convey.
If you wish to describe the association between handedness and gender (more logical in this example), the original layout is best (sex on the axis). If you wished to describe the association between gender and handedness (less logical in this example), then the last graph is best.
An alternative but less common way to visualize the association between two categorical variables is a mosaic plot.
We can use the mosaicplot
function, which expects a contingency table as its input, which itself is provided by the xtabs
function:
mosaicplot(xtabs(~ sex + dominant_hand, data = students),
col = c("firebrick", "goldenrod1"),
xlab = "Gender",
ylab = "Dominant hand",
main = "") # don't show main title because we have a proper figure caption
Figure X. Mosaic plot of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.
Let’s re-plot the grouped bar chart and the mosaic plot first (we won’t show the code, as it is identical to above):
Figure X. Grouped bar graph of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.
Figure X. Mosaic plot of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.
Both figures show that the percent of students of the same gender that are left-handed is greater among males than among females. However, this pattern is arguably easier to discern in the mosaic plot.
Note also that, by default, the mosaic plot provides an indication of sample sizes within the two categories (male versus female): the relative width of the bars indicates this. Recall that there were 64 males and 90 females in the class, and accordingly, the bars on the right (males) are narrower than those on the left.
We use a scatterplot to show association between two numerical variables.
The xyplot
function does the trick.
TIP: Notice the syntax: here we have our response (y) variable on the left of the ~
symbol, and the explanatory (x) variable on the right:
xyplot(head_circum_cm ~ height_cm, data = students,
col = "black", # change symbol colour to black
xlab = "Height (cm)",
ylab = "Head circumference (cm)")
Figure 1: Scatterplot of the association between head circumference and height among 154 students.
TIP: Note the added arguments to the xyplot
function. We have changed the symbol colour, and added better x- and y-axis labels.
To see how to format and create a good figure caption for a scatterplot, consult the guidelines for data presentation here.
Things to report when describing a scatterplot:
In the scatterplot above, head circumference is positively associated with height, and the association is moderately strong. There are no observations that are obviously incosistent with the general trend.
To visualize association between a numerical variable and a categorical variable, we can construct:
Use a stripchart when there are relatively few observations (e.g. less than 20) within each category of the categorical variable. Use a boxplot otherwise.
Let’s have a look at the locust serotonin data set using the str
function:
## 'data.frame': 30 obs. of 2 variables:
## $ serotoninLevel: num 5.3 4.6 4.5 4.3 4.2 3.6 3.7 3.3 12.1 18 ...
## $ treatmentTime : int 0 0 0 0 0 0 0 0 0 0 ...
These data describe serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. The treatmentTime
variable is an integer
type variable, in that it describes the number of hours (0, 1, or 2) the locusts were experimentally crowded. However, we could simply consider this variable as a categorical variable denoting which “treatment group” the locusts belonged to. That’s what we’ll do here: pretend the treatmentTime
variable is a categorical variable.
Let’s see how many observations there are in each treatment group. We can use the xtabs
function we learned earlier for this:
## treatmentTime
## 0 1 2
## 10 10 10
Given that there are only 10 observations per group, we should use a stripchart to visualize how serotonin levels associate with treatment.
We use the stripchart
function for this, and we’ll add some arguments to improve the quality of the graph:
stripchart(serotoninLevel ~ treatmentTime, data = locust,
ylab = "Serotonin (pmoles)",
xlab = "Treatment group (number of hours)",
method = "jitter", # jitters the symbols
pch = 1, # pch changes the symbol type
col = "firebrick",
vertical = TRUE,
las = 1) # orients y-axis tick labels properly
Figure 1: Stripchart of the association between serotonin levels and experimental treatment (N = 10 locusts per group).
To see how to format and create a good figure caption for a stripchart, consult the guidelines for data presentation here
We’ll go back to the students
dataset for this, and evaluate how height is associated with gender.
Let’s see how many observations there are in each group. We can use the xtabs
function we learned earlier for this:
## sex
## f m
## 90 64
So clearly a boxplot is useful here, because the number of observations in each group is >> 20.
We use the boxplot
function to create a boxplot.
Take note of the various arguments that we set:
boxplot(height_cm ~ sex, data = students,
ylab = "Height (cm)",
xlab = "Gender",
las = 1) # orients y-axis tick labels properly
Figure X. Boxplot of the association between height and gender among 154 students.
TIP: Notice the consistent syntax again, with the response (y) variable on the left of the ~
symbol and the explanatory (x) variable to the right.
To see how to format and create a good figure caption for a boxplot, consult the guidelines for data presentation here
We will wait until the tutorial Analysis of Variance to learn about interpreting stripcharts, because there we learn how to add more information to the graphs, such as group means and measures of uncertainty. For now, let’s look at the stripchart from above and see what we can glean:
stripchart(serotoninLevel ~ treatmentTime, data = locust,
ylab = "Serotonin (pmoles)",
xlab = "Treatment group (number of hours)",
method = "jitter", # jitters the symbols
pch = 1, # pch changes the symbol type
col = "firebrick",
vertical = TRUE,
las = 1) # orients y-axis tick labels properly
Figure 1: Stripchart of the association between serotonin levels and experimental treatment (N = 10 locusts per group).
We see that at the 0 and 1hr treatment times, most of the Serotonin levels appear to be clustered around 5pmoles, but a few observations are much higher. This pattern is less clear in the 2hr group. There does appear to be a slight increase, in general, in Serotonin levels in the 2hr group compared to the others, but we’ll learn more about this later.
Now for the boxplot:
boxplot(height_cm ~ sex, data = students,
ylab = "Height (cm)",
xlab = "Gender",
las = 1) # orients y-axis tick labels properly
Figure 2. Boxplot of the association between height and gender among 154 students.
This boxplot clearly shows that male students are, on average, around 15cm taller than females in the class. Median height (denoted by the thick horizontal lines) among females is around 166cm, and is around 180cm among males. A few students are especially tall compared to their classmates: one female is around 186cm tall, and one male is over 210cm tall!
You can find extra tips and instructions about graphing in R at this webpage. Note that the graphing functions you’ll learn about in this tutorial are from the lattice
package, which is loaded as part of the tigerstats
package.
We will eventually learn how to make graphs using the ggplot2
package.
Getting started:
read.csv
url
library
getwd
Data frame structure:
names
head
str
inspect
(tigerstats
/ mosaic
packages)Contingency tables:
xtabs
addmargins
prop.table
rowPerc
(tigerstats
)colPerc
(tigerstats
)Graphs:
barchartGC
(tigerstats
/ lattice
packages)mosaicplot
stripchart
(lattice
package)xyplot
(lattice
package)boxplot