This page was last updated on September 11, 2019.


Background

The type of graph that is most suitable for visualizing an association between two variables depends upon the type of data being visualized:

  • If both variables are categorical, we can visualize the association in a table called a contingency table, or we can visualize the association graphically using a grouped bar chart or a mosaic plot
  • If both variables are numeric, we visualize the association graphically using a scatterplot
  • If one variable is categorical and the other numeric, we visualize the association graphically using a strip chart or a boxplot

In this tutorial you’ll learn to construct and interpret each of these types of visualization. In later tutorials you’ll learn how to conduct statistical analyses of these associations.


Getting started

Load the required packages

Load the required tigerstats package:


Get an overview of the data

The students object that we created in your workspace is a data frame, with each row representing a case and each column representing a variable. Data frames can store a mixture of data types: numeric variables, categorical variables, logical variables etc… all in the same data frame (as separate columns). This isn’t the case with other object types (e.g. matrices).

Use the str command to get an overview of the dataset:

## 'data.frame':    154 obs. of  7 variables:
##  $ height_cm      : num  160 180 192 165 182 ...
##  $ head_circum_cm : num  55 56 56 61 61 53 53.5 54 54 54 ...
##  $ sex            : Factor w/ 2 levels "f","m": 1 1 2 1 2 1 1 1 1 1 ...
##  $ number_siblings: int  1 1 5 2 1 1 1 0 3 1 ...
##  $ dominant_hand  : Factor w/ 2 levels "l","r": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dominant_foot  : Factor w/ 2 levels "l","r": 2 2 2 2 1 2 2 2 2 2 ...
##  $ dominant_eye   : Factor w/ 2 levels "l","r": 1 2 2 2 2 1 2 2 2 2 ...

Even better, the mosaic package, which is loaded along with the tigerstats package, includes a function inspect that really provides a nice overview of a data frame, along with some descriptive statistics (we’ll come back to these later):

## 
## categorical variables:  
##            name  class levels   n missing
## 1           sex factor      2 154       0
## 2 dominant_hand factor      2 154       0
## 3 dominant_foot factor      2 154       0
## 4  dominant_eye factor      2 154       0
##                                    distribution
## 1 f (58.4%), m (41.6%)                         
## 2 r (90.3%), l (9.7%)                          
## 3 r (89%), l (11%)                             
## 4 r (68.8%), l (31.2%)                         
## 
## quantitative variables:  
##              name   class min  Q1  median    Q3   max       mean        sd
## 1       height_cm numeric 150 165 171.475 180.0 210.8 171.971234 10.027280
## 2  head_circum_cm numeric  53  56  57.000  58.5  63.0  57.185065  1.884848
## 3 number_siblings integer   0   1   2.000   2.0   6.0   1.707792  1.053629
##     n missing
## 1 154       0
## 2 154       0
## 3 154       0

You see that there are:

  • both categorical and quantitative (numerical) variables
  • no missing values for any of the variables
  • the class of each variable is listed (e.g. factor or numeric or integer)
  • each of the 4 categorical variables happens to have 2 levels (categories)
  • 2 numeric variables (height and head circumference)
  • 1 integer variable (number of siblings)
  • a variety of descriptive statistics accompanying both the variable types (categorical and numerical)

Now that we have explored the basic structure of our dataset, we’re ready to start visualizing the data graphically!


Visualizing association between two categorical variables

If two categorical variables are associated, the relative frequencies for one variable will differ among categories of the other variable. To visualize such an association, we can construct:

  • a contingency table
  • a grouped bar graph
  • a mosaic plot

Constructing a contingency table

We learned earlier that we can use the xtabs command (from the tigerstats package) to produce a frequency table for a single categorical variable:

## sex
##  f  m 
## 90 64

This shows the frequency of observations (cases) in each category (‘f’ and ‘m’) of the single categorical variable (sex). There are 90 females and 64 males.

Let’s check whether there is an association between handedness (using the variable dominant_hand) and gender (using the sex variable).

To construct a contingency table involving 2 variables, we include both variable names after the ~, separated by a + sign:

##    dominant_hand
## sex  l  r
##   f  4 86
##   m 11 53

We can see that 4 out of 90 female students are left-handed, while 11 out of 64 male students are left-handed. At first glance, it appears that the frequency (count) of left-handed males is high, but it’s easier to interpret relative frequencies or percentages. We’ll do this below. But first:

TIP: Contingency tables should always show the row and column totals, as well as the grand total (= total count of observations).

This can be achieved with the addmargins function, wrapped around the xtabs function:

##      dominant_hand
## sex     l   r Sum
##   f     4  86  90
##   m    11  53  64
##   Sum  15 139 154

Voila - you now have a contingency table showing an association between two categorical variables.

This particular table has dimensions 2 x 2, because each of our 2 categorical variables happens to have 2 categories. If one of our variables instead had, say, 3 categories, we would have a 2 x 3 dimension table.

Displaying the row and column totals, and the grand total, allows the reader to easily see how many observations (here, students) were in each category of each variable. For example, we see from the table above that there were 90 female and 64 male students in the dataset. We also see that there were 15 left-handed students and 139 right-handed students in the dataset.

Relative frequencies

Often it is easier to interpret categorical data if they are presented as relative frequencies, or proportions.

To do this, we need to decide which of the “margins”, i.e. rows or columns, we’re going to use as the denominator when calculating the relative frequencies.

For instance, if we wanted to answer the question “What proportion of males in the class is right handed?”, then we’d need to divide the frequencies in the table by the row totals. We can use the prop.table command to do this, making sure to include the argument margin = 1 so that it expresses the proportions based on the row totals:

##    dominant_hand
## sex          l          r
##   f 0.04444444 0.95555556
##   m 0.17187500 0.82812500

You’ll see that if you sum the values in the same row, the add up to 1.

And if you’d rather use percentages, simply multiply by 100:

##    dominant_hand
## sex         l         r
##   f  4.444444 95.555556
##   m 17.187500 82.812500

Now you see that 17.2 percent of males were left-handed, while only 4.4 percent of females were left handed.

If instead we wanted to answer the question “What proportion of right-handed students in the dataset is male?”, then we’d need to divide the frequencies in the table by the column totals.

Thus, change the margin argument value to 2:

##    dominant_hand
## sex         l         r
##   f 0.2666667 0.6187050
##   m 0.7333333 0.3812950

We see that of the right-handed students in the dataset, 38% (corresponding to a proportion of 0.38) are male.

Other functions that are handy when evaluating contingency tables are rowPerc and colPerc:

##    dominant_hand
## sex      l      r  Total
##   f   4.44  95.56 100.00
##   m  17.19  82.81 100.00
##        dominant_hand
## sex          l      r
##   f      26.67  61.87
##   m      73.33  38.13
##   Total 100.00 100.00

Constructing a grouped bar chart

Although the continency tables were informative, graphs can be much more effective at showing associations.

We can use the barchartGC function to visualize associations between two categorical variables, creating a grouped bar chart:

Figure X. Grouped bar graph of the association between handedness ('l' denotes left-handed, 'r' denotes right-handed) and gender in a class of 90 female ('f') and 64 male ('m') students.

Figure X. Grouped bar graph of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.

Note the additional argument type, which specifies whether we want to show the raw counts (“freq”) or percentages, as in the following:

Figure X. Grouped bar graph of the association between handedness ('l' denotes left-handed, 'r' denotes right-handed) and gender in a class of 90 female ('f') and 64 male ('m') students.

Figure X. Grouped bar graph of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.

Sometimes it’s better to display the graph horizontally. We can do this thanks to the horizonal argument in the barchartGC function:

Figure X. Grouped bar graph of the association between handedness ('l' denotes left-handed, 'r' denotes right-handed) and gender in a class of 90 female ('f') and 64 male ('m') students.

Figure X. Grouped bar graph of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.

In the figures above, we used sex (gender) as the variable shown on the graph axis, and dominant_hand differentiated using colours and a legend.

We can switch this around by changing which variable appears immediately after the ~ in the barchartGC command:

Figure X. Grouped bar graph of the association between gender and handedness ('l'  denotes left-handed, 'r' denotes right-handed) in a class of 90 female ('f') and 64 male ('m') students.

Figure X. Grouped bar graph of the association between gender and handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) in a class of 90 female (‘f’) and 64 male (‘m’) students.

Note the different figure caption in this figure compared to the preceding ones: here it says “association between gender and handedness”… we have “gender” appearing first, and the categories “m” and “f” are differentiated by colours, as indicated in the legend. It is typical to use this convention, i.e. the second variable in the statement is the one put on the axis of a grouped bar chart.

You should decide on which variable appears on the graph axis, and which in the legend, depending on what you wish to convey.

If you wish to describe the association between handedness and gender (more logical in this example), the original layout is best (sex on the axis). If you wished to describe the association between gender and handedness (less logical in this example), then the last graph is best.


Constructing a mosaic plot

An alternative but less common way to visualize the association between two categorical variables is a mosaic plot.

We can use the mosaicplot function, which expects a contingency table as its input, which itself is provided by the xtabs function:

Figure X. Mosaic plot of the association between handedness ('l' denotes left-handed, 'r' denotes right-handed) and gender in a class of 90 female ('f') and 64 male ('m') students.

Figure X. Mosaic plot of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.


Interpreting grouped bar charts and mosaic plots

Let’s re-plot the grouped bar chart and the mosaic plot first (we won’t show the code, as it is identical to above):

Figure X. Grouped bar graph of the association between handedness ('l'  denotes left-handed, 'r' denotes right-handed) and gender in a class of 90 female ('f') and 64 male ('m') students.

Figure X. Grouped bar graph of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.

Figure X. Mosaic plot of the association between handedness ('l' denotes left-handed, 'r' denotes right-handed) and gender in a class of 90 female ('f') and 64 male ('m') students.

Figure X. Mosaic plot of the association between handedness (‘l’ denotes left-handed, ‘r’ denotes right-handed) and gender in a class of 90 female (‘f’) and 64 male (‘m’) students.

Both figures show that the percent of students of the same gender that are left-handed is greater among males than among females. However, this pattern is arguably easier to discern in the mosaic plot.

Note also that, by default, the mosaic plot provides an indication of sample sizes within the two categories (male versus female): the relative width of the bars indicates this. Recall that there were 64 males and 90 females in the class, and accordingly, the bars on the right (males) are narrower than those on the left.


Visualizing association between two numeric variables

Creating a scatterplot

We use a scatterplot to show association between two numerical variables.

The xyplot function does the trick.

TIP: Notice the syntax: here we have our response (y) variable on the left of the ~ symbol, and the explanatory (x) variable on the right:

Figure 1: Scatterplot of the association between head circumference and height among 154 students.

Figure 1: Scatterplot of the association between head circumference and height among 154 students.

TIP: Note the added arguments to the xyplot function. We have changed the symbol colour, and added better x- and y-axis labels.

To see how to format and create a good figure caption for a scatterplot, consult the guidelines for data presentation here.


Interpreting and describing a scatterplot

Things to report when describing a scatterplot:

  • is there an association? A “shotgun blast” pattern indicates no. If there is an association, is it positive or negative?
  • is the association linear or not?
  • are there any outlier observations that lie far from the general trend?

In the scatterplot above, head circumference is positively associated with height, and the association is moderately strong. There are no observations that are obviously incosistent with the general trend.


Visualizing association between a numeric and a categorical variable

To visualize association between a numerical variable and a categorical variable, we can construct:

  • a stripchart
  • a boxplot

Use a stripchart when there are relatively few observations (e.g. less than 20) within each category of the categorical variable. Use a boxplot otherwise.

Let’s have a look at the locust serotonin data set using the str function:

## 'data.frame':    30 obs. of  2 variables:
##  $ serotoninLevel: num  5.3 4.6 4.5 4.3 4.2 3.6 3.7 3.3 12.1 18 ...
##  $ treatmentTime : int  0 0 0 0 0 0 0 0 0 0 ...

These data describe serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. The treatmentTime variable is an integer type variable, in that it describes the number of hours (0, 1, or 2) the locusts were experimentally crowded. However, we could simply consider this variable as a categorical variable denoting which “treatment group” the locusts belonged to. That’s what we’ll do here: pretend the treatmentTime variable is a categorical variable.

Let’s see how many observations there are in each treatment group. We can use the xtabs function we learned earlier for this:

## treatmentTime
##  0  1  2 
## 10 10 10

Given that there are only 10 observations per group, we should use a stripchart to visualize how serotonin levels associate with treatment.


Create a strip chart

We use the stripchart function for this, and we’ll add some arguments to improve the quality of the graph:

Figure 1: Stripchart of the association between serotonin levels and experimental treatment (N = 10 locusts per group).

Figure 1: Stripchart of the association between serotonin levels and experimental treatment (N = 10 locusts per group).

To see how to format and create a good figure caption for a stripchart, consult the guidelines for data presentation here


Create a boxplot

We’ll go back to the students dataset for this, and evaluate how height is associated with gender.

Let’s see how many observations there are in each group. We can use the xtabs function we learned earlier for this:

## sex
##  f  m 
## 90 64

So clearly a boxplot is useful here, because the number of observations in each group is >> 20.

We use the boxplot function to create a boxplot.

Take note of the various arguments that we set:

Figure X. Boxplot of the association between height and gender among 154 students.

Figure X. Boxplot of the association between height and gender among 154 students.

TIP: Notice the consistent syntax again, with the response (y) variable on the left of the ~ symbol and the explanatory (x) variable to the right.

To see how to format and create a good figure caption for a boxplot, consult the guidelines for data presentation here


Interpreting stripcharts and boxplots

We will wait until the tutorial Analysis of Variance to learn about interpreting stripcharts, because there we learn how to add more information to the graphs, such as group means and measures of uncertainty. For now, let’s look at the stripchart from above and see what we can glean:

Figure 1: Stripchart of the association between serotonin levels and experimental treatment (N = 10 locusts per group).

Figure 1: Stripchart of the association between serotonin levels and experimental treatment (N = 10 locusts per group).

We see that at the 0 and 1hr treatment times, most of the Serotonin levels appear to be clustered around 5pmoles, but a few observations are much higher. This pattern is less clear in the 2hr group. There does appear to be a slight increase, in general, in Serotonin levels in the 2hr group compared to the others, but we’ll learn more about this later.

Now for the boxplot:

Figure 2. Boxplot of the association between height and gender among 154 students.

Figure 2. Boxplot of the association between height and gender among 154 students.

This boxplot clearly shows that male students are, on average, around 15cm taller than females in the class. Median height (denoted by the thick horizontal lines) among females is around 166cm, and is around 180cm among males. A few students are especially tall compared to their classmates: one female is around 186cm tall, and one male is over 210cm tall!


Extra help

You can find extra tips and instructions about graphing in R at this webpage. Note that the graphing functions you’ll learn about in this tutorial are from the lattice package, which is loaded as part of the tigerstats package.

We will eventually learn how to make graphs using the ggplot2 package.


List of functions (and the source packages) used in tutorial

Getting started:

  • read.csv
  • url
  • library
  • getwd

Data frame structure:

  • names
  • head
  • str
  • inspect (tigerstats / mosaic packages)

Contingency tables:

  • xtabs
  • addmargins
  • prop.table
  • rowPerc (tigerstats)
  • colPerc (tigerstats)

Graphs:

  • barchartGC (tigerstats / lattice packages)
  • mosaicplot
  • stripchart (lattice package)
  • xyplot (lattice package)
  • boxplot