This page was last updated on September 11, 2019.
When visualizing a describing a single variable, we typically wish to describe a frequency distribution. We visualize a frequency distribution in different ways depending on the type of variable we’re dealing with: categorical or numeric.
In this tutorial you’ll learn to construct and interpret each of these types of visualization. You’ll also learn to calculate some descriptive statistics.
We will use the following datasets in this tutorial:
students.csv
file contains anonymous physical data about BIOL202 students from a few years agobirds.csv
file contains counts of different categories of bird from a marsh habitatRead in the data:
students <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/students.csv"), header = TRUE)
birds <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/birds.csv"), header = TRUE)
TIP: Notice the two sets of parentheses in each line of code above. The RStudio text editor will highlight errors such as unmatched parentheses for you.
In this tutorial we will make use of the following R package:
tigerstats
If you haven’t already installed the tigerstats
package (info here), click your pointer within the command console, then issue the following command:
Alternatively, you can click the “packages” tab in the bottom-right panel in RStudio, then the “Install” tab, then follow the instructions.
Once installed, load the package as follows:
The tigerstats
package automatically loads other packages such as mosaic
(info here), which include very handy functions for descriptive and elementary statistics.
The students
object that we created in your workspace is a data frame, with each row representing a case and each column representing a variable. Data frames can store a mixture of data types: numeric variables, categorical variables, logical variables etc… all in the same data frame (as separate columns). This isn’t the case with other object types (e.g. matrices).
To view the names of the variables in the data frame, use the names
command as follows:
## [1] "height_cm" "head_circum_cm" "sex" "number_siblings"
## [5] "dominant_hand" "dominant_foot" "dominant_eye"
This returns the names height_cm
, head_circum_cm
, sex
, number_siblings
, dominant_hand
, dominant_foot
, and dominant_eye
.
We can get a glimpse of the first handful of cases (rows) of our data with the head
command
## height_cm head_circum_cm sex number_siblings dominant_hand dominant_foot
## 1 160.02 55 f 1 r r
## 2 180.00 56 f 1 r r
## 3 192.00 56 m 5 r r
## 4 165.00 61 f 2 r r
## 5 182.00 61 m 1 r l
## 6 165.00 53 f 1 r r
## dominant_eye
## 1 l
## 2 r
## 3 r
## 4 r
## 5 r
## 6 l
You could also look at all of the data frame at once by typing its name into the console and pressing return, but that is not advisable, as you could get reams of output thrown at you if you’re dealing with a large dataset! It’s better to take a small peek at the data with head
.
We should now get an idea of how many cases are there in this data set, how many variables it contains, and what type of data comprise each variable (e.g. numeric, integer, character, factor, logical, etc…).
Use the str
command to do this:
## 'data.frame': 154 obs. of 7 variables:
## $ height_cm : num 160 180 192 165 182 ...
## $ head_circum_cm : num 55 56 56 61 61 53 53.5 54 54 54 ...
## $ sex : Factor w/ 2 levels "f","m": 1 1 2 1 2 1 1 1 1 1 ...
## $ number_siblings: int 1 1 5 2 1 1 1 0 3 1 ...
## $ dominant_hand : Factor w/ 2 levels "l","r": 2 2 2 2 2 2 2 2 2 2 ...
## $ dominant_foot : Factor w/ 2 levels "l","r": 2 2 2 2 1 2 2 2 2 2 ...
## $ dominant_eye : Factor w/ 2 levels "l","r": 1 2 2 2 2 1 2 2 2 2 ...
Even better, the mosaic
package, which is loaded along with the tigerstats
package, includes a function inspect
that really provides a nice overview of a data frame, along with some descriptive statistics (we’ll come back to these later):
##
## categorical variables:
## name class levels n missing
## 1 sex factor 2 154 0
## 2 dominant_hand factor 2 154 0
## 3 dominant_foot factor 2 154 0
## 4 dominant_eye factor 2 154 0
## distribution
## 1 f (58.4%), m (41.6%)
## 2 r (90.3%), l (9.7%)
## 3 r (89%), l (11%)
## 4 r (68.8%), l (31.2%)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd
## 1 height_cm numeric 150 165 171.475 180.0 210.8 171.971234 10.027280
## 2 head_circum_cm numeric 53 56 57.000 58.5 63.0 57.185065 1.884848
## 3 number_siblings integer 0 1 2.000 2.0 6.0 1.707792 1.053629
## n missing
## 1 154 0
## 2 154 0
## 3 154 0
You see that there are:
class
of each variable is listed (e.g. factor
or numeric
or integer
)levels
(categories)birds
data frame, so you know what it looks like.Now that we have explored the basic structure of our dataset, we’re ready to start visualizing the data graphically!
Once you have a dataframe to work with, and have explored its structure and contents (above), the next order of business is always to visualize and summarize your data using graphs and tables.
We start by examining the frequency distribution of the variable(s) of interest. A frequency distribution displays the number of occurrences (cases) of all values in the data. How we display this information depends on the type of data at hand: are they numerical or categorical?
We also typically report the relative frequency distribution for the variable(s), which describes the fraction of occurrences of each value of a variable.
Frequency distributions can be displayed in a table and graphically.
Use the xtabs
command (from the tigerstats
package) to produce a frequency table, which shows the frequency distribution for a categorical variable in tabular format.
Here we show the frequency of observations in each of the two categories contained in the sex
variable within the students
dataframe:
## sex
## f m
## 90 64
TIP: Take note of the syntax in the arguments provided to the xtabs
function. We’ll return to this later.
We see that there are 2 categories (or “levels”) “f” and “m” representing female and male; this was also shown by the str
function earlier. There are 90 and 64 observations (students) in those respective categories.
To show the relative frequency distribution for the sex
variable, use the prop.table
function and wrap it around the “xtabs” function using a second set of parentheses:
## sex
## f m
## 0.5844156 0.4155844
You can convert these relative frequencies to percentages by multiplying them by 100:
## sex
## f m
## 58.44156 41.55844
Or you can use the handy rowPerc
function to convert the raw frequencies to percentages:
##
## sex f m Total
## 58.44 41.56 100
When there are more than 2 categories in the variable of interest, you will need to sort the frequencies in decreasing order.
There are several steps to this, which we’ll demonstrate using the birds
dataset.
xtabs
function to create a frequency tablesort
function to sort the resulting frequencies across categoriesdata.frame
function to create a data frame that stores the properly sorted frequenciesbird.table <- xtabs(~ type, data = birds) # create frequency table
bird.table.sort <- sort(bird.table, decreasing = TRUE) # sort the frequencies in decreasing order
bird.df <- data.frame(bird.table.sort) # create a data frame
names(bird.df) <- c("Birdtype", "Frequency") # rename variables
bird.df # show the final table
## Birdtype Frequency
## 1 Waterfowl 43
## 2 Predatory 29
## 3 Shorebird 8
## 4 Songbird 6
There we go! A frequency table that is appropriately sorted. We’ll use the bird.table.sort
object later for graphing too.
We use a bar chart to visualize the frequency distribution for a single categorical variable.
The tigerstats
package has a handy function called barchartGC
that can do this:
TIP: Note that the syntax of the arguments provided to the barchartGC
function is identical to the syntax used for the xtabs
function.
Problem: Note how the categories are not sorted in order of decreasing frequency, as they should be. Thus, we need to use our sorted data from above to create a proper bar chart.
Sometimes it’s better to display the bar chart horizontally, especially if your category labels are long.
You should change the width and height settings accordingly:
Problem: Here the bars are sorted in the reverse order; the longest bar should be at the top of the graph. To fix this, simply use the sort
function again within the barchartGC
function, as follows:
We can use the same command to visualize the relative frequency distribution.
First let’s use the prop.table
function to create an object that holds the relative frequencies, then we’ll plot those with the barchartGC
function, using the sorted frequency table again:
bird.table.rel <- prop.table(bird.table.sort)
barchartGC(bird.table.rel, ylab = "Relative frequency")
Two things to note in the preceding code:
bird.table.rel
object as the main argument to the barchartGC
function. This was necessary to display the relative frequencies, which were calculated and stored in the appropriate table format using the xtabs
functionylab
, which specifies the text to use as the label for the y-axis. This was required because the default y-axis label is “frequency”, and we are showing “Relative frequency” here.The proportion is the most important descriptive statistic for a categorical variable. It measures the fraction of observations in a given category within a categorical variable. For example: what proportion of the BIOL202 class is female?
The proportion of students that are female is the same as the relative frequency of females in the class. Earlier, using the students
dataset, we learned how to show the relative frequencies of males and females in the class using the prop.table
function:
## sex
## f m
## 0.5844156 0.4155844
As shown above, the proportion of students that are female is 0.584. Proportions always fall between 0 and 1.
We start by examining the frequency distribution of the variable of interest, which is the number of occurrences (cases) of all values in the data. In this case, the data are numerical.
With numerical variables, such as the height_cm
variable in the students
dataset, it typically does not make sense to tabulate each unique value in the data (as we do with categorical variables) because there may be many, many values with only a single occurrence. Instead, intervals are created, and and the number of occurrences of values within each interval is tallied.
It is relatively uncommon in practice to report a frequency table for a numeric variable. Much more common is to proceed directly to graphically displaying the frequency distribution using a histogram.
A histogram uses the area of rectangular bars to display the frequency distribution (or relative frequency distribution) of a numerical variable.
Let’s construct a simple histogram of head circumference from the students
dataset using the histogram
function (which the tigerstats
package actually borrows from the lattice
and mosaic
packages).
We need to add one argument to the otherwise familiar syntax: the histogram
function can produce three types of histograms: “density”, “percent”, or “count”. We want to display the raw counts (= frequencies), so use the argument type = "count"
:
TIP: Note again the same general syntax used here and with the xtabs
and barchartGC
functions: nice and consistent!
The histogram
function chooses the intervals or “bin widths” for you. However, the choice of bin width can drammatically affect the look of your histogram. For BIOL202 students, consult p. 37 of the text for information about this.
You can change by using the argument nint
, which stands for “number of intervals”
hist
functionYou’ll note in the histograms above that the bars don’t necessarily line up with the x-axis tick marks.
This is an annoying feature of the histogram
function from the lattice
package (which is part of the tigerstats
package).
As with most tasks in R, there are many ways to create a histogram. Here we’ll use the hist
command (included in the base package) instead of the histogram
command.
?hist
Note the added arguments, to make the figure look good:
If you’re interested in creating fancier histograms, consult the following datacamp tutorial.
Frequency distributions for numerical variables can take on a variety of shapes, as shown in the following display of histograms:
Use the image above as a guide on how to describe a histogram. Note that the asymmetric distribution displayed above is skewed left.
Things to note in your description:
Typically your histogram and its description would be accompanied by descriptive statistics (see below).
When describing a numeric variable, calculate and report the mean and standard deviation as measures of centre and spread, respectively
If the frequency distribution is roughly symmetric and does not have any obvious outliers, the mean and the standard deviation are the preferred measures of centre and spread
If the frequency distribution is asymmetric and / or has outliers, the median and the inter-quartile range (IQR) are the preferred measures of centre and spread, and in this case, one often sees these reported in addition to the mean and standard deviation
The tigerstats
package we already installed includes a handy function favstats
that can be used to calculate descriptive statistics for numeric variables.
Here we’ll calculate these descriptive statistics on the height data from the students
dataset:
## min Q1 median Q3 max mean sd n missing
## 150 165 171.475 180 210.8 171.9712 10.02728 154 0
TIP: Note again the syntax!
You can also compute descriptive statistics one by one. For instance, to calculate the mean, median, variance, standard deviation, and IQR of height_cm
, type:
## [1] 171.9712
## [1] 171.475
## [1] 100.5463
## [1] 10.02728
## [1] 15
TIP: Sometimes packages include functions that conflict in name with functions in the “base” R package (the default set of functions available to you when you start R). For example, the tigerstats
package includes the functions mean
, sd
, median
and others that share a name with functions in the base R package. This is problematic if (i) you attempt to use the syntax of, say, the mean
function from tigerstats
, as shown in the preceding chunks, and (ii) you have not loaded the tigerstats
package. R will think you’re trying to call its base mean
function, which does not work the same way as the tigerstats
version.
Getting started:
read.csv
url
library
getwd
Data frame structure:
names
head
str
inspect
(tigerstats
/ mosaic
packages)Frequency tables:
xtabs
prop.table
rowPerc
(tigerstats
)colPerc
(tigerstats
)Graphs:
barchartGC
(tigerstats
/ lattice
packages)histogram
(tigerstats
/ lattice
packages)Descriptive stats:
favstats
(tigerstats
and mosaic
packages)mean
median
var
sd
IQR