This page was last updated on October 30, 2019.
Most statistical tests, such as the t-test, ANOVA, Pearson correlation, and least-squares regression, have assumptions that must be met. For example, the one-sample t-test requires that the variable is normally distributed in the population, and least-squares regression requires that the residuals from the regression be normally distributed. In this tutorial we’ll learn ways to check the assumption that the variable is normally distributed in the population.
We’ll also learn how transforming a variable can sometimes help satisfy assumptions, in which case the analysis is conducted on the transformed variable.
tigerstats
car
arm
If you don’t already have these installed, type this in the console:
install.packages("tigerstats", dependencies = T)
install.packages("car", dependencies = T)
install.packages("arm", dependencies = T)
Load the packages:
The “marine.csv” dataset is discussed in example 13.1 in the text book. The “flowers.csv” dataset is described below. The “students.csv” data include data about BIOL202 students from a few years ago.
marine <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/marine.csv"), header = TRUE)
flowers <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/flowers.csv"), header = TRUE)
students <- read.csv(url("https://people.ok.ubc.ca/jpither/datasets/students.csv"), header = TRUE)
Look at the first handful of rows of each data frame:
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n
## 1 biomassRatio numeric 0.83 1.2675 1.49 1.85 4.25 1.734375 0.7483312 32
## missing
## 1 0
##
## quantitative variables:
## name class min Q1 median Q3 max
## 1 propFertile numeric 0.005894604 0.1257379 0.4866791 0.7457393 0.9874335
## mean sd n missing
## 1 0.4734709 0.3316103 30 0
We can’t know for certain if a variable is normally distributed in the population, but given a proper random sample from the population, of sufficient sample size, we can assume that the frequency distribution of our sample data should, to reasonable degree, reflect the frequency distribution of the variable in the population.
The most straightfoward way to check the normality assumption is to visualize the data using a histogram, in combination with something called a normal quantile plot.
The R function to produce a normal quantile plot is qqnorm
:
?qqnorm
For details about what Normal Quantile Plots are, and how they’re constructed, consult this informative link.
And we also typically want to add a line to the normal quantile plot, using the qqline
function, as seen below (TIP: note the par
function, which allows one to align figures side by side):
par(mfrow = c(1,2)) # align figures side by side
hist(marine$biomassRatio, main = "",
las = 1, col = "grey",
xlab = "Biomass ratio",
ylab = "Frequency")
qqnorm(marine$biomassRatio, main = "",
las = 1,
ylab = "Biomass ratio",
xlab = "Normal quantile")
qqline(marine$biomassRatio)
Figure 1: The frequency distribution of the ‘biomass ratio’ of 32 marine reserves (left) and the corresponding normal quantile plot (right)
The histogram of biomass ratios shows a right-skewed frequency distribution, and the corresponding quantile plots shows points deviating substantially from the straight line. If the frequency distribution were normally distributed, points would fall close to the straight line in the quantile plot.
The frequency distribution of this variable obviously does not conform to a normal distribution.
Although graphical assessments are usually sufficient for checking the normality assumption, one can conduct a formal statistical test of the null hypothesis that the data are sampled from a population having a normal distribution. The test is called the Shapiro-Wilk test.
The Shapiro-Wilk test is a type of goodness-of-fit test.
The null and alternative hypotheses are as follows:
H0: The data are sampled from a population having a normal distribution.
HA: The data are sampled from a population having a non-normal distribution.
Although we’ve already determined, using graphs, that the biomassRatio
variable in the marine
dataset is not normally distributed, let’s conduct a Shapiro-Wilk test anyways, using the shapiro.test
function:
?shapiro.test
Here we go:
##
## Shapiro-Wilk normality test
##
## data: marine$biomassRatio
## W = 0.81751, p-value = 8.851e-05
It returns a test statistic “W”, and an associated P-value.
We see here that the P-value is very small (<0.001), so we reject the null hypothesis, and conclude that the biomass ratio data are drawn from a non-normal population (Shapiro-Wilk test, W = 0.82, P-value = 0).
Some statistical tests, such as the 2-sample t-test and ANOVA, assume equal variance among groups.
For this, we use the Levene’s test, which we implement using the leveneTest
function from the car
package:
?leveneTest
This tests the null hypothesis that the variances are equal among the groups.
The null and alternative hypotheses are as follows:
H0: The two populations have equal variance (\(\sigma\)(1) = \(\sigma\)(2)).
HA: The two populations do not have equal variance (\(\sigma\)(1) \(\neq\) \(\sigma\)(2)).
We’ll use the “students” data, and check whether height_cm
exhibits equal variance among males and females.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0 0.999
## 152
It uses a test statistic “F”, and we see here that the P-value associated with the test statistic is almost 1, so clearly not significant.
We state “A Levene’s test showed no evidence against the assumption of equal variance (F = 0; P-value = 0.999).”
Here we learn how to transform numeric variables using two common methods:
There are many other types of transformations that can be performed, some of which are described in Chapter 13 of the course text book.
More discussion about transformations can be found at this useful but somewhat dated website.
Note that it is often better to use the “logit” transformation rather than the “arcsin square-root” transformation for proportion or percentage data, as described in this article by Warton and Hui (2011).
When one observes a right-skewed frequency distribution, as seen in the marine biomass ratio data above, a log-transformation often helps.
To log-transform the data, simply create a new variable in the data frame, say logbiomass
, and use the log
function like so:
Let’s look at the histogram of the transformed data:
par(mfrow = c(1,2)) # align figures side by side
hist(marine$logbiomass, main = "",
las = 1, col = "grey",
xlab = "Biomass ratio (log-transformed)",
ylab = "Frequency")
qqnorm(marine$logbiomass, main = "",
las = 1,
ylab = "Biomass ratio (log)",
xlab = "Normal quantile")
qqline(marine$logbiomass)
Figure 2: The frequency distribution of the ‘biomass ratio’ (log-transformed) of 32 marine reserves (left) and the corresponding normal quantile plot (right)
The log-transform definitely helped, but the distribution still looks a bit wonky.
Just to be sure, let’s conduct a Shapiro-Wilk test, using an \(\alpha\) level of 0.05:
##
## Shapiro-Wilk normality test
##
## data: marine$logbiomass
## W = 0.93795, p-value = 0.06551
The P-value is greater than 0.05, so we’d conclude that there’s not sufficient evidence to reject the null hypothesis that these data fit a normal distribution.
Importantly, with decent sample sizes (here, 32), tests such as the t-test are robust to slight non-normality.
If you try to log-transform a value of zero, R will return a -Inf
value.
In this case, you’ll need to add a constant (value) to each observation, and convention is to simply add 1 to each value prior to log-transforming.
In fact, you can add any constant that makes the data conform best to the assumptions once log-transformed. The key is that you must add the same constant to every value in the variable.
You then conduct the analyses using these newly transformed data (which had 1 added prior to log-transform), remembering that after back-transformation (see below), you need to subtract 1 to get back to the original scale.
The log
function calculates the natural logarithm (base e), but related functions permit any base:
?log
For instance, log10
uses log base 10:
In order to back-transform data that were transformed using the natural logarithm (log
), you make use of the exp
function:
?exp
Let’s try it, and plot the original untransformed data against the back-transformed data, and they should fall on the equality straight line in a scatterplot.
First, back-transform the data and store the results in a new variable within the data frame:
Now plot these against the original data frame variable:
plot(biomass_back_transformed ~ biomassRatio, data = marine,
ylab = "Back-transformed data",
xlab = "Original untransformed data",
las = 1)
Figure 3. The original, raw biomass data (x-axis) plotted against back-transformed data (y-axis)
Yup, it worked!
If you had used the log base 10 transformation, as follows:
then this is how you back-transform:
## [1] 1.34 1.96 2.49 1.27 1.19 1.15 1.29 1.05 1.10 1.21 1.31 1.26 1.38 1.49
## [15] 1.84 1.84 3.06 2.65 4.25 3.35 2.55 1.72 1.52 1.49 1.67 1.78 1.71 1.88
## [29] 0.83 1.16 1.31 1.40
The ^
symbol stands for “exponent”. So here we’re calculating 10 to the exponent x, where x is each value in the dataset.
For data that are proportions or percentages, it is recommended that they be logit-transformed.
The arm
package includes both the logit
function and the invlogit
function, the latter for back-transforming.
However, the logit
funcion that is in the car
package is better, because it accommodates the possibility that your dataset includes a zero and / or a one (equivalently, a zero or 100 percent), and has a mechanism to deal with this properly.
The logit
function in the arm
package does not deal with this possibility for you.
However, the car
package does not have a function that will back-transform logit-transformed data.
This is why we’ll use the logit
function from the car
package, and the invlogit
function from the arm
package!
Let’s see how it works with the flowers
dataset, which includes a variable propFertile
that describes the proportion of seeds produced by individual plants that were fertilized.
Let’s visualize the data:
par(mfrow = c(1,2)) # align figures side by side
hist(flowers$propFertile, main = "",
las = 1, col = "grey",
xlab = "Proportion of seeds that were fertilized",
ylab = "Frequency")
qqnorm(flowers$propFertile, main = "",
las = 1,
xlab = "Proportion of seeds that were fertilized",
ylab = "Normal quantile")
qqline(flowers$propFertile)
Figure 4: The frequency distribution of the proportion of seeds fertilized on 30 plants (left) and the corresponding normal quantile plot (right)
Now let’s logit-transform the data.
To ensure that we’re using the correct logit
function, i.e. the one from the car
package and NOT from the arm
package, we can use the ::
syntax, with the package name preceding the double-colons, which tells R the correct package to use, as follows:
Now let’s visualize the transformed data:
par(mfrow = c(1,2)) # align figures side by side
hist(flowers$logitfertile, main = "",
las = 1, col = "grey",
xlab = "Proportion of seeds that were fertilized (logit)",
ylab = "Frequency")
qqnorm(flowers$logitfertile, main = "",
las = 1,
xlab = "Proportion of seeds that were fertilized (logit)",
ylab = "Normal quantile")
qqline(flowers$logitfertile)
Figure 5: The frequency distribution of the proportion of seeds fertilized (logit transformed) on 30 plants (left) and the corresponding normal quantile plot (right)
That’s much better!
We’ll use the invlogit
function from the arm
package:
?invlogit
First do the back-transform, storing the results in a new object:
Let’s plot the back-transformed data against the original data, and see if they match:
plot(flower_backtransformed ~ propFertile, data = flowers,
ylab = "Back-transformed data",
xlab = "Original untransformed data",
las = 1)
Figure 6. The original, raw flower data (x-axis) plotted against back-transformed data (y-axis)
Yup, it worked!
You should back-transform your data when it makes sense to communicate findings on the original measurement scale.
The most common example is reporting confidence intervals for a mean or difference in means.
If you conduct a least-square regression analysis on transformed data, you do not generally need to back-transform any of the results, but you should make clear that the data are transformed (and what transformation you used). In some instances, perhaps when talking about specific predicted values from the regression, it may make sense to back-transform.
Getting started:
read.csv
url
library
Testing assumptions:
head
hist
qqnorm
qqline
par
plot
shapiro.test
leveneTest
(from the car
package)inspect
(from the tigerstats
package)Transformations:
log
log10
logit
(from the car
package)invlogit
(from the arm
package)exp