This page was last updated on September 25, 2018.
In this tutorial you will learn how to:
If you want a tutorial that really cuts to the chase, check this one out.
This tutorial assumes that software such as Microsoft Excel is used to initially enter and store data.
To be completed
In the meantime, it is recommended that you read about “tidy data” Hadley Wickham’s “R for Datascience” online book, linked here.
The following are good rules of thumb to follow when setting up your spreadsheet to store your data:
To be completed
In the meantime, here are some useful resources:
A short tutorial and less on why “long” format is preferred over “wide” format.
Read about “tidy data” Hadley Wickham’s “R for Datascience” online book, linked here.
You can tidy your data using the dplyr
package. Its associated cheatsheet is available here.
First, download an example Excel file here called “example_data.xlsx”. It includes data typical of a concentration-response exercise.
Then, open the file in Excel to have a look. You’ll see three sets of data, one for each substance being analyzed (NaCl, Sucrose, and CaCl2), some calculated statistics associated with the measured values (mean and standard error of the mean), and finally some response curves that have been fit. Here is a screenshot of what you’ll see:
We need to re-format these data (i.e. make them “tidy”) in order to import into R.
Typically it is easier to enter data in “wide” format, so we’ll assume this is the case here. This is how the above spreadsheet should be organized for import into R:
Note that we don’t include any derived values like the mean and standard error… these will be calculated in R.
You can find additional help on importing different types of files here.
Download the CSV file from here to your working directory.
NOTE: If your spreadsheet includes blank cells (i.e. missing values), R will automatically covert these to “NA” values.
Once it is saved in your working directory, we use the read.csv
function to import the data. We’ll import it into a data frame called “example.data”:
example.data <- read.csv(file = "example_data.csv",
header = TRUE)
Now we can have a look at the data frame, first exploring its structure using the str
function, then looking at the first few rows with the head
function:
str(example.data)
## 'data.frame': 21 obs. of 7 variables:
## $ substance: Factor w/ 3 levels "CaCl2","NaCl",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ conc : int 300 250 200 150 100 50 0 300 250 200 ...
## $ rep1 : num 97.4 97.3 96.2 94.8 10 0.7 0.1 97.2 97.3 97.4 ...
## $ rep2 : num 97.3 97.9 93.8 93.4 29.5 1.1 0.5 98.5 96.1 54 ...
## $ rep3 : num 98.4 96.9 97.3 95.4 11.1 5.9 1.3 97.6 90.7 76.7 ...
## $ rep4 : num 93 92.1 91.7 90.6 17.7 6.2 1.2 96.5 84.4 62.9 ...
## $ rep5 : num 91.2 95.8 93.9 95.1 16.9 1.6 0.6 98.3 98 83.4 ...
and:
head(example.data)
## substance conc rep1 rep2 rep3 rep4 rep5
## 1 NaCl 300 97.4 97.3 98.4 93.0 91.2
## 2 NaCl 250 97.3 97.9 96.9 92.1 95.8
## 3 NaCl 200 96.2 93.8 97.3 91.7 93.9
## 4 NaCl 150 94.8 93.4 95.4 90.6 95.1
## 5 NaCl 100 10.0 29.5 11.1 17.7 16.9
## 6 NaCl 50 0.7 1.1 5.9 6.2 1.6
We first need to load the tidyr
package (install it if you haven’t already):
library(tidyr)
Now we’ll use the gather
function for this:
?gather
Here’s the appropriate code for our current example, and we’ll create a new data frame called “long.data”:
long.data <- gather(example.data, replicate, value, rep1:rep5, factor_key = TRUE)
Now let’s look at the resulting data frame:
head(long.data)
## substance conc replicate value
## 1 NaCl 300 rep1 97.4
## 2 NaCl 250 rep1 97.3
## 3 NaCl 200 rep1 96.2
## 4 NaCl 150 rep1 94.8
## 5 NaCl 100 rep1 10.0
## 6 NaCl 50 rep1 0.7
We could now do some analyses on these data, as shown in the next section.
As we learned in another tutorial, we can calculate descriptive statistics (e.g. mean, standard deviation) for a numeric variable grouped by categories. In the present example, we may wish to calculate the mean and standard deviation for each combination of concentration and substance.
To do this, load the tigerstats
library (install it if you haven’t already).
library(tigerstats)
Now we’ll use the favstats
function to calculate some descriptive statistics for the response variable value
, grouped by each combination of conc
(concentration) and substance
.
favstats(value ~ conc + substance, data = long.data)
## conc.substance min Q1 median Q3 max mean sd n missing
## 1 0.CaCl2 0.3 0.5 0.5 0.5 1.0 0.56 0.2607681 5 0
## 2 50.CaCl2 0.3 0.7 0.8 0.9 3.0 1.14 1.0644247 5 0
## 3 100.CaCl2 91.1 91.2 91.3 94.3 94.3 92.44 1.6994117 5 0
## 4 150.CaCl2 92.1 93.0 93.2 94.4 94.4 93.42 0.9859006 5 0
## 5 200.CaCl2 85.8 90.6 94.6 95.8 96.5 92.66 4.4629587 5 0
## 6 250.CaCl2 94.3 96.7 97.1 98.5 98.6 97.04 1.7458522 5 0
## 7 300.CaCl2 92.9 96.2 97.7 98.2 99.3 96.86 2.4785076 5 0
## 8 0.NaCl 0.1 0.5 0.6 1.2 1.3 0.74 0.5029911 5 0
## 9 50.NaCl 0.7 1.1 1.6 5.9 6.2 3.10 2.7138533 5 0
## 10 100.NaCl 10.0 11.1 16.9 17.7 29.5 17.04 7.7548694 5 0
## 11 150.NaCl 90.6 93.4 94.8 95.1 95.4 93.86 1.9768662 5 0
## 12 200.NaCl 91.7 93.8 93.9 96.2 97.3 94.58 2.2015903 5 0
## 13 250.NaCl 92.1 95.8 96.9 97.3 97.9 96.00 2.3108440 5 0
## 14 300.NaCl 91.2 93.0 97.3 97.4 98.4 95.46 3.1619614 5 0
## 15 0.Sucrose 0.2 0.3 0.4 0.5 0.8 0.44 0.2302173 5 0
## 16 50.Sucrose 0.4 0.5 0.7 0.7 1.0 0.66 0.2302173 5 0
## 17 100.Sucrose 0.5 0.7 0.7 0.9 1.5 0.86 0.3847077 5 0
## 18 150.Sucrose 7.6 8.6 8.7 9.0 11.3 9.04 1.3685759 5 0
## 19 200.Sucrose 54.0 62.9 76.7 83.4 97.4 74.88 17.0480791 5 0
## 20 250.Sucrose 84.4 90.7 96.1 97.3 98.0 93.30 5.7423863 5 0
## 21 300.Sucrose 96.5 97.2 97.6 98.3 98.5 97.62 0.8167007 5 0
Or, we could use the much better approach, using the dplyr
package:
library(dplyr)
Here we provide the mean and standard deviation of value
grouped according to conc
and substance
:
long.data %>%
group_by(conc, substance) %>%
summarize(mean_val = mean(value, na.rm = TRUE), sd_val = sd(value, na.rm = T))
## # A tibble: 21 x 4
## # Groups: conc [?]
## conc substance mean_val sd_val
## <int> <fct> <dbl> <dbl>
## 1 0 CaCl2 0.560 0.261
## 2 0 NaCl 0.740 0.503
## 3 0 Sucrose 0.440 0.230
## 4 50 CaCl2 1.14 1.06
## 5 50 NaCl 3.10 2.71
## 6 50 Sucrose 0.660 0.230
## 7 100 CaCl2 92.4 1.70
## 8 100 NaCl 17.0 7.75
## 9 100 Sucrose 0.860 0.385
## 10 150 CaCl2 93.4 0.986
## # ... with 11 more rows
The output is a tibble
(see R for Data Science book).
To calculate the standard error of value
, grouped by each combination of conc
(concentration) and substance
, we need to do a bit more work. We first our own standard error function, then use the dplyr
package to apply the function across the group combinations.
First, create a function for the standard error (consult this tutorial for a refresher):
standard.error <- function(x){sd(na.omit(x))/sqrt(length(na.omit(x)))}
Now we can apply this function to the group combinations using this code:
long.data %>%
group_by(conc, substance) %>%
summarize(mean_trans = mean(value, na.rm = TRUE), st_err = standard.error(value))
## # A tibble: 21 x 4
## # Groups: conc [?]
## conc substance mean_trans st_err
## <int> <fct> <dbl> <dbl>
## 1 0 CaCl2 0.560 0.117
## 2 0 NaCl 0.740 0.225
## 3 0 Sucrose 0.440 0.103
## 4 50 CaCl2 1.14 0.476
## 5 50 NaCl 3.10 1.21
## 6 50 Sucrose 0.660 0.103
## 7 100 CaCl2 92.4 0.760
## 8 100 NaCl 17.0 3.47
## 9 100 Sucrose 0.860 0.172
## 10 150 CaCl2 93.4 0.441
## # ... with 11 more rows
To be completed
In the meantime, consult this page where it uses tidyr
for this purpose.
TIP You can easily create a long-format data frame in R using the expand.grid
function:
?expand.grid
For the above example dataset, we could have prepared a data frame in R with the appropriate long format as follows:
long.data2 <- as.data.frame(expand.grid(conc = seq(from = 300, to = 0, by = -50), substance = c("NaCl", "Sucrose", "CaCl2"), replicate = c("rep1", "rep2", "rep3", "rep4", "rep5"), value = NA))
The result:
head(long.data2)
## conc substance replicate value
## 1 300 NaCl rep1 NA
## 2 250 NaCl rep1 NA
## 3 200 NaCl rep1 NA
## 4 150 NaCl rep1 NA
## 5 100 NaCl rep1 NA
## 6 50 NaCl rep1 NA
We can re-order the columns to look exactly like the original data frame:
long.data2 <- long.data2[,c(2,1,3,4)]
The result:
head(long.data2)
## conc substance replicate value
## 1 300 NaCl rep1 NA
## 2 250 NaCl rep1 NA
## 3 200 NaCl rep1 NA
## 4 150 NaCl rep1 NA
## 5 100 NaCl rep1 NA
## 6 50 NaCl rep1 NA
If you wished to use the long format data frame created above say, for example, to enter data, you could export it to CSV first, open it in Excel, then enter your measurements under the “value” column.
Expore the data frame with the write.csv
command:
?write.csv
write.csv(long.data2, file = "longdata.csv", row.names = FALSE)