This page was last updated on September 25, 2018.


Getting started

In this tutorial you will learn how to:

  • format your data for importing into R
  • Messy versus Tidy data
  • Naming variables
  • Wide versus Long format
  • Saving your file in CSV (comma-separated values) format
  • Import your CSV file into a data frame in R

If you want a tutorial that really cuts to the chase, check this one out.


Formatting data

This tutorial assumes that software such as Microsoft Excel is used to initially enter and store data.

Messy versus Tidy data

To be completed

In the meantime, it is recommended that you read about “tidy data” Hadley Wickham’s “R for Datascience” online book, linked here.

Naming variables

The following are good rules of thumb to follow when setting up your spreadsheet to store your data:

  • the first row should contain the names of the variables
  • each variable should have its own column
  • below the “header” row that contains the variable names, each row should contain one observation
  • do not mix data types (e.g. numerical, categorical) within a single column;
  • the variable names should not include spaces or special characters; in lieu of a space, use an underscore “_"
  • use easily-interpreted variable names when possible, otherwise provide a separate “text” file that describes what each variable is
  • the file should be saved in “CSV” format
  • do not include spaces or special characters in the file name

Wide versus Long format

To be completed

In the meantime, here are some useful resources:

  • A short tutorial and less on why “long” format is preferred over “wide” format.

  • Read about “tidy data” Hadley Wickham’s “R for Datascience” online book, linked here.

You can tidy your data using the dplyr package. Its associated cheatsheet is available here.


Preparing the data for import

First, download an example Excel file here called “example_data.xlsx”. It includes data typical of a concentration-response exercise.

Then, open the file in Excel to have a look. You’ll see three sets of data, one for each substance being analyzed (NaCl, Sucrose, and CaCl2), some calculated statistics associated with the measured values (mean and standard error of the mean), and finally some response curves that have been fit. Here is a screenshot of what you’ll see:


We need to re-format these data (i.e. make them “tidy”) in order to import into R.

Typically it is easier to enter data in “wide” format, so we’ll assume this is the case here. This is how the above spreadsheet should be organized for import into R:


  1. Reformat the Excel spreadsheet to look like the picture above, then save it as a CSV file.

Note that we don’t include any derived values like the mean and standard error… these will be calculated in R.


Importing data from a CSV file

You can find additional help on importing different types of files here.

Download the CSV file from here to your working directory.

NOTE: If your spreadsheet includes blank cells (i.e. missing values), R will automatically covert these to “NA” values.

Once it is saved in your working directory, we use the read.csv function to import the data. We’ll import it into a data frame called “example.data”:

example.data <- read.csv(file = "example_data.csv", 
                         header = TRUE)

Now we can have a look at the data frame, first exploring its structure using the str function, then looking at the first few rows with the head function:

str(example.data)
## 'data.frame':    21 obs. of  7 variables:
##  $ substance: Factor w/ 3 levels "CaCl2","NaCl",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ conc     : int  300 250 200 150 100 50 0 300 250 200 ...
##  $ rep1     : num  97.4 97.3 96.2 94.8 10 0.7 0.1 97.2 97.3 97.4 ...
##  $ rep2     : num  97.3 97.9 93.8 93.4 29.5 1.1 0.5 98.5 96.1 54 ...
##  $ rep3     : num  98.4 96.9 97.3 95.4 11.1 5.9 1.3 97.6 90.7 76.7 ...
##  $ rep4     : num  93 92.1 91.7 90.6 17.7 6.2 1.2 96.5 84.4 62.9 ...
##  $ rep5     : num  91.2 95.8 93.9 95.1 16.9 1.6 0.6 98.3 98 83.4 ...

and:

head(example.data)
##   substance conc rep1 rep2 rep3 rep4 rep5
## 1      NaCl  300 97.4 97.3 98.4 93.0 91.2
## 2      NaCl  250 97.3 97.9 96.9 92.1 95.8
## 3      NaCl  200 96.2 93.8 97.3 91.7 93.9
## 4      NaCl  150 94.8 93.4 95.4 90.6 95.1
## 5      NaCl  100 10.0 29.5 11.1 17.7 16.9
## 6      NaCl   50  0.7  1.1  5.9  6.2  1.6

Converting from wide to long format

We first need to load the tidyr package (install it if you haven’t already):

library(tidyr)

Now we’ll use the gather function for this:

?gather

Here’s the appropriate code for our current example, and we’ll create a new data frame called “long.data”:

long.data <- gather(example.data, replicate, value, rep1:rep5, factor_key = TRUE)

Now let’s look at the resulting data frame:

head(long.data)
##   substance conc replicate value
## 1      NaCl  300      rep1  97.4
## 2      NaCl  250      rep1  97.3
## 3      NaCl  200      rep1  96.2
## 4      NaCl  150      rep1  94.8
## 5      NaCl  100      rep1  10.0
## 6      NaCl   50      rep1   0.7

We could now do some analyses on these data, as shown in the next section.


Example analyses on long format data

As we learned in another tutorial, we can calculate descriptive statistics (e.g. mean, standard deviation) for a numeric variable grouped by categories. In the present example, we may wish to calculate the mean and standard deviation for each combination of concentration and substance.

To do this, load the tigerstats library (install it if you haven’t already).

library(tigerstats)

Now we’ll use the favstats function to calculate some descriptive statistics for the response variable value, grouped by each combination of conc (concentration) and substance.

favstats(value ~ conc + substance, data = long.data)
##    conc.substance  min   Q1 median   Q3  max  mean         sd n missing
## 1         0.CaCl2  0.3  0.5    0.5  0.5  1.0  0.56  0.2607681 5       0
## 2        50.CaCl2  0.3  0.7    0.8  0.9  3.0  1.14  1.0644247 5       0
## 3       100.CaCl2 91.1 91.2   91.3 94.3 94.3 92.44  1.6994117 5       0
## 4       150.CaCl2 92.1 93.0   93.2 94.4 94.4 93.42  0.9859006 5       0
## 5       200.CaCl2 85.8 90.6   94.6 95.8 96.5 92.66  4.4629587 5       0
## 6       250.CaCl2 94.3 96.7   97.1 98.5 98.6 97.04  1.7458522 5       0
## 7       300.CaCl2 92.9 96.2   97.7 98.2 99.3 96.86  2.4785076 5       0
## 8          0.NaCl  0.1  0.5    0.6  1.2  1.3  0.74  0.5029911 5       0
## 9         50.NaCl  0.7  1.1    1.6  5.9  6.2  3.10  2.7138533 5       0
## 10       100.NaCl 10.0 11.1   16.9 17.7 29.5 17.04  7.7548694 5       0
## 11       150.NaCl 90.6 93.4   94.8 95.1 95.4 93.86  1.9768662 5       0
## 12       200.NaCl 91.7 93.8   93.9 96.2 97.3 94.58  2.2015903 5       0
## 13       250.NaCl 92.1 95.8   96.9 97.3 97.9 96.00  2.3108440 5       0
## 14       300.NaCl 91.2 93.0   97.3 97.4 98.4 95.46  3.1619614 5       0
## 15      0.Sucrose  0.2  0.3    0.4  0.5  0.8  0.44  0.2302173 5       0
## 16     50.Sucrose  0.4  0.5    0.7  0.7  1.0  0.66  0.2302173 5       0
## 17    100.Sucrose  0.5  0.7    0.7  0.9  1.5  0.86  0.3847077 5       0
## 18    150.Sucrose  7.6  8.6    8.7  9.0 11.3  9.04  1.3685759 5       0
## 19    200.Sucrose 54.0 62.9   76.7 83.4 97.4 74.88 17.0480791 5       0
## 20    250.Sucrose 84.4 90.7   96.1 97.3 98.0 93.30  5.7423863 5       0
## 21    300.Sucrose 96.5 97.2   97.6 98.3 98.5 97.62  0.8167007 5       0

Or, we could use the much better approach, using the dplyr package:

library(dplyr)

Here we provide the mean and standard deviation of value grouped according to conc and substance:

long.data %>%
  group_by(conc, substance) %>%
  summarize(mean_val = mean(value, na.rm = TRUE), sd_val = sd(value, na.rm = T))
## # A tibble: 21 x 4
## # Groups:   conc [?]
##     conc substance mean_val sd_val
##    <int> <fct>        <dbl>  <dbl>
##  1     0 CaCl2        0.560  0.261
##  2     0 NaCl         0.740  0.503
##  3     0 Sucrose      0.440  0.230
##  4    50 CaCl2        1.14   1.06 
##  5    50 NaCl         3.10   2.71 
##  6    50 Sucrose      0.660  0.230
##  7   100 CaCl2       92.4    1.70 
##  8   100 NaCl        17.0    7.75 
##  9   100 Sucrose      0.860  0.385
## 10   150 CaCl2       93.4    0.986
## # ... with 11 more rows

The output is a tibble (see R for Data Science book).

Standard error by group

To calculate the standard error of value, grouped by each combination of conc (concentration) and substance, we need to do a bit more work. We first our own standard error function, then use the dplyr package to apply the function across the group combinations.

First, create a function for the standard error (consult this tutorial for a refresher):

standard.error <- function(x){sd(na.omit(x))/sqrt(length(na.omit(x)))}

Now we can apply this function to the group combinations using this code:

long.data %>%
  group_by(conc, substance) %>%
  summarize(mean_trans = mean(value, na.rm = TRUE), st_err = standard.error(value))
## # A tibble: 21 x 4
## # Groups:   conc [?]
##     conc substance mean_trans st_err
##    <int> <fct>          <dbl>  <dbl>
##  1     0 CaCl2          0.560  0.117
##  2     0 NaCl           0.740  0.225
##  3     0 Sucrose        0.440  0.103
##  4    50 CaCl2          1.14   0.476
##  5    50 NaCl           3.10   1.21 
##  6    50 Sucrose        0.660  0.103
##  7   100 CaCl2         92.4    0.760
##  8   100 NaCl          17.0    3.47 
##  9   100 Sucrose        0.860  0.172
## 10   150 CaCl2         93.4    0.441
## # ... with 11 more rows

Converting from wide to long format

To be completed

In the meantime, consult this page where it uses tidyr for this purpose.


The “expand.grid” function

TIP You can easily create a long-format data frame in R using the expand.grid function:

?expand.grid

For the above example dataset, we could have prepared a data frame in R with the appropriate long format as follows:

long.data2 <- as.data.frame(expand.grid(conc = seq(from = 300, to = 0, by = -50), substance = c("NaCl", "Sucrose", "CaCl2"), replicate = c("rep1", "rep2", "rep3", "rep4", "rep5"), value = NA))

The result:

head(long.data2)
##   conc substance replicate value
## 1  300      NaCl      rep1    NA
## 2  250      NaCl      rep1    NA
## 3  200      NaCl      rep1    NA
## 4  150      NaCl      rep1    NA
## 5  100      NaCl      rep1    NA
## 6   50      NaCl      rep1    NA

We can re-order the columns to look exactly like the original data frame:

long.data2 <- long.data2[,c(2,1,3,4)]

The result:

head(long.data2)
##   conc substance replicate value
## 1  300      NaCl      rep1    NA
## 2  250      NaCl      rep1    NA
## 3  200      NaCl      rep1    NA
## 4  150      NaCl      rep1    NA
## 5  100      NaCl      rep1    NA
## 6   50      NaCl      rep1    NA

Exporting to CSV file

If you wished to use the long format data frame created above say, for example, to enter data, you could export it to CSV first, open it in Excel, then enter your measurements under the “value” column.

Expore the data frame with the write.csv command:

?write.csv
write.csv(long.data2, file = "longdata.csv", row.names = FALSE)