library(tidyverse)
theme_set(theme_minimal())

# set seed for reproducibility
set.seed(1234)

`readr` vs. base R

One of the main advantages of readr functions over base R functions is that they are typically much faster. For example, let’s import a randomly generated CSV file with 50,000 rows and ncol(data_small) columns. How long does it take read.csv() to import this file vs. readr::read_csv()? To assess the differences, we use the microbenchmark to run each function 100 times and compare the distributions of the time it takes to complete the data import:

library(microbenchmark)

results_small <- microbenchmark(
  read.csv = read.csv("data/sim-data-small.csv"),
  read_csv = read_csv("data/sim-data-small.csv")
)

autoplot(results_small) +
  scale_y_log10(breaks = c(25, 50, 100)) +
  labs(y = "Time [milliseconds], logged")

read_csv() is over 5 times faster than read.csv(). Of course with relatively small data files, this isn’t a large difference in absolute terms (a difference of 100 milliseconds is only .1 second). However, as the data file increases in size the performance savings will be much larger. Consider the same test with a CSV file with 5,000,000 rows:

library(microbenchmark)

results_large <- microbenchmark(
  read.csv = read.csv("data/sim-data-large.csv"),
  read_csv = read_csv("data/sim-data-large.csv")
)

autoplot(results_large) +
  scale_y_log10(breaks = c(2.5, 5, 10, 20)) +
  labs(y = "Time [seconds], logged")

Here read_csv() is far superior to read.csv().

Alternative file formats

CSV files, while common, are not the only type of data storage format you will encounter in the wild. Here is a quick primer on other file formats you may encounter and how to import/export them using R. We’ll use the challenge dataset in readr to demonstrate some of these formats.

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_date()
  )
)

challenge

## # A tibble: 2,000 x 2
##        x      y
##    <dbl> <date>
##  1   404     NA
##  2  4172     NA
##  3  3004     NA
##  4   787     NA
##  5    37     NA
##  6  2332     NA
##  7  2489     NA
##  8  1449     NA
##  9  3665     NA
## 10  3863     NA
## # ... with 1,990 more rows

RDS

RDS is a custom binary format used exclusively by R to store data objects.

# write to csv
write_csv(challenge, "data/challenge.csv")

# write to/read from rds
write_rds(challenge, "data/challenge.rds")
read_rds("data/challenge.rds")

## # A tibble: 2,000 x 2
##        x      y
##    <dbl> <date>
##  1   404     NA
##  2  4172     NA
##  3  3004     NA
##  4   787     NA
##  5    37     NA
##  6  2332     NA
##  7  2489     NA
##  8  1449     NA
##  9  3665     NA
## 10  3863     NA
## # ... with 1,990 more rows

# compare file size
file.info("data/challenge.csv")$size %>%
  utils:::format.object_size("auto")

## [1] "37.1 Kb"

file.info("data/challenge.rds")$size %>%
  utils:::format.object_size("auto")

## [1] "31.9 Kb"

# compare read speeds
microbenchmark(
  read_csv = read_csv(
    readr_example("challenge.csv"), 
    col_types = cols(
      x = col_double(),
      y = col_date()
    )
  ),
  read_rds = read_rds("data/challenge.rds")
) %>%
  autoplot +
  labs(y = "Time [microseconds], logged")

By default, write_rds() does not compress the .rds file; use the compress argument to implement one of several different compression algorithms. read_rds() is noticably faster than read_csv(), and also has the benefit of preserving column types. The downside is that RDS is only implemented by R; it is not used by any other program so if you need to import/export data files into other languages like Python (or open in Excel), RDS is not a good storage format.

`feather`

The feather package implements a binary file format that is cross-compatible with many different programming languages:

library(feather)

write_feather(challenge, "data/challenge.feather")
read_feather("data/challenge.feather")

## # A tibble: 2,000 x 2
##        x y         
##    <dbl> <date>    
##  1  404. NA        
##  2 4172. NA        
##  3 3004. NA        
##  4  787. NA        
##  5   37. NA        
##  6 2332. NA        
##  7 2489. NA        
##  8 1449. NA        
##  9 3665. NA        
## 10 3863. NA        
## # ... with 1,990 more rows

# compare read speeds
microbenchmark(
  read_csv = read_csv(
    readr_example("challenge.csv"), 
    col_types = cols(
      x = col_double(),
      y = col_date()
    )
  ),
  read_rds = read_rds("data/challenge.rds"),
  read_feather = read_feather("data/challenge.feather")
) %>%
  autoplot +
  scale_y_continuous(labels = scales::comma) +
  labs(y = "Time [microseconds], logged")

feather is generally faster than RDS and read_csv().¹ Furthermore, it has native support for Python, R, and Julia., so if you develop an analytics pipeline that switches between R and Python, you can import/export data files in .feather without any loss of information.

`readxl`

readxl enables you to read (but not write) Excel files (.xls and xlsx).²

library(readxl)

xlsx_example <- readxl_example("datasets.xlsx")
read_excel(xlsx_example)

## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
##  1          5.1         3.5          1.4         0.2  setosa
##  2          4.9         3.0          1.4         0.2  setosa
##  3          4.7         3.2          1.3         0.2  setosa
##  4          4.6         3.1          1.5         0.2  setosa
##  5          5.0         3.6          1.4         0.2  setosa
##  6          5.4         3.9          1.7         0.4  setosa
##  7          4.6         3.4          1.4         0.3  setosa
##  8          5.0         3.4          1.5         0.2  setosa
##  9          4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... with 140 more rows

The nice thing about readxl is that you can access multiple sheets from the workbook. List the sheet names with excel_sheets():

excel_sheets(xlsx_example)

## [1] "iris"     "mtcars"   "chickwts" "quakes"

Then specify which worksheet you want by name or number:

read_excel(xlsx_example, sheet = "chickwts")

## # A tibble: 71 x 2
##    weight      feed
##     <dbl>     <chr>
##  1    179 horsebean
##  2    160 horsebean
##  3    136 horsebean
##  4    227 horsebean
##  5    217 horsebean
##  6    168 horsebean
##  7    108 horsebean
##  8    124 horsebean
##  9    143 horsebean
## 10    140 horsebean
## # ... with 61 more rows

read_excel(xlsx_example, sheet = 2)

## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
##  2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
##  3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
##  4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
##  5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
##  6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
##  7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
##  8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
##  9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
## # ... with 22 more rows

`haven`

haven allows you to read and write data from other statistical packages such as SAS (.sas7bdat + .sas7bcat), SPSS (.sav + .por), and Stata (.dta).

library(haven)

# SAS
read_sas(system.file("examples", "iris.sas7bdat", package = "haven"))

## # A tibble: 150 x 5
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
##  1          5.1         3.5          1.4         0.2  setosa
##  2          4.9         3.0          1.4         0.2  setosa
##  3          4.7         3.2          1.3         0.2  setosa
##  4          4.6         3.1          1.5         0.2  setosa
##  5          5.0         3.6          1.4         0.2  setosa
##  6          5.4         3.9          1.7         0.4  setosa
##  7          4.6         3.4          1.4         0.3  setosa
##  8          5.0         3.4          1.5         0.2  setosa
##  9          4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... with 140 more rows

write_sas(mtcars, "data/mtcars.sas7bdat")

# SPSS
read_sav(system.file("examples", "iris.sav", package = "haven"))

## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##           <dbl>       <dbl>        <dbl>       <dbl> <dbl+lbl>
##  1          5.1         3.5          1.4         0.2         1
##  2          4.9         3.0          1.4         0.2         1
##  3          4.7         3.2          1.3         0.2         1
##  4          4.6         3.1          1.5         0.2         1
##  5          5.0         3.6          1.4         0.2         1
##  6          5.4         3.9          1.7         0.4         1
##  7          4.6         3.4          1.4         0.3         1
##  8          5.0         3.4          1.5         0.2         1
##  9          4.4         2.9          1.4         0.2         1
## 10          4.9         3.1          1.5         0.1         1
## # ... with 140 more rows

write_sav(mtcars, "data/mtcars.sav")

# Stata
read_dta(system.file("examples", "iris.dta", package = "haven"))

## # A tibble: 150 x 5
##    sepallength sepalwidth petallength petalwidth species
##          <dbl>      <dbl>       <dbl>      <dbl>   <chr>
##  1         5.1        3.5         1.4        0.2  setosa
##  2         4.9        3.0         1.4        0.2  setosa
##  3         4.7        3.2         1.3        0.2  setosa
##  4         4.6        3.1         1.5        0.2  setosa
##  5         5.0        3.6         1.4        0.2  setosa
##  6         5.4        3.9         1.7        0.4  setosa
##  7         4.6        3.4         1.4        0.3  setosa
##  8         5.0        3.4         1.5        0.2  setosa
##  9         4.4        2.9         1.4        0.2  setosa
## 10         4.9        3.1         1.5        0.1  setosa
## # ... with 140 more rows

write_dta(mtcars, "data/mtcars.dta")

That said, if you can obtain your data file in a plain .csv or .tsv file format, I strongly recommend it. SAS, SPSS, and Stata files represent labeled data and missing values differently from R. haven attempts to bridge the gap and preserve as much information as possible, but I frequently find myself stripping out all the label information and rebuilding it using dplyr functions and the codebook for the data file.

Need to import a SAS, SPSS, or Stata data file? Read the documentation to learn how to best handle value labels and missing values.

Session Info

devtools::session_info()

## Session info -------------------------------------------------------------

##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-04-09

## Packages -----------------------------------------------------------------

##  package        * version    date      
##  assertthat       0.2.0      2017-04-11
##  backports        1.1.2      2017-12-13
##  base           * 3.4.3      2017-12-07
##  bindr            0.1.1      2018-03-13
##  bindrcpp         0.2.2.9000 2018-04-08
##  broom            0.4.4      2018-03-29
##  cellranger       1.1.0      2016-07-27
##  cli              1.0.0      2017-11-05
##  codetools        0.2-15     2016-10-05
##  colorspace       1.3-2      2016-12-14
##  compiler         3.4.3      2017-12-07
##  crayon           1.3.4      2017-10-03
##  datasets       * 3.4.3      2017-12-07
##  devtools         1.13.5     2018-02-18
##  digest           0.6.15     2018-01-28
##  dplyr          * 0.7.4.9003 2018-04-08
##  evaluate         0.10.1     2017-06-24
##  feather        * 0.3.1      2016-11-09
##  forcats        * 0.3.0      2018-02-19
##  foreign          0.8-69     2017-06-22
##  ggplot2        * 2.2.1      2016-12-30
##  glue             1.2.0      2017-10-29
##  graphics       * 3.4.3      2017-12-07
##  grDevices      * 3.4.3      2017-12-07
##  grid             3.4.3      2017-12-07
##  gtable           0.2.0      2016-02-26
##  haven          * 1.1.1      2018-01-18
##  hms              0.4.2      2018-03-10
##  htmltools        0.3.6      2017-04-28
##  httr             1.3.1      2017-08-20
##  jsonlite         1.5        2017-06-01
##  knitr            1.20       2018-02-20
##  labeling         0.3        2014-08-23
##  lattice          0.20-35    2017-03-25
##  lazyeval         0.2.1      2017-10-29
##  lubridate        1.7.3      2018-02-27
##  magrittr         1.5        2014-11-22
##  memoise          1.1.0      2017-04-21
##  methods        * 3.4.3      2017-12-07
##  microbenchmark * 1.4-4      2018-01-24
##  mnormt           1.5-5      2016-10-15
##  modelr           0.1.1      2017-08-10
##  munsell          0.4.3      2016-02-13
##  nlme             3.1-131.1  2018-02-16
##  parallel         3.4.3      2017-12-07
##  pillar           1.2.1      2018-02-27
##  pkgconfig        2.0.1      2017-03-21
##  plyr             1.8.4      2016-06-08
##  psych            1.7.8      2017-09-09
##  purrr          * 0.2.4      2017-10-18
##  R6               2.2.2      2017-06-17
##  Rcpp             0.12.16    2018-03-13
##  readr          * 1.1.1      2017-05-16
##  readxl         * 1.0.0      2017-04-18
##  reshape2         1.4.3      2017-12-11
##  rlang            0.2.0      2018-02-20
##  rmarkdown        1.9        2018-03-01
##  rprojroot        1.3-2      2018-01-03
##  rstudioapi       0.7        2017-09-07
##  rvest            0.3.2      2016-06-17
##  scales           0.5.0      2017-08-24
##  stats          * 3.4.3      2017-12-07
##  stringi          1.1.7      2018-03-12
##  stringr        * 1.3.0      2018-02-19
##  tibble         * 1.4.2      2018-01-22
##  tidyr          * 0.8.0      2018-01-29
##  tidyselect       0.2.4      2018-02-26
##  tidyverse      * 1.2.1      2017-11-14
##  tools            3.4.3      2017-12-07
##  utf8             1.1.3      2018-01-03
##  utils          * 3.4.3      2017-12-07
##  withr            2.1.2      2018-03-15
##  xml2             1.2.0      2018-01-24
##  yaml             2.1.18     2018-03-08
##  source                             
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.3)                     
##  local                              
##  CRAN (R 3.4.3)                     
##  Github (krlmlr/bindrcpp@bd5ae73)   
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.2)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.0)                     
##  local                              
##  Github (gaborcsardi/crayon@b5221ab)
##  local                              
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.3)                     
##  Github (tidyverse/dplyr@b7aaa95)   
##  CRAN (R 3.4.1)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.2)                     
##  local                              
##  local                              
##  local                              
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.1)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.2)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.0)                     
##  local                              
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.0)                     
##  local                              
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.3)                     
##  local                              
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.1)                     
##  CRAN (R 3.4.2)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.4)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.0)                     
##  CRAN (R 3.4.3)                     
##  cran (@0.2.0)                      
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.1)                     
##  CRAN (R 3.4.0)                     
##  cran (@0.5.0)                      
##  local                              
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.2)                     
##  local                              
##  CRAN (R 3.4.3)                     
##  local                              
##  CRAN (R 3.4.4)                     
##  CRAN (R 3.4.3)                     
##  CRAN (R 3.4.4)

Notice that the x-axis is logarithmically scaled.↩
If you need to export data into Excel, use readr::write_csv_excel().↩

This work is licensed under the CC BY-NC 4.0 Creative Commons License.

Importing data into R

readr vs. base R

Alternative file formats

RDS

feather

readxl

haven

Session Info

`readr` vs. base R

`feather`

`readxl`

`haven`