--- title: "R data.table Tutorial" author: 'Vinh "MBALearnsToCode" Luong' output: pdf_document fontsize: 12 geometry: margin=1.0in --- *(This tutorial is adapted from a technical presentation by Vinh's fellow data scientist Brian Silva at Uptake, LLC, a Chicago-based analytics firm focusing on heavy industry and the industrial internet-of-things)* This tutorial presents [**`data.table`**](http://cran.r-project.org/web/packages/data.table/index.html), an R package by Matt Dowle, which is an extension on R's default `data.frame` used for fast aggregation of large data (e.g. 100GB in RAM). We will compare its speed and syntax with that of data.frame. ## Installation First, make sure the **`data.table`** package is installed by running: ```{r eval=F, results='hide'} install.packages('data.table') ``` ```{r echo=FALSE, results='hide'} library(data.table) # load package ``` ```{r echo=FALSE} # define a convenient function to present the head of a database and the time R takes to do so HeadAndTime <- function(x) { system.time(head(x)) } ``` ## Data importing: **`fread`** vs. **`read.csv`** The **`fread`** function, which has almost identical arguments as those of `read.csv`, works many times faster than `read.csv`. For example, a large (10GB, 100M rows x 10 cols) data set will take `read.csv` hours to read, while it will take `fread` only about 4-5 minutes. By default `fread` will read a .csv file into a data.table, but you can also read it in as a data.frame. ```{r, eval=F} dt <- fread('dataset.csv') # returns data.table df <- fread('dataset.csv', data.table=F) # returns data.frame ``` ## Syntax: **`data.table`** vs. **`data.frame`** Let's start by making a large amount of data in both **`data.frame`** and **`data.table`** formats: ```{r} set.seed(35753) numRows <- 10000000 # Create data with assets and dateTimes assets <- paste('Asset', 1:1000, sep='_') dateTimes <- seq.POSIXt(from=as.POSIXct('2014-01-01 00:00:00'), to=as.POSIXct('2015-01-01 00:00:00'), length.out=50000) # Randomly select assets dateTimes and generate signals DF <- data.frame(name=1:numRows, assetId=sample(assets, numRows, replace=T), dateTime=sample(dateTimes, numRows, replace=T), signal1=runif(numRows), signal2=rexp(numRows), signal3=sample(c('ON', 'OFF'), numRows, replace=T), stringsAsFactors=F) DT <- as.data.table(DF) ``` ### Viewing the data First of all, data.table inherits from `data.frame`, which means that it can be passed to any package that only accepts data.frame. ```{r} class(DT) ``` Let's take a look at the data.frame and data.table we created. ```{r} system.time(h <- head(DF)) # see how much time it takes to present head of data.frame h system.time(h <- head(DT)) # see how much time it takes to present head of data.table h ``` The output looks pretty similar with the only exception being the colon after the row number. One thing to take away, though, is that it takes less time to read a `data.table` as it does a `data.frame`. This is because `data.table` only makes references to the underlying data, whereas `data.frame` copies data into a new object. This will be a recurring theme when comparing `data.table` and `data.frame`. Another nice thing is that when you print out a large `data.table` object, you are only shown a summary and not the whole things. You wouldn't want to do the following with a huge data.frame: ```{r} DT ``` ### Referencing rows Let's now look at an example of conditionally selecting certain rows in both `data.frame` and `data.table`. ### Select observations of `Asset_100` where `signal3` is 'ON.': ```{r} system.time(h <- head(DF[DF$assetId == 'Asset_100' & DF$signal3 == 'ON', ])) h system.time(h <- head(DT[assetId == 'Asset_100' & signal3 == 'ON', ])) h ``` Notice that within our `data.table`, we don't have to say `DT$assetId == 'Asset_100' & DT$signal3 == 'ON'`. This is because within `data.table`'s square brackets, we can reference column names directly as variables. While in the above code `data.table` is faster than `data.frame`, it is still not the best way to subset a `data.table`. Another method involves setting a key for the `data.table` and then subsetting. ```{r} setkey(DT, assetId, signal3) # set key to use binary search instead of linear scan system.time(h <- DT[list('Asset_100', 'ON'), ]) h ``` While setting up the key initially can take some time, all of the later subsetting is much faster. This is because `data.table` rearranges itself to allow binary search instead of linear scan. This means that instead of checking every row for these conditions, `data.table` can immediately eliminate many rows. For comparison, the computational complexity of vector scan is **O**(*n*), while that of binary search is **O**(log *n*). Additionally, `data.table` performs computations by reference instead of making a copy and performing calculations on these. This is much more performant and memory efficient. ### Referencing columns Referencing columns within `data.table` is something that can seem a little confusing at first -- especially when one is used to data.frame syntax. Take the following as an example: ```{r} head(DF[, 'signal1']) head(DT[, 'signal1']) ``` When we apply the same syntax from `data.frame` to `data.table`, we get something quite different. To some this may seem like a bug, but it is actually made this way by design. The second argument within `data.table`, which in `data.frame` references columns, can be an **expression** and not simply column names or indexes. So when you want to return the data from `signal1`, you can do either of the following: ```{r} head(DT[, signal1]) head(DT[['signal1']]) head(DT[, 'signal1', with=F]) ``` In the first example we just referenced the column name directly since column names are treated as variables within `data.table`. In the second example, we are essentially treating `DT` as a list.In the last example, we had to say `with=F`. This is because we want to pass in a string directly to reference the column name. (*don't ask us why the argument is named "with="... this probably has come from a lengthy history*) ### Why use data.table over data.frame? Right now we have seen that data.table can do the same things that data.frame can do. And we have seen that it can do them a bit faster too. But the syntax seems weird and confusing at first. Is data.table really worth the extra effort? Let's first look at the structure of data.table's arguments and then look at some examples of where this structure is incredibly useful. ### data.table's arguments data.table's inputs - often denoted as `DT[i, j, by]` - allow the following: - `i` allows you to evaluate conditional arguments (i.e. `signal1 > 0.5`) - `j` allows you to select or perform expressions on columns - `by` allows you to perform evaluations by group `data.table`'s syntax is (in many ways) analogous to SQL. For example, you can think of the inputs to data.table as the following: ```{r eval=F} DT[where, select|update, group by][order by][...] ... [...] ``` ### Cool stuff in data.table So we know that we can pass expressions to data.table's `j` argument. Here are a couple examples of where this could be useful: ### Calculate the mean of `signal1`: ```{r} DT[, mean(signal1)] ``` ### Calculate the mean and standard deviation of `signal1`: ```{r} DT[, list(avg=mean(signal1), sd=sd(signal1))] ``` ### Create a new column called `sigDif`, which is the difference between `signal2` and `signal1`: ```{r} head(DT[, sigDif := signal2 - signal1]) ``` Notice we use `:=` to assign calculations to this new column. We can also use `data.table`'s `by` argument to perform these calculations by group: ### Calculate the mean and standard deviation of `signal1` by `assetId`: ```{r} DT[, list(avg=mean(signal1), sd=sd(signal1)), by=assetId] ``` ### Helpful links [Introduction to the data.table package in R](http://datatable.r-forge.r-project.org/datatable-intro.pdf) [FAQs about the data.table package in R](http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf) [Matt Dowle's "data.table" talk at useR 2014](https://www.youtube.com/watch?v=qLrdYhizEMg)