cm013 - May 7, 2018

Overview

Discuss the need for distributed computing
Illustrate the split-apply-combine analytical pattern
Define parallel processing
Define SQL
Demonstrate how to access local and remote SQL databases
Introduce Hadoop and Spark as distributed computing platforms
Introduce the sparklyr package
Demonstrate how to use sparklyr for machine learning using the Titanic data set

Slides
Accessing databases using dbplyr
Split-apply-combine and parallel computing
Spark and sparklyr
The split-apply-combine strategy for data analysis - paper by Hadley Wickham establishing a general overview of split-apply-combine problems. Note that the plyr package is now deprecated in favor of dplyr and the other tidyverse packages
Accessing databases using dplyr
Taxi dataset
bigrquery - instructions for setting up an account to access Google Bigquery databases
sparklyr - introduction to the sparklyr interface for Spark