cm013 - May 7, 2018

Overview

  • Discuss the need for distributed computing
  • Illustrate the split-apply-combine analytical pattern
  • Define parallel processing
  • Define SQL
  • Demonstrate how to access local and remote SQL databases
  • Introduce Hadoop and Spark as distributed computing platforms
  • Introduce the sparklyr package
  • Demonstrate how to use sparklyr for machine learning using the Titanic data set

Before class

  • Install sparklyr and H2O on your local computer. Run the code below to install all necessary packages and set the correct options.

    install.packages(c("sparklyr", "rsparkling"))
    options(rsparkling.sparklingwater.version = "2.1.0")
    
    library(sparklyr)
    spark_install(version = "2.1.0")

    Last year, 70% of students were able to successfully install these packages without problems. The others ran into problems. Make sure to attempt installing these packages before class so if you have errors we can debug them before you need to use the packages.

What you need to do

This work is licensed under the CC BY-NC 4.0 Creative Commons License.