Overview

Due before class May 14th.

Fork the hw07 repository

Go here to fork the repo for homework 07.

Part 1: Statistical learning in Spark

Last week you estimated statistical learning models predicting survival and death on the Titanic. Take the model specification for your best-performing model from the Titanic problem, and estimate it using at least three machine learning algorithms in sparklyr (either MLlib or H2O). Calculate the accuracy and AUC metrics for each model. Which algorithm performs the best?

Part 2: Generating geospatial visualization

Build a map.

Is that really all the help I get?

Yes.

Arrrrrrrrgh but that is so vague

I know. But at this point you should be able to rise to the occasion.

But where do I start?

Think of data you’ve seen in the past that you think would make for a good geospatial visualization. Which means it needs to include both a geographic component plus some additional data to overlay on top of the geography.

As for drawing the geographic boundaries, that depends on what you want to map. Remember the maps package has some basic boundaries for major geographic units. Run help(package = "maps") to get a list of geographic databases contained in maps. Also remember the fiftystater package is useful for mapping state-level variables within the United States. If you need to obtain shapefiles for more specific regions, Google is a great starting point. If you need help finding a relevant shapefile, feel free to post on the issues page to get help from the instructional staff/peers.

Once you have your geographic boundaries data (either from an R package or imported from a shapefile), combine this with your substantive data you wish to visualize. Make sure to make the graph presentable - that is, make it look like a nice map. Things to consider include (but are not limited to):

  • A map projection system
  • Appropriate legends, titles, labels, etc.
  • Color palette

Along with the map itself, write a brief description (250-500 words) of the map. Summarize the information being depicted and explain any major visual design choices (e.g. why this color palette, why split the continuous variable into XYZ intervals rather than ABC intervals).

Remember to make your assignment reproducible. If you get a shapefile from the internet, either include it in your repo or make sure your R Markdown document/R script includes a function to download it from the internet.

Submit the assignment

Your assignment should be submitted as one or more R Markdown documents, data files, figures, etc. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

Rubric

Check minus: Cannot get code to run or is poorly documented. No documentation in the README file. Severe misinterpretations of the results. Overall a shoddy or incomplete assignment. Map looks amateurish or hard to interpret.

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Interpretation is clear and in-depth. Accurately interprets the results, with appropriate caveats for what the technique can and cannot do. Code is reproducible. Writes a user-friendly README file. Graph looks crisp, easy-to-read, and communicates information honestly and accurately.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.