--- title: 'Week 4: Tidying up Data' author: "Ben Best" date: '`r format(Sys.time(), "%Y-%m-%d %H:%M")`' output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval=FALSE) ``` ## Precursors ### Final Project The instructions for your class group's [Final Project](./#final-project) have been updated. ### Schedule The class for week 9 on March 4th conflicts with the [Bren Group Project](http://www.bren.ucsb.edu/research/gp_submit.html) presentations, so we'll be extending the classes before and after by 1 hour. ### Invitations - **organizations**: invite @bbest and @naomitague to your `github.com/` so we can directly _push_ to your site (vs _fork_ & _pull request_) - **auditors**: email bbest@gmail.com to ensure you get announcements via GauchoSpace ### `setwd()` As mentioned in [wk03_dplyr](./wk03_dplyr.html#where-am-i?-getting-around-in-the-command-line), the working directory when knitting your Rmarkdown file is always the folder in which it is contained, eg for `env-info/students/bbest.Rmd` the working directory is `env-info/students`. This may be different from your R Console in RStudio which defaults to the working directory to the top level folder of your project, ie `env-info`. To get the two to be the same as you test code in the Console before knitting the Rmarkdown to HTML, I **REQUIRE** you to insert the following R chunk near the beginning (but after the Rmarkdown front matter surrounded by the rows `---`) of your `env-info/students/.Rmd`: ```{r setwd students} # set working directory if has students directory and at R Console (vs knitting) if ('students' %in% list.files() & interactive()){ setwd('students' ) } # ensure working directory is students if (basename(getwd()) != 'students'){ stop(sprintf("WHOAH! Your working directory is not in 'students'!\n getwd(): %s", getwd())) } ``` This then ensures that "relative" paths will work the same in your R Console as when knitting your Rmarkdown to HTML. For instance: ```{r} # absolute: /Users/bbest/github/env-info/students/data/bbest_ports-bc.csv d = read.csv('./data/bbest_ports-bc.csv') # ./data is child of students # absolute: /Users/bbest/github/env-info/data/r-ecology/surveys.csv d = read.csv('../data/r-ecology/surveys.csv') # ../data is sibling of students ``` The first path uses this folder `.` since that `data` folder is a "child" of the `students` folder, whereas the second path backs up a folder `..` before descending into the other `data` folder that is a "sibling" of the `students` folder. ### Assignment (Individual) For the data wrangling portion of today, append the header `## 4. Tidying up Data` to your `env-info/students/.Rmd` and include R chunks to run the demo below and give yourself the opportunity to try out possibilities with the code. Please set aside another header below this section `## 4. Answers and Tasks` where you answer questions and perform tasks on applying the functions to the CO₂ dataset as R chunks. Please include the question or task above the R chunk or answer. You'll find it easiest to copy and paste the demo portion from [`env-info/wk04_tidyr.Rmd`](https://raw.githubusercontent.com/ucsb-bren/env-info/gh-pages/wk04_tidyr.Rmd) but will need to understand this material enough to apply to the questions and tasks. You will want to synchronize with `ucsb-bren/env-info` (ie _pull request_ `ucsb-bren/env-info` to `/env-info`, _merge_ the pull request in `/env-info`, and _pull_ to update your local machine), in order to successfully knit your `env-info/students/.Rmd`. The Rmarkdown below expects `env-info/wk04_tidyr/img/data-wrangling-cheatsheet_tidyr.png` and `env-info/data/co2_europa.xls` which are in the updated `ucsb-bren/env-info`. ## data The R chunks explaining the `dplyr` and `tidyr` functions below are pulled from the excellent [**wrangling-webinar.pdf**](wk03_dplyr/wrangling-webinar.pdf) presentation, which you should consult as you execute (see shortcuts in [rstudio-IDE-cheatsheet.pdf](refs/cheatsheets/rstudio-IDE-cheatsheet.pdf)). ### EDAWR ```{r EDAWR} # install.packages("devtools") # devtools::install_github("rstudio/EDAWR") library(EDAWR) help(package='EDAWR') ?storms # wind speed data for 6 hurricanes ?cases # subset of WHO tuberculosis ?pollution # pollution data from WHO Ambient Air Pollution, 2014 ?tb # tuberculosis data View(storms) View(cases) View(pollution) ``` ### slicing ```{r traditional R slicing} # storms storms$storm storms$wind storms$pressure storms$date # cases cases$country names(cases)[-1] unlist(cases[1:3, 2:4]) # pollution pollution$city[c(1,3,5)] pollution$amount[c(1,3,5)] pollution$amount[c(2,4,6)] # ratio storms$pressure / storms$wind ``` ```{r dplyr on storms} # better yet library(dplyr) pollution %>% filter(city != 'New York') %>% mutate( ratio = pressure / wind) ``` ## tidyr Two main functions: gather() and spread() ```{r tidyr} # install.packages("tidyr") library(tidyr) ?gather # gather to long ?spread # spread to wide ``` ### `gather` ```{r gather} cases gather(cases, "year", "n", 2:4) gather(cases, "year", "n", -1) ``` ### `spread` ```{r spread} pollution spread(pollution, size, amount) ``` Other functions to extract and combine columns... ### `separate` ```{r separate} storms storms2 = separate(storms, date, c("year", "month", "day"), sep = "-") ``` ### `unite` ```{r unite} storms2 unite(storms2, "date", year, month, day, sep = "-") ``` **Recap: tidyr**: - A package that reshapes the layout of data sets. - Make observations from variables with `gather()` Make variables from observations with `spread()` - Split and merge columns with `unite()` and `separate()` From the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf): ![](wk04_tidyr/img/data-wrangling-cheatsheet_tidyr.png) ### tidy CO₂ emissions _**Task**. Convert the following table [CO₂ emissions per country since 1970](http://edgar.jrc.ec.europa.eu/overview.php?v=CO2ts1990-2014&sort=des9) from wide to long format and output the first few rows into your Rmarkdown. I recommend consulting `?gather` and you should have 3 columns in your output._ ```{r read co2} library(dplyr) library(readxl) # install.packages('readxl') # xls downloaded from http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls xls = '../data/co2_europa.xls' print(getwd()) co2 = read_excel(xls, skip=12) co2 ``` _**Question**. Why use `skip=12` argument in `read_excel()`?_ ## dplyr A package that helps transform tabular data ```{r dplyr} # install.packages("dplyr") library(dplyr) ?select ?filter ?arrange ?mutate ?group_by ?summarise ``` See sections in the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf): - Subset Variables (Columns), eg `select()` - Subset Observations (Rows), eg `filter()` - Reshaping Data - Change the layout of a data set, eg `arrange()` - Make New Variables, eg `mutate()` - Group Data, eg `group_by()` and `summarise()` ### `select` ```{r select} storms select(storms, storm, pressure) storms %>% select(storm, pressure) ``` ### `filter` ```{r filter} storms filter(storms, wind >= 50) storms %>% filter(wind >= 50) storms %>% filter(wind >= 50) %>% select(storm, pressure) ``` ### `mutate` ```{r mutate} storms %>% mutate(ratio = pressure / wind) %>% select(storm, ratio) ``` ### `group_by` ```{r group_by} pollution pollution %>% group_by(city) ``` ### `summarise` ```{r summarise} # by city pollution %>% group_by(city) %>% summarise( mean = mean(amount), sum = sum(amount), n = n()) # by size pollution %>% group_by(size) %>% summarise( mean = mean(amount), sum = sum(amount), n = n()) ``` note that `summarize` synonymously works ### `ungroup` ```{r ungroup} pollution %>% group_by(size) pollution %>% group_by(size) %>% ungroup() ``` ### multiple groups ```{r multiple groups} tb %>% group_by(country, year) %>% summarise(cases = sum(cases)) summarise(cases = sum(cases)) ``` **Recap: dplyr**: - Extract columns with `select()` and rows with `filter()` - Sort rows by column with `arrange()` - Make new columns with `mutate()` - Group rows by column with `group_by()` and `summarise()` See sections in the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf): - Subset Variables (Columns), eg `select()` - Subset Observations (Rows), eg `filter()` - Reshaping Data - Change the layout of a data set, eg `arrange()` - Make New Variables, eg `mutate()` - Group Data, eg `group_by()` and `summarise()` ### summarize CO₂ emissions _**Task**. Report the top 5 emitting countries (not World or EU28) for **2014** using your long format table. (You may need to convert your year column from factor to numeric, eg `mutate(year = as.numeric(as.character(year)))`. As with most analyses, there are multiple ways to do this. I used the following additional functions: `filter`, `arrange`, `desc`, `head`)_. _**Task**. Summarize the **total** emissions by country (not World or EU28) across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `group_by`, `summarize`, `sum`, `arrange`, `desc`, `head`)_. ## joining data Next week, we'll do a bit on joining data. ## References ### Data Wrangling in R - [Data Wrangling (dplyr, tidyr) cheat sheet]({{ site.baseurl }}/refs/cheatsheets/data-wrangling-cheatsheet.pdf) - [wrangling-webinar.pdf](wrangling-webinar.pdf)