--- title: "joining" author: "Ben Best" date: '`r format(Sys.time(), "%Y-%m-%d %H:%M")`' output: html_document --- ```{r setup, echo=F} # NOTE: This chunk sets the default to eval=F for all following chunks. # Am just using this for instructions. Please exclude in your assignment to see all results. knitr::opts_chunk$set(eval=F) ``` ## Precursors ### RStudio Heroes Last weekend at 1st annual [shinydevcon](https://www.rstudio.com/shinydevcon/). Left to right, @github (affiliation): - Winston Chang @wch (RStudio) - Hadley Wickham @hadley (RStudio) - Julie Lowndes @jules32 (OHI) - Jamie Afflerbach @jafflerbach (OHI) - Joe Cheng @jcheng5 (RStudio) - Ben Best @bbest (OHI) - Herman Strop @FrissAnalytics (FrissAnalytics) ![](./wk05_ggplot/img/rstudio_shinydevcon_wch-hadley-jcheng5-FrissAnalytics.JPG) Brought some of these stickers for ya: [![](./wk05_ggplot/img/rstudio_stickers.png)](https://www.rstudio.com/about/gear/) ### RStudio Shortcuts - [rstudio-IDE-cheatsheet.pdf](./refs/cheatsheets/rstudio-IDE-cheatsheet.pdf) - Show Keyboard Shortcuts `?`: Alt+Shift+**K** (Win), Option+Shift+**K** (Mac) - (Un)Comment lines `# `: Ctrl+Shift+**C** (Win), Cmd+Shift+**C** (Mac) - Insert `%>%`: Ctrl+Shift+**M** (Win), Cmd+Shift+**M** (Mac) - Attempt completion: **Tab** or Ctrl+Space (Win), **Tab** or Cmd+Space (Mac), - Knit Document: Ctrl+Shift+**K** (Win), Cmd+Shift+**K** (Mac) - Copy Lines Up/Down: Shift+Alt+**$\uparrow$**/**$\downarrow$** (Win), Cmd+Option+**$\uparrow$**/**$\downarrow$** (Mac) ### New Assignment Github Workflow #### OLD So far to turn in individual assignments, you've been editing `env-info/students/.Rmd` and knitting to `env-info/students/.html`. Then you push to your repo `/env-info` before submitting to the course repo `ucsb-bren/env-info` with a push, and pull request. To update your `/env-info` with the latest from `ucsb-bren/env-info`, you also need to do an additional pull request, merge, and pull. repo location | `` permission | initialize | edit | update -----------|:-----------:|:-----------:|:-----------:|:-----------: `github.com/ucsb-bren/env-info` | read only | | [**merge**](https://help.github.com/articles/merging-a-pull-request) [BB] | `github.com//env-info` | read + write | [**fork**](https://help.github.com/articles/fork-a-repo) | [**pull request**](https://help.github.com/articles/creating-a-pull-request/) | [**pull request**](https://help.github.com/articles/creating-a-pull-request/) , [**merge**](https://help.github.com/articles/merging-a-pull-request) `~/localdir/.github.io` | read + write | [**clone**](https://help.github.com/articles/fetching-a-remote) | [**commit**](http://git-scm.com/docs/git-commit) , [**push**](https://help.github.com/articles/pushing-to-a-remote/) | [**pull**](https://help.github.com/articles/fetching-a-remote/#pull) #### NEW For future assignments, you'll use a much simpler Github workflow with less possibilities for [merge conflicts](http://r-pkgs.had.co.nz/git.html#git-pull): repo location | `` permission | initialize | edit | update -----------|:-----------:|:-----------:|:-----------:|:-----------: `github.com//.github.io` | read + write | [**create**](https://help.github.com/articles/create-a-repo/) | | `~/localdir/.github.io` | read + write | [**clone**](https://help.github.com/articles/fetching-a-remote) | [**commit**](http://git-scm.com/docs/git-commit) , [**push**](https://help.github.com/articles/pushing-to-a-remote/) | [**pull**](https://help.github.com/articles/fetching-a-remote/#pull) Similar to your class group project repo `/.github.io` which serves a web site at `http://.github.io`, you want to initialize: 1. [Create a repository](https://help.github.com/articles/create-a-repo/) `/.github.io` where you'll replace `` with your Github username (eg I created a repository `bbest.github.io` in which `bbest` is the owner). If you already have this repository created, you can simply move on. Otherwise, please also tick the box to initialize this repository with a README. 1. Go to Settings, Collaborators and add `bbest` and `naomitague` as collaborators. This will give us write permissions on your repository, enabling us to more easily help you fix code. 1. Run the equivalent of `git clone http://github.com//.github.io` from RStudio menu File, New Project…, Version Control, Git. 1. Create a new folder `env-info_hw` inside your local repository `.github.io`. It is important that you use the exact spelling for all of the above. Otherwise, we won't be able to find your homework. ### Individual Assignment `wk05_ggplot` Now, for every week's assignment you'll create a new Rmarkdown file inside the `env-info_hw` folder of your `.github.io` repo. This week create one called `wk05_ggplot.Rmd` with File, New File..., R Markdown... We'll be able to grade assignments, because we have a roster of students by Github username, and so presume we'll find your assignment by visiting `http://.github.io/env-info_hw/wk05_ggplot.html`, eg `http://bbest.github.io/env-info_hw/wk05_ggplot.html`. Again, it is important that you use the exact spelling above. Otherwise, we won't be able to find your homework. ## joining data For this portion of the individual assignment, similar to last week, you'll find it easiest to copy and paste from `## joining data` onwards in [`env-info/wk05_joining.Rmd`](https://raw.githubusercontent.com/ucsb-bren/env-info/gh-pages/wk05_joining.Rmd) to your `.github.io/env-info_hw/wk05_ggplot.Rmd`. Then you can play with different chunks of the code. Please be sure to answer all tasks and questions at the bottom. The R chunks explaining the `dplyr` join functions below are pulled from the excellent [**wrangling-webinar.pdf**](wk03_dplyr/wrangling-webinar.pdf) presentation, which you should consult as you execute (see shortcuts in [rstudio-IDE-cheatsheet.pdf](refs/cheatsheets/rstudio-IDE-cheatsheet.pdf)). ### setup Ensure that you're in the same working directory `env-info_hw` when you Knit HTML as when you test code in the Console. ```{r setwd students} wd = 'env-info_hw' # set working directory for Console (vs Rmd) if (wd %in% list.files() & interactive()){ setwd(wd) } # ensure working directory if (basename(getwd()) != wd){ stop(sprintf("WHOAH! Your working directory is not in '%s'!\n getwd(): %s", wd, getwd())) } ``` ### `bind_cols` bind columns between data frames ```{r bind_cols} library(readr) library(dplyr) y = read_csv( 'x1,x2 A,1 B,2 C,3') z = read_csv( 'x1,x2 B,2 C,3 D,4') y z bind_cols(y, z) ``` ### `bind_rows` bind rows of data frames ```{r bind_rows} y z bind_rows(y, z) ``` ### `union` set operation on data frames, returning all unique rows ```{r union} y z union(y, z) ``` ### `intersect` set operation on data frames, returning those rows in common ```{r intersect} y z intersect(y, z) ``` ### `setdiff` set operation on data frames, returning all mismatched rows ```{r setdiff} y z setdiff(y, z) ``` ### `left_join` return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. ```{r left_join} artists = read_csv( 'name,plays George,sitar John,guitar Paul,bass Ringo,drums') songs = read_csv( 'song,name Across the Universe,John Come Together,John "Hello, Goodbye",Paul Peggy Sue,Buddy') artists songs left_join(songs, artists, by='name') ``` ### `inner_join` return all rows from x where there are matching values in y, and all columns from x and y ```{r inner_join} inner_join(songs, artists, by = "name") ``` ### `semi_join` return all rows from x where there are matching values in y, keeping just columns from x ```{r semi_join} semi_join(songs, artists, by = "name") ``` ### `anti_join` return all rows from x where there are not matching values in y, keeping just columns from x ```{r anti_join} anti_join(songs, artists, by = "name") ``` ### per capita CO2 emissions You'll join the [population](http://data.okfn.org/data/core/population) dataset to calculate per capita CO2 emissions. _**Task**. Calculate the per capita emissions by country (not World or EU28) and return the top 5 emitting countries for **2014**._ _**Task**. Summarize the per capita emissions by country (not World or EU28) as the **mean** (ie average) value across all years and return the top 5 emitting countries._ ```{r bbest, echo=F, eval=F} # setwd('~/github/bbest.github.io/env-info_hw') ``` ```{r read co2 & popn, eval=T} suppressPackageStartupMessages({ library(readr) library(readxl) # install.packages('readxl') library(dplyr) library(tidyr) library(httr) # install.packages('httr') }) # co2 Excel file url = 'http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls' xls = 'co2_europa.xls' if (!file.exists(xls)) writeBin(content(GET(url), 'raw'), xls) # read in carbon dioxide emissions file co2 = read_excel(xls, skip=12) %>% gather(year, ktons_co2, -Country) %>% filter(!Country %in% c('World','EU28') & !is.na(Country)) %>% mutate(year = as.integer(as.character(year))) %>% rename(country=Country) co2 # read in population popn = read_csv('https://raw.githubusercontent.com/datasets/population/master/data/population.csv') %>% select(country=`Country Name`, year=Year, popn=Value) popn ``` In order to calculate per capita emissions, you'll need to join the `popn` data frame with the `co2` data frame before calculating something like `ktons_co2 / popn`. One of the trickiest parts about joining data is ensuring that values in the key columns are matching. The column `year` is a simple 4 digit integer in both tables (after converting to `as.integer()` for the gathered `co2`). But what about `country`? Let's look at just `year==2014` to see what values of `country` in table `co2` are not found in table `popn`, ie `anti_join()`: ```{r co2$popn not in popn$country, eval=T} # co2$popn not in popn$country co2 %>% filter(year==2014) %>% anti_join( popn %>% filter(year==2014), by='country') %>% arrange(desc(ktons_co2)) ``` Whoah, we're missing 47 "countries", including the "United States of America"! So what is "United States of America" called in `popn$country`? ```{r popn$country not in co2$popn, eval=T} # popn$country not in co2$popn popn %>% filter(year==2014) %>% anti_join( co2 %>% filter(year==2014), by='country') %>% arrange(desc(country)) ``` Ok, so in `popn$country` we have "United States" which doesn't get matched with "United States of America" in `co2$country`? How to fix this? We need a lookup table for translating country values between tables. I went ahead and made one for you by copying the two anti_join tables above and matching rows where possible. ```{r cntry, eval=T} # get lookup table to translate popn$country to co2$country cntry = read_csv('https://raw.githubusercontent.com/ucsb-bren/env-info/gh-pages/data/co2_country_to_popn.csv') cntry ``` So before we join the tables `co2` and `popn`, first translate country in one to match the other. ```{r update co2$country to popn$country, eval=T} # update co2$country to popn$country co2_m = co2 %>% left_join( cntry, # note use of named vector to columns with different names by = c('country'='country_co2')) %>% mutate( # note use of ifelse to upated country with popn$country_popn country = ifelse(!is.na(country_popn), country_popn, country)) # check that "United States of America" changed to "United States" co2_m %>% filter(year==2014) %>% arrange(desc(country)) %>% head(15) ``` Ok, so `co2_m$country` got updated with values to match `popn$country`, eg "United States of America" to "United States". Please use this matching `co2_m` to join with `popn` and calculate per capita emissions.