---
title: "Reproducible Data Cleaning in RMarkdown"
subtitle: EPID 7500 - Introduction to Coding in R for Public Health & the Life Sciences
author: Steve Bellan
date: August 28, 2017
output: slidy_presentation 

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Load the necessary packages

Note that to enter a code segment we can do `Ctrl-Alt-i` (Windows) or  `Cmd-Alt-i` (Mac)
```{r message=F}
require(tidyverse)
```

## Knitting

* Use keyboard shortcuts to knit whenever you want
* `Ctrl-Shift-k` (Windows) or `Cmd-Shift-k` (Mac)

## Loading in a data set from the internet

* [WHO TB notification data provided by countries](http://www.who.int/tb/country/data/download/en/)
* Right click on `Download case notifications [1.9Mb]` and save it to your hard drive
* Cannot load directly from internet since no URL for the CSV file available
* [WHO Data Dictionary](https://extranet.who.int/tme/generateCSV.asp?ds=dictionary)

```{r message=F, results='hide'}
notif <- read_csv('~/Downloads/TB_notifications_2017-08-27.csv')
```
```{r message=F}
head(colnames(notif),20)
notif[1:4,1:5]
```

## Workflow for running R code within Rmd

* I prefer to be able to use Console
* Set R output to console (not inline) when running lines [see here](https://support.rstudio.com/hc/en-us/community/posts/115000797947-Global-Changes-show-chunk-output-in-console-not-execute-all-lines)

## Select just a few variables of interest

```{r selectCols}
notif2 <- select(notif, iso3, country, g_whoregion, year, c_newinc) %>% 
  rename(whoreg = g_whoregion, inc = c_newinc)
notif2
head(distinct(notif2, country) %>% .$country)
```

## Missing values

```{r incNA}
indNA <- which(is.na(notif2$inc))
head(indNA)
slice(notif2, indNA)
```

* Excercise: Repeat this but look at rows with 0 values
```{r incZ, echo=F}
ind0 <- which(notif2$inc==0)
head(ind0)
slice(notif2, ind0)
```

## Identify countries without missing rows

* `sum(is.na(x))` tells you how many NAs are in x
* `sum(is.na(x))==0` tells you there are no NAs in x

```{r}
sum(is.na(c(0,NA,3,NA,5)))
```

```{r fullcountry}
fullcountry <- 
  group_by(notif2, country) %>%
  summarize(noNAs = sum(is.na(inc))==0) %>%
  filter(noNAs==T) %>%
  .$country
head(fullcountry)
```

## Subsetting

* Select the data set for which there are no NAs

```{r fullcountryDat}
notif3 <- filter(notif2, country %in% fullcountry)
range(notif3$inc)
sum(is.na(notif3$inc))
glimpse(notif3)
summary(notif3)
```

## Excercises

1. Write code that spreads `notif3` so that each country has it's own column
1. Look at a boxplot of TB notifications by country for the 10 countries with the greatest # of notifications overall.
    + Hint: `group_by()` countries, then sum all notifications, then sort 
    in descending order with `?arrange`

## Answers 1

1. Write code that spreads `notif3` so that each country has it's own column
```{r spread, echo=FALSE}
args(spread)
head(notif3)
notifWide <- select(notif3, country, year, inc) %>%
  spread(country, inc)
notifWide[1:5,1:5]
filter(notif3, country=='Argentina')
```

## Answers 2

2. Look at a boxplot of TB notifications by country for the 10 countries with the greatest # of cumulative notifications across years.

```{r bigC}
big_countries <- 
  group_by(notif3, country) %>%
  summarize(cum_inc= sum(inc)) %>%
  arrange(desc(cum_inc)) %>% 
  slice(1:10) %>%
  .$country
big_countries
```

## Answers 2

2. Look at a boxplot of TB notifications by country for the 10 countries with the greatest # of cumulative notifications across years.

```{r boxplot, echo=FALSE}
big_countries
boxplot(inc ~ iso3, filter(notif3, country %in% big_countries) , las = 2)
```

## Answers 2 (with ggplot)

2. Look at a boxplot of TB notifications by country for the 10 countries with the greatest # of cumulative notifications across years.

```{r boxplotGG, echo=FALSE}
big_countries
filter(notif3, country %in% big_countries) %>% 
  ggplot(., aes(iso3, inc)) + geom_boxplot() + 
  ## the next line rotates the country labels 90 degrees
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 
```

## Excercise 3

* [WHO Measles & Rubella Page](http://www.who.int/immunization/monitoring_surveillance/burden/vpd/surveillance_type/active/measles_monthlydata/en/)
* Download the [Distribution of measles cases by country and by month, 2011-2017](http://www.who.int/entity/immunization/monitoring_surveillance/burden/vpd/Table_407_Web_Number_Measles_Cases_Month_2017-08-09.xlsx?ua=1)
* Read it into R with the  `xlsx` library (note you have to read 2nd sheet!)
* Reformat the data so it is in long with date in YYYY-MM format

## Answer 3

```{r measlesDat1}
## load the necessary libraries
require(readxl); require(tidyverse); require(lubridate); require(stringr)
getwd()
setwd("~/Documents/R Repos/EPID7500/Live Coding")
meas <- read_xlsx('Table_407_Web_Number_Measles_Cases_Month_2017-08-09.xlsx', sheet = 2)
head(meas)
?gather
## gather everything but iso3, countyr, year and make value of column name
meas2 <- gather(meas, month, cases, -c(ISO3, Country, Year) )
## change class of month to a factor, and use month.name as ordered.level
month.name
meas2$month <- factor(meas2$month, ordered = T, levels = month.name)
meas2$monthNum <- as.numeric(meas2$month)
meas2$monthNum <- str_pad(meas2$monthNum, width = 2, side = 'left', pad=0)
meas2
```

```{r measlesDat2}
## double check that the number months match the name months
distinct(meas2, month, monthNum) ## look at every distinct entry by monthXmonthNum
filter(meas2, monthNum==5) ## look at a random month num

meas3 <- ## unite year & month columns
  unite(meas2, ym, Year, monthNum, sep='-') %>%
  mutate(date = paste0(ym, '-01')) %>% ## add day
  mutate(date = ymd(date)) %>%
  select(-month)
meas3
```

## Excercise 4

* Plot measles incidence time series by country for the 5 highest incidence countries in 2015


## Export Rmd to R

* You can extract the inside of code chunks using `purl()`

```{r}
require(knitr)
purl('Day3.Rmd')
```