--- title: "Import and Tidy Data" author: "Dr. Hua Zhou" date: "Jan 30, 2018" subtitle: Biostat M280 output: ioslides_presentation: smaller: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width = 5, fig.height = 3.5, fig.align = 'center', cache = TRUE) ``` ## Workflow A typical data science project:
# Tibble | r4ds chapter 10 ## Tibbles - Tibbles extend data frames in R and form the core of tidyverse. ```{r} library("tidyverse") ``` ## Create tibbles - Convert a regular data frame to tibble: ```{r} # a regular data frame head(iris) ``` ---- - ```{r} as_tibble(iris) ``` - Convert a tibble to data frame: ```{r, eval = FALSE} as.data.frame(tb) ``` ---- - Create tibble from individual vectors: ```{r} tibble( x = 1:5, y = 1, z = x ^ 2 + y ) ``` ---- - Transposed tibbles: ```{r} tribble( ~x, ~y, ~z, #--|--|---- "a", 2, 3.6, "b", 1, 8.5 ) ``` ## Printing of a tibble - By default, tibble prints the first 10 rows and all columns that fit on screen. ```{r} tibble( a = lubridate::now() + runif(1e3) * 86400, b = lubridate::today() + runif(1e3) * 30, c = 1:1e3, d = runif(1e3), e = sample(letters, 1e3, replace = TRUE) ) ``` ---- - To change number of rows and columns to display: ```{r} nycflights13::flights %>% print(n = 10, width = Inf) ``` ---- - To change the default print setting: - `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m` rows, print only `n` rows. - `options(dplyr.print_min = Inf)`: print all row. - `options(tibble.width = Inf)`: print all columns. ## Subsetting - ```{r} df <- tibble( x = runif(5), y = rnorm(5) ) ``` - Extract by name: ```{r} df$x df[["x"]] ``` ---- - Extract by position: ```{r} df[[1]] ``` - Pipe: ```{r} df %>% .$x df %>% .[["x"]] ``` # Data import | r4ds chapter 11 ## - readr package implements functions that turn flat files into tibbles. - `read_csv()`, `read_csv2()` (semicolon seperated files), `read_tsv()`, `read_delim()`. - `read_fwf()` (fixed width files), `read_table()`. - `read_log()` (Apache style log files). - An example file [heights.csv](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/08-tidy/heights.csv): ```{bash} head heights.csv ``` ---- - Read from a local file [heights.csv](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/08-tidy/heights.csv): ```{r} (heights <- read_csv("heights.csv")) ``` ---- - Read from inline csv file: ```{r} read_csv("a,b,c 1,2,3 4,5,6") ``` - Skip first `n` lines: ```{r} read_csv("The first line of metadata The second line of metadata x,y,z 1,2,3", skip = 2) ``` ---- - Skip comment lines: ```{r} read_csv("# A comment I want to skip x,y,z 1,2,3", comment = "#") ``` - No header line: ```{r} read_csv("1,2,3\n4,5,6", col_names = FALSE) ``` ---- - No header line and specify colnames: ```{r} read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z")) ``` - Specify the symbol representing missing values: ```{r} read_csv("a,b,c\n1,2,.", na = ".") ``` ## Writing to a file - Write to csv: ```{r, eval = FALSE} write_csv(challenge, "challenge.csv") ``` - Write (and read) RDS files: ```{r, eval = FALSE} write_rds(challenge, "challenge.rds") read_rds("challenge.rds") ``` ## Other types of data - **haven** reads SPSS, Stata, and SAS files. - **readxl** reads excel files (both .xls and .xlsx). - **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame. - **jsonlite** reads json files. - **xml2** reads XML files. # Tidy data | r4ds chapter 12 ## Tidy data There are three interrelated rules which make a dataset tidy: - Each variable must have its own column. - Each observation must have its own row. - Each value must have its own cell.
---- - Example table1 ```{r} table1 ``` is tidy. ---- - Example table2 ```{r} table2 ``` is not tidy. ---- - Example table3 ```{r} table3 ``` is not tidy. ---- - Example table4a ```{r} table4a ``` is not tidy. - Example table4b ```{r} table4b ``` is not tidy. ## Gathering
- `gather` columns into a new pair of variables. ```{r} table4a %>% gather(`1999`, `2000`, key = "year", value = "cases") ``` ---- - We can gather table4b too and then join them ```{r} tidy4a <- table4a %>% gather(`1999`, `2000`, key = "year", value = "cases") tidy4b <- table4b %>% gather(`1999`, `2000`, key = "year", value = "population") left_join(tidy4a, tidy4b) ``` ## Spreading
- Spreading is the opposite of gathering. ```{r} spread(table2, key = type, value = count) ``` ## Separating
- ```{r} table3 %>% separate(rate, into = c("cases", "population")) ``` ---- - Seperate into numeric values: ```{r} table3 %>% separate(rate, into = c("cases", "population"), convert = TRUE) ``` ---- - Separate at a fixed position: ```{r} table3 %>% separate(year, into = c("century", "year"), sep = 2) ``` ## Unite
- ```{r} table5 ``` ---- - `unite()` is the inverse of `separate()`. ```{r} table5 %>% unite(new, century, year, sep = "") ```