--- title: "Import and Tidy Data" author: "Dr. Hua Zhou @ UCLA" date: "Feb 5, 2019" subtitle: Biostat M280 output: # ioslides_presentation: default html_document: toc: true toc_depth: 4 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.align = 'center', cache = TRUE) ``` A typical data science project:

# Tibble | r4ds chapter 10 ## Tibbles - Tibbles extend data frames in R and form the core of tidyverse. ```{r} library("tidyverse") ``` ## Create tibbles - Convert a regular data frame to tibble: ```{r} # a regular data frame head(iris) ``` ---- - ```{r} as_tibble(iris) ``` - Convert a tibble to data frame: ```{r, eval = FALSE} as.data.frame(tb) ``` ---- - Create tibble from individual vectors. Note values for y are recycled: ```{r} tibble( x = 1:5, y = 1, z = x ^ 2 + y ) ``` ---- - Transposed tibbles: ```{r} tribble( ~x, ~y, ~z, #--|--|---- "a", 2, 3.6, "b", 1, 8.5 ) ``` ## Printing of a tibble - By default, tibble prints the first 10 rows and all columns that fit on screen. ```{r} tibble( a = lubridate::now() + runif(1e3) * 86400, b = lubridate::today() + runif(1e3) * 30, c = 1:1e3, d = runif(1e3), e = sample(letters, 1e3, replace = TRUE) ) ``` ---- - To change number of rows and columns to display: ```{r} nycflights13::flights %>% print(n = 10, width = Inf) ``` ---- - To change the default print setting: - `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m` rows, print only `n` rows. - `options(dplyr.print_min = Inf)`: print all row. - `options(tibble.width = Inf)`: print all columns. ## Subsetting - ```{r} df <- tibble( x = runif(5), y = rnorm(5) ) ``` - Extract by name: ```{r} df$x df[["x"]] ``` ---- - Extract by position： ```{r} df[[1]] ``` - Pipe: ```{r} df %>% .$x df %>% .[["x"]] ``` # Data import | r4ds chapter 11 ## readr - readr package implements functions that turn flat files into tibbles. - `read_csv()`, `read_csv2()` (semicolon seperated files), `read_tsv()`, `read_delim()`. - `read_fwf()` (fixed width files), `read_table()`. - `read_log()` (Apache style log files). - An example file [heights.csv](http://hua-zhou.github.io/teaching/biostatm280-2019winter/slides/08-tidy/heights.csv): ```{bash} head heights.csv ``` ---- - Read from a local file [heights.csv](http://hua-zhou.github.io/teaching/biostatm280-2019winter/slides/08-tidy/heights.csv): ```{r} (heights <- read_csv("heights.csv")) ``` ---- - I'm curious about relation between `earn` and `height` and `sex` ```{r} ggplot(data = heights) + geom_point(mapping = aes(x = height, y = earn, color = sex)) ``` ---- - Read from inline csv file: ```{r} read_csv("a,b,c 1,2,3 4,5,6") ``` - Skip first `n` lines: ```{r} read_csv("The first line of metadata The second line of metadata x,y,z 1,2,3", skip = 2) ``` ---- - Skip comment lines: ```{r} read_csv("# A comment I want to skip x,y,z 1,2,3", comment = "#") ``` - No header line: ```{r} read_csv("1,2,3\n4,5,6", col_names = FALSE) ``` ---- - No header line and specify colnames: ```{r} read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z")) ``` - Specify the symbol representing missing values: ```{r} read_csv("a,b,c\n1,2,.", na = ".") ``` ## Writing to a file - Write to csv: ```{r, eval = FALSE} write_csv(challenge, "challenge.csv") ``` - Write (and read) RDS files: ```{r, eval = FALSE} write_rds(challenge, "challenge.rds") read_rds("challenge.rds") ``` ## Excel files

- readxl package (part of tidyverse) reads both xls and xlsx files: ```{r} library(readxl) # xls file read_excel("datasets.xls") ``` ```{r} # xls file read_excel("datasets.xlsx") ``` - List the sheet name: ```{r} excel_sheets("datasets.xlsx") ``` - Read in a specific sheet by name or number: ```{r} read_excel("datasets.xlsx", sheet = "mtcars") ``` ```{r} read_excel("datasets.xlsx", sheet = 4) ``` - Control subset of cells to read: ```{r} # first 3 rows read_excel("datasets.xlsx", n_max = 3) ``` Excel type range ```{r} read_excel("datasets.xlsx", range = "C1:E4") ``` ```{r} # first 4 rows read_excel("datasets.xlsx", range = cell_rows(1:4)) ``` ```{r} # columns B-D read_excel("datasets.xlsx", range = cell_cols("B:D")) ``` ```{r} # sheet read_excel("datasets.xlsx", range = "mtcars!B1:D5") ``` - Specify `NA`s: ```{r} read_excel("datasets.xlsx", na = "setosa") ``` - Writing Excel files: `openxlsx` and `writexl` packages. ## Other types of data - **haven** reads SPSS, Stata, and SAS files. - **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame. - **jsonlite** reads json files. - **xml2** reads XML files. - **tidyxl** reads non-tabular data from Excel. # Tidy data | r4ds chapter 12 ## Tidy data There are three interrelated rules which make a dataset tidy: - Each variable must have its own column. - Each observation must have its own row. - Each value must have its own cell.

---- - Example table1 ```{r} table1 ``` is tidy. ---- - Example table2 ```{r} table2 ``` is not tidy. ---- - Example table3 ```{r} table3 ``` is not tidy. ---- - Example table4a ```{r} table4a ``` is not tidy. - Example table4b ```{r} table4b ``` is not tidy. ## Gathering

- `gather` columns into a new pair of variables. ```{r} table4a %>% gather(`1999`, `2000`, key = "year", value = "cases") ``` ---- - We can gather table4b too and then join them ```{r} tidy4a <- table4a %>% gather(`1999`, `2000`, key = "year", value = "cases") tidy4b <- table4b %>% gather(`1999`, `2000`, key = "year", value = "population") left_join(tidy4a, tidy4b) ``` ## Spreading

- Spreading is the opposite of gathering. ```{r} spread(table2, key = type, value = count) ``` ## Separating

- ```{r} table3 %>% separate(rate, into = c("cases", "population")) ``` ---- - Seperate into numeric values: ```{r} table3 %>% separate(rate, into = c("cases", "population"), convert = TRUE) ``` ---- - Separate at a fixed position: ```{r} table3 %>% separate(year, into = c("century", "year"), sep = 2) ``` ## Unite

- ```{r} table5 ``` ---- - `unite()` is the inverse of `separate()`. ```{r} table5 %>% unite(new, century, year, sep = "") ```