---
title: "Import and Tidy Data"
author: "Dr. Hua Zhou @ UCLA"
date: "Jan 18, 2022"
subtitle: Biostat 203B
output:
# ioslides_presentation: default
html_document:
toc: true
toc_depth: 4
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.align = 'center', cache = TRUE)
```
## Outline
We will spend next couple lectures studying some R packages for typical data science projects.
* Data wrangling (import, tidy, visualization, transformation).
* [R for Data Science](http://r4ds.had.co.nz) by Garrett Grolemund and Hadley Wickham.
* [_R Graphics Cookbook_](https://r-graphics.org) by Winston Chang.
* Interactive data visualization by Shiny.
* Interface with databases, eg., SQL and Apache Spark.
* String and text data manipulation.
* Webscraping.
A typical data science project:
## [Tidyverse](https://www.tidyverse.org)
- `tidyverse` is a collection of R packages that make data wrangling easy.
- Install `tidyverse` from RStudio menu `Tools -> Install Packages...` or
```{r, eval = FALSE}
install.packages("tidyverse")
```
- After installation, load `tidyverse` by
```{r}
library("tidyverse")
```
# Tibble | r4ds chapter 10
## Tibbles
- Tibbles extend data frames in R and form the core of tidyverse.
## Create tibbles
- `iris` is a dataframe available in base R:
```{r}
# a regular data frame
iris
```
- Convert a regular data frame to tibble:
```{r}
as_tibble(iris)
```
We see by default tibble only displays the first 10 rows of data.
- Convert a tibble to data frame:
```{r, eval = FALSE}
as.data.frame(tb)
```
----
- Create tibble from individual vectors. Note values for y are recycled:
```{r}
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
```
----
- Transposed tibbles:
```{r}
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
```
## Printing of a tibble
- By default, tibble prints the first 10 rows and all columns that fit on screen.
```{r}
nycflights13::flights
```
----
- To change number of rows and columns to display:
```{r}
nycflights13::flights %>%
print(n = 10, width = Inf)
```
Here we see the **pipe operator** `%>%` pipes the output from previous command to the (first) agument of the next command.
----
- To change the default print setting:
- `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m` rows, print only `n` rows.
- `options(dplyr.print_min = Inf)`: print all row.
- `options(tibble.width = Inf)`: print all columns.
## Subsetting
-
```{r}
df <- tibble(
x = runif(5),
y = rnorm(5)
)
df
```
- Extract by name:
```{r}
df$x
df[["x"]]
```
----
- Extract by position:
```{r}
df[[1]]
```
- Pipe:
```{r}
df %>% .$x
df %>% .[["x"]]
```
# Data import | r4ds chapter 11
## readr
- readr package implements functions that turn flat files into tibbles.
- `read_csv()`, `read_csv2()` (semicolon seperated files), `read_tsv()`, `read_delim()`.
- `read_fwf()` (fixed width files), `read_table()`.
- `read_log()` (Apache style log files).
- An example file [heights.csv](https://raw.githubusercontent.com/ucla-biostat203b-2021winter/ucla-biostat203b-2021winter.github.io/master/slides/05-tidy/heights.csv):
```{bash}
head heights.csv
```
----
- Read from a local file [heights.csv](https://raw.githubusercontent.com/ucla-biostat203b-2021winter/ucla-biostat203b-2021winter.github.io/master/slides/05-tidy/heights.csv):
```{r}
heights <- read_csv("heights.csv")
heights
```
----
- I'm curious about relation between `earn` and `height` and `sex`
```{r}
ggplot(data = heights) +
geom_point(mapping = aes(x = height, y = earn, color = sex))
```
----
- Read from inline csv file:
```{r}
read_csv("a,b,c
1,2,3
4,5,6")
```
- Skip first `n` lines:
```{r}
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
```
----
- Skip comment lines:
```{r}
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
```
- No header line:
```{r}
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
----
- No header line and specify colnames:
```{r}
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
```
- Specify the symbol representing missing values:
```{r}
read_csv("a,b,c\n1,2,.", na = ".")
```
## Writing to a file
- Write to csv:
```{r, eval = FALSE}
write_csv(challenge, "challenge.csv")
```
- Write (and read) RDS files:
```{r, eval = FALSE}
write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
```
## Excel files
- readxl package (part of tidyverse) reads both xls and xlsx files:
```{r}
library(readxl)
# xls file
read_excel("datasets.xls")
```
```{r}
# xlsx file
read_excel("datasets.xlsx")
```
- List the sheet name:
```{r}
excel_sheets("datasets.xlsx")
```
- Read in a specific sheet by name or number:
```{r}
read_excel("datasets.xlsx", sheet = "mtcars")
```
```{r}
read_excel("datasets.xlsx", sheet = 4)
```
- Control subset of cells to read:
```{r}
# first 3 rows
read_excel("datasets.xlsx", n_max = 3)
```
Excel type range
```{r}
read_excel("datasets.xlsx", range = "C1:E4")
```
```{r}
# first 4 rows
read_excel("datasets.xlsx", range = cell_rows(1:4))
```
```{r}
# columns B-D
read_excel("datasets.xlsx", range = cell_cols("B:D"))
```
```{r}
# sheet
read_excel("datasets.xlsx", range = "mtcars!B1:D5")
```
- Specify `NA`s:
```{r}
read_excel("datasets.xlsx", na = "setosa")
```
- Writing Excel files: `openxlsx` and `writexl` packages.
## Other types of data
- **haven** reads SPSS, Stata, and SAS files.
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows us to run SQL queries against a database and return a data frame. Later we will use DBI to work with databases.
- **jsonlite** reads json files.
- **xml2** reads XML files.
- **tidyxl** reads non-tabular data from Excel.
# Tidy data | r4ds chapter 12
## Tidy data
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
----
- Example table1
```{r}
table1
```
is tidy.
----
- Example table2
```{r}
table2
```
is not tidy.
----
- Example table3
```{r}
table3
```
is not tidy.
----
- Example table4a
```{r}
table4a
```
is not tidy.
- Example table4b
```{r}
table4b
```
is not tidy.
## Gathering (pivot_longer)
- `gather` columns into a new pair of variables.
```{r}
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
```
- `gather` function has been superseded by `pivot_longer`
```{r}
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
```
----
- We can gather table4b too and then join them
```{r}
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
# gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
# gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
```
## Spreading
- Spreading is the opposite of gathering.
```{r}
table2 %>%
spread(key = type, value = count)
```
- `spread` function has been superseded by `pivot_wider`
```{r}
table2 %>%
pivot_wider(names_from = type, values_from = count)
```
## Separating
-
```{r}
table3 %>%
separate(rate, into = c("cases", "population"))
```
----
- Seperate into numeric values:
```{r}
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
```
----
- Separate at a fixed position:
```{r}
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
```
## Unite
-
```{r}
table5
```
----
- `unite()` is the inverse of `separate()`.
```{r}
table5 %>%
unite(new, century, year, sep = "")
```