--- title: "Exercise 5: Data Wrangling Pt 1" output: html_notebook: toc: yes toc_float: yes --- # Importing Messy Excel Data Although your data may be in a spreadsheet or CSV file, it may not be in an ideal format. See this registration list: ![](./images/reg_list_excel_1500x600.png) \ Here we see a bunch of stuff before the data, the column "names" are unwieldy, there are several extra columns we don't need, missing values, etc. You can tackle many of these issues when you import the data. First, we construct the file name and verify it exists: ```{r chunk01} reg_list_fn = "./data/SurveyResults_36100.xlsx" file.exists(reg_list_fn) ``` \ To import Excel files, we'll go to `read_xlsx()` from the *readxl* package. This function has a lot of options, so it's worth reading the help page: ```{r chunk02} library(readxl) reg_list_tbl = read_xlsx(path = reg_list_fn, sheet = "Survey Output", skip = 3, col_names = c("user", "reg_date", "fname", "lname", "email", "title", "org", "cosponsor", "allergies"), col_types = c("text", "text", "text", "text", "text", "text", "numeric", "text", "text")) reg_list_tbl ``` \ Likewise, the [readr](https://readr.tidyverse.org/) package provides super-charged functions for importing and exporting rectangular data in text formats. \ **Take Home Messags** - Data wrangling starts when you import - You're not limited to just using `read.csv()` \ # `dplyr` methods ```{r chunk03} library(dplyr) ``` \ We begin by looking at the Palmer Penguins data: ```{r chunk04} library(palmerpenguins) head(penguins) ``` \ dplyr provides `glimpse()` which is an alternative way preview a tibble: ```{r chunk05} glimpse(penguins) ``` \ ## Subset Columns Suppose we want to compare how bill length and flipper length look for each species. We could start by selecting just those columns: ```{r chunk06} library(dplyr) penguins %>% select(species, bill_length_mm, flipper_length_mm) %>% head() ``` \ You can also **rename** columns with `select()` (dplyr also has a `rename()` function): ```{r chunk07} penguins %>% select(species, bill = bill_length_mm, flipper = flipper_length_mm) %>% head() ``` \ If you want to extract the values from a single column and get the results back as a vector, use `pull()` (similar to the `$` operator): ```{r chunk08} penguins %>% pull(species) %>% table() ``` \ ## Adding/Modifying Columns To **add a column**, we can use `mutate()`. Let's create a new column with the ratio of the bill to flipper length: ```{r chunk09} peng_billflip_tbl <- penguins %>% select(species, bill_length_mm, flipper_length_mm) %>% mutate(bill_flip_ratio = bill_length_mm / flipper_length_mm) head(peng_billflip_tbl) ``` \ `mutate()` can also be used to modify existing columns, for example we could round bill_flip_ratio to 3 decimal places. ```{r chunk10} peng_billflip_tbl <- peng_billflip_tbl %>% mutate(bill_flip_ratio = round(bill_flip_ratio, 3)) head(peng_billflip_tbl) ``` \ **CHALLENGE #1**: Add a column called `bill_flip_sum` which represents the sum of the bill and flipper length. Hint: use `mutate()` ```{r chunk11} ## Your answer here ``` \ # Managing Rows ## Sorting You can **sort** a table using `arrange()`, passing the column(s) you want to sort by. The default sort order is ascending, but you can change that to descending by wrapping the column name in `desc()`. Sort by flipper size (smallest to largest): ```{r chunk12} peng_billflip_tbl %>% arrange(flipper_length_mm) %>% head() ``` \ Largest to smallest: ```{r chunk13} peng_billflip_tbl %>% arrange(desc(flipper_length_mm)) %>% head() ``` \ ## Subsetting The two functions most commonly used to subset rows include `filter()` and `slice()`. - use `slice()` to subset based on indices - use `filter()` to subset based on a logical expressions \ Logical expressions you can use in `filter()` include: ![](./images/logical-expressions-740x280x256.png) \ Suppose we only wanted to work with observations of *Gentoo* penguins: ```{r chunk14} gentoo_billflip_tbl <- peng_billflip_tbl %>% filter(species == "Gentoo") glimpse(gentoo_billflip_tbl) ``` \ If you want to filter on two criteria, add a second expression to `filter()`: ```{r chunk15} peng_billflip_tbl %>% filter(species == "Gentoo", flipper_length_mm > 220) %>% select(species, flipper_length_mm) %>% glimpse() ``` \ Suppose we wanted to identify the top 5 Gentoo observations with the largest bill_flip_ratio value. We can do that using a combination of `arrange()` and `slice()`: ```{r chunk16} gentoo_billflip_tbl %>% arrange(desc(bill_flip_ratio)) %>% slice(1:5) ``` \ We could also use `top_n()`: ```{r chunk17} gentoo_billflip_tbl %>% top_n(5, bill_flip_ratio) %>% arrange(desc(bill_flip_ratio)) ``` \ ## Challenge Questions Mix and match the following functions for the challenge questions. `select()`, `mutate()`, `pull()`, `arrange()`, `filter()`, `slice()` \ **CHALLENGE #2**: Write an expression that pulls out just the males on Torgersen Island. HOw many are there? ```{r chunk18} ## Your answer here ``` Ans. 23 \ **CHALLENGE #3**: COmpute the ratio of body mass to bill length and add it as a new column. Return just the species, sex, and this ratio. ```{r chunk19} ## Your answer here ``` \ # Missing Values In general you have three options for missing values: 1. Remove any row that has missing values from the data 2. Leave all rows but ignore NA values when you compute summaries (like mean) 3. Estimate or impute them \ Suppose we wanted to keep only observations that have no missing values for flipper_length_mm, body_mass_g, and sex. ```{r chunk20} penguins %>% select(species, flipper_length_mm, body_mass_g, sex) %>% filter(!is.na(flipper_length_mm), !is.na(body_mass_g), !is.na(sex)) %>% head() ``` \ # Table Summaries The main dplyr function for creating summaries of data is `summarise()`. The syntax is similar to `mutate()`. Summary columns can use any aggregate function from base R or dplyr (see the [dplyr cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf) for a list of functions that are designed to use with `summarise()`). \ ## Summarise an entire tibble Compute the mean of several columns (using the `na.rm = TRUE` to ignore NA values): ```{r chunk21} penguins %>% summarise(bill_len_mean = mean(bill_length_mm, na.rm = TRUE), bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE), body_mass_mean = mean(body_mass_g, na.rm = TRUE)) ``` \ ## Summarise groups of rows Often you want to create summaries for groups of rows (i.e., a different unit of analysis). If we wanted to compute summary stats for each species, you would first group the rows by species with `group_by()`, and then use `summarise()`. ```{r chunk22} penguins %>% group_by(species) %>% summarise(bill_len_mean = mean(bill_length_mm, na.rm = TRUE), bill_depth_mean = mean(bill_depth_mm, na.rm = TRUE), body_mass_mean = mean(body_mass_g, na.rm = TRUE)) ``` \ **How many rows per island?** For this question, we can use the `n()` function in `summarise()`: ```{r chunk23} penguins %>% group_by(island) %>% summarise(num_records = n()) ``` \ **CHALLENGE #4**: Compute the number of observations per year ```{r chunk24} ## Your answer here ``` \ **CHALLENGE #5**: Compute the number of observations per year and species? Hint: `group_by()` can take more than one column ```{r chunk25} ## Your answer here ``` \ # END That's it!