--- title: "Lab: data wrangling" format: html --- ## Reading and previewing data Our data are usually stored as a `.csv` file and after loading a `.csv` file into RStudio, we will have a "data frame". A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the `tidyverse` suite of packages, we use the [`read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) function to load the `heart_disease` dataset from yesterday: ```{r} library(tidyverse) heart_disease <- read_csv("https://raw.githubusercontent.com/36-SURE/2024/main/data/heart_disease.csv") ``` By default, `read_csv()` reads in the dataset as a `tbl` (aka `tibble`) object instead of a `data.frame` object. You can read about the differences [here](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html), but it's not that meaningful for purposes. We can use the functions `slice_head()` and `slice_tail()` to view a sample of the data. Use the `slice_head()` function to view the first 6 rows, then use the `slice_tail()` function to view the last 3 rows: ```{r} # INSERT CODE HERE ``` View the dimensions of the data with `dim()`: ```{r} # INSERT CODE HERE ``` Quickly view summary statistics for all variables with the `summary()` function: ```{r} # Uncomment the following code by deleting the # at the front # summary(heart_disease) ``` View the data structure types with `str()`: ```{r} # str(heart_disease) ``` What's the difference between the output from the two functions? You can find a description of the dataset [here](https://www.stat.cmu.edu/cmsac/sure/2022/materials/data/health/intro_r/heart_disease_README.txt). ## Data manipulation with `dplyr` An easier way to manipulate the data frame is through the `dplyr` package, which is in the `tidyverse` suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The "split-apply-combine" concept can be achieved by `dplyr`. ### Selecting columns with `select()` The function `select()` can be use to select certain column with the column names. First create a new table called `heart_disease_ad` that only contains the `Age` and `Drugs` columns: ```{r} # INSERT CODE HERE ``` To select all columns except a specific column, use the `-` (subtraction) operator. For example, view the output from uncommenting the following line of code: ```{r} # select(heart_disease, -Interventions) ``` To select a range of columns by name (that are in consecutive order), use the `:` (colon) operator. For example, view the output from uncommenting the following line of code: ```{r} # select(heart_disease, Drugs:Duration) ``` To select all columns that start with certain character strings, use the function `starts_with()`. Other matching options are: 1. `ends_with()`: select columns that end with a character string 2. `contains()`: select columns that contain a character string 3. `matches()`: select columns that match a regular expression 4. `one_of()`: select columns names that are from a group of names ```{r} # Uncomment the following lines of code # select(heart_disease, starts_with("Com")) # select(heart_disease, contains("er")) ``` ### Extracting rows using `filter()` We can also extract the rows/observations that satisfy certain criteria. Try extracting the rows with `Cost` of more than 400: ```{r} # INSERT CODE HERE ``` We can also filter on multiple criteria. Subset the rows with `Age` above 60 and the `gender` is "Male": ```{r} # INSERT CODE HERE ``` ### Arranging rows using `arrange()` To arrange the data frame by a specific order we need to use the function `arrange()`. The default is by increasing order and the `desc()` will provide the decreasing order. First arrange the `heart_disease` table by `Duration` in ascending order: ```{r} # INSERT CODE HERE ``` Next by descending order: ```{r} # INSERT CODE HERE ``` Try combining a pipeline of `select()`, `filter()`, and `arrange()` steps together with the `|>` operator by: 1. Selecting the `Age`, `Cost`, `ERVisit`, and `Duration` columns, 2. Filter to select only rows with `Age` above 60, 3. Sort by `Duration` in descending order ```{r} # INSERT CODE HERE ``` ### Creating new columns using `mutate()` Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. Create a new column `cost_per_day` by taking the `Cost` and dividing by `Duration` (reassign this output to the `heart_disease` table following the commented code chunk so this column is added to the table): ```{r} # heart_disease <- heart_disease |> # mutate(INSERT CODE HERE) ``` ### Creating summaries with `summarize()` To create summary statistics for a given column in the data frame, we can use `summarize()` function. Compute the `mean`, `min`, and `max` amount of `Cost`: ```{r} # INSERT CODE HERE ``` The advantage of `summarize()` is more obvious if we combine it with `group_by()`, the group operators. Try to `group_by()` the `Gender` column first and then compute the same summary statistics for `Cost`: ```{r} # INSERT CODE HERE ```