---
title: "Lab: data wrangling"
format: html
---
## Reading and previewing data
Our data are usually stored as a `.csv` file and after loading a `.csv` file into RStudio, we will have a "data frame". A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the `tidyverse` suite of packages, we use the [`read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) function to load the `heart_disease` dataset from yesterday:
```{r}
library(tidyverse)
heart_disease <- read_csv("https://raw.githubusercontent.com/36-SURE/2024/main/data/heart_disease.csv")
```
By default, `read_csv()` reads in the dataset as a `tbl` (aka `tibble`) object instead of a `data.frame` object. You can read about the differences [here](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html), but it's not that meaningful for purposes.
We can use the functions `slice_head()` and `slice_tail()` to view a sample of the data. Use the `slice_head()` function to view the first 6 rows, then use the `slice_tail()` function to view the last 3 rows:
```{r}
# INSERT CODE HERE
```
View the dimensions of the data with `dim()`:
```{r}
# INSERT CODE HERE
```
Quickly view summary statistics for all variables with the `summary()` function:
```{r}
# Uncomment the following code by deleting the # at the front
# summary(heart_disease)
```
View the data structure types with `str()`:
```{r}
# str(heart_disease)
```
What's the difference between the output from the two functions?
You can find a description of the dataset [here](https://www.stat.cmu.edu/cmsac/sure/2022/materials/data/health/intro_r/heart_disease_README.txt).
## Data manipulation with `dplyr`
An easier way to manipulate the data frame is through the `dplyr` package, which is in the `tidyverse` suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The "split-apply-combine" concept can be achieved by `dplyr`.
### Selecting columns with `select()`
The function `select()` can be use to select certain column with the column names. First create a new table called `heart_disease_ad` that only contains the `Age` and `Drugs` columns:
```{r}
# INSERT CODE HERE
```
To select all columns except a specific column, use the `-` (subtraction) operator. For example, view the output from uncommenting the following line of code:
```{r}
# select(heart_disease, -Interventions)
```
To select a range of columns by name (that are in consecutive order), use the `:` (colon) operator. For example, view the output from uncommenting the following line of code:
```{r}
# select(heart_disease, Drugs:Duration)
```
To select all columns that start with certain character strings, use the function `starts_with()`. Other matching options are:
1. `ends_with()`: select columns that end with a character string
2. `contains()`: select columns that contain a character string
3. `matches()`: select columns that match a regular expression
4. `one_of()`: select columns names that are from a group of names
```{r}
# Uncomment the following lines of code
# select(heart_disease, starts_with("Com"))
# select(heart_disease, contains("er"))
```
### Extracting rows using `filter()`
We can also extract the rows/observations that satisfy certain criteria. Try extracting the rows with `Cost` of more than 400:
```{r}
# INSERT CODE HERE
```
We can also filter on multiple criteria. Subset the rows with `Age` above 60 and the `gender` is "Male":
```{r}
# INSERT CODE HERE
```
### Arranging rows using `arrange()`
To arrange the data frame by a specific order we need to use the function `arrange()`. The default is by increasing order and the `desc()` will provide the decreasing order. First arrange the `heart_disease` table by `Duration` in ascending order:
```{r}
# INSERT CODE HERE
```
Next by descending order:
```{r}
# INSERT CODE HERE
```
Try combining a pipeline of `select()`, `filter()`, and `arrange()` steps together with the `|>` operator by:
1. Selecting the `Age`, `Cost`, `ERVisit`, and `Duration` columns,
2. Filter to select only rows with `Age` above 60,
3. Sort by `Duration` in descending order
```{r}
# INSERT CODE HERE
```
### Creating new columns using `mutate()`
Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. Create a new column `cost_per_day` by taking the `Cost` and dividing by `Duration` (reassign this output to the `heart_disease` table following the commented code chunk so this column is added to the table):
```{r}
# heart_disease <- heart_disease |>
# mutate(INSERT CODE HERE)
```
### Creating summaries with `summarize()`
To create summary statistics for a given column in the data frame, we can use `summarize()` function. Compute the `mean`, `min`, and `max` amount of `Cost`:
```{r}
# INSERT CODE HERE
```
The advantage of `summarize()` is more obvious if we combine it with `group_by()`, the group operators. Try to `group_by()` the `Gender` column first and then compute the same summary statistics for `Cost`:
```{r}
# INSERT CODE HERE
```