---
title: "Lab: data wrangling"
format: html
---

## Reading and previewing data

Our data are usually stored as a `.csv` file and after loading a `.csv` file into RStudio, we will have a "data frame". A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the `tidyverse` suite of packages, we use the [`read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) function to load the `heart_disease` dataset from yesterday:

```{r}
library(tidyverse)
heart_disease <- read_csv("https://raw.githubusercontent.com/36-SURE/2024/main/data/heart_disease.csv")
```

By default, `read_csv()` reads in the dataset as a `tbl` (aka `tibble`) object instead of a `data.frame` object. You can read about the differences [here](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html), but it's not that meaningful for purposes.

We can use the functions `slice_head()` and `slice_tail()` to view a sample of the data. Use the `slice_head()` function to view the first 6 rows, then use the `slice_tail()` function to view the last 3 rows:

```{r}
# INSERT CODE HERE
```

View the dimensions of the data with `dim()`:

```{r}
# INSERT CODE HERE
```

Quickly view summary statistics for all variables with the `summary()` function:

```{r}
# Uncomment the following code by deleting the # at the front
# summary(heart_disease)
```

View the data structure types with `str()`:

```{r}
# str(heart_disease)
```

What's the difference between the output from the two functions?

You can find a description of the dataset [here](https://www.stat.cmu.edu/cmsac/sure/2022/materials/data/health/intro_r/heart_disease_README.txt).

## Data manipulation with `dplyr`

An easier way to manipulate the data frame is through the `dplyr` package, which is in the `tidyverse` suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The "split-apply-combine" concept can be achieved by `dplyr`.

### Selecting columns with `select()`

The function `select()` can be use to select certain column with the column names. First create a new table called `heart_disease_ad` that only contains the `Age` and `Drugs` columns:

```{r}
# INSERT CODE HERE
```

To select all columns except a specific column, use the `-` (subtraction) operator. For example, view the output from uncommenting the following line of code:

```{r}
# select(heart_disease, -Interventions)
```

To select a range of columns by name (that are in consecutive order), use the `:` (colon) operator. For example, view the output from uncommenting the following line of code:

```{r}
# select(heart_disease, Drugs:Duration)
```

To select all columns that start with certain character strings, use the function `starts_with()`. Other matching options are:

1.  `ends_with()`: select columns that end with a character string

2.  `contains()`: select columns that contain a character string

3.  `matches()`: select columns that match a regular expression

4.  `one_of()`: select columns names that are from a group of names

```{r}
# Uncomment the following lines of code
# select(heart_disease, starts_with("Com"))
# select(heart_disease, contains("er"))
```

### Extracting rows using `filter()`

We can also extract the rows/observations that satisfy certain criteria. Try extracting the rows with `Cost` of more than 400:

```{r}
# INSERT CODE HERE
```

We can also filter on multiple criteria. Subset the rows with `Age` above 60 and the `gender` is "Male":

```{r}
# INSERT CODE HERE
```

### Arranging rows using `arrange()`

To arrange the data frame by a specific order we need to use the function `arrange()`. The default is by increasing order and the `desc()` will provide the decreasing order. First arrange the `heart_disease` table by `Duration` in ascending order:

```{r}
# INSERT CODE HERE
```

Next by descending order:

```{r}
# INSERT CODE HERE
```

Try combining a pipeline of `select()`, `filter()`, and `arrange()` steps together with the `|>` operator by:

1.  Selecting the `Age`, `Cost`, `ERVisit`, and `Duration` columns,
2.  Filter to select only rows with `Age` above 60,
3.  Sort by `Duration` in descending order

```{r}
# INSERT CODE HERE
```

### Creating new columns using `mutate()`

Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. Create a new column `cost_per_day` by taking the `Cost` and dividing by `Duration` (reassign this output to the `heart_disease` table following the commented code chunk so this column is added to the table):

```{r}
# heart_disease <- heart_disease |>
#   mutate(INSERT CODE HERE)
```

### Creating summaries with `summarize()`

To create summary statistics for a given column in the data frame, we can use `summarize()` function. Compute the `mean`, `min`, and `max` amount of `Cost`:

```{r}
# INSERT CODE HERE
```

The advantage of `summarize()` is more obvious if we combine it with `group_by()`, the group operators. Try to `group_by()` the `Gender` column first and then compute the same summary statistics for `Cost`:

```{r}
# INSERT CODE HERE
```