---
title: "Lab: data wrangling"
format: html
---

## Reading and previewing data

Our data are usually stored as a `.csv` file and after loading a `.csv` file into RStudio, we will have a "data frame". A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the `tidyverse` suite of packages, we use the [`read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) function to load the 2024 NBA regular season stats dataset from yesterday:

```{r}
library(tidyverse)
nba_stats <- read_csv("https://raw.githubusercontent.com/36-SURE/2024/main/data/nba_stats.csv")
```

By default, `read_csv()` reads in the dataset as a `tbl` (aka `tibble`) object instead of a `data.frame` object. You can read about the differences [here](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html), but it's not that meaningful for purposes.

We can use the functions `slice_head()` and `slice_tail()` to view a sample of the data. Use the `slice_head()` function to view the first 6 rows, then use the `slice_tail()` function to view the last 3 rows:

```{r}
# INSERT CODE HERE
```

View the dimensions of the data with `dim()`:

```{r}
# INSERT CODE HERE
```

Quickly view summary statistics for all variables with the `summary()` function:

```{r}
# Uncomment the following code by deleting the # at the front
# summary(nba_stats)
```

View the data structure types with `str()`:

```{r}
# str(nba_stats)
```

What's the difference between the output from the two functions?

## Data manipulation with `dplyr`

An easier way to manipulate the data frame is through the `dplyr` package, which is in the `tidyverse` suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The "split-apply-combine" concept can be achieved by `dplyr`.

### Selecting columns with `select()`

The function `select()` can be use to select certain column with the column names. First create a new table called `nba_stats_pg` that only contains the `player` and `games` columns:

```{r}
# INSERT CODE HERE
```

To select all columns except a specific column, use the `-` (subtraction) operator. For example, view the output from uncommenting the following line of code:

```{r}
# select(nba_stats, -player)
```

To select a range of columns by name (that are in consecutive order), use the `:` (colon) operator. For example, view the output from uncommenting the following line of code:

```{r}
# select(nba_stats, player:games)
```

To select all columns that start with certain character strings, use the function `starts_with()`. Other matching options are:

1.  `ends_with()`: select columns that end with a character string

2.  `contains()`: select columns that contain a character string

3.  `matches()`: select columns that match a regular expression

4.  `one_of()`: select columns names that are from a group of names

```{r}
# Uncomment the following lines of code
# select(nba_stats, starts_with("three"))
# select(nba_stats, contains("throw"))
```

### Extracting rows using `filter()`

We can also extract the rows/observations that satisfy certain criteria. Try extracting the rows with more than 500 assists:

```{r}
# INSERT CODE HERE
```

We can also filter on multiple criteria. Subset the rows with `age` above 30 and the team is either "HOU" or "GSW":

```{r}
# INSERT CODE HERE
```

### Arranging rows using `arrange()`

To arrange the data frame by a specific order we need to use the function `arrange()`. The default is by increasing order and the `desc()` function will provide the decreasing order. First arrange the `nba_stats` table by `personal_fouls` in ascending order:

```{r}
# INSERT CODE HERE
```

Next by descending order:

```{r}
# INSERT CODE HERE
```

Try combining a pipeline of `select()`, `filter()`, and `arrange()` steps together with the `|>` operator by:

1.  Selecting the `player`, `team`, `age`, and `games` columns,
2.  Filter to select only rows with `games` above 50,
3.  Sort by `age` in descending order

```{r}
# INSERT CODE HERE
```

### Creating new columns using `mutate()`

Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. Create a new column `fouls_per_game` by taking the `personal_fouls` and dividing by `games` (reassign this output to the `nba_stats` table following the commented code chunk so this column is added to the table):

```{r}
# nba_stats <- nba_stats |>
#   mutate(INSERT CODE HERE)
```

### Creating summaries with `summarize()`

To create summary statistics for a given column in the data frame, we can use `summarize()` function. Compute the `mean`, `min`, and `max` number of `assists`:

```{r}
# INSERT CODE HERE
```

The advantage of `summarize()` is more obvious if we combine it with `group_by()`, the group operators. Since players at the different position tend to have very different statistics, first `group_by()` position and then compute the same summary statistics for `assists`:

```{r}
# INSERT CODE HERE
```