---
title: 'Unit 4 (Classification): Fit RDA Models'
author: "Your Name"
date: "`r lubridate::today()`"
format: 
  html: 
    embed-resources: true
    toc: true 
    toc_depth: 4
editor_options: 
  chunk_output_type: console
---

## General Instructions

**NOTE:** If you have not yet worked through **`hw_unit_04_eda_cleaning.qmd`**, you are in the wrong place. Start there!

To begin fitting models, you will use the following files:

* Two Quarto (.qmd) files: **`hw_unit_04_fit_knn.qmd`**, & **`hw_unit_04_fit_rda.qmd`**)

* **`titanic_test_cln_no_label.csv`** (already-cleaned held-out test data with outcome labels removed)

* **`titanic_data_dictionary.png`**

-----

The data for this week's assignment (across all scripts) will be the Titanic dataset, which contains survival outcomes and passenger characteristics for the Titanic's passengers. You can find information about the dataset in the data dictionary (available for download from the course website). This is a popular dataset that has been used for machine learning competitions and is noted for being a fun and easy introduction into data science. There is tons of information on how others have worked with this dataset online, and you are encouraged to incorporate knowledge that others have gained if you want! However, only code from the class web book will be needed for full credit on this assignment.

In this script, you will fit a series of `knn` models in your training data and use your validation data to select the best model. This script starts with reading in the `titanic_train.csv` & `titanic_val.csv` files that you created in the cleaning EDA script. Although this assignment does not have an explicit script for modeling EDA, we expect that you will have done some modeling EDA on your own to be able to do good feature engineering and fit good models. You should now have multiple examples of good modeling EDA (the web book and two modeling EDA homework keys). You do not need to turn in any code for modeling EDA that you complete.

At the end of this assignment, you will select your best performing model across all `knn` and `rda` models that you fit. You will use the `titanic_test_cln_no_label.csv` file **only once** at the end of the assignment to generate your best model's predictions of the survived outcome in the held-out test set. We will assess everyone's predictions on the held-out set to determine the best fitting model in the class. **The winner gets a free lunch!**


## Setup

### Handle conflicts
```{r}
#| message: false

```

### Load required packages and source

Add packages (only what you need!)
```{r}

```

Source any function scripts (John's are already sourced for you, but you can change these or add your own function scripts too)
```{r}
#| message: false

devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true")
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true")
```

### Specify other global settings
```{r}
#| include: false

theme_set(theme_classic())
options(tibble.width = Inf, dplyr.print_max=Inf)
```


In the chunk below, set path_data to the location of your cleaned data files. 
```{r}
path_data <- "homework/unit_04"
```


### Read in data

Read in data files for cleaned training, validation, and test data (generated in hw_unit_04_eda_cleaning.qmd). **Please load your data using the provided names**

Use `here::here()` and relative path for your data.  Make sure your iaml project is open
```{r}
data_trn <- 
data_val <- 
data_test <- 
```

### Class variables

set appropriate variable classes (remember character classes should be factors)
```{r}

```

### Create tracking tibble

Create an empty tracking tibble to track the validation error across the various model configurations you will fit.
```{r}

```

## RDA model building instructions

You may consider as many RDA model configurations as you wish, but there are a few rules:

* All models must use `survived` as the outcome variable
* You must build at least 3 models (if you build 3 models, that is sufficient to receive full credit on this component of the assignment)
* At least one model must include a variable with missing data (you will need to handle that missingness before using the variable)
* At least one model must include a transformation of a numeric variable
* At least one model must include one additional feature engineering technique that you believe may improve model performance 
* All models you fit should use the hyperparameter values associated with a QDA because you don't know how to tune hyperparameters yet. You can get more details on this in the web book, but the code needed for this appears in the model-fitting chunks below.

Your models may include as many other predictors as you would like (e.g., your models may build off of each other). Be as creative as you'd like!

## RDA model configuration 1

Indicate which required model components you are including for your first model (delete ones that do not apply):
*Variable with missingness*
*Transformed numeric*
*Additional feature engineering technique (when you add this component, please write one sentence stating how your modeling EDA informed what you chose)*

### Set up recipe

Informed by EDA from your modeling EDA, create a recipe for your first model.
```{r}

```

### Training feature matrix

Use the `prep()` function to prep your recipe
```{r}

```

Use the `bake()` function to generate the feature matrix of the training data.
```{r}

```

### Fit your model

Fit a `rda` model predicting `survived` from your training feature matrix.
```{r}
# use these hyperparamater values:
# discrim_regularized(frac_common_cov = 0, frac_identity = 0)
```

### Validation feature matrix

Use `bake()` to generate the feature matrix of the validation data that we will use to assess your model.
```{r}

```

### Assess your model

Use `accuracy_vec()` to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble.
```{r}

```

## RDA model configuration 2

Indicate which required model components you are including for your second model (delete ones that do not apply):
*Variable with missingness*
*Transformed numeric*
*Additional feature engineering technique (when you add this component, please write one sentence stating how your modeling EDA informed what you chose)*

### Set up recipe

Informed by EDA from your modeling EDA, create a recipe for your second model.
```{r}

```

### Training feature matrix

Use the `prep()` function to prep your recipe.
```{r}

```

Use the `bake()` function to generate the feature matrix of the training data.
```{r}

```

### Fit your model

Fit a `rda` model predicting `survived` from your training feature matrix
```{r}
# use these hyperparamater values:
# discrim_regularized(frac_common_cov = 0, frac_identity = 0)
```

### Validation feature matrix

Use `bake()` to generate the feature matrix of the validation data that we will use to assess your model.
```{r}

```

### Assess your model

Use `accuracy_vec()` to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble.
```{r}

```

## RDA model configuration 3

Indicate which required model components you are including for your third model (delete ones that do not apply):
*Variable with missingness*
*Transformed numeric*
*Additional feature engineering technique (when you add this component, please write one sentence stating how your modeling EDA informed what you chose)*

### Set up recipe

Informed by EDA from your modeling EDA, create a recipe for your third model.
```{r}

```

### Training feature matrix

Use the `prep()` function to prep your recipe.
```{r}

```

Use the `bake()` function to generate the feature matrix of the training data.
```{r}

```


### Fit your model

Fit a `rda` model predicting `survived` from your training feature matrix.
```{r}
# use these hyperparamater values:
# discrim_regularized(frac_common_cov = 0, frac_identity = 0)
```

### Validation feature matrix

Use `bake()` to generate the feature matrix of the validation data that we will use to assess your model.
```{r}

```

### Assess your model

Use `accuracy_vec()` to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble.
```{r}

```

## Additional model configurations (Optional)

Create as many code chunks as you would like below to test additional RDA model configurations. Record your models' performance in the validation data (accuracy) in your validation error tracking tibble.
```{r}

```

## Predictions

This section is for generating predictions for your best model in the held-out test set. You should only generate predictions for **one model** out of all KNN and RDA models you fit across both model fitting files. We will use these predictions to generate your ONE estimate of model performance in the held-out data.

If your **best** model is a RDA (from this script), follow the steps below. If your best model is a KNN (from your other script), skip the next section and **Save & Render** at the end of the document.

### Assign best model

If your **best** model is a `rda` from this script, edit the following objects in the code chunk below:

* Assign the model object from your best model to `best_model`
* Copy and paste the recipe for your best model (created from the training data) to assign to `best_rec` below.
* Type your last name between the quotation marks of `last_name` (e.g. "janssen") for file naming.

If your best model is a `knn`, don't edit anything in the next two code chunks and skip to the final section to save, render, etc.
```{r}
best_model <- NULL 
best_rec <- NA # best recipe
last_name <- ""
```

### Generate test predictions

Run this code chunk to save out your best model's predictions of `survived` in the held-out test set. Look over your predictions to confirm your model generated valid predictions for each test observation.
(If you left `best_model` as NA above, no file will be saved out).

ADVANCED NOTE:  You *could* improve this code below to train an even better model.  If you know what could be improved, do it.  If not, do not worry. This will work well enough (but maybe not for a free lunch!!).  
```{r}
if (!is.null(best_model)){
  
 # Make test set features with the best recipe
  rec_prep <- best_rec |> 
    prep(training = data_trn)
  
  feat_test <- rec_prep |> 
    bake(new_data = data_test)
  
 # Generate predictions made by the best model
  test_preds <- data_test |> 
    mutate(survived = predict(best_model, feat_test)$.pred_class) |> 
    select(passenger_id, survived) #passenger_id is the id variable to match predictions
  
  # Save predictions as a csv file with your last name in the file name
   write.csv(test_preds, here::here(path_data,str_c("test_preds_", 
                                                   last_name,
                                                   ".csv")))
}
```


## Save, Render, Upload

### Save
Save this `.qmd` file with your last name at the end (e.g. `hw_unit_3_rda_janssen.qmd`)
Make sure you changed "Your name here" at the top of the file to be your own name. 

### Render
Render the file to html and upload your rendered html file to Canvas. 

### Upload
Upload your saved `test_preds_(lastname).csv` (if you created it in this script) to Canvas. We will assess everyones predictions in the test set. **The winner gets a free lunch!!!**

Don't forget to complete the KNN portion of this assignment `hw_unit_04_fit_knn.qmd` if you haven't already and submit it to Canvas.

**Way to go!!**