--- title: 'Unit 4 (Classification): Fit RDA Models' author: "Your Name" date: "`r lubridate::today()`" format: html: embed-resources: true toc: true toc_depth: 4 editor_options: chunk_output_type: console --- ## General Instructions **NOTE:** If you have not yet worked through **`hw_unit_04_eda_cleaning.qmd`**, you are in the wrong place. Start there! To begin fitting models, you will use the following files: * Two Quarto (.qmd) files: **`hw_unit_04_fit_knn.qmd`**, & **`hw_unit_04_fit_rda.qmd`**) * **`titanic_test_cln_no_label.csv`** (already-cleaned held-out test data with outcome labels removed) * **`titanic_data_dictionary.png`** ----- The data for this week's assignment (across all scripts) will be the Titanic dataset, which contains survival outcomes and passenger characteristics for the Titanic's passengers. You can find information about the dataset in the data dictionary (available for download from the course website). This is a popular dataset that has been used for machine learning competitions and is noted for being a fun and easy introduction into data science. There is tons of information on how others have worked with this dataset online, and you are encouraged to incorporate knowledge that others have gained if you want! However, only code from the class web book will be needed for full credit on this assignment. In this script, you will fit a series of `knn` models in your training data and use your validation data to select the best model. This script starts with reading in the `titanic_train.csv` & `titanic_val.csv` files that you created in the cleaning EDA script. Although this assignment does not have an explicit script for modeling EDA, we expect that you will have done some modeling EDA on your own to be able to do good feature engineering and fit good models. You should now have multiple examples of good modeling EDA (the web book and two modeling EDA homework keys). You do not need to turn in any code for modeling EDA that you complete. At the end of this assignment, you will select your best performing model across all `knn` and `rda` models that you fit. You will use the `titanic_test_cln_no_label.csv` file **only once** at the end of the assignment to generate your best model's predictions of the survived outcome in the held-out test set. We will assess everyone's predictions on the held-out set to determine the best fitting model in the class. **The winner gets a free lunch!** ## Setup ### Handle conflicts ```{r} #| message: false ``` ### Load required packages and source Add packages (only what you need!) ```{r} ``` Source any function scripts (John's are already sourced for you, but you can change these or add your own function scripts too) ```{r} #| message: false devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true") devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_eda.R?raw=true") ``` ### Specify other global settings ```{r} #| include: false theme_set(theme_classic()) options(tibble.width = Inf, dplyr.print_max=Inf) ``` In the chunk below, set path_data to the location of your cleaned data files. ```{r} path_data <- "homework/unit_04" ``` ### Read in data Read in data files for cleaned training, validation, and test data (generated in hw_unit_04_eda_cleaning.qmd). **Please load your data using the provided names** Use `here::here()` and relative path for your data. Make sure your iaml project is open ```{r} data_trn <- data_val <- data_test <- ``` ### Class variables set appropriate variable classes (remember character classes should be factors) ```{r} ``` ### Create tracking tibble Create an empty tracking tibble to track the validation error across the various model configurations you will fit. ```{r} ``` ## RDA model building instructions You may consider as many RDA model configurations as you wish, but there are a few rules: * All models must use `survived` as the outcome variable * You must build at least 3 models (if you build 3 models, that is sufficient to receive full credit on this component of the assignment) * At least one model must include a variable with missing data (you will need to handle that missingness before using the variable) * At least one model must include a transformation of a numeric variable * At least one model must some other feature engineering technique that you believe may improve model performance * All models you fit should use the hyperparameter values associated with a QDA because you don't know how to tune hyperparameters yet. You can get more details on this in the web book, but the code needed for this appears in the model-fitting chunks below. Your models may include as many other predictors as you would like (e.g., your model satisfying the 2 numeric predictors requirement may include more than 2 numeric variables, categorical variables, etc.). Be as creative as you'd like! ## RDA model configuration 1 Indicate which required model components you are including for your first model (delete ones that do not apply): *Variable with missingness* *Transformed numeric* *Interaction(s)* ### Set up recipe Informed by EDA from your modeling EDA, create a recipe for your first model ```{r} ``` ### Training feature matrix Use the `prep()` function to prep your recipe ```{r} ``` Use the `bake()` function to generate the feature matrix of the training data. ```{r} ``` ### Fit your model Fit a `rda` model predicting `survived` from your training feature matrix ```{r} # use these hyperparamater values: # discrim_regularized(frac_common_cov = 0, frac_identity = 0) ``` ### Validation feature matrix Use `bake()` to generate the feature matrix of the validation data that we will use to assess your model. ```{r} ``` ### Assess your model Use `accuracy_vec()` to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble ```{r} ``` ## RDA model configuration 2 Indicate which required model components you are including for your second model (delete ones that do not apply): *Variable with missingness* *Transformed numeric* *Interaction(s)* ### Set up recipe Informed by EDA from your modeling EDA, create a recipe for your first model ```{r} ``` ### Training feature matrix Use the `prep()` function to prep your recipe ```{r} ``` Use the `bake()` function to generate the feature matrix of the training data. ```{r} ``` ### Fit your model Fit a `rda` model predicting `survived` from your training feature matrix ```{r} # use these hyperparamater values: # discrim_regularized(frac_common_cov = 0, frac_identity = 0) ``` ### Validation feature matrix Use `bake()` to generate the feature matrix of the validation data that we will use to assess your model. ```{r} ``` ### Assess your model Use `accuracy_vec()` to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble ```{r} ``` ## RDA model configuration 3 Indicate which required model components you are including for your third model (delete ones that do not apply): *Variable with missingness* *Transformed numeric* *Interaction(s)* ### Set up recipe Informed by EDA from your modeling EDA, create a recipe for your first model ```{r} ``` ### Training feature matrix Use the `prep()` function to prep your recipe ```{r} ``` Use the `bake()` function to generate the feature matrix of the training data. ```{r} ``` ### Fit your model Fit a `rda` model predicting `survived` from your training feature matrix ```{r} # use these hyperparamater values: # discrim_regularized(frac_common_cov = 0, frac_identity = 0) ``` ### Validation feature matrix Use `bake()` to generate the feature matrix of the validation data that we will use to assess your model. ```{r} ``` ### Assess your model Use `accuracy_vec()` to calculate the validation error (accuracy) of your model. Add this value to your validation error tracking tibble ```{r} ``` ## Additional model configurations (Optional) Create as many code chunks as you would like below to test additional RDA model configurations. Record your models' performance in the validation data (accuracy) in your validation error tracking tibble. ```{r} ``` ## Predictions This section is for generating predictions for your best model in the held-out test set. You should only generate predictions for **one model** out of all KNN and RDA models you fit across both model fitting files. We will use these predictions to generate your ONE estimate of model performance in the held-out data. If your **best** model is a RDA (from this script), follow the steps below. If your best model is a KNN (from your other script), skip the next section and **Save & Render** at the end of the document. ### Assign best model If your **best** model is a `rda` from this script, edit the following objects in the code chunk below: * Assign the model object from your best model to `best_model` * Copy and paste the recipe for your best model (created from the training data) to assign to `best_rec` below. * Type your last name between the quotation marks of `last_name` (e.g. "marji") for file naming. If your best model is a `knn`, don't edit anything in the next two code chunks and skip to the final section to save, render, etc. ```{r} best_model <- NULL best_rec <- NA # best recipe last_name <- "" ``` ### Generate test predictions Run this code chunk to save out your best model's predictions of `survived` in the held-out test set. Look over your predictions to confirm your model generated valid predictions for each test observation. (If you left `best_model` as NA above, no file will be saved out). ADVANCED NOTE: You *could* improve this code below to train an even better model. If you know what could be improved, do it. If not, do not worry. This will work well enough (but maybe not for a free lunch!!). ```{r} if (!is.null(best_model)){ # Make test set features with the best recipe rec_prep <- best_rec |> prep(training = data_trn) feat_test <- rec_prep |> bake(new_data = data_test) # Generate predictions made by the best model test_preds <- data_test |> mutate(survived = predict(best_model, feat_test)$.pred_class) |> select(passenger_id, survived) #passenger_id is the id variable to match predictions # Save predictions as a csv file with your last name in the file name write.csv(test_preds, here::here(path_data,str_c("test_preds_", last_name, ".csv"))) } ``` ## Save, Render, Upload ### Save Save this `.qmd` file with your last name at the end (e.g. `hw_unit_3_rda_marji.qmd`) Make sure you changed "Your name here" at the top of the file to be your own name. ### Render Render the file to html and upload your rendered html file to Canvas. ### Upload Upload your saved `test_preds_(lastname).csv` (if you created it in this script) to Canvas. We will assess everyones predictions in the test set. **The winner gets a free lunch!!!** Don't forget to complete the KNN portion of this assignment `hw_unit_04_fit_knn.qmd` if you haven't already and submit it to Canvas. **Way to go!!**