--- title: "Palmer Penguins — multiclass imbalance and upsampling" author: "Aparna Pandey and Stephan Peischl" format: html: toc: true code-tools: true engine: knitr --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) suppressPackageStartupMessages({ library(tidymodels) library(themis) library(dplyr) library(ggplot2) }) source("../R/slide-viz-helpers.R") ``` # Overview Minimal walkthrough for [Day 4 Part E](../slides/day-04-thursday.html#/part-e-metrics): three-species classification with a **rare Chinstrap** class, one model, and **`step_upsample()`** in the recipe. See also [three-species notebook](penguins-species-multiclass.Rmd) and the [Palmer Penguins data card](../data/cards/palmer-penguins.qmd). ## Load imbalanced data We keep Adelie and Gentoo abundant and retain **15 Chinstrap** rows (deterministic rule: closest to the Adelie morphometric cloud). ```{r} peng_imb3 <- prep_penguins_multiclass_imbalance( minority_class = "Chinstrap", hard_to_class = "Adelie", n_minority = 15L ) peng_imb3 |> count(y3) |> knitr::kable(col.names = c("Species", "n")) ``` ```{r fig.width=6, fig.height=3.5} peng_imb3 |> count(y3) |> ggplot(aes(y3, n, fill = y3)) + geom_col(show.legend = FALSE) + geom_text(aes(label = n), vjust = -0.25, size = 4) + theme_minimal() + labs(title = "Class counts (rare Chinstrap)", x = NULL, y = "n") ``` ## One model, two recipes Same **`decision_tree`** spec; only the recipe differs (with vs without upsampling). ```{r} mc_tree_spec <- decision_tree(tree_depth = 4, min_n = 20) |> set_engine("rpart") |> set_mode("classification") rec_no <- recipe(y3 ~ ., data = peng_imb3) |> step_zv(all_predictors()) |> step_dummy(all_nominal_predictors()) |> step_normalize(all_numeric_predictors()) rec_up <- recipe(y3 ~ ., data = peng_imb3) |> step_upsample(y3) |> step_zv(all_predictors()) |> step_dummy(all_nominal_predictors()) |> step_normalize(all_numeric_predictors()) wf_no <- workflow() |> add_recipe(rec_no) |> add_model(mc_tree_spec) wf_up <- workflow() |> add_recipe(rec_up) |> add_model(mc_tree_spec) ``` ## Train / test split and fit ```{r} set.seed(13) split <- initial_split(peng_imb3, prop = 0.8, strata = y3) train <- training(split) test <- testing(split) fit_no <- fit(wf_no, train) fit_up <- fit(wf_up, train) pred_no <- augment(fit_no, test) pred_up <- augment(fit_up, test) ``` ## Confusion matrices (holdout) Rows = true species; columns = predicted species. **No upsampling:** ```{r} pred_no |> conf_mat(truth = y3, estimate = .pred_class) ``` **With `step_upsample(y3)`:** ```{r} pred_up |> conf_mat(truth = y3, estimate = .pred_class) ``` ## Per-class recall (what improves) ```{r} bind_rows( multiclass_recall_table(pred_no, truth_col = "y3") |> mutate(model = "No upsample"), multiclass_recall_table(pred_up, truth_col = "y3") |> mutate(model = "Upsample") ) |> mutate(recall = round(recall, 3)) |> select(model, truth, recall) |> tidyr::pivot_wider(names_from = model, values_from = recall) |> knitr::kable(caption = "Holdout recall by species") ``` Focus on the **Chinstrap** row: upsampling is useful when it raises minority recall without hiding poor performance on other classes.