--- title: "Palmer Penguins — three-species classification" author: "Aparna Pandey and Stephan Peischl" format: html: toc: true code-tools: true engine: knitr --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) library(palmerpenguins) library(dplyr) library(ggplot2) library(GGally) library(nnet) library(rpart) library(rpart.plot) library(tidymodels) library(tidyr) theme_set(theme_classic()) ``` # Overview Here we treat **`species`** as a **three-level** outcome (Adelie, Chinstrap, Gentoo) using **multinomial logistic regression** (`nnet::multinom`) and a **multiclass classification tree** (`rpart`). This complements the **binary** Adelie-vs-Gentoo notebook (`penguins-classification.Rmd`). For **metrics and multiclass intuition** on the site, see [Module 06](../modules/module-06-evaluation-and-interpretability.qmd). See **[Palmer Penguins data card](../data/cards/palmer-penguins.qmd)**. ## Prepare data ```{r} data("penguins", package = "palmerpenguins") pg <- penguins |> tidyr::drop_na(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island, sex, year) |> mutate( species = droplevels(species), year = as.numeric(year) ) table(pg$species) nrow(pg) ``` ## Pair plot (measurements + island, coloured by species) ```{r fig.width=8.5, fig.height=5.5} GGally::ggpairs( pg, columns = c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "island"), aes(color = species, alpha = 0.25) ) + theme_minimal() ``` ## Train / test split (stratified on `species`) ```{r} set.seed(24) split <- initial_split(pg, prop = 0.75, strata = species) train <- training(split) test <- testing(split) ``` ## Multinomial logistic regression ```{r} set.seed(1) multi_fit <- nnet::multinom( species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + sex + year, data = train, trace = FALSE, MaxNWts = 5000 ) summary(multi_fit) ``` ```{r} pred_multi <- predict(multi_fit, newdata = test) tibble(truth = test$species, .pred_class = pred_multi) |> conf_mat(truth = truth, estimate = .pred_class) ``` ```{r fig.width=5.5, fig.height=4.5} cm_obj <- conf_mat( tibble(truth = test$species, .pred_class = pred_multi), truth = truth, estimate = .pred_class ) cm <- as.data.frame.table(cm_obj$table, stringsAsFactors = FALSE) |> dplyr::rename(Reference = Truth) ggplot(cm, aes(Reference, Prediction, fill = Freq)) + geom_tile(color = "gray80") + geom_text(aes(label = Freq), color = "gray15") + scale_fill_gradient(low = "white", high = "steelblue") + theme_minimal() + labs( title = "Multinomial logit — test confusion (counts)", x = "True species", y = "Predicted species" ) ``` ## Multiclass tree ```{r fig.width=9, fig.height=6} tree_fit <- rpart( species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + sex + year, data = train, method = "class" ) rpart.plot(tree_fit, type = 4, extra = 104, box.palette = "GnYlRd", main = "Three species (rpart, train)") ``` ```{r} pred_t <- predict(tree_fit, test, type = "class") |> factor(levels = levels(test$species)) tibble(truth = test$species, .pred_class = pred_t) |> conf_mat(truth = truth, estimate = .pred_class) ``` ## Takeaways - **Chinstrap** is often the hardest class (smaller *n*, overlap in measurement space) — inspect **per-class** metrics, not only overall accuracy. - Multiclass **ROC** and **one-vs-rest** calibration are natural Thursday extensions; here we stay with **confusion matrices** + trees for clarity. - Compare with the **binary** pipeline in `_includes/day02-tidymodels-walkthrough.qmd` (Adelie vs Gentoo slice).