---
title: "Palmer Penguins — predicting sex (binary)"
author: "Aparna Pandey and Stephan Peischl"
format:
  html:
    toc: true
    code-tools: true
engine: knitr
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(palmerpenguins)
library(dplyr)
library(ggplot2)
library(GGally)
library(rpart)
library(rpart.plot)
library(tidymodels)
library(tidyr)
library(rlang)
theme_set(theme_classic())
```

# Overview

This notebook predicts **`sex`** (female vs male) from **morphometrics and island / year only**. We **deliberately omit `species`** as a predictor so the task is not trivially solved by species–sex composition differences alone (in exercises you can add `species` and compare accuracy — then discuss **leakage** and what you are willing to claim scientifically).

**Models:** `glm` and `rpart`. **Train/test split and confusion tables:** `tidymodels` (`initial_split`, `yardstick::conf_mat`). For a full Adelie-vs-Gentoo pipeline, see [Module 04](../modules/module-04-pipeline.qmd#train-test-last-fit).

See **[Palmer Penguins data card](../data/cards/palmer-penguins.qmd)** and the related notebooks (mass, species, multiclass). For **SHAP** on this task (without `species`), see [Module 06](../modules/module-06-evaluation-and-interpretability.qmd#shap-shapley-values-on-penguins-sex) and [Day 4 slides](../slides/day-04-thursday.html).

## Prepare data (complete cases on `sex`)

```{r}
data("penguins", package = "palmerpenguins")
pg <- penguins |>
  tidyr::drop_na(sex, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island, year) |>
  mutate(
    sex = droplevels(sex),
    year = as.numeric(year)
  ) |>
  dplyr::select(-species)

table(pg$sex)
nrow(pg)
```

## Pair plot (measurements coloured by sex)

```{r fig.width=8, fig.height=5}
GGally::ggpairs(
  pg,
  columns = c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"),
  aes(color = sex, alpha = 0.35)
) +
  theme_minimal()
```

## Train / test split (stratified on `sex`)

```{r}
set.seed(11)
split <- initial_split(pg, prop = 0.75, strata = sex)
train <- training(split)
test <- testing(split)
```

## Logistic regression (no `species` in formula)

```{r}
log_fit <- glm(
  sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year,
  data = train,
  family = binomial()
)
summary(log_fit)
```

```{r}
p_test <- predict(log_fit, newdata = test, type = "response")
pred_sex <- factor(ifelse(p_test > 0.5, levels(train$sex)[2], levels(train$sex)[1]), levels = levels(train$sex))
tibble(truth = test$sex, .pred_class = pred_sex) |>
  conf_mat(truth = truth, estimate = .pred_class)
```

## Classification tree

```{r fig.width=8, fig.height=5.5}
tree_fit <- rpart(
  sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year,
  data = train,
  method = "class"
)
rpart.plot(tree_fit, type = 4, extra = 104, main = "Sex from morphology (training data)")
```

```{r}
pred_t <- predict(tree_fit, test, type = "class") |> factor(levels = levels(test$sex))
tibble(truth = test$sex, .pred_class = pred_t) |>
  conf_mat(truth = truth, estimate = .pred_class)
```

## 2D view (bill length vs flipper; other predictors at training medians / modes)

```{r fig.width=7, fig.height=4.5}
plot_boundary <- function(model, data, f1, f2) {
  r1 <- range(data[[f1]])
  r2 <- range(data[[f2]])
  grid <- expand.grid(
    seq(r1[1], r1[2], length.out = 100),
    seq(r2[1], r2[2], length.out = 100)
  )
  names(grid) <- c(f1, f2)
  for (nm in setdiff(names(data), c(f1, f2, "sex"))) {
    v <- data[[nm]]
    grid[[nm]] <- if (is.numeric(v)) stats::median(v, na.rm = TRUE) else {
      tab <- table(v)
      names(tab)[which.max(tab)]
    }
  }
  grid$p_male <- predict(model, newdata = grid, type = "response")
  lv <- levels(data$sex)
  grid$cls <- factor(ifelse(grid$p_male > 0.5, lv[2], lv[1]), levels = lv)
  ggplot(data, aes(!!sym(f1), !!sym(f2))) +
    geom_raster(data = grid, aes(!!sym(f1), !!sym(f2), fill = cls), alpha = 0.22, inherit.aes = FALSE) +
    geom_point(aes(shape = sex, fill = sex), size = 2.2, color = "gray25") +
    scale_fill_brewer(palette = "Set1") +
    theme_minimal() +
    labs(
      title = "Logistic regions for sex (bill vs flipper)",
      subtitle = "Other predictors fixed; species not used as input",
      fill = "Region", shape = "Sex"
    )
}
print(plot_boundary(log_fit, train, "bill_length_mm", "flipper_length_mm"))
```

## Takeaways

- **Missing `sex`** shrinks usable *n* — discuss NA reporting and whether imputation is defensible.
- Omitting **`species`** makes errors more interesting; adding it back is a one-line **ablation** for discussion.
- Compare difficulty here to **Adelie vs Gentoo** in `penguins-classification.Rmd`.