{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Part 12: Presenting Data with Great Tables\"\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/12-great-tables.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/12-great-tables.ipynb)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**DS-MLOps Data Analysis**\n",
    "\n",
    "**Python 3.12+ | Author: Anthony Faustine**\n",
    "\n",
    "## Before you begin\n",
    "\n",
    "This notebook assumes you have completed Parts 8-11 (the Data Analysis section). Every example builds on the `university_analytics.csv` dataset and the `ark.plot.gt_style` module introduced alongside the plotting chapters.\n",
    "\n",
    "A plain `df.head()` output serves you in a notebook. It does not serve a stakeholder, a report, or a slide deck. Great Tables (`great_tables`) is the Python library that bridges that gap: it wraps a pandas DataFrame in a fluent API and produces publication-ready HTML tables: precise column formatting, readable labels, conditional highlighting, and summary rows: with no CSS knowledge required.\n",
    "\n",
    "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of Part 12 you will be able to:\n",
    "\n",
    "| # | Skill | Covered in |\n",
    "|---|---|---|\n",
    "| 1 | Explain when a table is the right choice over a chart | Sec. 1 |\n",
    "| 2 | Wrap a DataFrame with `GT()` and apply the project brand with `themed_gt()` | Sec. 2 |\n",
    "| 3 | Format numbers, percentages, and missing values with `fmt_*` methods | Sec. 3 |\n",
    "| 4 | Add readable column labels with `cols_label` and group related columns with `tab_spanner` | Sec. 4 |\n",
    "| 5 | Target cells with `loc` and apply styling with `tab_style` | Sec. 5 |\n",
    "| 6 | Add grand summary rows with `grand_summary_rows` | Sec. 6 |\n",
    "| 7 | Build a model comparison table with `metrics_report()` | Sec. 7 |\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from great_tables import GT, loc, md as gt_md, style\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from ark.plot.gt_style import metrics_report, themed_gt\n",
    "from ark.plot.tokens import PRIMARY, SUCCESS, SURFACE_MUTED\n",
    "\n",
    "df = pd.read_csv(\"data/university_analytics.csv\")\n",
    "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## 0. The Last Mile of a Data Story\n",
    "\n",
    "You have run the analysis. You have the numbers: pass rates by school, score distributions by program, trend lines across semesters. Now your manager asks for a report — something to put in front of a stakeholder, not a developer. You open a notebook, call `df.head()`, and stare at a grey monospace grid with no hierarchy, no colour, no units, and no sense of which numbers matter.\n",
    "\n",
    "`df.head()` serves you in a notebook. It does not serve anyone else.\n",
    "\n",
    "The gap between \"I have the result\" and \"I can communicate the result\" is called the last mile of a data story. It is where a lot of analysis work quietly disappears: correct findings, buried in formatting nobody wanted to read. **Great Tables** ([posit-dev.github.io/great-tables](https://posit-dev.github.io/great-tables/)) is the Python library that closes that gap. It wraps a pandas DataFrame in a fluent API — one that mirrors R's `{gt}` package — and produces publication-ready HTML tables with column spanners, colour scales, embedded sparklines, and controlled footnotes.\n",
    "\n",
    "### How it compares to other table tools\n",
    "\n",
    "| Tool | Output | Strengths | When to use instead |\n",
    "| --- | --- | --- | --- |\n",
    "| **Great Tables** ([posit-dev.github.io/great-tables](https://posit-dev.github.io/great-tables/)) | HTML | Full layout control, colour scales, sparklines, `{gt}`-compatible API | Reports, notebooks, any HTML output |\n",
    "| **pandas Styler** ([pandas docs](https://pandas.pydata.org/docs/user_guide/style.html)) | HTML | Built-in, no extra install, fast for simple highlighting | Quick colouring when GT is overkill |\n",
    "| **tabulate** ([tabulate on PyPI](https://pypi.org/project/tabulate/)) | Text, Markdown, HTML, LaTeX | Lightweight, great for terminal or Markdown output | CLI output, `.md` files |\n",
    "| **rich** ([rich.readthedocs.io](https://rich.readthedocs.io)) | Terminal | Beautiful terminal tables, progress bars | Terminal-only display |\n",
    "| **itables** ([mwouts.github.io/itables](https://mwouts.github.io/itables/)) | Interactive HTML | Sort, filter, search in notebook | Exploratory analysis, large tables |\n",
    "\n",
    "### Already in your environment\n",
    "\n",
    "```bash\n",
    "uv add great-tables          # for a standalone project\n",
    "```\n",
    "\n",
    "Official docs: [posit-dev.github.io/great-tables/articles/intro](https://posit-dev.github.io/great-tables/articles/intro.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "## 1. When Tables Beat Charts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "A chart compresses a distribution into shape: it shows a trend, a cluster, or an outlier at a glance. A table preserves exact values so a reader can answer a precise question: *which course has the highest midterm average?* or *by how many points does one program outperform another?*\n",
    "\n",
    "Use a table when:\n",
    "- Readers will look up a specific row or compare two exact values\n",
    "- The differences between groups are small and a chart would compress them into noise\n",
    "- A report or stakeholder document needs a citable number, not an impression\n",
    "\n",
    "Use a chart when:\n",
    "- You want to show a trend, a distribution, or a relationship across many data points\n",
    "- The pattern matters more than the individual values\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Tables serve lookup; charts serve pattern recognition</span><br><br>\n",
    "Neither replaces the other. A data storytelling section (Part 7) shows a trend with a chart. A summary report shows the underlying numbers in a table. The combination answers both the <em>what happened</em> and the <em>by exactly how much</em>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "## 2. Your First Styled Table"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "Every Great Tables workflow starts with `GT(df)`: wrapping a pandas DataFrame in the Great Tables object. From there you chain methods to add structure and styling. On its own, `GT(df)` renders a minimal unstyled table. `themed_gt()` applies the project's brand: column header background, font, border colours, and alternating row stripes: one call at the end of the chain.\n",
    "\n",
    "The first example is a summary of mean scores by gender:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = (\n",
    "    df.groupby(\"gender\")\n",
    "    .agg(\n",
    "        n_students=(\"student_id\", \"count\"),\n",
    "        midterm=(\"midterm_score\", \"mean\"),\n",
    "        final=(\"final_score\", \"mean\"),\n",
    "        project=(\"project_score\", \"mean\"),\n",
    "        fail_rate=(\"passed\", lambda x: (~x).mean()),\n",
    "    )\n",
    "    .reset_index()\n",
    "    .round(2)\n",
    ")\n",
    "summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "`GT(df)` alone already renders a table, but column names are raw and values have no formatting. Wrapping it in `themed_gt()` applies the brand while `.tab_header()` adds a title and subtitle:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "table = (\n",
    "    GT(summary)\n",
    "    .tab_header(\n",
    "        title=gt_md(\"**Mean Exam Scores by Gender**\"),\n",
    "        subtitle=\"Students with complete score records across all three components\",\n",
    "    )\n",
    "    .tab_source_note(\"Source: DS-MLOps university analytics dataset · 2,400 rows\")\n",
    ")\n",
    "themed_gt(table, n_rows=len(summary))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: The chain always ends with <code>themed_gt()</code></span><br><br>\n",
    "<code>themed_gt()</code> applies brand-wide options (<code>tab_options</code>) and text styling. Call it <em>last</em>, after all structural methods (<code>tab_header</code>, <code>cols_label</code>, <code>tab_spanner</code>, etc.) so it can apply consistently across everything you have added.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 1 - First Styled Table</span><br><br>\n",
    "<b>Goal:</b> Group <code>df</code> by <code>program</code> instead of <code>gender</code>, compute the same five aggregates, then wrap with <code>GT</code> and <code>themed_gt</code>. Add a title that identifies the program.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>program_summary = df.groupby(\"program\").agg(...).reset_index().round(2)\n",
    "GT(program_summary).tab_header(title=gt_md(\"**...**\"), subtitle=\"...\").pipe(themed_gt, n_rows=len(program_summary))</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: build program_summary and display with GT + themed_gt\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "## 3. Formatting Values and Labelling Columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "Raw floats in a table communicate false precision: a pass rate of `0.87654` signals noise, not information. Great Tables `fmt_*` methods format each column's values to the right precision for its type, and `cols_label` replaces machine-readable column names with reader-facing ones.\n",
    "\n",
    "The four formatting methods used most in DS tables:\n",
    "- `fmt_number(columns, decimals)`: round to `decimals` places\n",
    "- `fmt_integer(columns)`: strip decimal point, add thousands separator\n",
    "- `fmt_percent(columns, decimals)`: multiply by 100 and append `%`\n",
    "- `fmt_missing(columns, missing_text)`: replace `NaN` with a readable label\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "<div style='background:#EAF7F0;border-left:5px solid #059669;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#059669;font-weight:bold'><i class=\"bi bi-journal-code\"></i> Example: fmt_percent turns 0.913 into 91.3%</span><br><br>\n",
    "Without formatting, <code>fail_rate=0.04</code> reads as a raw proportion. With <code>fmt_percent(columns='fail_rate', decimals=1)</code>, the same cell displays as <code>4.0%</code>: the reader does not need to mentally multiply by 100.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "formatted = (\n",
    "    GT(summary)\n",
    "    .tab_header(\n",
    "        title=gt_md(\"**Mean Exam Scores by Gender**\"),\n",
    "        subtitle=\"All figures rounded to one decimal place\",\n",
    "    )\n",
    "    .cols_label(\n",
    "        gender=\"Gender\",\n",
    "        n_students=\"Students\",\n",
    "        midterm=\"Midterm\",\n",
    "        final=\"Final\",\n",
    "        project=\"Project\",\n",
    "        fail_rate=\"Fail Rate\",  # noqa: S106\n",
    "    )\n",
    "    .fmt_integer(columns=\"n_students\")\n",
    "    .fmt_number(columns=[\"midterm\", \"final\", \"project\"], decimals=1)\n",
    "    .fmt_percent(columns=\"fail_rate\", decimals=1)  # noqa: S106\n",
    "    .tab_source_note(\"Source: DS-MLOps university analytics dataset · 2,400 rows\")\n",
    ")\n",
    "themed_gt(formatted, n_rows=len(summary))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: fmt_missing catches the NaN before the reader sees it</span><br><br>\n",
    "Any column that can contain <code>NaN</code>: a score column with ~3% missing, an optional field: should have <code>fmt_missing(columns=..., missing_text=\":\")</code> added to the chain. A blank cell in a published table is ambiguous: did the student not sit the exam, or did the pipeline drop the value?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 2 - Format the Program Table</span><br><br>\n",
    "<b>Goal:</b> Take the <code>program_summary</code> from Activity 1 and add <code>cols_label</code>, <code>fmt_integer</code>, <code>fmt_number</code>, and <code>fmt_percent</code> to match the <code>formatted</code> table above.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>formatted_programs = (\n",
    "    GT(program_summary)\n",
    "    .cols_label(program=\"Program\", n_students=\"Students\", ...)\n",
    "    .fmt_integer(columns=\"n_students\")\n",
    "    .fmt_number(columns=[...], decimals=1)\n",
    "    .fmt_percent(columns=\"fail_rate\", decimals=1)\n",
    ")\n",
    "themed_gt(formatted_programs, n_rows=len(program_summary))</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: add cols_label and fmt_* to your program_summary table\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24",
   "metadata": {},
   "source": [
    "## 4. Column Spanners"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {},
   "source": [
    "When a table has several columns that belong to a natural group, for example three score columns or multiple model metrics, a column spanner adds a shared header label above the group. This reduces cognitive load: the reader understands the table structure before reading the individual values.\n",
    "\n",
    "`tab_spanner(label, columns)` draws the label above the specified columns. It does not move or reorder columns; it only adds a visual grouping above them.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Course performance table: one row per course\n",
    "course_detail = (\n",
    "    df.groupby(\"course\")\n",
    "    .agg(\n",
    "        students=(\"student_id\", \"count\"),\n",
    "        midterm=(\"midterm_score\", \"mean\"),\n",
    "        final=(\"final_score\", \"mean\"),\n",
    "        project=(\"project_score\", \"mean\"),\n",
    "        pass_rate=(\"passed\", \"mean\"),\n",
    "    )\n",
    "    .reset_index()\n",
    "    .round(2)\n",
    ")\n",
    "course_detail"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27",
   "metadata": {},
   "outputs": [],
   "source": [
    "with_spanner = (\n",
    "    GT(course_detail)\n",
    "    .tab_header(title=gt_md(\"**Performance by Course**\"))\n",
    "    .cols_label(\n",
    "        course=\"Course\",\n",
    "        students=\"Students\",\n",
    "        midterm=\"Midterm\",\n",
    "        final=\"Final\",\n",
    "        project=\"Project\",\n",
    "        pass_rate=\"Pass Rate\",  # noqa: S106\n",
    "    )\n",
    "    .tab_spanner(label=\"Score (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n",
    "    .fmt_integer(columns=\"students\")\n",
    "    .fmt_number(columns=[\"midterm\", \"final\", \"project\"], decimals=1)\n",
    "    .fmt_percent(columns=\"pass_rate\", decimals=1)  # noqa: S106\n",
    "    .tab_source_note(\"Source: DS-MLOps university analytics dataset · 2,400 rows\")\n",
    ")\n",
    "themed_gt(with_spanner, n_rows=len(course_detail))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 3 - Add a Spanner</span><br><br>\n",
    "<b>Goal:</b> Take the <code>formatted_programs</code> table from Activity 2 and add a <code>tab_spanner</code> labelled <code>\"Scores (0-100)\"</code> over the three score columns.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>GT(program_summary)\n",
    "    ...\n",
    "    .tab_spanner(label=\"Scores (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n",
    "    ...</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: add a tab_spanner to the program table\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "## 5. Conditional Styling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "Conditional styling directs the reader's eye to the cells that matter: the highest pass rate, the lowest score, an outlier. `tab_style` applies a visual property and `loc` specifies exactly where it applies. `style` is the what, `loc` is the where.\n",
    "\n",
    "The most common locations:\n",
    "- `loc.body(columns, rows)`: specific cells in the data area\n",
    "- `loc.column_labels()`: the column header row\n",
    "- `loc.title()` / `loc.subtitle()`: the table header text\n",
    "\n",
    "`rows` inside `loc.body()` accepts an integer index, a list of indices, or a lambda that receives the DataFrame and returns a boolean Series.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "highlighted = (\n",
    "    GT(course_detail)\n",
    "    .tab_header(title=gt_md(\"**Course Performance: Best and Worst Pass Rate**\"))\n",
    "    .cols_label(\n",
    "        course=\"Course\",\n",
    "        students=\"Students\",\n",
    "        midterm=\"Midterm\",\n",
    "        final=\"Final\",\n",
    "        project=\"Project\",\n",
    "        pass_rate=\"Pass Rate\",  # noqa: S106\n",
    "    )\n",
    "    .tab_spanner(label=\"Score (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n",
    "    .fmt_integer(columns=\"students\")\n",
    "    .fmt_number(columns=[\"midterm\", \"final\", \"project\"], decimals=1)\n",
    "    .fmt_percent(columns=\"pass_rate\", decimals=1)  # noqa: S106\n",
    "    .tab_style(\n",
    "        style=style.fill(color=SUCCESS),\n",
    "        locations=loc.body(\n",
    "            columns=\"pass_rate\",\n",
    "            rows=lambda df_gt: df_gt[\"pass_rate\"] == df_gt[\"pass_rate\"].max(),\n",
    "        ),\n",
    "    )\n",
    "    .tab_style(\n",
    "        style=style.fill(color=\"#FEF2F2\"),\n",
    "        locations=loc.body(\n",
    "            columns=\"pass_rate\",\n",
    "            rows=lambda df_gt: df_gt[\"pass_rate\"] == df_gt[\"pass_rate\"].min(),\n",
    "        ),\n",
    "    )\n",
    ")\n",
    "themed_gt(highlighted, n_rows=len(course_detail))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: <code>loc</code> is a targeting system, not a filter</span><br><br>\n",
    "<code>loc.body(rows=lambda df: df['pass_rate'] == df['pass_rate'].max())</code> does not subset the table: it identifies which rows receive the styling. The underlying data is unchanged. You can chain multiple <code>tab_style</code> calls; later ones add to earlier ones without overwriting.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Passing a boolean mask directly to <code>rows</code></span><br><br>\n",
    "<code>loc.body(rows=course_detail['pass_rate'] == course_detail['pass_rate'].max())</code> fails because <code>rows</code> inside <code>loc.body()</code> needs a callable that receives the <em>rendered</em> DataFrame, not the original one. Always use a lambda: <code>rows=lambda df: df['pass_rate'] == df['pass_rate'].max()</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 4 - Highlight the Best Midterm Score</span><br><br>\n",
    "<b>Goal:</b> Take the <code>highlighted</code> table and add a third <code>tab_style</code> call that highlights the <code>midterm</code> cell with the highest value in a light blue (<code>#EAF3FA</code>). Use a lambda for the row selection.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>.tab_style(\n",
    "    style=style.fill(color=\"#EAF3FA\"),\n",
    "    locations=loc.body(\n",
    "        columns=\"midterm\",\n",
    "        rows=lambda df: df[\"midterm\"] == df[\"midterm\"].max(),\n",
    "    ),\n",
    ")</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: add a third tab_style call for the highest midterm value\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37",
   "metadata": {},
   "source": [
    "## 6. Summary Rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38",
   "metadata": {},
   "source": [
    "A summary row aggregates the entire table into one footer row: a grand mean, a column total, or a count. The reader no longer needs to mentally compute the aggregate, and the table and its summary stay in the same visual unit.\n",
    "\n",
    "`grand_summary_rows(fns)` adds these rows. `fns` is a dict mapping a display label to an aggregation function. In version 0.20, it aggregates all numeric columns in the table, so the DataFrame passed to `GT` should contain only the columns you want summarised:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39",
   "metadata": {},
   "outputs": [],
   "source": [
    "from great_tables import vals as gt_vals  # noqa: F401\n",
    "\n",
    "# Use only the score + pass_rate columns so the summary row is meaningful\n",
    "course_scores = course_detail.drop(columns=[\"students\"])\n",
    "\n",
    "with_summary = (\n",
    "    GT(course_scores)\n",
    "    .tab_header(title=gt_md(\"**Course Summary with Grand Mean**\"))\n",
    "    .cols_label(\n",
    "        course=\"Course\",\n",
    "        midterm=\"Midterm\",\n",
    "        final=\"Final\",\n",
    "        project=\"Project\",\n",
    "        pass_rate=\"Pass Rate\",  # noqa: S106\n",
    "    )\n",
    "    .tab_spanner(label=\"Score (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n",
    "    .fmt_number(columns=[\"midterm\", \"final\", \"project\"], decimals=1)\n",
    "    .fmt_percent(columns=\"pass_rate\", decimals=1)  # noqa: S106\n",
    "    .grand_summary_rows(\n",
    "        fns={\"Mean\": lambda x: x.mean(numeric_only=True)},\n",
    "    )\n",
    ")\n",
    "themed_gt(with_summary, n_rows=len(course_scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Shape the DataFrame before passing it to GT</span><br><br>\n",
    "<code>grand_summary_rows</code> aggregates every numeric column in the table. If a count column like <code>students</code> would produce a meaningless mean, drop it before calling <code>GT()</code>: <code>df.drop(columns=[\"students\"])</code>. If the table still includes a string column like the row label, pass <code>numeric_only=True</code> to the aggregation: <code>lambda x: x.mean(numeric_only=True)</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 5 - Add a Min and Max Row</span><br><br>\n",
    "<b>Goal:</b> Extend <code>with_summary</code> to show three summary rows: <code>Min</code>, <code>Max</code>, and <code>Mean</code> across all score columns. Pass a dict with three keys to <code>fns</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>fns={\"Min\": lambda x: x.min(numeric_only=True), \"Max\": lambda x: x.max(numeric_only=True), \"Mean\": lambda x: x.mean(numeric_only=True)}</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: add Min/Max/Mean grand summary rows\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43",
   "metadata": {},
   "source": [
    "## 7. Model Comparison with `metrics_report()`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44",
   "metadata": {},
   "source": [
    "The `ark.plot.gt_style` module ships `metrics_report()`: a one-call wrapper that produces a publication-ready model comparison table. It handles formatting, brand styling, and conditional highlighting in a single call.\n",
    "\n",
    "`metrics_report(df, metrics, minimize_cols, maximize_cols)` highlights the best value in each metric column: green for minimize metrics (lower is better: MAE, RMSE), green for maximize metrics (higher is better: R², accuracy). The caller decides which direction is better for each metric; the function does not guess.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "comparison = pd.DataFrame(\n",
    "    {\n",
    "        \"Model\": [\"Linear Regression\", \"Ridge (α=0.1)\", \"Ridge (α=1.0)\", \"Random Forest\"],\n",
    "        \"MAE\": [8.21, 8.09, 7.98, 7.43],\n",
    "        \"RMSE\": [10.42, 10.31, 10.19, 9.61],\n",
    "        \"R2\": [0.781, 0.784, 0.788, 0.810],\n",
    "    }\n",
    ")\n",
    "\n",
    "metrics_report(\n",
    "    comparison,\n",
    "    metrics=[\"MAE\", \"RMSE\", \"R2\"],\n",
    "    minimize_cols=[\"MAE\", \"RMSE\"],\n",
    "    maximize_cols=[\"R2\"],\n",
    "    title=\"Grade Prediction: Model Comparison\",\n",
    "    subtitle=\"Predicting average_marks from study_hours, attendance_pct, and program\",\n",
    "    source_note=\"university_analytics.csv · 5-fold CV · held-out 20% test set\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: metrics_report highlights by direction, not by rank</span><br><br>\n",
    "<code>minimize_cols</code> highlights the row with the <em>lowest</em> value: better for error metrics. <code>maximize_cols</code> highlights the row with the <em>highest</em> value: better for performance metrics. A column can appear in at most one list. If a column appears in neither, it is formatted but not highlighted.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 6 - Add a Gradient Boosting Row</span><br><br>\n",
    "<b>Goal:</b> Add a fifth row to <code>comparison</code>: <code>\"Gradient Boosting\"</code> with MAE=6.91, RMSE=8.84, R2=0.843: and re-run <code>metrics_report()</code>. Confirm the highlighted row updates automatically.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>comparison = pd.concat([comparison, pd.DataFrame([{\"Model\": \"Gradient Boosting\", \"MAE\": 6.91, \"RMSE\": 8.84, \"R2\": 0.843}])], ignore_index=True)\n",
    "metrics_report(comparison, metrics=[...], minimize_cols=[...], maximize_cols=[...], ...)</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: add Gradient Boosting row and re-run metrics_report\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49",
   "metadata": {},
   "source": [
    "## Capstone: Course Performance Report\n",
    "\n",
    "Combine every technique from this notebook into one complete report table. The report should give a department head a single table they can paste into a slide deck.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Capstone Exercise - Course Performance Report</span><br><br>\n",
    "\n",
    "<b>Goal:</b>\n",
    "<ol>\n",
    "<li>Build a <code>report</code> DataFrame grouped by <code>course</code> with columns: students, midterm mean, final mean, project mean, pass rate, and average_marks mean</li>\n",
    "<li>Wrap with <code>GT</code>. Add a descriptive title and source note</li>\n",
    "<li>Apply <code>cols_label</code> and the appropriate <code>fmt_*</code> for each column</li>\n",
    "<li>Add a <code>tab_spanner</code> over the three score columns</li>\n",
    "<li>Highlight the course with the highest pass rate (green) and lowest pass rate (light red)</li>\n",
    "<li>Add a grand <code>Mean</code> summary row across all numeric columns</li>\n",
    "<li>Call <code>themed_gt()</code> last</li>\n",
    "</ol>\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'># Build report DataFrame first, then chain all GT methods in one expression</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: build the complete course performance report\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "| Resource | Why it matters |\n",
    "|---|---|\n",
    "| [Great Tables documentation](https://posit-dev.github.io/great-tables/) | Complete API reference with rendered examples for every method |\n",
    "| [Great Tables: `loc` reference](https://posit-dev.github.io/great-tables/reference/#targeting-cells) | Full list of location helpers: `loc.body`, `loc.column_labels`, `loc.spanner_labels`, `loc.grand_summary` |\n",
    "| [Great Tables blog: Python tables](https://posit-dev.github.io/great-tables/blog/) | Worked examples including financial reports and ML comparison tables |\n",
    "| Knaflic, C.N. (2015). *Storytelling with Data*. Wiley. | Chapter 2 covers when tables serve communication better than charts |\n",
    "| [pandas `GroupBy.agg` reference](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html) | Named aggregations (`col=(src, func)`) used throughout this notebook |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| GT method | What it does |\n",
    "|---|---|\n",
    "| `GT(df)` | Wrap a DataFrame and begin the method chain |\n",
    "| `themed_gt(table, n_rows=n)` | Apply project brand: header colors, font, striped rows. Call last. |\n",
    "| `tab_header(title, subtitle)` | Add a title row above the column headers |\n",
    "| `tab_source_note(text)` | Add an attribution line below the table |\n",
    "| `cols_label(**kwargs)` | Replace column names with reader-facing labels |\n",
    "| `fmt_number(columns, decimals)` | Round floats to `decimals` places |\n",
    "| `fmt_integer(columns)` | Remove decimal, add thousands separator |\n",
    "| `fmt_percent(columns, decimals)` | Multiply by 100 and append `%` |\n",
    "| `fmt_missing(columns, missing_text)` | Replace `NaN` with a readable placeholder |\n",
    "| `tab_spanner(label, columns)` | Group related columns under a shared header label |\n",
    "| `tab_style(style, locations)` | Apply a visual property (`fill`, `text`) to a location (`loc.body`, `loc.column_labels`) |\n",
    "| `loc.body(columns, rows)` | Target specific cells; `rows` takes an index or a lambda |\n",
    "| `grand_summary_rows(fns)` | Add one summary row per key in `fns`; aggregates all numeric columns |\n",
    "| `metrics_report(df, metrics, ...)` | One-call ML comparison table with directional highlighting |\n",
    "\n",
    "**Next:** Part 3: Dev Tools covers the professional toolchain: uv, ruff, type annotations, git, pytest, and pre-commit.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}