{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 12: Presenting Data with Great Tables\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/12-great-tables.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/12-great-tables.ipynb)\n" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Parts 8-11 (the Data Analysis section). Every example builds on the `university_analytics.csv` dataset and the `ark.plot.gt_style` module introduced alongside the plotting chapters.\n", "\n", "A plain `df.head()` output serves you in a notebook. It does not serve a stakeholder, a report, or a slide deck. Great Tables (`great_tables`) is the Python library that bridges that gap: it wraps a pandas DataFrame in a fluent API and produces publication-ready HTML tables: precise column formatting, readable labels, conditional highlighting, and summary rows: with no CSS knowledge required.\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide).\n" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 12 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain when a table is the right choice over a chart | Sec. 1 |\n", "| 2 | Wrap a DataFrame with `GT()` and apply the project brand with `themed_gt()` | Sec. 2 |\n", "| 3 | Format numbers, percentages, and missing values with `fmt_*` methods | Sec. 3 |\n", "| 4 | Add readable column labels with `cols_label` and group related columns with `tab_spanner` | Sec. 4 |\n", "| 5 | Target cells with `loc` and apply styling with `tab_style` | Sec. 5 |\n", "| 6 | Add grand summary rows with `grand_summary_rows` | Sec. 6 |\n", "| 7 | Build a model comparison table with `metrics_report()` | Sec. 7 |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "from great_tables import GT, loc, md as gt_md, style\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from ark.plot.gt_style import metrics_report, themed_gt\n", "from ark.plot.tokens import PRIMARY, SUCCESS, SURFACE_MUTED\n", "\n", "df = pd.read_csv(\"data/university_analytics.csv\")\n", "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n", "df.head(3)" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 0. The Last Mile of a Data Story\n", "\n", "You have run the analysis. You have the numbers: pass rates by school, score distributions by program, trend lines across semesters. Now your manager asks for a report — something to put in front of a stakeholder, not a developer. You open a notebook, call `df.head()`, and stare at a grey monospace grid with no hierarchy, no colour, no units, and no sense of which numbers matter.\n", "\n", "`df.head()` serves you in a notebook. It does not serve anyone else.\n", "\n", "The gap between \"I have the result\" and \"I can communicate the result\" is called the last mile of a data story. It is where a lot of analysis work quietly disappears: correct findings, buried in formatting nobody wanted to read. **Great Tables** ([posit-dev.github.io/great-tables](https://posit-dev.github.io/great-tables/)) is the Python library that closes that gap. It wraps a pandas DataFrame in a fluent API — one that mirrors R's `{gt}` package — and produces publication-ready HTML tables with column spanners, colour scales, embedded sparklines, and controlled footnotes.\n", "\n", "### How it compares to other table tools\n", "\n", "| Tool | Output | Strengths | When to use instead |\n", "| --- | --- | --- | --- |\n", "| **Great Tables** ([posit-dev.github.io/great-tables](https://posit-dev.github.io/great-tables/)) | HTML | Full layout control, colour scales, sparklines, `{gt}`-compatible API | Reports, notebooks, any HTML output |\n", "| **pandas Styler** ([pandas docs](https://pandas.pydata.org/docs/user_guide/style.html)) | HTML | Built-in, no extra install, fast for simple highlighting | Quick colouring when GT is overkill |\n", "| **tabulate** ([tabulate on PyPI](https://pypi.org/project/tabulate/)) | Text, Markdown, HTML, LaTeX | Lightweight, great for terminal or Markdown output | CLI output, `.md` files |\n", "| **rich** ([rich.readthedocs.io](https://rich.readthedocs.io)) | Terminal | Beautiful terminal tables, progress bars | Terminal-only display |\n", "| **itables** ([mwouts.github.io/itables](https://mwouts.github.io/itables/)) | Interactive HTML | Sort, filter, search in notebook | Exploratory analysis, large tables |\n", "\n", "### Already in your environment\n", "\n", "```bash\n", "uv add great-tables # for a standalone project\n", "```\n", "\n", "Official docs: [posit-dev.github.io/great-tables/articles/intro](https://posit-dev.github.io/great-tables/articles/intro.html)" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "## 1. When Tables Beat Charts" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "A chart compresses a distribution into shape: it shows a trend, a cluster, or an outlier at a glance. A table preserves exact values so a reader can answer a precise question: *which course has the highest midterm average?* or *by how many points does one program outperform another?*\n", "\n", "Use a table when:\n", "- Readers will look up a specific row or compare two exact values\n", "- The differences between groups are small and a chart would compress them into noise\n", "- A report or stakeholder document needs a citable number, not an impression\n", "\n", "Use a chart when:\n", "- You want to show a trend, a distribution, or a relationship across many data points\n", "- The pattern matters more than the individual values\n" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "
\n", " Key Concept: Tables serve lookup; charts serve pattern recognition

\n", "Neither replaces the other. A data storytelling section (Part 7) shows a trend with a chart. A summary report shows the underlying numbers in a table. The combination answers both the what happened and the by exactly how much.\n", "
" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "## 2. Your First Styled Table" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "Every Great Tables workflow starts with `GT(df)`: wrapping a pandas DataFrame in the Great Tables object. From there you chain methods to add structure and styling. On its own, `GT(df)` renders a minimal unstyled table. `themed_gt()` applies the project's brand: column header background, font, border colours, and alternating row stripes: one call at the end of the chain.\n", "\n", "The first example is a summary of mean scores by gender:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "summary = (\n", " df.groupby(\"gender\")\n", " .agg(\n", " n_students=(\"student_id\", \"count\"),\n", " midterm=(\"midterm_score\", \"mean\"),\n", " final=(\"final_score\", \"mean\"),\n", " project=(\"project_score\", \"mean\"),\n", " fail_rate=(\"passed\", lambda x: (~x).mean()),\n", " )\n", " .reset_index()\n", " .round(2)\n", ")\n", "summary" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "`GT(df)` alone already renders a table, but column names are raw and values have no formatting. Wrapping it in `themed_gt()` applies the brand while `.tab_header()` adds a title and subtitle:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "table = (\n", " GT(summary)\n", " .tab_header(\n", " title=gt_md(\"**Mean Exam Scores by Gender**\"),\n", " subtitle=\"Students with complete score records across all three components\",\n", " )\n", " .tab_source_note(\"Source: DS-MLOps university analytics dataset · 2,400 rows\")\n", ")\n", "themed_gt(table, n_rows=len(summary))" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "
\n", " Key Concept: The chain always ends with themed_gt()

\n", "themed_gt() applies brand-wide options (tab_options) and text styling. Call it last, after all structural methods (tab_header, cols_label, tab_spanner, etc.) so it can apply consistently across everything you have added.\n", "
" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "
\n", " Activity 1 - First Styled Table

\n", "Goal: Group df by program instead of gender, compute the same five aggregates, then wrap with GT and themed_gt. Add a title that identifies the program.\n", "
program_summary = df.groupby(\"program\").agg(...).reset_index().round(2)\n",
    "GT(program_summary).tab_header(title=gt_md(\"**...**\"), subtitle=\"...\").pipe(themed_gt, n_rows=len(program_summary))
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "# TODO: build program_summary and display with GT + themed_gt\n", "..." ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "## 3. Formatting Values and Labelling Columns" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "Raw floats in a table communicate false precision: a pass rate of `0.87654` signals noise, not information. Great Tables `fmt_*` methods format each column's values to the right precision for its type, and `cols_label` replaces machine-readable column names with reader-facing ones.\n", "\n", "The four formatting methods used most in DS tables:\n", "- `fmt_number(columns, decimals)`: round to `decimals` places\n", "- `fmt_integer(columns)`: strip decimal point, add thousands separator\n", "- `fmt_percent(columns, decimals)`: multiply by 100 and append `%`\n", "- `fmt_missing(columns, missing_text)`: replace `NaN` with a readable label\n" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "
\n", " Example: fmt_percent turns 0.913 into 91.3%

\n", "Without formatting, fail_rate=0.04 reads as a raw proportion. With fmt_percent(columns='fail_rate', decimals=1), the same cell displays as 4.0%: the reader does not need to mentally multiply by 100.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "formatted = (\n", " GT(summary)\n", " .tab_header(\n", " title=gt_md(\"**Mean Exam Scores by Gender**\"),\n", " subtitle=\"All figures rounded to one decimal place\",\n", " )\n", " .cols_label(\n", " gender=\"Gender\",\n", " n_students=\"Students\",\n", " midterm=\"Midterm\",\n", " final=\"Final\",\n", " project=\"Project\",\n", " fail_rate=\"Fail Rate\", # noqa: S106\n", " )\n", " .fmt_integer(columns=\"n_students\")\n", " .fmt_number(columns=[\"midterm\", \"final\", \"project\"], decimals=1)\n", " .fmt_percent(columns=\"fail_rate\", decimals=1) # noqa: S106\n", " .tab_source_note(\"Source: DS-MLOps university analytics dataset · 2,400 rows\")\n", ")\n", "themed_gt(formatted, n_rows=len(summary))" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "
\n", " Pro Tip: fmt_missing catches the NaN before the reader sees it

\n", "Any column that can contain NaN: a score column with ~3% missing, an optional field: should have fmt_missing(columns=..., missing_text=\":\") added to the chain. A blank cell in a published table is ambiguous: did the student not sit the exam, or did the pipeline drop the value?\n", "
" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "
\n", " Activity 2 - Format the Program Table

\n", "Goal: Take the program_summary from Activity 1 and add cols_label, fmt_integer, fmt_number, and fmt_percent to match the formatted table above.\n", "
formatted_programs = (\n",
    "    GT(program_summary)\n",
    "    .cols_label(program=\"Program\", n_students=\"Students\", ...)\n",
    "    .fmt_integer(columns=\"n_students\")\n",
    "    .fmt_number(columns=[...], decimals=1)\n",
    "    .fmt_percent(columns=\"fail_rate\", decimals=1)\n",
    ")\n",
    "themed_gt(formatted_programs, n_rows=len(program_summary))
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "# TODO: add cols_label and fmt_* to your program_summary table\n", "..." ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "## 4. Column Spanners" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "When a table has several columns that belong to a natural group, for example three score columns or multiple model metrics, a column spanner adds a shared header label above the group. This reduces cognitive load: the reader understands the table structure before reading the individual values.\n", "\n", "`tab_spanner(label, columns)` draws the label above the specified columns. It does not move or reorder columns; it only adds a visual grouping above them.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "# Course performance table: one row per course\n", "course_detail = (\n", " df.groupby(\"course\")\n", " .agg(\n", " students=(\"student_id\", \"count\"),\n", " midterm=(\"midterm_score\", \"mean\"),\n", " final=(\"final_score\", \"mean\"),\n", " project=(\"project_score\", \"mean\"),\n", " pass_rate=(\"passed\", \"mean\"),\n", " )\n", " .reset_index()\n", " .round(2)\n", ")\n", "course_detail" ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "with_spanner = (\n", " GT(course_detail)\n", " .tab_header(title=gt_md(\"**Performance by Course**\"))\n", " .cols_label(\n", " course=\"Course\",\n", " students=\"Students\",\n", " midterm=\"Midterm\",\n", " final=\"Final\",\n", " project=\"Project\",\n", " pass_rate=\"Pass Rate\", # noqa: S106\n", " )\n", " .tab_spanner(label=\"Score (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n", " .fmt_integer(columns=\"students\")\n", " .fmt_number(columns=[\"midterm\", \"final\", \"project\"], decimals=1)\n", " .fmt_percent(columns=\"pass_rate\", decimals=1) # noqa: S106\n", " .tab_source_note(\"Source: DS-MLOps university analytics dataset · 2,400 rows\")\n", ")\n", "themed_gt(with_spanner, n_rows=len(course_detail))" ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "
\n", " Activity 3 - Add a Spanner

\n", "Goal: Take the formatted_programs table from Activity 2 and add a tab_spanner labelled \"Scores (0-100)\" over the three score columns.\n", "
GT(program_summary)\n",
    "    ...\n",
    "    .tab_spanner(label=\"Scores (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n",
    "    ...
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "# TODO: add a tab_spanner to the program table\n", "..." ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "## 5. Conditional Styling" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "Conditional styling directs the reader's eye to the cells that matter: the highest pass rate, the lowest score, an outlier. `tab_style` applies a visual property and `loc` specifies exactly where it applies. `style` is the what, `loc` is the where.\n", "\n", "The most common locations:\n", "- `loc.body(columns, rows)`: specific cells in the data area\n", "- `loc.column_labels()`: the column header row\n", "- `loc.title()` / `loc.subtitle()`: the table header text\n", "\n", "`rows` inside `loc.body()` accepts an integer index, a list of indices, or a lambda that receives the DataFrame and returns a boolean Series.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "highlighted = (\n", " GT(course_detail)\n", " .tab_header(title=gt_md(\"**Course Performance: Best and Worst Pass Rate**\"))\n", " .cols_label(\n", " course=\"Course\",\n", " students=\"Students\",\n", " midterm=\"Midterm\",\n", " final=\"Final\",\n", " project=\"Project\",\n", " pass_rate=\"Pass Rate\", # noqa: S106\n", " )\n", " .tab_spanner(label=\"Score (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n", " .fmt_integer(columns=\"students\")\n", " .fmt_number(columns=[\"midterm\", \"final\", \"project\"], decimals=1)\n", " .fmt_percent(columns=\"pass_rate\", decimals=1) # noqa: S106\n", " .tab_style(\n", " style=style.fill(color=SUCCESS),\n", " locations=loc.body(\n", " columns=\"pass_rate\",\n", " rows=lambda df_gt: df_gt[\"pass_rate\"] == df_gt[\"pass_rate\"].max(),\n", " ),\n", " )\n", " .tab_style(\n", " style=style.fill(color=\"#FEF2F2\"),\n", " locations=loc.body(\n", " columns=\"pass_rate\",\n", " rows=lambda df_gt: df_gt[\"pass_rate\"] == df_gt[\"pass_rate\"].min(),\n", " ),\n", " )\n", ")\n", "themed_gt(highlighted, n_rows=len(course_detail))" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "
\n", " Key Concept: loc is a targeting system, not a filter

\n", "loc.body(rows=lambda df: df['pass_rate'] == df['pass_rate'].max()) does not subset the table: it identifies which rows receive the styling. The underlying data is unchanged. You can chain multiple tab_style calls; later ones add to earlier ones without overwriting.\n", "
" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "
\n", " Common Mistake: Passing a boolean mask directly to rows

\n", "loc.body(rows=course_detail['pass_rate'] == course_detail['pass_rate'].max()) fails because rows inside loc.body() needs a callable that receives the rendered DataFrame, not the original one. Always use a lambda: rows=lambda df: df['pass_rate'] == df['pass_rate'].max().\n", "
" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "
\n", " Activity 4 - Highlight the Best Midterm Score

\n", "Goal: Take the highlighted table and add a third tab_style call that highlights the midterm cell with the highest value in a light blue (#EAF3FA). Use a lambda for the row selection.\n", "
.tab_style(\n",
    "    style=style.fill(color=\"#EAF3FA\"),\n",
    "    locations=loc.body(\n",
    "        columns=\"midterm\",\n",
    "        rows=lambda df: df[\"midterm\"] == df[\"midterm\"].max(),\n",
    "    ),\n",
    ")
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "36", "metadata": {}, "outputs": [], "source": [ "# TODO: add a third tab_style call for the highest midterm value\n", "..." ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "## 6. Summary Rows" ] }, { "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "A summary row aggregates the entire table into one footer row: a grand mean, a column total, or a count. The reader no longer needs to mentally compute the aggregate, and the table and its summary stay in the same visual unit.\n", "\n", "`grand_summary_rows(fns)` adds these rows. `fns` is a dict mapping a display label to an aggregation function. In version 0.20, it aggregates all numeric columns in the table, so the DataFrame passed to `GT` should contain only the columns you want summarised:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [], "source": [ "from great_tables import vals as gt_vals # noqa: F401\n", "\n", "# Use only the score + pass_rate columns so the summary row is meaningful\n", "course_scores = course_detail.drop(columns=[\"students\"])\n", "\n", "with_summary = (\n", " GT(course_scores)\n", " .tab_header(title=gt_md(\"**Course Summary with Grand Mean**\"))\n", " .cols_label(\n", " course=\"Course\",\n", " midterm=\"Midterm\",\n", " final=\"Final\",\n", " project=\"Project\",\n", " pass_rate=\"Pass Rate\", # noqa: S106\n", " )\n", " .tab_spanner(label=\"Score (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n", " .fmt_number(columns=[\"midterm\", \"final\", \"project\"], decimals=1)\n", " .fmt_percent(columns=\"pass_rate\", decimals=1) # noqa: S106\n", " .grand_summary_rows(\n", " fns={\"Mean\": lambda x: x.mean(numeric_only=True)},\n", " )\n", ")\n", "themed_gt(with_summary, n_rows=len(course_scores))" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "
\n", " Pro Tip: Shape the DataFrame before passing it to GT

\n", "grand_summary_rows aggregates every numeric column in the table. If a count column like students would produce a meaningless mean, drop it before calling GT(): df.drop(columns=[\"students\"]). If the table still includes a string column like the row label, pass numeric_only=True to the aggregation: lambda x: x.mean(numeric_only=True).\n", "
" ] }, { "cell_type": "markdown", "id": "41", "metadata": {}, "source": [ "
\n", " Activity 5 - Add a Min and Max Row

\n", "Goal: Extend with_summary to show three summary rows: Min, Max, and Mean across all score columns. Pass a dict with three keys to fns.\n", "
fns={\"Min\": lambda x: x.min(numeric_only=True), \"Max\": lambda x: x.max(numeric_only=True), \"Mean\": lambda x: x.mean(numeric_only=True)}
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "42", "metadata": {}, "outputs": [], "source": [ "# TODO: add Min/Max/Mean grand summary rows\n", "..." ] }, { "cell_type": "markdown", "id": "43", "metadata": {}, "source": [ "## 7. Model Comparison with `metrics_report()`" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "The `ark.plot.gt_style` module ships `metrics_report()`: a one-call wrapper that produces a publication-ready model comparison table. It handles formatting, brand styling, and conditional highlighting in a single call.\n", "\n", "`metrics_report(df, metrics, minimize_cols, maximize_cols)` highlights the best value in each metric column: green for minimize metrics (lower is better: MAE, RMSE), green for maximize metrics (higher is better: R², accuracy). The caller decides which direction is better for each metric; the function does not guess.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "comparison = pd.DataFrame(\n", " {\n", " \"Model\": [\"Linear Regression\", \"Ridge (α=0.1)\", \"Ridge (α=1.0)\", \"Random Forest\"],\n", " \"MAE\": [8.21, 8.09, 7.98, 7.43],\n", " \"RMSE\": [10.42, 10.31, 10.19, 9.61],\n", " \"R2\": [0.781, 0.784, 0.788, 0.810],\n", " }\n", ")\n", "\n", "metrics_report(\n", " comparison,\n", " metrics=[\"MAE\", \"RMSE\", \"R2\"],\n", " minimize_cols=[\"MAE\", \"RMSE\"],\n", " maximize_cols=[\"R2\"],\n", " title=\"Grade Prediction: Model Comparison\",\n", " subtitle=\"Predicting average_marks from study_hours, attendance_pct, and program\",\n", " source_note=\"university_analytics.csv · 5-fold CV · held-out 20% test set\",\n", ")" ] }, { "cell_type": "markdown", "id": "46", "metadata": {}, "source": [ "
\n", " Key Concept: metrics_report highlights by direction, not by rank

\n", "minimize_cols highlights the row with the lowest value: better for error metrics. maximize_cols highlights the row with the highest value: better for performance metrics. A column can appear in at most one list. If a column appears in neither, it is formatted but not highlighted.\n", "
" ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "
\n", " Activity 6 - Add a Gradient Boosting Row

\n", "Goal: Add a fifth row to comparison: \"Gradient Boosting\" with MAE=6.91, RMSE=8.84, R2=0.843: and re-run metrics_report(). Confirm the highlighted row updates automatically.\n", "
comparison = pd.concat([comparison, pd.DataFrame([{\"Model\": \"Gradient Boosting\", \"MAE\": 6.91, \"RMSE\": 8.84, \"R2\": 0.843}])], ignore_index=True)\n",
    "metrics_report(comparison, metrics=[...], minimize_cols=[...], maximize_cols=[...], ...)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "48", "metadata": {}, "outputs": [], "source": [ "# TODO: add Gradient Boosting row and re-run metrics_report\n", "..." ] }, { "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "## Capstone: Course Performance Report\n", "\n", "Combine every technique from this notebook into one complete report table. The report should give a department head a single table they can paste into a slide deck.\n" ] }, { "cell_type": "markdown", "id": "50", "metadata": {}, "source": [ "
\n", " Capstone Exercise - Course Performance Report

\n", "\n", "Goal:\n", "
    \n", "
  1. Build a report DataFrame grouped by course with columns: students, midterm mean, final mean, project mean, pass rate, and average_marks mean
  2. \n", "
  3. Wrap with GT. Add a descriptive title and source note
  4. \n", "
  5. Apply cols_label and the appropriate fmt_* for each column
  6. \n", "
  7. Add a tab_spanner over the three score columns
  8. \n", "
  9. Highlight the course with the highest pass rate (green) and lowest pass rate (light red)
  10. \n", "
  11. Add a grand Mean summary row across all numeric columns
  12. \n", "
  13. Call themed_gt() last
  14. \n", "
\n", "
# Build report DataFrame first, then chain all GT methods in one expression
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "51", "metadata": {}, "outputs": [], "source": [ "# TODO: build the complete course performance report\n", "..." ] }, { "cell_type": "markdown", "id": "52", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| [Great Tables documentation](https://posit-dev.github.io/great-tables/) | Complete API reference with rendered examples for every method |\n", "| [Great Tables: `loc` reference](https://posit-dev.github.io/great-tables/reference/#targeting-cells) | Full list of location helpers: `loc.body`, `loc.column_labels`, `loc.spanner_labels`, `loc.grand_summary` |\n", "| [Great Tables blog: Python tables](https://posit-dev.github.io/great-tables/blog/) | Worked examples including financial reports and ML comparison tables |\n", "| Knaflic, C.N. (2015). *Storytelling with Data*. Wiley. | Chapter 2 covers when tables serve communication better than charts |\n", "| [pandas `GroupBy.agg` reference](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html) | Named aggregations (`col=(src, func)`) used throughout this notebook |\n" ] }, { "cell_type": "markdown", "id": "53", "metadata": {}, "source": [ "## Summary\n", "\n", "| GT method | What it does |\n", "|---|---|\n", "| `GT(df)` | Wrap a DataFrame and begin the method chain |\n", "| `themed_gt(table, n_rows=n)` | Apply project brand: header colors, font, striped rows. Call last. |\n", "| `tab_header(title, subtitle)` | Add a title row above the column headers |\n", "| `tab_source_note(text)` | Add an attribution line below the table |\n", "| `cols_label(**kwargs)` | Replace column names with reader-facing labels |\n", "| `fmt_number(columns, decimals)` | Round floats to `decimals` places |\n", "| `fmt_integer(columns)` | Remove decimal, add thousands separator |\n", "| `fmt_percent(columns, decimals)` | Multiply by 100 and append `%` |\n", "| `fmt_missing(columns, missing_text)` | Replace `NaN` with a readable placeholder |\n", "| `tab_spanner(label, columns)` | Group related columns under a shared header label |\n", "| `tab_style(style, locations)` | Apply a visual property (`fill`, `text`) to a location (`loc.body`, `loc.column_labels`) |\n", "| `loc.body(columns, rows)` | Target specific cells; `rows` takes an index or a lambda |\n", "| `grand_summary_rows(fns)` | Add one summary row per key in `fns`; aggregates all numeric columns |\n", "| `metrics_report(df, metrics, ...)` | One-call ML comparison table with directional highlighting |\n", "\n", "**Next:** Part 3: Dev Tools covers the professional toolchain: uv, ruff, type annotations, git, pytest, and pre-commit.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }