{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 12: Presenting Data with Great Tables\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/12-great-tables.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/12-great-tables.ipynb)\n" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Parts 8-11 (the Data Analysis section). Every example builds on the `university_analytics.csv` dataset and the `ark.plot.gt_style` module introduced alongside the plotting chapters.\n", "\n", "A plain `df.head()` output serves you in a notebook. It does not serve a stakeholder, a report, or a slide deck. Great Tables (`great_tables`) is the Python library that bridges that gap: it wraps a pandas DataFrame in a fluent API and produces publication-ready HTML tables: precise column formatting, readable labels, conditional highlighting, and summary rows: with no CSS knowledge required.\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide).\n" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 12 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain when a table is the right choice over a chart | Sec. 1 |\n", "| 2 | Wrap a DataFrame with `GT()` and apply the project brand with `themed_gt()` | Sec. 2 |\n", "| 3 | Format numbers, percentages, and missing values with `fmt_*` methods | Sec. 3 |\n", "| 4 | Add readable column labels with `cols_label` and group related columns with `tab_spanner` | Sec. 4 |\n", "| 5 | Target cells with `loc` and apply styling with `tab_style` | Sec. 5 |\n", "| 6 | Add grand summary rows with `grand_summary_rows` | Sec. 6 |\n", "| 7 | Build a model comparison table with `metrics_report()` | Sec. 7 |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "from great_tables import GT, loc, md as gt_md, style\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from ark.plot.gt_style import metrics_report, themed_gt\n", "from ark.plot.tokens import PRIMARY, SUCCESS, SURFACE_MUTED\n", "\n", "df = pd.read_csv(\"data/university_analytics.csv\")\n", "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n", "df.head(3)" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 0. The Last Mile of a Data Story\n", "\n", "You have run the analysis. You have the numbers: pass rates by school, score distributions by program, trend lines across semesters. Now your manager asks for a report — something to put in front of a stakeholder, not a developer. You open a notebook, call `df.head()`, and stare at a grey monospace grid with no hierarchy, no colour, no units, and no sense of which numbers matter.\n", "\n", "`df.head()` serves you in a notebook. It does not serve anyone else.\n", "\n", "The gap between \"I have the result\" and \"I can communicate the result\" is called the last mile of a data story. It is where a lot of analysis work quietly disappears: correct findings, buried in formatting nobody wanted to read. **Great Tables** ([posit-dev.github.io/great-tables](https://posit-dev.github.io/great-tables/)) is the Python library that closes that gap. It wraps a pandas DataFrame in a fluent API — one that mirrors R's `{gt}` package — and produces publication-ready HTML tables with column spanners, colour scales, embedded sparklines, and controlled footnotes.\n", "\n", "### How it compares to other table tools\n", "\n", "| Tool | Output | Strengths | When to use instead |\n", "| --- | --- | --- | --- |\n", "| **Great Tables** ([posit-dev.github.io/great-tables](https://posit-dev.github.io/great-tables/)) | HTML | Full layout control, colour scales, sparklines, `{gt}`-compatible API | Reports, notebooks, any HTML output |\n", "| **pandas Styler** ([pandas docs](https://pandas.pydata.org/docs/user_guide/style.html)) | HTML | Built-in, no extra install, fast for simple highlighting | Quick colouring when GT is overkill |\n", "| **tabulate** ([tabulate on PyPI](https://pypi.org/project/tabulate/)) | Text, Markdown, HTML, LaTeX | Lightweight, great for terminal or Markdown output | CLI output, `.md` files |\n", "| **rich** ([rich.readthedocs.io](https://rich.readthedocs.io)) | Terminal | Beautiful terminal tables, progress bars | Terminal-only display |\n", "| **itables** ([mwouts.github.io/itables](https://mwouts.github.io/itables/)) | Interactive HTML | Sort, filter, search in notebook | Exploratory analysis, large tables |\n", "\n", "### Already in your environment\n", "\n", "```bash\n", "uv add great-tables # for a standalone project\n", "```\n", "\n", "Official docs: [posit-dev.github.io/great-tables/articles/intro](https://posit-dev.github.io/great-tables/articles/intro.html)" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "## 1. When Tables Beat Charts" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "A chart compresses a distribution into shape: it shows a trend, a cluster, or an outlier at a glance. A table preserves exact values so a reader can answer a precise question: *which course has the highest midterm average?* or *by how many points does one program outperform another?*\n", "\n", "Use a table when:\n", "- Readers will look up a specific row or compare two exact values\n", "- The differences between groups are small and a chart would compress them into noise\n", "- A report or stakeholder document needs a citable number, not an impression\n", "\n", "Use a chart when:\n", "- You want to show a trend, a distribution, or a relationship across many data points\n", "- The pattern matters more than the individual values\n" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "
themed_gt()themed_gt() applies brand-wide options (tab_options) and text styling. Call it last, after all structural methods (tab_header, cols_label, tab_spanner, etc.) so it can apply consistently across everything you have added.\n",
"df by program instead of gender, compute the same five aggregates, then wrap with GT and themed_gt. Add a title that identifies the program.\n",
"program_summary = df.groupby(\"program\").agg(...).reset_index().round(2)\n",
"GT(program_summary).tab_header(title=gt_md(\"**...**\"), subtitle=\"...\").pipe(themed_gt, n_rows=len(program_summary))\n",
"fail_rate=0.04 reads as a raw proportion. With fmt_percent(columns='fail_rate', decimals=1), the same cell displays as 4.0%: the reader does not need to mentally multiply by 100.\n",
"NaN: a score column with ~3% missing, an optional field: should have fmt_missing(columns=..., missing_text=\":\") added to the chain. A blank cell in a published table is ambiguous: did the student not sit the exam, or did the pipeline drop the value?\n",
"program_summary from Activity 1 and add cols_label, fmt_integer, fmt_number, and fmt_percent to match the formatted table above.\n",
"formatted_programs = (\n",
" GT(program_summary)\n",
" .cols_label(program=\"Program\", n_students=\"Students\", ...)\n",
" .fmt_integer(columns=\"n_students\")\n",
" .fmt_number(columns=[...], decimals=1)\n",
" .fmt_percent(columns=\"fail_rate\", decimals=1)\n",
")\n",
"themed_gt(formatted_programs, n_rows=len(program_summary))\n",
"formatted_programs table from Activity 2 and add a tab_spanner labelled \"Scores (0-100)\" over the three score columns.\n",
"GT(program_summary)\n",
" ...\n",
" .tab_spanner(label=\"Scores (0-100)\", columns=[\"midterm\", \"final\", \"project\"])\n",
" ...\n",
"loc is a targeting system, not a filterloc.body(rows=lambda df: df['pass_rate'] == df['pass_rate'].max()) does not subset the table: it identifies which rows receive the styling. The underlying data is unchanged. You can chain multiple tab_style calls; later ones add to earlier ones without overwriting.\n",
"rowsloc.body(rows=course_detail['pass_rate'] == course_detail['pass_rate'].max()) fails because rows inside loc.body() needs a callable that receives the rendered DataFrame, not the original one. Always use a lambda: rows=lambda df: df['pass_rate'] == df['pass_rate'].max().\n",
"highlighted table and add a third tab_style call that highlights the midterm cell with the highest value in a light blue (#EAF3FA). Use a lambda for the row selection.\n",
".tab_style(\n",
" style=style.fill(color=\"#EAF3FA\"),\n",
" locations=loc.body(\n",
" columns=\"midterm\",\n",
" rows=lambda df: df[\"midterm\"] == df[\"midterm\"].max(),\n",
" ),\n",
")\n",
"grand_summary_rows aggregates every numeric column in the table. If a count column like students would produce a meaningless mean, drop it before calling GT(): df.drop(columns=[\"students\"]). If the table still includes a string column like the row label, pass numeric_only=True to the aggregation: lambda x: x.mean(numeric_only=True).\n",
"with_summary to show three summary rows: Min, Max, and Mean across all score columns. Pass a dict with three keys to fns.\n",
"fns={\"Min\": lambda x: x.min(numeric_only=True), \"Max\": lambda x: x.max(numeric_only=True), \"Mean\": lambda x: x.mean(numeric_only=True)}\n",
"minimize_cols highlights the row with the lowest value: better for error metrics. maximize_cols highlights the row with the highest value: better for performance metrics. A column can appear in at most one list. If a column appears in neither, it is formatted but not highlighted.\n",
"comparison: \"Gradient Boosting\" with MAE=6.91, RMSE=8.84, R2=0.843: and re-run metrics_report(). Confirm the highlighted row updates automatically.\n",
"comparison = pd.concat([comparison, pd.DataFrame([{\"Model\": \"Gradient Boosting\", \"MAE\": 6.91, \"RMSE\": 8.84, \"R2\": 0.843}])], ignore_index=True)\n",
"metrics_report(comparison, metrics=[...], minimize_cols=[...], maximize_cols=[...], ...)\n",
"report DataFrame grouped by course with columns: students, midterm mean, final mean, project mean, pass rate, and average_marks meanGT. Add a descriptive title and source notecols_label and the appropriate fmt_* for each columntab_spanner over the three score columnsMean summary row across all numeric columnsthemed_gt() last# Build report DataFrame first, then chain all GT methods in one expression\n", "