{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 7: Data Storytelling and House Style\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/07-data-storytelling.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/07-data-storytelling.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Part 5 (`05-matplotlib.ipynb`) and Part 6 (`06-lets-plot.ipynb`). Parts 5 and 6 taught you how to make a chart at all. This part is about taste: which chart to make, what to leave out of it, and how to brand it once it is ready to show someone other than yourself, using this project's own `ark.plot` module.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Chart selection** | The question decides the chart, not the data type |\n", "| **Data-ink ratio** | Every pixel that is not data is a tax on the reader's attention |\n", "| **Common chart lies** | The same data, drawn to mislead instead of inform |\n", "| **Tables vs. charts** | Exact lookup wants a table; pattern recognition wants a chart |\n", "| **House style** | Replacing per-chart styling with one reusable theme |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 7 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Choose a chart type based on the question, not habit | Sec. 1 |\n", "| 2 | Recognise and avoid the most common ways charts mislead | Sec. 2 |\n", "| 3 | Decide when a table communicates better than a chart | Sec. 3 |\n", "| 4 | Apply `ark.plot`'s house theme to matplotlib and lets-plot charts | Sec. 4 |\n", "| 5 | Identify preattentive attributes and use them deliberately | Sec. 5 |\n", "| 6 | Add annotations that carry the chart's narrative | Sec. 6 |\n", "| 7 | Explain why summary statistics alone are never sufficient (Datasaurus) | Sec. 7 |\n", "| 8 | Produce one polished, captioned chart from a messy dataset | Capstone |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 1. Picking the Right Chart, and Cutting What Does Not Help\n", "\n", "The most common visualisation mistake is not a styling problem, it is a **selection** problem: building a chart type out of habit instead of asking what question it needs to answer. A pie chart cannot answer \"which course improved the most,\" because comparing angles is something people are measurably worse at than comparing positions along a shared axis. That is not opinion, it is the finding of a well-known study by Cleveland and McGill (1984) ranking how accurately people read different visual encodings: position along a common scale first, then length, then angle, then area, with colour saturation last." ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# Load the shared university dataset used across all Part 1 and Part 2 notebooks\n", "df7 = pd.read_csv(\"data/university_analytics.csv\")\n", "\n", "# `results` keeps the course/semester/exam_score structure that cells below depend on\n", "results = df7[[\"course\", \"semester\", \"final_score\"]].copy()\n", "results = results.rename(columns={\"final_score\": \"exam_score\"})\n", "results.head()" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "**Data-ink ratio**, a term from Edward Tufte's *The Visual Display of Quantitative Information*, is the proportion of a chart's ink that represents actual data, as opposed to gridlines, borders, redundant legends, and other decoration. Maximising it does not mean a bare chart, it means every mark earns its place. Compare matplotlib's defaults to a version with the same data and nothing extra:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "course_means = results.groupby(\"course\")[\"exam_score\"].mean()\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(10, 3.5))\n", "\n", "# Left: matplotlib defaults. Box on all four sides, both gridlines, a\n", "# legend the bar colours already make redundant since each bar has its\n", "# own label directly underneath it.\n", "axes[0].bar(course_means.index, course_means.values, color=[\"#4477AA\", \"#EE6677\", \"#228833\"])\n", "axes[0].set_title(\"Before: matplotlib defaults\")\n", "axes[0].grid(True)\n", "\n", "# Right: same data, decluttered. No top/right spines, no gridlines, the\n", "# value printed directly on each bar instead of relying on the y-axis.\n", "axes[1].bar(course_means.index, course_means.values, color=\"#4477AA\")\n", "axes[1].spines[\"top\"].set_visible(False)\n", "axes[1].spines[\"right\"].set_visible(False)\n", "axes[1].set_yticks([])\n", "for i, value in enumerate(course_means.values):\n", " axes[1].text(i, value + 1, f\"{value:.0f}\", ha=\"center\")\n", "axes[1].set_title(\"After: higher data-ink ratio\")\n", "\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "
\n", " Key Concept: Data-ink ratio

\n", "Every gridline, border, and redundant legend entry competes with your data for the reader's attention. The right-hand chart above removes the y-axis entirely (the labelled bars make it redundant) and the top/right spines, and prints each value once, directly on its bar, instead of forcing the reader to trace a line back to an axis. Decluttering is not about making a chart sparse for its own sake, it is about removing anything that does not carry information.\n", "
" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "**One message per chart** is the last principle worth internalising before the chart-lie gallery in Section 2: a chart that tries to answer three questions at once usually answers none of them clearly. If you find yourself adding a third colour dimension, a secondary y-axis, and a trend line to the same chart, that is a sign you need two charts, not one crowded one." ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "## 2. Common Chart Lies\n", "\n", "None of the charts below contain incorrect numbers. Each one is drawn in a way that makes the reader's brain compute the wrong conclusion from correct data. Recognising these on sight, in your own draft charts, is the actual skill." ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(9, 3.5))\n", "\n", "axes[0].bar(course_means.index, course_means.values, color=\"#4477AA\")\n", "axes[0].set_ylim(0, 100)\n", "axes[0].set_title(\"Honest: y-axis starts at 0\")\n", "\n", "axes[1].bar(course_means.index, course_means.values, color=\"#EE6677\")\n", "axes[1].set_ylim(65, 78)\n", "axes[1].set_title(\"Misleading: y-axis starts at 65\")\n", "\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "
\n", " Common Mistake: Truncating the y-axis

\n", "The two charts above show the exact same three numbers. The right-hand one starts its y-axis at 65 instead of 0, which makes a 6-point difference between courses look like one course scores three times higher than another. Bar charts in particular must start at zero, because the reader judges bar height (and therefore a ratio) automatically. Line charts can sometimes justify a non-zero baseline if it is labelled clearly, bars almost never can.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(6, 3.5))\n", "\n", "cohort_size = np.array([320, 340, 410, 580])\n", "avg_score = np.array([71, 72, 70, 74])\n", "years = [\"2022\", \"2023\", \"2024\", \"2025\"]\n", "\n", "ax.plot(years, cohort_size, color=\"#4477AA\", marker=\"o\", label=\"Cohort size\")\n", "ax.set_ylabel(\"Cohort size\", color=\"#4477AA\")\n", "\n", "ax2 = ax.twinx()\n", "ax2.plot(years, avg_score, color=\"#EE6677\", marker=\"o\", label=\"Average score\")\n", "ax2.set_ylabel(\"Average score\", color=\"#EE6677\")\n", "\n", "ax.set_title(\"Two independently scaled axes invite a false correlation\");" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "
\n", " Common Mistake: Dual axes with independent scales

\n", "Cohort size and average score both happen to rise together in this chart, but each axis was scaled independently to make the two lines fit nicely, which is exactly what makes dual axes dangerous: you can almost always pick scales that make two unrelated series appear to move together. If two series genuinely need comparing, plot their percent change from a common baseline on one shared axis instead, or use two separate charts stacked vertically with the same x-axis.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "enrollment = pd.Series(\n", " {\"ML\": 145, \"Data Structures\": 132, \"Statistics\": 98, \"Databases\": 41, \"Networks\": 28, \"Ethics\": 15}\n", ")\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(10, 4))\n", "\n", "axes[0].pie(enrollment.values, labels=enrollment.index, autopct=\"%1.0f%%\")\n", "axes[0].set_title(\"Pie: hard to rank six similar slices\")\n", "\n", "axes[1].barh(enrollment.index[::-1], enrollment.values[::-1], color=\"#4477AA\")\n", "axes[1].set_title(\"Bar: ranking is immediate\")\n", "\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "
\n", " Common Mistake: Too many pie slices

\n", "Six categories is already past where a pie chart works: Ethics and Networks are visibly different bars, but as pie slices they are nearly impossible to rank by eye, exactly the angle-judgment weakness Cleveland and McGill measured. A horizontal bar chart, sorted by value, answers \"which course has the most enrollment\" instantly. Reserve pie charts for two or three categories where one slice is clearly dominant, and prefer a bar chart almost everywhere else.\n", "
" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "## 3. When a Table Beats a Chart\n", "\n", "A chart is for spotting a pattern at a glance. A table is for looking up an exact number, or comparing many numbers a reader needs to cite precisely, such as a results appendix someone will quote in a report. This project's `ark.plot.gt_style` module wraps `great_tables` with the same brand colours used everywhere else, so a results table looks like it belongs in the same document as the chart next to it:" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "from great_tables import GT, md as gt_md\n", "\n", "from ark.plot.gt_style import themed_gt\n", "\n", "summary = results.groupby(\"course\")[\"exam_score\"].agg(mean=\"mean\", std=\"std\", n=\"count\").round(1).reset_index()\n", "\n", "table = themed_gt(\n", " GT(summary)\n", " .tab_header(title=gt_md(\"**Exam Score Summary**\"), subtitle=\"Mean, spread, and sample size per course\")\n", " .cols_label(course=\"Course\", mean=\"Mean\", std=\"Std. Dev.\", n=\"N\"),\n", " n_rows=len(summary),\n", ")\n", "table" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "
\n", " Pro Tip: Pair a chart with a table, do not choose between them

\n", "A results section in a report often wants both: the histogram from Part 5 to show the shape of the distribution, and a table like the one above to give the exact numbers someone will want to quote. Neither replaces the other.\n", "
" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "## 4. From Default to Branded\n", "\n", "Parts 5 and 6 used each library's own defaults on purpose, so the charts in those notebooks teach the mechanics without a styling layer in the way. This project's `ark.plot` module exists precisely so you do not have to restate the same colours, fonts, and spacing on every single chart by hand. Compare a lets-plot chart with and without it:" ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "from lets_plot import LetsPlot, aes, geom_histogram, gggrid, ggplot, ggsize\n", "\n", "LetsPlot.setup_html()\n", "\n", "default_chart = ggplot(results, aes(x=\"exam_score\", fill=\"course\")) + geom_histogram(bins=20, alpha=0.6)\n", "default_chart + ggsize(450, 300)" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "from lets_plot import labs, scale_fill_manual\n", "\n", "from ark.plot.theme import modern_theme, pro_colors\n", "\n", "branded_chart = (\n", " ggplot(results, aes(x=\"exam_score\", fill=\"course\"))\n", " + geom_histogram(bins=20, alpha=0.85)\n", " + scale_fill_manual(values=pro_colors)\n", " + labs(\n", " title=\"Exam score distribution by course\",\n", " x=\"Exam score (0-100)\",\n", " y=\"Number of students\",\n", " fill=\"Course\",\n", " )\n", " + modern_theme(grid=True)\n", ")\n", "branded_chart + ggsize(450, 300)" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "`modern_theme()` is one more `+` layer, exactly like every `geom_*()` in Part 6. `scale_fill_manual(values=pro_colors)` pins each course to the brand palette, and `labs()` replaces the auto-generated column names with reader-facing text. `gggrid()` puts both charts side by side in one output, which is the cleanest way to show a before/after comparison in a single cell:" ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "gggrid([default_chart + ggsize(580, 380), branded_chart + ggsize(580, 380)], ncol=1)" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "The matplotlib equivalent, `ark.plot.matplot_theme.configure_matplotlib_style()`, works differently: it updates matplotlib's global `rcParams`, so it applies to every chart drawn after you call it, not just one. That is the right trade-off for a notebook or script that draws many charts you want to look consistent, at the cost of not being able to mix styled and unstyled charts in the same figure:" ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "from ark.plot.matplot_theme import configure_matplotlib_style\n", "\n", "configure_matplotlib_style(font_size=10, fig_width=5)\n", "\n", "fig, ax = plt.subplots()\n", "ax.hist(results[\"exam_score\"], bins=20)\n", "ax.set_xlabel(\"Exam score\")\n", "ax.set_ylabel(\"Number of students\")\n", "ax.set_title(\"Same chart as Part 5, after configure_matplotlib_style()\");" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "
\n", " Activity 1 - Brand a Boxplot

\n", "\n", "Goal: Rebuild Part 5's seaborn boxplot of exam_score by course, but after calling configure_matplotlib_style() from this section. No other code changes: the same sns.boxplot() call should now pick up the brand colour cycle automatically.\n", "
import seaborn as sns\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "sns.boxplot(data=results, x=\"course\", y=\"exam_score\", ax=ax)\n",
    "ax.set_title(...)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "\n", "# TODO: boxplot of exam_score by course, styled by the already-active house theme\n", "..." ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "
\n", " Pro Tip: A house theme is a contract, not a one-off setting

\n", "The value of ark.plot is not any single colour choice, it is that every chart in this project, in any notebook, in any report, reaches for the same module instead of redefining its own palette. When the brand colours change, they change in one file.\n", "
" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "## 5. Preattentive Attributes: What the Brain Sees First\n", "\n", "Before you consciously read a chart, your visual system has already processed certain features. These are called **preattentive attributes**: colour hue, luminance, shape, orientation, size, position, and enclosure. They register in under 250 milliseconds, before attention is directed.\n", "\n", "The design implication is simple but easy to violate: **use at most one preattentive attribute for emphasis per chart**. Two or more compete for attention and cancel each other out." ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "
\n", " Key Concept: Preattentive Attributes

\n", "Visual properties processed before conscious attention — colour hue, luminance, shape, orientation, size, position, and motion. The practical rule: encode at most one categorical dimension preattentively, and make sure it carries information, not decoration.\n", "

\n", "Reference: Ware, C. (2012). Information Visualization: Perception for Design, 3rd ed.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "df7 = pd.read_csv(\"data/university_analytics.csv\")\n", "course_means = df7.groupby(\"course\")[\"final_score\"].mean().sort_values()\n", "\n", "# Left: every bar the same colour — no preattentive signal\n", "# Right: one bar highlighted with colour — attention goes there instantly\n", "TARGET = \"Machine Learning\"\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(11, 3.8))\n", "\n", "for ax, use_highlight in zip(axes, [False, True], strict=False):\n", " colors = (\n", " [\"#009E73\" if c == TARGET else \"#ADB5BD\" for c in course_means.index]\n", " if use_highlight\n", " else [\"#ADB5BD\"] * len(course_means)\n", " )\n", " ax.barh(course_means.index, course_means.values, color=colors)\n", " ax.set_xlabel(\"Mean final score\")\n", " ax.spines[[\"top\", \"right\"]].set_visible(False)\n", " ax.set_xlim(0, 100)\n", " title = \"No preattentive signal\" if not use_highlight else f\"Colour highlights {TARGET!r}\"\n", " ax.set_title(title, fontsize=11, fontweight=\"bold\", color=\"#1E293B\")\n", "\n", "fig.suptitle(\"Same data — preattentive colour changes where the eye goes first\", fontsize=10, color=\"#6B7280\")\n", "fig.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "
\n", " Activity 5 - Highlight by Region

\n", "\n", "Goal: Build a horizontal bar chart of mean midterm_score by region.\n", "Highlight the region with the highest mean in #0369A1 and colour every other bar #ADB5BD.\n", "Add a concise title stating which region leads.\n", "
region_means = df7.groupby(\"region\")[\"midterm_score\"].mean().sort_values()\n",
    "# ... build the chart with one highlighted bar ...
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "# TODO: highlight the top-scoring region in the midterm score chart\n", "..." ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "## 6. Annotation as Narrative\n", "\n", "A chart without annotation forces viewers to form their own interpretation. Annotation is where the **\"so what?\"** lives — the one sentence that turns a picture into an argument.\n", "\n", "Effective annotation does three things:\n", "1. Points at the specific data feature that supports the message (arrow or callout).\n", "2. States that feature in plain language, not chart jargon.\n", "3. Uses restraint: one annotation per message, not one per data point." ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "
\n", " Key Concept: Annotation as the \"So What?\"

\n", "Knaflic's rule (2015): \"the most important single thing you can do to improve your graph is to add a descriptive title.\"\n", "Add arrows and labels that direct attention to the exact point you want the viewer to remember.\n", "A good annotation answers: what should I notice here, and why does it matter?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(12, 4.2))\n", "\n", "# ── Left: unannotated — what's the takeaway? ──────────────────────────────────\n", "program_means = df7.groupby(\"program\")[\"final_score\"].mean().sort_values()\n", "\n", "for ax in axes:\n", " bars = ax.bar(program_means.index, program_means.values, color=[\"#ADB5BD\"] * len(program_means))\n", " ax.set_ylabel(\"Mean final score\")\n", " ax.spines[[\"top\", \"right\"]].set_visible(False)\n", " ax.set_ylim(0, 85)\n", " ax.tick_params(axis=\"x\", rotation=15)\n", "\n", "axes[0].set_title(\"Without annotation\\n(viewer guesses the message)\", fontsize=10, fontweight=\"bold\")\n", "\n", "# ── Right: annotated — narrative is explicit ──────────────────────────────────\n", "best_idx = list(program_means.index).index(program_means.idxmax())\n", "axes[1].patches[best_idx].set_facecolor(\"#0369A1\")\n", "\n", "axes[1].annotate(\n", " f\"Data Science leads\\nat {program_means.max():.1f}\",\n", " xy=(best_idx, program_means.max()),\n", " xytext=(best_idx + 0.6, program_means.max() - 4),\n", " fontsize=9,\n", " color=\"#0369A1\",\n", " fontweight=\"bold\",\n", " arrowprops={\"arrowstyle\": \"->\", \"color\": \"#0369A1\", \"lw\": 1.4},\n", " ha=\"left\",\n", ")\n", "axes[1].set_title(\"With annotation\\n(message is explicit)\", fontsize=10, fontweight=\"bold\")\n", "\n", "fig.suptitle(\"Annotation turns a picture into an argument\", fontsize=10, color=\"#6B7280\")\n", "fig.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "
\n", " Activity 6 - Annotate a Trend

\n", "\n", "Goal: Plot mean final_score by semester as a line chart.\n", "Add ax.annotate() to flag the semester with the highest mean and write a one-sentence insight as the chart title.\n", "
sem_means = df7.groupby(\"semester\")[\"final_score\"].mean()\n",
    "# ... line chart + annotation ...
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [], "source": [ "# TODO: line chart of final_score by semester with an annotation on the peak semester\n", "..." ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "## 7. The Datasaurus Dozen: Always Visualize\n", "\n", "In 1973 Francis Anscombe constructed four datasets with nearly identical summary statistics — same mean, variance, and correlation — but radically different visual shapes. In 2017, Matejka and Fitzmaurice extended this to thirteen datasets, including a dinosaur, which gave the collection its name: the **Datasaurus Dozen**.\n", "\n", "The lesson is not subtle: **summary statistics alone hide the shape of your data**. A mean and a standard deviation cannot distinguish a normal distribution from a bimodal one, a dinosaur, or a circle." ] }, { "cell_type": "markdown", "id": "41", "metadata": {}, "source": [ "
\n", " Key Concept: Datasaurus Dozen

\n", "Matejka, J. & Fitzmaurice, G. (2017). Same stats, different graphs. Proceedings of CHI 2017.\n", "The original Anscombe's Quartet dates from 1973. Both show the same principle: visualize before summarizing.\n", "A modern corollary for ML: never report only RMSE without looking at a residual plot.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "42", "metadata": {}, "outputs": [], "source": [ "# Demonstrate \"same stats, different shapes\" with two student subsets\n", "# that have matching means and standard deviations but different distributions.\n", "\n", "rng = np.random.default_rng(7)\n", "\n", "# Group A: normally distributed scores (typical class)\n", "n = 200\n", "group_a = pd.DataFrame(\n", " {\n", " \"group\": \"Group A (normal)\",\n", " \"midterm\": rng.normal(68, 12, n).clip(20, 100),\n", " \"final\": rng.normal(70, 12, n).clip(20, 100),\n", " }\n", ")\n", "\n", "# Group B: bimodal — two subpopulations merged (e.g. two campuses with different resources)\n", "group_b = pd.DataFrame(\n", " {\n", " \"group\": \"Group B (bimodal)\",\n", " \"midterm\": np.concatenate([rng.normal(55, 6, n // 2), rng.normal(82, 6, n // 2)]),\n", " \"final\": np.concatenate([rng.normal(57, 6, n // 2), rng.normal(84, 6, n // 2)]),\n", " }\n", ")\n", "\n", "combined = pd.concat([group_a, group_b], ignore_index=True)\n", "\n", "# Summary stats — almost identical\n", "print(\"Summary statistics:\")\n", "print(combined.groupby(\"group\")[[\"midterm\", \"final\"]].describe().round(1).to_string())\n", "print()\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=False)\n", "\n", "for ax, (name, grp) in zip(axes, combined.groupby(\"group\"), strict=False):\n", " ax.scatter(grp[\"midterm\"], grp[\"final\"], alpha=0.4, s=18, color=\"#0369A1\")\n", " ax.set_xlabel(\"Midterm score\")\n", " ax.set_ylabel(\"Final score\")\n", " ax.set_title(name, fontweight=\"bold\")\n", " ax.spines[[\"top\", \"right\"]].set_visible(False)\n", " # Annotate means\n", " ax.axvline(grp[\"midterm\"].mean(), color=\"#D97706\", lw=1.2, ls=\"--\", label=f\"mean={grp['midterm'].mean():.1f}\")\n", " ax.axhline(grp[\"final\"].mean(), color=\"#D97706\", lw=1.2, ls=\"--\")\n", " ax.legend(fontsize=8)\n", "\n", "fig.suptitle(\"Same means and standard deviations — completely different structures\", fontsize=10, color=\"#6B7280\")\n", "fig.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "43", "metadata": {}, "source": [ "
\n", " Common Mistake: Reporting only mean ± std for a bimodal distribution

\n", "Group B above has the same mean and standard deviation as Group A. A table of statistics alone would show no difference.\n", "The scatter plot reveals that Group B is actually two separate subpopulations — a finding with direct intervention implications.\n", "

\n", "Rule: for any continuous variable you care about, always plot a distribution (histogram, KDE, strip plot, or box plot) before reporting a single number.\n", "
" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "
\n", " Activity 7 - Unmask the Hidden Structure

\n", "\n", "Goal: From university_analytics.csv, compare the distribution of final_score for two programs side-by-side using overlapping KDE plots (or box plots).\n", "Before plotting, print the mean and standard deviation for each program. Then reflect: would the statistics alone have revealed the difference?\n", "
import seaborn as sns\n",
    "# ... print stats, then plot KDE for each program ...
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "# TODO: compare final_score distributions for two programs: print stats, then KDE plot\n", "..." ] }, { "cell_type": "markdown", "id": "46", "metadata": {}, "source": [ "## Capstone: Rescue a Misleading Chart\n", "\n", "Below is a chart with three problems from this notebook stacked together: a truncated y-axis, no axis labels, and matplotlib's unbranded defaults. Fix all three, then add one caption sentence stating the chart's single message." ] }, { "cell_type": "code", "execution_count": null, "id": "47", "metadata": {}, "outputs": [], "source": [ "# The chart to rescue. Run this first to see what is wrong with it.\n", "fig, ax = plt.subplots(figsize=(5, 3))\n", "ax.bar(course_means.index, course_means.values, color=[\"#4477AA\", \"#EE6677\", \"#228833\"])\n", "ax.set_ylim(65, 78)\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "id": "48", "metadata": {}, "source": [ "
\n", " Capstone Exercise - Fix the Chart

\n", "\n", "Goal:\n", "
    \n", "
  1. Fix the y-axis to start at 0
  2. \n", "
  3. Add an x and y axis label
  4. \n", "
  5. Apply the house style with configure_matplotlib_style()
  6. \n", "
  7. Add one print statement with a one-sentence caption stating the chart's single message
  8. \n", "
\n", "Hint: configure_matplotlib_style() only affects charts drawn after it runs, so call it before plt.subplots(), not after.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "49", "metadata": {}, "outputs": [], "source": [ "# TODO: rescue the chart\n", "...\n", "\n", "print(\"Caption: ...\")" ] }, { "cell_type": "markdown", "id": "50", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| Knaflic, C.N. (2015). *Storytelling with Data*. Wiley. | The most practical data visualisation book for analysts; the annotated chart examples in Chapter 5 are worth the price alone |\n", "| Wilke, C.O. (2019). *Fundamentals of Data Visualization*. O'Reilly. | Free at [clauswilke.com/dataviz](https://clauswilke.com/dataviz) — Chapter 17 on \"proportional ink\" and Chapter 18 on chart junk directly support Sections 1–2 of this notebook |\n", "| Schwabish, J. (2021). *Better Data Visualizations*. Columbia University Press. | Covers annotation, layout, and the \"sorta spaghetti\" problem; strong on the gap between what analysts produce and what executives read |\n", "| Cleveland, W.S. & McGill, R. (1984). [Graphical perception](https://doi.org/10.2307/2288400). *Journal of the American Statistical Association* 79(387), 531–554. | The empirical study that established the encoding hierarchy used in Section 1 |\n", "| Matejka, J. & Fitzmaurice, G. (2017). [Same stats, different graphs](https://dl.acm.org/doi/10.1145/3025453.3025912). *CHI 2017*. | The Datasaurus Dozen paper — free PDF on the ACM DL |\n" ] }, { "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "|---|---|\n", "| Chart selection | Let the question pick the chart; position beats angle beats area for accurate reading |\n", "| Data-ink ratio | Every gridline, border, and redundant legend competes with your data |\n", "| One message per chart | A third colour dimension or a second y-axis doubles the cognitive load |\n", "| Table vs chart | Use a table when the reader needs exact numbers; use a chart when the pattern matters |\n", "| House style | Apply the house theme once and leave it — consistency is itself a signal of quality |\n", "| Preattentive attributes | Encode at most one categorical dimension preattentively; two or more cancel out |\n", "| Annotation as narrative | The title and annotation carry the \"so what?\" — write them before you design the chart |\n", "| Datasaurus | Always visualise before reporting summary statistics — same stats, different shapes |" ] } ], "metadata": { "kernelspec": { "display_name": "ark (3.12.12.final.0)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }