{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 7: Data Storytelling and House Style\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/07-data-storytelling.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/07-data-storytelling.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Part 5 (`05-matplotlib.ipynb`) and Part 6 (`06-lets-plot.ipynb`). Parts 5 and 6 taught you how to make a chart at all. This part is about taste: which chart to make, what to leave out of it, and how to brand it once it is ready to show someone other than yourself, using this project's own `ark.plot` module.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Chart selection** | The question decides the chart, not the data type |\n", "| **Data-ink ratio** | Every pixel that is not data is a tax on the reader's attention |\n", "| **Common chart lies** | The same data, drawn to mislead instead of inform |\n", "| **Tables vs. charts** | Exact lookup wants a table; pattern recognition wants a chart |\n", "| **House style** | Replacing per-chart styling with one reusable theme |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 7 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Choose a chart type based on the question, not habit | Sec. 1 |\n", "| 2 | Recognise and avoid the most common ways charts mislead | Sec. 2 |\n", "| 3 | Decide when a table communicates better than a chart | Sec. 3 |\n", "| 4 | Apply `ark.plot`'s house theme to matplotlib and lets-plot charts | Sec. 4 |\n", "| 5 | Identify preattentive attributes and use them deliberately | Sec. 5 |\n", "| 6 | Add annotations that carry the chart's narrative | Sec. 6 |\n", "| 7 | Explain why summary statistics alone are never sufficient (Datasaurus) | Sec. 7 |\n", "| 8 | Produce one polished, captioned chart from a messy dataset | Capstone |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 1. Picking the Right Chart, and Cutting What Does Not Help\n", "\n", "The most common visualisation mistake is not a styling problem, it is a **selection** problem: building a chart type out of habit instead of asking what question it needs to answer. A pie chart cannot answer \"which course improved the most,\" because comparing angles is something people are measurably worse at than comparing positions along a shared axis. That is not opinion, it is the finding of a well-known study by Cleveland and McGill (1984) ranking how accurately people read different visual encodings: position along a common scale first, then length, then angle, then area, with colour saturation last." ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# Load the shared university dataset used across all Part 1 and Part 2 notebooks\n", "df7 = pd.read_csv(\"data/university_analytics.csv\")\n", "\n", "# `results` keeps the course/semester/exam_score structure that cells below depend on\n", "results = df7[[\"course\", \"semester\", \"final_score\"]].copy()\n", "results = results.rename(columns={\"final_score\": \"exam_score\"})\n", "results.head()" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "**Data-ink ratio**, a term from Edward Tufte's *The Visual Display of Quantitative Information*, is the proportion of a chart's ink that represents actual data, as opposed to gridlines, borders, redundant legends, and other decoration. Maximising it does not mean a bare chart, it means every mark earns its place. Compare matplotlib's defaults to a version with the same data and nothing extra:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "course_means = results.groupby(\"course\")[\"exam_score\"].mean()\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(10, 3.5))\n", "\n", "# Left: matplotlib defaults. Box on all four sides, both gridlines, a\n", "# legend the bar colours already make redundant since each bar has its\n", "# own label directly underneath it.\n", "axes[0].bar(course_means.index, course_means.values, color=[\"#4477AA\", \"#EE6677\", \"#228833\"])\n", "axes[0].set_title(\"Before: matplotlib defaults\")\n", "axes[0].grid(True)\n", "\n", "# Right: same data, decluttered. No top/right spines, no gridlines, the\n", "# value printed directly on each bar instead of relying on the y-axis.\n", "axes[1].bar(course_means.index, course_means.values, color=\"#4477AA\")\n", "axes[1].spines[\"top\"].set_visible(False)\n", "axes[1].spines[\"right\"].set_visible(False)\n", "axes[1].set_yticks([])\n", "for i, value in enumerate(course_means.values):\n", " axes[1].text(i, value + 1, f\"{value:.0f}\", ha=\"center\")\n", "axes[1].set_title(\"After: higher data-ink ratio\")\n", "\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "
exam_score by course, but after calling configure_matplotlib_style() from this section. No other code changes: the same sns.boxplot() call should now pick up the brand colour cycle automatically.\n",
"import seaborn as sns\n",
"\n",
"fig, ax = plt.subplots()\n",
"sns.boxplot(data=results, x=\"course\", y=\"exam_score\", ax=ax)\n",
"ax.set_title(...)\n",
"ark.plot is not any single colour choice, it is that every chart in this project, in any notebook, in any report, reaches for the same module instead of redefining its own palette. When the brand colours change, they change in one file.\n",
"midterm_score by region.\n",
"Highlight the region with the highest mean in #0369A1 and colour every other bar #ADB5BD.\n",
"Add a concise title stating which region leads.\n",
"region_means = df7.groupby(\"region\")[\"midterm_score\"].mean().sort_values()\n",
"# ... build the chart with one highlighted bar ...\n",
"final_score by semester as a line chart.\n",
"Add ax.annotate() to flag the semester with the highest mean and write a one-sentence insight as the chart title.\n",
"sem_means = df7.groupby(\"semester\")[\"final_score\"].mean()\n",
"# ... line chart + annotation ...\n",
"university_analytics.csv, compare the distribution of final_score for two programs side-by-side using overlapping KDE plots (or box plots).\n",
"Before plotting, print the mean and standard deviation for each program. Then reflect: would the statistics alone have revealed the difference?\n",
"import seaborn as sns\n",
"# ... print stats, then plot KDE for each program ...\n",
"configure_matplotlib_style()configure_matplotlib_style() only affects charts drawn after it runs, so call it before plt.subplots(), not after.\n",
"