{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 5: Matplotlib and Its Ecosystem\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/05-matplotlib.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/05-matplotlib.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Parts 1-4 (`01-python-core.ipynb`, `02-control-flow.ipynb`, `03-python-patterns.ipynb`, `04-numpy.ipynb`). If you have not, start there.\n", "\n", "Part 5 covers the two layers of Python's most established plotting stack: **matplotlib**, the library almost everything else in the ecosystem is built on, and **seaborn**, which sits on top of it for statistical graphics. Both work on the same **university analytics platform** scenario from earlier parts, extended here to student exam results across several courses and two semesters, since a believable visualisation needs more than one number to plot.\n", "\n", "Part 6 (`06-lets-plot.ipynb`) introduces a different way of thinking about plots entirely: the grammar of graphics. Part 7 (`07-data-storytelling.ipynb`) covers what makes a chart good and applies this project's own house style to both libraries.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Figure and Axes** | The object model every matplotlib call eventually goes through |\n", "| **Core chart types** | Line, scatter, bar, histogram: the four you will use constantly |\n", "| **Multiple Axes** | Comparing several views of the same data side by side |\n", "| **Saving figures** | Resolution and format choices that matter for reports and papers |\n", "| **Seaborn** | One line of statistical graphics, still a matplotlib `Axes` underneath |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 5 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain the Figure/Axes object model and why it matters | Sec. 1 |\n", "| 2 | Build line, scatter, bar, and histogram charts with the object-oriented API | Sec. 2 |\n", "| 3 | Lay out and compare multiple Axes in one Figure | Sec. 3 |\n", "| 4 | Save a figure at the right resolution and format for its destination | Sec. 3 |\n", "| 5 | Use seaborn for one-line statistical graphics, then keep customising with matplotlib | Sec. 4 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. Python's Plotting Landscape\n", "\n", "Picture this: you have just finished cleaning `university_analytics.csv`. The head looks right, the dtypes are correct, the nulls are gone. Then your manager asks: *\"What does the score distribution actually look like?\"* You could print quartiles. You could sort and read through 2,400 rows. Or you could produce one chart that answers the question before the sentence is finished.\n", "\n", "That chart needs a library. Python has several, and they make different trade-offs:\n", "\n", "| Library | Style | Strengths | Best for |\n", "| --- | --- | --- | --- |\n", "| **Matplotlib** ([matplotlib.org](https://matplotlib.org)) | Imperative (OO API) | Total control, publication quality, runs everywhere | Custom figures, fine-grained layout, saving to PNG/PDF/SVG |\n", "| **Seaborn** ([seaborn.pydata.org](https://seaborn.pydata.org)) | High-level on top of Matplotlib | Statistical plots in one line, beautiful defaults | Distribution plots, pair plots, heatmaps |\n", "| **Lets-Plot** ([lets-plot.org](https://lets-plot.org)) | Declarative, Grammar of Graphics | Expressive, ggplot2-compatible, interactive-ready | Layered charts, Part 6 of this book |\n", "| **Plotly** ([plotly.com/python](https://plotly.com/python)) | Declarative, interactive | Hover, zoom, dash integration | Dashboards, interactive reports |\n", "| **Bokeh** ([bokeh.org](https://docs.bokeh.org)) | Declarative, interactive | Streaming, large data | Real-time visualisations |\n", "\n", "This chapter focuses on Matplotlib and Seaborn. Every other library in the list either wraps Matplotlib, exports to it, or assumes you understand it. Matplotlib is the bedrock: learn it once and every other plotting library becomes a set of shortcuts on top of something you already know.\n", "\n", "### Already in your environment\n", "\n", "Both libraries are in `pyproject.toml`. For a standalone project:\n", "\n", "```bash\n", "uv add matplotlib seaborn\n", "```" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Why Visualise? The Figure and Axes\n", "\n", "A table of a thousand exam scores tells you nothing at a glance. A histogram of the same thousand scores tells you the shape of the distribution in about half a second. That is the entire case for visualisation: it trades a small amount of precision for a large amount of immediate understanding, which is exactly what you want before you have decided which question to ask next." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "rng = np.random.default_rng(7)\n", "exam_scores = rng.normal(loc=72, scale=12, size=1000).clip(0, 100)\n", "\n", "print(f\"mean : {exam_scores.mean():.1f}\")\n", "print(f\"median : {np.median(exam_scores):.1f}\")\n", "print(f\"std : {exam_scores.std():.1f}\")\n", "# The numbers alone do not tell you whether the distribution is symmetric,\n", "# has a long tail, or is bimodal. A histogram answers that in one look." ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "matplotlib has two APIs for building the same chart. The older one, `pyplot` (`plt.plot(...)`), is a state machine: it always draws onto \"whichever figure was most recently touched,\" which is fine for a single quick chart and confusing the moment you need two charts side by side. The **object-oriented API** is explicit instead: you ask for a `Figure` (the whole canvas) and one or more `Axes` (an individual plot inside it), then call methods directly on the `Axes` you want to draw on." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# The object-oriented pattern you will use for almost every chart in this\n", "# notebook: ask for a Figure and an Axes, then call methods on the Axes.\n", "fig, ax = plt.subplots(figsize=(5, 3))\n", "ax.hist(exam_scores, bins=20, color=\"#4477AA\", edgecolor=\"white\")\n", "ax.set_xlabel(\"Exam score\")\n", "ax.set_ylabel(\"Number of students\")\n", "ax.set_title(\"Exam score distribution\");" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "
\n", " Key Concept: Figure vs. Axes

\n", "A Figure is the whole canvas: the window or page a chart is drawn on, and the thing you save to a file. An Axes is one plot inside that canvas, with its own x-axis, y-axis, title, and data. fig, ax = plt.subplots() gives you one of each. Every method that actually draws data (.plot(), .scatter(), .bar(), .hist()) lives on the Axes, not the Figure.\n", "
" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "The dashed boundary below is the one thing in this diagram that has no real line of code behind it: it is just there to make the Figure's own edge visible, since otherwise it is easy to forget it exists at all." ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "from ark.plot.diagrams import figure_axes_diagram\n", "\n", "figure_axes_diagram();" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "
\n", " Common Mistake: Mixing plt.plot() and ax.plot()

\n", "plt.title(\"x\") sets the title of whichever Axes pyplot thinks is \"current\", which silently changes after you create a new subplot. The moment you have more than one Axes, calling plt.xlabel() instead of ax.set_xlabel() is a common way to label the wrong chart. Once you have an ax object, call methods on it directly and skip plt.* entirely.\n", "
" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "## 2. Core Chart Types\n", "\n", "Four chart types cover most of what you will plot day to day. Each answers a different question, and picking the wrong one is the fastest way to make a chart that looks fine but says nothing useful." ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "# Build the running dataset: exam results across three courses and two\n", "# semesters for the university analytics platform.\n", "rng = np.random.default_rng(42)\n", "\n", "courses = np.array([\"Machine Learning\", \"Data Structures\", \"Statistics\"])\n", "semesters = np.array([\"Fall 2024\", \"Spring 2025\"])\n", "\n", "n_per_group = 60\n", "course_col = np.repeat(courses, n_per_group * len(semesters))\n", "semester_col = np.tile(np.repeat(semesters, n_per_group), len(courses))\n", "\n", "# Each course has a slightly different difficulty and improves slightly\n", "# from Fall to Spring, to give the line chart in this section something\n", "# real to show.\n", "course_base = {\"Machine Learning\": 68, \"Data Structures\": 74, \"Statistics\": 71}\n", "semester_bump = {\"Fall 2024\": 0, \"Spring 2025\": 4}\n", "\n", "exam_score = np.array(\n", " [rng.normal(course_base[c] + semester_bump[s], 10) for c, s in zip(course_col, semester_col, strict=True)]\n", ").clip(0, 100)\n", "study_hours = rng.uniform(0, 25, size=len(course_col))\n", "attendance_pct = rng.uniform(50, 100, size=len(course_col))\n", "\n", "print(f\"rows: {len(course_col)}\")\n", "print(f\"courses: {courses}\")" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "**Line chart**, for a trend across an ordered sequence. Here, average score per course from Fall to Spring:" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(5, 3))\n", "\n", "for course in courses:\n", " course_mask = course_col == course\n", " means = [exam_score[course_mask & (semester_col == s)].mean() for s in semesters]\n", " ax.plot(semesters, means, marker=\"o\", label=course)\n", "\n", "ax.set_ylabel(\"Average exam score\")\n", "ax.set_title(\"Average score by course, Fall to Spring\")\n", "ax.legend();" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "**Scatter plot**, for the relationship between two continuous variables. Here, study hours against exam score:" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(5, 3))\n", "ax.scatter(study_hours, exam_score, alpha=0.4, color=\"#4477AA\")\n", "ax.set_xlabel(\"Study hours\")\n", "ax.set_ylabel(\"Exam score\")\n", "ax.set_title(\"Study hours vs. exam score\");" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "**Bar chart**, for comparing a single number across categories. Here, average score per course:" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "course_means = [exam_score[course_col == c].mean() for c in courses]\n", "\n", "fig, ax = plt.subplots(figsize=(5, 3))\n", "ax.bar(courses, course_means, color=[\"#4477AA\", \"#EE6677\", \"#228833\"])\n", "ax.set_ylabel(\"Average exam score\")\n", "ax.set_title(\"Average score by course\")\n", "ax.tick_params(axis=\"x\", labelrotation=15);" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "
\n", " Pro Tip: Use ax.bar_label() to annotate bar values automatically

\n", "Adding the numeric value above each bar used to mean a manual loop calling ax.text() for each rectangle. Since matplotlib 3.4, ax.bar_label(container) does it in one line. ax.bar() returns a BarContainer; pass it to bar_label and optionally format the numbers with fmt:\n", "\n", "
bars = ax.bar(courses, course_means, color=[\"#4477AA\", \"#EE6677\", \"#228833\"])\n",
    "ax.bar_label(bars, fmt=\"{:.1f}\", padding=3)
\n", "\n", "fmt accepts a format string (applied to each value) or a labels keyword with an explicit list. padding pushes the text a few points above the bar top.\n", "
" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "
\n", " Activity 1 - Attendance Histogram

\n", "\n", "Goal: Plot a histogram of attendance_pct with 15 bins, label both axes, and give it a title. Use the object-oriented pattern: fig, ax = plt.subplots(), then call methods on ax.\n", "
fig, ax = plt.subplots(figsize=(5, 3))\n",
    "ax.hist(attendance_pct, bins=15, ...)\n",
    "# expect a roughly uniform spread between 50 and 100, since\n",
    "# attendance_pct was generated with rng.uniform(50, 100, ...)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(5, 3))\n", "# TODO: plot the histogram, then set xlabel, ylabel, and title\n", "...\n", "\n", "fig" ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "
\n", " Pro Tip: A trailing bare fig displays the figure in Jupyter

\n", "Jupyter displays the last expression in a cell automatically, the same way it prints a bare x on its own line. Ending a plotting cell with fig (or letting ax.hist(...) be the last call) shows the chart without an explicit plt.show(), which you only need outside a notebook.\n", "
" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "## 3. Multiple Axes and Saving Figures\n", "\n", "Real analysis rarely stops at one chart. `plt.subplots(nrows, ncols)` returns a whole grid of `Axes` at once, as a NumPy array, so you can loop over it the same way you looped over any other array in Part 4." ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 3, figsize=(11, 3), sharey=True)\n", "\n", "for ax, course in zip(axes.flat, courses, strict=True):\n", " course_scores = exam_score[course_col == course]\n", " ax.hist(course_scores, bins=15, color=\"#4477AA\", edgecolor=\"white\")\n", " ax.set_title(course, fontsize=10)\n", " ax.set_xlabel(\"Exam score\")\n", "\n", "axes[0].set_ylabel(\"Number of students\")\n", "fig.suptitle(\"Score distribution per course\");" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "`axes.flat` works regardless of the grid shape: a `2x2` grid of `Axes` is a 2D array, but `.flat` always gives you a flat iterator over all of them, in row-major order. `sharey=True` forces every Axes in the grid to use the same y-axis range, which is what makes the three histograms above honestly comparable instead of each rescaled to its own data." ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "
\n", " Common Mistake: Comparing histograms with different y-axis scales

\n", "Without sharey=True, matplotlib autoscales each Axes to its own data. Three histograms that look like they have the same number of students can actually have wildly different counts, because each y-axis silently uses a different scale. Whenever you put similar charts side by side for comparison, force a shared scale.\n", "
" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "Saving a figure has two choices that matter: resolution (`dpi`, dots per inch) and file format. A raster format (PNG) at low `dpi` looks blurry when scaled up; a vector format (SVG or PDF) stays sharp at any size because it stores shapes, not pixels. Call `fig.tight_layout()` before saving to close any gaps between subplots and prevent axis labels from being clipped:" ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "output_dir = Path(\"tmp_plots\")\n", "output_dir.mkdir(exist_ok=True)\n", "\n", "fig, ax = plt.subplots(figsize=(5, 3))\n", "ax.hist(exam_scores, bins=20, color=\"#4477AA\", edgecolor=\"white\")\n", "ax.set_title(\"Exam score distribution\")\n", "fig.tight_layout()\n", "\n", "# PNG: fine for slides and notebooks, blurry if you zoom in or print large\n", "fig.savefig(output_dir / \"scores.png\", dpi=150, bbox_inches=\"tight\")\n", "# SVG: vector format, stays crisp at any size, the right choice for reports\n", "fig.savefig(output_dir / \"scores.svg\", bbox_inches=\"tight\")\n", "\n", "print(sorted(p.name for p in output_dir.iterdir()))\n", "\n", "import shutil\n", "\n", "shutil.rmtree(output_dir)" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "
\n", " Pro Tip: Default to vector formats for anything that gets printed or zoomed

\n", "PNG and JPEG store a fixed grid of pixels: stretch them and they blur. SVG and PDF store the actual shapes and text, so they render sharp at any zoom level or print size. Save PNG for quick previews and web thumbnails; save SVG or PDF for anything that ends up in a report, slide deck, or paper.\n", "
" ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "## 4. Seaborn: Statistical Graphics in One Line\n", "\n", "Seaborn is built directly on matplotlib. It does not replace anything from Sections 1-3: it adds a layer of functions that know how to take a whole DataFrame, split it by a category, color each group, and draw a legend, all in a single call. Reaching for seaborn first whenever your chart needs grouping or a statistical summary saves a genuine amount of code." ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "Seaborn expects **tidy data**: one row per observation, one column per variable. A full pandas introduction comes later in the data analysis tutorials, but building a DataFrame from arrays you already have is one line:" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "results = pd.DataFrame(\n", " {\n", " \"course\": course_col,\n", " \"semester\": semester_col,\n", " \"exam_score\": exam_score,\n", " \"study_hours\": study_hours,\n", " \"attendance_pct\": attendance_pct,\n", " }\n", ")\n", "results.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "\n", "fig, ax = plt.subplots(figsize=(6, 3.5))\n", "sns.histplot(data=results, x=\"exam_score\", hue=\"course\", kde=True, ax=ax)\n", "ax.set_title(\"Exam score distribution by course\");" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "
\n", " Key Concept: Seaborn returns a matplotlib Axes

\n", "sns.histplot(..., ax=ax) draws onto the Axes you pass it and returns that same Axes. Nothing from Sections 1-3 is wasted: every ax.set_title(), ax.set_xlabel(), or fig.savefig() you already know still works on a seaborn chart. Seaborn only replaces the part where you would otherwise have looped over groups and called ax.hist() once per group yourself.\n", "
" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "`hue` is seaborn's primary grouping parameter: pass a column name and seaborn splits the data by that column, assigns each group a colour from its default palette, and draws a legend automatically. `palette` overrides those colours, accepting a named seaborn palette (`\"tab10\"`, `\"Set2\"`) or a list of hex codes. `style` (available in `sns.lineplot` and `sns.scatterplot`) adds a second visual channel by varying the marker shape or line style per group, which helps when a chart may be viewed in greyscale." ] }, { "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "`sns.boxplot` summarises a whole distribution (median, quartiles, outliers) per category in one call, the kind of comparison that would take a loop and several `ax.hist()` calls in raw matplotlib:" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(6, 3.5))\n", "sns.boxplot(data=results, x=\"course\", y=\"exam_score\", hue=\"semester\", ax=ax)\n", "ax.set_title(\"Exam score by course and semester\");" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "`sns.violinplot()` shows the full shape of the distribution on both sides of a central axis, not just the five-number summary a boxplot gives. Use it when you suspect a distribution is skewed or has more than one peak in a way the box would hide:" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(6, 3.5))\n", "sns.violinplot(data=results, x=\"course\", y=\"exam_score\", hue=\"semester\", ax=ax)\n", "ax.set_title(\"Exam score distribution by course and semester\");" ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "`sns.heatmap` is the standard way to visualise a correlation matrix. Pass it `results[numeric_cols].corr()`, a small DataFrame seaborn happily turns into a colour grid with the actual numbers annotated:" ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [], "source": [ "numeric_cols = [\"exam_score\", \"study_hours\", \"attendance_pct\"]\n", "corr = results[numeric_cols].corr()\n", "\n", "fig, ax = plt.subplots(figsize=(4, 3.5))\n", "sns.heatmap(corr, annot=True, fmt=\".2f\", cmap=\"coolwarm\", center=0, ax=ax)\n", "ax.set_title(\"Feature correlation\");" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "
\n", " Activity 2 - Compare Study Habits Across Courses

\n", "\n", "Goal: Use sns.boxplot to compare study_hours across the three courses (x-axis: course, y-axis: study_hours), then add a title with ax.set_title().\n", "
fig, ax = plt.subplots(figsize=(6, 3.5))\n",
    "sns.boxplot(data=results, x=\"course\", y=\"study_hours\", ax=ax)\n",
    "ax.set_title(...)
\n", "Hint: This is almost identical to the boxplot above, just with a different y-axis and no hue.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(6, 3.5))\n", "# TODO: boxplot of study_hours by course, plus a title\n", "...\n", "\n", "fig" ] }, { "cell_type": "markdown", "id": "46", "metadata": {}, "source": [ "
\n", " Pro Tip: Exploratory charts and presentation charts have different goals

\n", "Seaborn's defaults are optimised for exploratory use: fast, readable charts that help you understand the data before you have decided what question to ask. A presentation chart, one headed for a report or a slide deck, needs deliberate title wording, axis labels in the reader's language, and a colour palette that matches the project's house style. Part 7 covers that transition.\n", "
" ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "
\n", " Pro Tip: seaborn 0.12+ ships a declarative objects API alongside the classic functions

\n", "Seaborn 0.12 introduced seaborn.objects (so.Plot()), a fully composable layer built on the same grammar-of-graphics ideas as Part 6's Lets-Plot. It's worth knowing even if you do not switch immediately:\n", "\n", "
import seaborn.objects as so\n",
    "\n",
    "(\n",
    "    so.Plot(results, x=\"study_hours\", y=\"exam_score\", color=\"course\")\n",
    "    .add(so.Dot(alpha=0.4))\n",
    "    .label(title=\"Study hours vs. exam score\")\n",
    ")
\n", "\n", "so.Plot() is lazy (nothing renders until you call .show() or display it), composable (chain .add() calls to layer marks), and consistent with the Lets-Plot mental model from Part 6. For exploratory work the classic sns.* functions are still faster to type; so.Plot() pays off when a chart needs several layers or custom marks.\n", "
" ] }, { "cell_type": "markdown", "id": "48", "metadata": {}, "source": [ "## Capstone: A Three-Panel Course Report\n", "\n", "Combine everything from this notebook into one Figure with three Axes side by side: a histogram of scores for one course, a scatter of study hours against score for the same course, and a bar chart comparing average scores across all three courses. This is the shape of report you would actually hand to a department head." ] }, { "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "
\n", " Capstone Exercise - Three-Panel Report

\n", "\n", "Goal: Build a (1, 3) grid of Axes:\n", "
    \n", "
  1. Axes 0: histogram of exam_score for \"Machine Learning\" only
  2. \n", "
  3. Axes 1: scatter of study_hours vs. exam_score, same course only
  4. \n", "
  5. Axes 2: bar chart of average exam_score per course (all courses)
  6. \n", "
\n", "Give the Figure an overall title with fig.suptitle() and each Axes its own ax.set_title(). Hint: Filter with ml_mask = results[\"course\"] == \"Machine Learning\", then index results[ml_mask] for the first two panels.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 3, figsize=(13, 3.5))\n", "\n", "ml_mask = results[\"course\"] == \"Machine Learning\"\n", "ml_results = results[ml_mask]\n", "\n", "# TODO panel 0: histogram of ml_results[\"exam_score\"]\n", "# TODO panel 1: scatter of ml_results[\"study_hours\"] vs ml_results[\"exam_score\"]\n", "# TODO panel 2: bar chart of average exam_score per course (all courses)\n", "...\n", "\n", "fig.suptitle(\"Course Report: Machine Learning\")\n", "fig" ] }, { "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| Hunter, J.D. (2007). [Matplotlib: A 2D graphics environment](https://doi.org/10.1109/MCSE.2007.55). *Computing in Science & Engineering* 9(3), 90–95. | The original paper; understanding the Figure/Axes object model it describes makes every API decision predictable |\n", "| VanderPlas, J. (2016). *Python Data Science Handbook*, Ch. 4. O'Reilly. | Free online — covers customisation, subplots, and the stateful vs object-oriented API |\n", "| [Matplotlib documentation — Axes API](https://matplotlib.org/stable/api/axes_api.html) | Every method you will use in practice lives here; use `Ctrl+F` instead of guessing |\n", "| Wilke, C.O. (2019). *Fundamentals of Data Visualization*. O'Reilly. | Free at [clauswilke.com/dataviz](https://clauswilke.com/dataviz) — the chapter on figure design explains *why* certain defaults are problematic |\n" ] }, { "cell_type": "markdown", "id": "52", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "|---|---|\n", "| Figure vs. Axes | Figure is the canvas, Axes is one plot; draw by calling methods on `ax`, not `plt` |\n", "| `plt.subplots()` | Returns `(fig, ax)` for one plot, `(fig, axes)` for a grid |\n", "| Line chart | Trend across an ordered sequence: `ax.plot()` |\n", "| Scatter plot | Relationship between two continuous variables: `ax.scatter()` |\n", "| Bar chart | Comparing one number across categories: `ax.bar()` |\n", "| Histogram | Shape of one variable's distribution: `ax.hist()` |\n", "| `axes.flat` | Flat iterator over any subplot grid, regardless of its shape |\n", "| `sharey=True` | Forces a fair visual comparison across subplots |\n", "| `fig.tight_layout()` | Closes gaps and prevents clipped labels before saving |\n", "| Saving figures | PNG for previews, SVG/PDF for anything printed or zoomed; pass `bbox_inches=\"tight\"` |\n", "| Seaborn | One-line statistical graphics on tidy DataFrames; returns a matplotlib `Axes` |\n", "| `hue` | Splits data by a column, assigns colours automatically, and draws a legend |\n", "| `sns.violinplot` | Shows the full distribution shape per group, where a boxplot shows only a five-number summary |\n", "\n", "**Next:** `06-lets-plot.ipynb`, introducing the grammar of graphics: the same charts from this notebook, built declaratively instead of imperatively." ] } ], "metadata": { "kernelspec": { "display_name": "ark (3.12.12.final.0)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }