{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 5: Matplotlib and Its Ecosystem\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/05-matplotlib.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/05-matplotlib.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Parts 1-4 (`01-python-core.ipynb`, `02-control-flow.ipynb`, `03-python-patterns.ipynb`, `04-numpy.ipynb`). If you have not, start there.\n", "\n", "Part 5 covers the two layers of Python's most established plotting stack: **matplotlib**, the library almost everything else in the ecosystem is built on, and **seaborn**, which sits on top of it for statistical graphics. Both work on the same **university analytics platform** scenario from earlier parts, extended here to student exam results across several courses and two semesters, since a believable visualisation needs more than one number to plot.\n", "\n", "Part 6 (`06-lets-plot.ipynb`) introduces a different way of thinking about plots entirely: the grammar of graphics. Part 7 (`07-data-storytelling.ipynb`) covers what makes a chart good and applies this project's own house style to both libraries.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Figure and Axes** | The object model every matplotlib call eventually goes through |\n", "| **Core chart types** | Line, scatter, bar, histogram: the four you will use constantly |\n", "| **Multiple Axes** | Comparing several views of the same data side by side |\n", "| **Saving figures** | Resolution and format choices that matter for reports and papers |\n", "| **Seaborn** | One line of statistical graphics, still a matplotlib `Axes` underneath |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 5 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain the Figure/Axes object model and why it matters | Sec. 1 |\n", "| 2 | Build line, scatter, bar, and histogram charts with the object-oriented API | Sec. 2 |\n", "| 3 | Lay out and compare multiple Axes in one Figure | Sec. 3 |\n", "| 4 | Save a figure at the right resolution and format for its destination | Sec. 3 |\n", "| 5 | Use seaborn for one-line statistical graphics, then keep customising with matplotlib | Sec. 4 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. Python's Plotting Landscape\n", "\n", "Picture this: you have just finished cleaning `university_analytics.csv`. The head looks right, the dtypes are correct, the nulls are gone. Then your manager asks: *\"What does the score distribution actually look like?\"* You could print quartiles. You could sort and read through 2,400 rows. Or you could produce one chart that answers the question before the sentence is finished.\n", "\n", "That chart needs a library. Python has several, and they make different trade-offs:\n", "\n", "| Library | Style | Strengths | Best for |\n", "| --- | --- | --- | --- |\n", "| **Matplotlib** ([matplotlib.org](https://matplotlib.org)) | Imperative (OO API) | Total control, publication quality, runs everywhere | Custom figures, fine-grained layout, saving to PNG/PDF/SVG |\n", "| **Seaborn** ([seaborn.pydata.org](https://seaborn.pydata.org)) | High-level on top of Matplotlib | Statistical plots in one line, beautiful defaults | Distribution plots, pair plots, heatmaps |\n", "| **Lets-Plot** ([lets-plot.org](https://lets-plot.org)) | Declarative, Grammar of Graphics | Expressive, ggplot2-compatible, interactive-ready | Layered charts, Part 6 of this book |\n", "| **Plotly** ([plotly.com/python](https://plotly.com/python)) | Declarative, interactive | Hover, zoom, dash integration | Dashboards, interactive reports |\n", "| **Bokeh** ([bokeh.org](https://docs.bokeh.org)) | Declarative, interactive | Streaming, large data | Real-time visualisations |\n", "\n", "This chapter focuses on Matplotlib and Seaborn. Every other library in the list either wraps Matplotlib, exports to it, or assumes you understand it. Matplotlib is the bedrock: learn it once and every other plotting library becomes a set of shortcuts on top of something you already know.\n", "\n", "### Already in your environment\n", "\n", "Both libraries are in `pyproject.toml`. For a standalone project:\n", "\n", "```bash\n", "uv add matplotlib seaborn\n", "```" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Why Visualise? The Figure and Axes\n", "\n", "A table of a thousand exam scores tells you nothing at a glance. A histogram of the same thousand scores tells you the shape of the distribution in about half a second. That is the entire case for visualisation: it trades a small amount of precision for a large amount of immediate understanding, which is exactly what you want before you have decided which question to ask next." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "rng = np.random.default_rng(7)\n", "exam_scores = rng.normal(loc=72, scale=12, size=1000).clip(0, 100)\n", "\n", "print(f\"mean : {exam_scores.mean():.1f}\")\n", "print(f\"median : {np.median(exam_scores):.1f}\")\n", "print(f\"std : {exam_scores.std():.1f}\")\n", "# The numbers alone do not tell you whether the distribution is symmetric,\n", "# has a long tail, or is bimodal. A histogram answers that in one look." ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "matplotlib has two APIs for building the same chart. The older one, `pyplot` (`plt.plot(...)`), is a state machine: it always draws onto \"whichever figure was most recently touched,\" which is fine for a single quick chart and confusing the moment you need two charts side by side. The **object-oriented API** is explicit instead: you ask for a `Figure` (the whole canvas) and one or more `Axes` (an individual plot inside it), then call methods directly on the `Axes` you want to draw on." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# The object-oriented pattern you will use for almost every chart in this\n", "# notebook: ask for a Figure and an Axes, then call methods on the Axes.\n", "fig, ax = plt.subplots(figsize=(5, 3))\n", "ax.hist(exam_scores, bins=20, color=\"#4477AA\", edgecolor=\"white\")\n", "ax.set_xlabel(\"Exam score\")\n", "ax.set_ylabel(\"Number of students\")\n", "ax.set_title(\"Exam score distribution\");" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "
Figure is the whole canvas: the window or page a chart is drawn on, and the thing you save to a file. An Axes is one plot inside that canvas, with its own x-axis, y-axis, title, and data. fig, ax = plt.subplots() gives you one of each. Every method that actually draws data (.plot(), .scatter(), .bar(), .hist()) lives on the Axes, not the Figure.\n",
"plt.plot() and ax.plot()plt.title(\"x\") sets the title of whichever Axes pyplot thinks is \"current\", which silently changes after you create a new subplot. The moment you have more than one Axes, calling plt.xlabel() instead of ax.set_xlabel() is a common way to label the wrong chart. Once you have an ax object, call methods on it directly and skip plt.* entirely.\n",
"ax.bar_label() to annotate bar values automaticallyax.text() for each rectangle. Since matplotlib 3.4, ax.bar_label(container) does it in one line. ax.bar() returns a BarContainer; pass it to bar_label and optionally format the numbers with fmt:\n",
"\n",
"bars = ax.bar(courses, course_means, color=[\"#4477AA\", \"#EE6677\", \"#228833\"])\n",
"ax.bar_label(bars, fmt=\"{:.1f}\", padding=3)\n",
"\n",
"fmt accepts a format string (applied to each value) or a labels keyword with an explicit list. padding pushes the text a few points above the bar top.\n",
"attendance_pct with 15 bins, label both axes, and give it a title. Use the object-oriented pattern: fig, ax = plt.subplots(), then call methods on ax.\n",
"fig, ax = plt.subplots(figsize=(5, 3))\n",
"ax.hist(attendance_pct, bins=15, ...)\n",
"# expect a roughly uniform spread between 50 and 100, since\n",
"# attendance_pct was generated with rng.uniform(50, 100, ...)\n",
"fig displays the figure in Jupyterx on its own line. Ending a plotting cell with fig (or letting ax.hist(...) be the last call) shows the chart without an explicit plt.show(), which you only need outside a notebook.\n",
"sharey=True, matplotlib autoscales each Axes to its own data. Three histograms that look like they have the same number of students can actually have wildly different counts, because each y-axis silently uses a different scale. Whenever you put similar charts side by side for comparison, force a shared scale.\n",
"sns.histplot(..., ax=ax) draws onto the Axes you pass it and returns that same Axes. Nothing from Sections 1-3 is wasted: every ax.set_title(), ax.set_xlabel(), or fig.savefig() you already know still works on a seaborn chart. Seaborn only replaces the part where you would otherwise have looped over groups and called ax.hist() once per group yourself.\n",
"sns.boxplot to compare study_hours across the three courses (x-axis: course, y-axis: study_hours), then add a title with ax.set_title().\n",
"fig, ax = plt.subplots(figsize=(6, 3.5))\n",
"sns.boxplot(data=results, x=\"course\", y=\"study_hours\", ax=ax)\n",
"ax.set_title(...)\n",
"Hint: This is almost identical to the boxplot above, just with a different y-axis and no hue.\n",
"seaborn.objects (so.Plot()), a fully composable layer built on the same grammar-of-graphics ideas as Part 6's Lets-Plot. It's worth knowing even if you do not switch immediately:\n",
"\n",
"import seaborn.objects as so\n",
"\n",
"(\n",
" so.Plot(results, x=\"study_hours\", y=\"exam_score\", color=\"course\")\n",
" .add(so.Dot(alpha=0.4))\n",
" .label(title=\"Study hours vs. exam score\")\n",
")\n",
"\n",
"so.Plot() is lazy (nothing renders until you call .show() or display it), composable (chain .add() calls to layer marks), and consistent with the Lets-Plot mental model from Part 6. For exploratory work the classic sns.* functions are still faster to type; so.Plot() pays off when a chart needs several layers or custom marks.\n",
"(1, 3) grid of Axes:\n",
"exam_score for \"Machine Learning\" onlystudy_hours vs. exam_score, same course onlyexam_score per course (all courses)fig.suptitle() and each Axes its own ax.set_title(). Hint: Filter with ml_mask = results[\"course\"] == \"Machine Learning\", then index results[ml_mask] for the first two panels.\n",
"