{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 6: Grammar of Graphics with Lets-Plot\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/06-lets-plot.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/06-lets-plot.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Python Foundations**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed Part 5 (`05-matplotlib.ipynb`). If you have not, start there: this notebook rebuilds several of its charts on purpose, so the contrast between the two ways of thinking about a plot is concrete rather than abstract.\n", "\n", "Part 5 was **imperative**: you told matplotlib exactly which method to call, in which order, for every piece of the chart. This part is **declarative**: you describe the data and the mapping you want, and the library works out how to draw it. This style is called the **grammar of graphics**, and **lets-plot** implements it in Python the way ggplot2 implements it in R. Part 7 (`07-data-storytelling.ipynb`) covers what makes a chart good and applies this project's house style to both libraries.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Declarative vs. imperative** | The mental model shift that makes the rest of this notebook click |\n", "| **`ggplot` + `aes` + `geom`** | The three pieces every lets-plot chart is built from |\n", "| **Mapping vs. setting** | The single most common mistake in any grammar of graphics |\n", "| **Layering** | Adding a statistical summary on top of raw data, declaratively |\n", "| **Faceting** | One line to replace a whole manual subplot loop |\n", "| **Titles, labels, and scales** | Communicating clearly: naming axes, renaming legends, controlling colours |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 6 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Explain the difference between declarative and imperative plotting | Sec. 1 |\n", "| 2 | Build a chart from `ggplot()`, `aes()`, and a `geom_*()` | Sec. 2 |\n", "| 3 | Distinguish mapping a variable from setting a fixed value | Sec. 3 |\n", "| 4 | Layer a statistical summary on top of raw data | Sec. 4 |\n", "| 5 | Facet one chart into many panels instead of looping over subplots | Sec. 5 |\n", "| 6 | Add titles, axis labels, and control colours with `labs()` and scale functions | Sec. 6 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. Why Grammar of Graphics?\n", "\n", "You have already drawn scatter plots with Matplotlib. You called `ax.scatter(x, y, c=color, s=size)`. If you wanted a different geom, you called a different method. If you wanted a trend line, you called yet another. The plot grew by accumulating function calls, each one doing something slightly different.\n", "\n", "There is another way to think about it. A chart is not a list of drawing commands: it is a **mapping** from data columns to visual channels (position, colour, shape, size). State that mapping once, and the library figures out how to draw it. Add a layer (a trend line, a rug, a label) and you extend the mapping rather than calling a new drawing function. This is the Grammar of Graphics, introduced by Leland Wilkinson in 1999 and popularised by Hadley Wickham's **ggplot2** for R.\n", "\n", "**Lets-Plot** ([lets-plot.org](https://lets-plot.org)) is JetBrains' Python implementation of the same grammar. Its API mirrors ggplot2 so closely that R users can read Lets-Plot code without a translation guide. It renders to HTML in Jupyter, to PNG for reports, and (in its Pro edition) to interactive Datalore dashboards.\n", "\n", "### Alternatives that use the same grammar\n", "\n", "| Library | Language | Notes |\n", "| --- | --- | --- |\n", "| **ggplot2** ([ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) | R | The original; Lets-Plot mirrors it deliberately |\n", "| **plotnine** ([plotnine.org](https://plotnine.org)) | Python | ggplot2 port; similar API, Matplotlib backend |\n", "| **Lets-Plot** ([lets-plot.org](https://lets-plot.org)) | Python / Kotlin | HTML-first output, fast, maintained by JetBrains |\n", "| **Vega-Altair** ([altair-viz.github.io](https://altair-viz.github.io)) | Python | Different grammar (Vega-Lite), interactive |\n", "\n", "### Already in your environment\n", "\n", "```bash\n", "uv add lets-plot # for a standalone project\n", "```" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 1. Declarative vs. Imperative\n", "\n", "In Part 5, building a scatter plot meant calling `ax.scatter()` directly: you were the one deciding which function draws which shape. The grammar of graphics flips this around. You describe **what the data means** (this column is the x position, this column is the colour) and the library decides how to draw it. The same description works whether you add one point or one million, and stays valid even if you later change the chart type entirely." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "from lets_plot import LetsPlot\n", "\n", "LetsPlot.setup_html()" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "Rebuild the same dataset from Part 5: exam results across three courses and two semesters." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "rng = np.random.default_rng(42)\n", "\n", "courses = np.array([\"Machine Learning\", \"Data Structures\", \"Statistics\"])\n", "semesters = np.array([\"Fall 2024\", \"Spring 2025\"])\n", "\n", "n_per_group = 60\n", "course_col = np.repeat(courses, n_per_group * len(semesters))\n", "semester_col = np.tile(np.repeat(semesters, n_per_group), len(courses))\n", "\n", "course_base = {\"Machine Learning\": 68, \"Data Structures\": 74, \"Statistics\": 71}\n", "semester_bump = {\"Fall 2024\": 0, \"Spring 2025\": 4}\n", "\n", "exam_score = np.array(\n", " [rng.normal(course_base[c] + semester_bump[s], 10) for c, s in zip(course_col, semester_col, strict=True)]\n", ").clip(0, 100)\n", "study_hours = rng.uniform(0, 25, size=len(course_col))\n", "\n", "results = pd.DataFrame(\n", " {\n", " \"course\": course_col,\n", " \"semester\": semester_col,\n", " \"exam_score\": exam_score,\n", " \"study_hours\": study_hours,\n", " }\n", ")\n", "results.head()" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "Here is the Part 5 scatter plot (study hours vs. exam score) again, this time declaratively:" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "from lets_plot import aes, geom_point, ggplot\n", "\n", "ggplot(results, aes(x=\"study_hours\", y=\"exam_score\")) + geom_point(alpha=0.4)" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "
+: ggplot(data, aes(...)) declares the dataset and which columns map to which visual property, and one or more geom_*() layers say what shape to draw with that mapping. Change the geom_*() and the same aes() mapping produces a completely different chart type, often without touching anything else.\n",
"exam_score using geom_histogram(). Map x=\"exam_score\" in aes(), and pass bins=20 to geom_histogram().\n",
"ggplot(results, aes(x=\"exam_score\")) + geom_histogram(bins=20, fill=\"#4477AA\")\n", "
aes()aes(color=\"#4477AA\") tries to map the literal string \"#4477AA\" as if it were a column name. Lets-plot will either error or, worse, silently treat it as a constant category and draw a one-entry legend that says #4477AA. A fixed value is a setting and belongs as a plain keyword argument to the geom_*(), outside aes(). A column name that should vary per row is a mapping and belongs inside aes().\n",
"method='lm' when you need a straight-line fitgeom_smooth() uses a LOESS (locally weighted) smoother by default, which follows local curves in the data. When you specifically want a linear regression fit, pass method='lm'. The confidence band and se=True/False control both methods the same way:\n",
"\n",
"# Default LOESS — follows curves\n",
"geom_smooth(se=True)\n",
"\n",
"# Linear (OLS) — forces a straight line, shows the parametric fit\n",
"geom_smooth(method='lm', se=True)\n",
"\n",
"The LOESS smoother needs statsmodels installed; the linear one does not. Both display the 95% confidence band when se=True.\n",
"geom_boxplot() for geom_violin() when shape mattersaes() mapping stays identical: just change the geom_*().\n",
"geom_point() + geom_smooth() draws points first, then the trend line on top. Reversing the order draws the line first and the points on top of it. This matters most when a layer has a solid fill that would otherwise hide whatever was drawn before it.\n",
"ggmarginal()ggmarginal() wraps any scatter plot with histograms or density curves along the x and y margins, letting you see both the bivariate relationship and the individual distributions without a separate figure:\n",
"\n",
"from lets_plot import ggmarginal\n",
"\n",
"p = (\n",
" ggplot(results, aes(x=\"study_hours\", y=\"exam_score\", color=\"course\"))\n",
" + geom_point(alpha=0.4)\n",
" + geom_smooth(method=\"lm\", se=False)\n",
")\n",
"ggmarginal(p, type=\"density\")\n",
"\n",
"type accepts \"density\" (KDE curves), \"histogram\", or \"boxplot\". The marginals automatically inherit the color mapping so each course gets its own density curve.\n",
"facet_wrap(facets=\"course\") splits the data by the course column and draws one panel per group, using the exact same aes() and geom_*() for every panel. There is no loop to write and no risk of accidentally giving one panel a different scale than the others, the bug from Part 5's Common Mistake callout. Add a second facet variable with facet_grid() for a full grid of panels.\n",
"study_hours vs. exam_score, coloured by course, faceted into one panel per semester.\n",
"ggplot(results, aes(x=\"study_hours\", y=\"exam_score\", color=\"course\")) \\\n",
" + geom_point(alpha=0.5) \\\n",
" + facet_wrap(facets=\"semester\")\n",
"labs() renames anything auto-generated from a column namelabs(title=..., x=..., y=..., color=...) is one more + layer. The named argument matches the aesthetic: use color= when you have aes(color=\"course\") and fill= when you have aes(fill=\"course\"). pro_colors from ark.plot.theme is the project's brand palette; passing it to scale_color_manual() connects this chart to the same colour system that modern_theme() applies globally in Part 7.\n",
"geom_density() chart from Section 4 and add a proper title, axis labels, and legend title with labs().\n",
"ggplot(results, aes(x=\"exam_score\", fill=\"course\")) + geom_density(alpha=0.4) \\\n",
" + labs(title=..., x=..., y=..., fill=...)\n",
"exam_score, faceted by course, with geom_vline() marking the overall mean score so each course's panel can be compared against it.\n",
"overall_mean = results[\"exam_score\"].mean()\n",
"\n",
"ggplot(results, aes(x=\"exam_score\")) \\\n",
" + geom_histogram(bins=15, fill=\"#4477AA\") \\\n",
" + geom_vline(xintercept=overall_mean, color=\"#EE6677\", linetype=\"dashed\") \\\n",
" + facet_wrap(facets=\"course\") \\\n",
" + ggsize(700, 250)\n",
"Hint: geom_vline() takes a fixed xintercept, a setting, not a mapping, so it goes outside aes() just like the fixed colours in Sec. 3.\n",
"