{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Part 6: Grammar of Graphics with Lets-Plot\"\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/06-lets-plot.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/06-lets-plot.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**DS-MLOps Python Foundations**\n",
    "\n",
    "**Python 3.12+ | Author: Anthony Faustine**\n",
    "\n",
    "## Before you begin\n",
    "\n",
    "This notebook assumes you have completed Part 5 (`05-matplotlib.ipynb`). If you have not, start there: this notebook rebuilds several of its charts on purpose, so the contrast between the two ways of thinking about a plot is concrete rather than abstract.\n",
    "\n",
    "Part 5 was **imperative**: you told matplotlib exactly which method to call, in which order, for every piece of the chart. This part is **declarative**: you describe the data and the mapping you want, and the library works out how to draw it. This style is called the **grammar of graphics**, and **lets-plot** implements it in Python the way ggplot2 implements it in R. Part 7 (`07-data-storytelling.ipynb`) covers what makes a chart good and applies this project's house style to both libraries.\n",
    "\n",
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Topics covered\n",
    "\n",
    "| Topic | Why it matters |\n",
    "|---|---|\n",
    "| **Declarative vs. imperative** | The mental model shift that makes the rest of this notebook click |\n",
    "| **`ggplot` + `aes` + `geom`** | The three pieces every lets-plot chart is built from |\n",
    "| **Mapping vs. setting** | The single most common mistake in any grammar of graphics |\n",
    "| **Layering** | Adding a statistical summary on top of raw data, declaratively |\n",
    "| **Faceting** | One line to replace a whole manual subplot loop |\n",
    "| **Titles, labels, and scales** | Communicating clearly: naming axes, renaming legends, controlling colours |\n",
    ":::\n",
    "\n",
    "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of Part 6 you will be able to:\n",
    "\n",
    "| # | Skill | Covered in |\n",
    "|---|---|---|\n",
    "| 1 | Explain the difference between declarative and imperative plotting | Sec. 1 |\n",
    "| 2 | Build a chart from `ggplot()`, `aes()`, and a `geom_*()` | Sec. 2 |\n",
    "| 3 | Distinguish mapping a variable from setting a fixed value | Sec. 3 |\n",
    "| 4 | Layer a statistical summary on top of raw data | Sec. 4 |\n",
    "| 5 | Facet one chart into many panels instead of looping over subplots | Sec. 5 |\n",
    "| 6 | Add titles, axis labels, and control colours with `labs()` and scale functions | Sec. 6 |\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "## 0. Why Grammar of Graphics?\n",
    "\n",
    "You have already drawn scatter plots with Matplotlib. You called `ax.scatter(x, y, c=color, s=size)`. If you wanted a different geom, you called a different method. If you wanted a trend line, you called yet another. The plot grew by accumulating function calls, each one doing something slightly different.\n",
    "\n",
    "There is another way to think about it. A chart is not a list of drawing commands: it is a **mapping** from data columns to visual channels (position, colour, shape, size). State that mapping once, and the library figures out how to draw it. Add a layer (a trend line, a rug, a label) and you extend the mapping rather than calling a new drawing function. This is the Grammar of Graphics, introduced by Leland Wilkinson in 1999 and popularised by Hadley Wickham's **ggplot2** for R.\n",
    "\n",
    "**Lets-Plot** ([lets-plot.org](https://lets-plot.org)) is JetBrains' Python implementation of the same grammar. Its API mirrors ggplot2 so closely that R users can read Lets-Plot code without a translation guide. It renders to HTML in Jupyter, to PNG for reports, and (in its Pro edition) to interactive Datalore dashboards.\n",
    "\n",
    "### Alternatives that use the same grammar\n",
    "\n",
    "| Library | Language | Notes |\n",
    "| --- | --- | --- |\n",
    "| **ggplot2** ([ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) | R | The original; Lets-Plot mirrors it deliberately |\n",
    "| **plotnine** ([plotnine.org](https://plotnine.org)) | Python | ggplot2 port; similar API, Matplotlib backend |\n",
    "| **Lets-Plot** ([lets-plot.org](https://lets-plot.org)) | Python / Kotlin | HTML-first output, fast, maintained by JetBrains |\n",
    "| **Vega-Altair** ([altair-viz.github.io](https://altair-viz.github.io)) | Python | Different grammar (Vega-Lite), interactive |\n",
    "\n",
    "### Already in your environment\n",
    "\n",
    "```bash\n",
    "uv add lets-plot          # for a standalone project\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## 1. Declarative vs. Imperative\n",
    "\n",
    "In Part 5, building a scatter plot meant calling `ax.scatter()` directly: you were the one deciding which function draws which shape. The grammar of graphics flips this around. You describe **what the data means** (this column is the x position, this column is the colour) and the library decides how to draw it. The same description works whether you add one point or one million, and stays valid even if you later change the chart type entirely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import LetsPlot\n",
    "\n",
    "LetsPlot.setup_html()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "Rebuild the same dataset from Part 5: exam results across three courses and two semesters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "rng = np.random.default_rng(42)\n",
    "\n",
    "courses = np.array([\"Machine Learning\", \"Data Structures\", \"Statistics\"])\n",
    "semesters = np.array([\"Fall 2024\", \"Spring 2025\"])\n",
    "\n",
    "n_per_group = 60\n",
    "course_col = np.repeat(courses, n_per_group * len(semesters))\n",
    "semester_col = np.tile(np.repeat(semesters, n_per_group), len(courses))\n",
    "\n",
    "course_base = {\"Machine Learning\": 68, \"Data Structures\": 74, \"Statistics\": 71}\n",
    "semester_bump = {\"Fall 2024\": 0, \"Spring 2025\": 4}\n",
    "\n",
    "exam_score = np.array(\n",
    "    [rng.normal(course_base[c] + semester_bump[s], 10) for c, s in zip(course_col, semester_col, strict=True)]\n",
    ").clip(0, 100)\n",
    "study_hours = rng.uniform(0, 25, size=len(course_col))\n",
    "\n",
    "results = pd.DataFrame(\n",
    "    {\n",
    "        \"course\": course_col,\n",
    "        \"semester\": semester_col,\n",
    "        \"exam_score\": exam_score,\n",
    "        \"study_hours\": study_hours,\n",
    "    }\n",
    ")\n",
    "results.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "Here is the Part 5 scatter plot (study hours vs. exam score) again, this time declaratively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import aes, geom_point, ggplot\n",
    "\n",
    "ggplot(results, aes(x=\"study_hours\", y=\"exam_score\")) + geom_point(alpha=0.4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: The Grammar of Graphics</span><br><br>\n",
    "Every lets-plot chart is built from the same three pieces, combined with <code>+</code>: <code>ggplot(data, aes(...))</code> declares the dataset and which columns map to which visual property, and one or more <code>geom_*()</code> layers say what shape to draw with that mapping. Change the <code>geom_*()</code> and the same <code>aes()</code> mapping produces a completely different chart type, often without touching anything else.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "## 2. The Grammar: `ggplot`, `aes`, `geom`\n",
    "\n",
    "`aes()` (short for \"aesthetics\") maps DataFrame columns to visual properties: `x`, `y`, `color`, `fill`, `size`. The `geom_*()` you add decides what shape represents each row. Swapping `geom_point()` for `geom_line()` or `geom_bar()` is often the only change needed to turn one chart type into another:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import geom_bar\n",
    "\n",
    "course_means = results.groupby(\"course\", as_index=False)[\"exam_score\"].mean()\n",
    "\n",
    "ggplot(course_means, aes(x=\"course\", y=\"exam_score\")) + geom_bar(stat=\"identity\", fill=\"#4477AA\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "`stat=\"identity\"` tells `geom_bar()` to plot the `exam_score` column exactly as given, rather than its default behaviour of counting rows per category. Compare this to Part 5's bar chart: the data preparation (`groupby().mean()`) is identical, only the drawing step changed shape."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "`position=\"dodge\"` places bars for each group side by side instead of stacking them, useful when you want to compare values across two grouping variables at once. The data needs a category column on `x` and a group column on `fill`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "course_semester_means = results.groupby([\"course\", \"semester\"], as_index=False)[\"exam_score\"].mean()\n",
    "\n",
    "(\n",
    "    ggplot(course_semester_means, aes(x=\"course\", y=\"exam_score\", fill=\"semester\"))\n",
    "    + geom_bar(stat=\"identity\", position=\"dodge\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 1 - Histogram, Declaratively</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Rebuild the Part 5 histogram of <code>exam_score</code> using <code>geom_histogram()</code>. Map <code>x=\"exam_score\"</code> in <code>aes()</code>, and pass <code>bins=20</code> to <code>geom_histogram()</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>ggplot(results, aes(x=\"exam_score\")) + geom_histogram(bins=20, fill=\"#4477AA\")</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: build the histogram described above\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "## 3. Mapping vs. Setting\n",
    "\n",
    "`aes(color=\"course\")` **maps** the `course` column to colour: each course gets its own colour, chosen automatically, with a legend. `color=\"#4477AA\"` **sets** every point to the exact same fixed colour, with no legend, because there is nothing left to distinguish. Confusing the two is the single most common mistake when learning any grammar of graphics, in Python or R:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mapping: course determines colour, one colour per course, with a legend\n",
    "ggplot(results, aes(x=\"study_hours\", y=\"exam_score\", color=\"course\")) + geom_point(alpha=0.6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting: every point is the same fixed colour, no legend needed\n",
    "ggplot(results, aes(x=\"study_hours\", y=\"exam_score\")) + geom_point(color=\"#4477AA\", alpha=0.6)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Putting a fixed value inside <code>aes()</code></span><br><br>\n",
    "<code>aes(color=\"#4477AA\")</code> tries to map the literal string <code>\"#4477AA\"</code> as if it were a column name. Lets-plot will either error or, worse, silently treat it as a constant category and draw a one-entry legend that says <code>#4477AA</code>. A fixed value is a <b>setting</b> and belongs as a plain keyword argument to the <code>geom_*()</code>, outside <code>aes()</code>. A column name that should vary per row is a <b>mapping</b> and belongs inside <code>aes()</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23",
   "metadata": {},
   "source": [
    "## 4. Layering: Raw Data and a Statistical Summary Together\n",
    "\n",
    "Because every `geom_*()` is its own layer, you can stack a raw-data layer and a computed-summary layer on the same `aes()` mapping. `geom_smooth()` fits a trend line to the data it is given, with a shaded confidence band, entirely declaratively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import geom_smooth\n",
    "\n",
    "(\n",
    "    ggplot(results, aes(x=\"study_hours\", y=\"exam_score\"))\n",
    "    + geom_point(alpha=0.3, color=\"#4477AA\")\n",
    "    + geom_smooth(color=\"#EE6677\", se=True)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use <code>method='lm'</code> when you need a straight-line fit</span><br><br>\n",
    "<code>geom_smooth()</code> uses a LOESS (locally weighted) smoother by default, which follows local curves in the data. When you specifically want a linear regression fit, pass <code>method='lm'</code>. The confidence band and <code>se=True/False</code> control both methods the same way:\n",
    "\n",
    "<pre style='background:#F4F5F6;padding:10px;border-radius:4px;font-size:0.9em'># Default LOESS — follows curves\n",
    "geom_smooth(se=True)\n",
    "\n",
    "# Linear (OLS) — forces a straight line, shows the parametric fit\n",
    "geom_smooth(method='lm', se=True)</pre>\n",
    "\n",
    "The LOESS smoother needs <code>statsmodels</code> installed; the linear one does not. Both display the 95% confidence band when <code>se=True</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "`geom_density()` is the smooth-curve equivalent of a histogram, useful when comparing several distributions that would otherwise overlap into an unreadable stack of bars:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import geom_density\n",
    "\n",
    "ggplot(results, aes(x=\"exam_score\", fill=\"course\")) + geom_density(alpha=0.4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "`geom_boxplot()` summarises a distribution as median, quartiles, and outliers per group, the declarative equivalent of Part 5's `sns.boxplot()`. Because it is just another `geom_*()`, it composes with `aes()` and faceting exactly like any other layer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import geom_boxplot\n",
    "\n",
    "ggplot(results, aes(x=\"course\", y=\"exam_score\", fill=\"course\")) + geom_boxplot(alpha=0.7)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Swap <code>geom_boxplot()</code> for <code>geom_violin()</code> when shape matters</span><br><br>\n",
    "A boxplot gives you five numbers: minimum, first quartile, median, third quartile, and maximum. A violin gives you the full density shape on both sides of the axis, which reveals skew or multiple peaks that the box hides. The <code>aes()</code> mapping stays identical: just change the <code>geom_*()</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Layers compose in the order you add them</span><br><br>\n",
    "<code>geom_point() + geom_smooth()</code> draws points first, then the trend line on top. Reversing the order draws the line first and the points on top of it. This matters most when a layer has a solid fill that would otherwise hide whatever was drawn before it.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Add marginal distribution plots with <code>ggmarginal()</code></span><br><br>\n",
    "Lets-Plot's <code>ggmarginal()</code> wraps any scatter plot with histograms or density curves along the x and y margins, letting you see both the bivariate relationship and the individual distributions without a separate figure:\n",
    "\n",
    "<pre style='background:#F4F5F6;padding:10px;border-radius:4px;font-size:0.9em'>from lets_plot import ggmarginal\n",
    "\n",
    "p = (\n",
    "    ggplot(results, aes(x=\"study_hours\", y=\"exam_score\", color=\"course\"))\n",
    "    + geom_point(alpha=0.4)\n",
    "    + geom_smooth(method=\"lm\", se=False)\n",
    ")\n",
    "ggmarginal(p, type=\"density\")</pre>\n",
    "\n",
    "<code>type</code> accepts <code>\"density\"</code> (KDE curves), <code>\"histogram\"</code>, or <code>\"boxplot\"</code>. The marginals automatically inherit the <code>color</code> mapping so each course gets its own density curve.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33",
   "metadata": {},
   "source": [
    "## 5. Faceting: One Plot, Many Panels\n",
    "\n",
    "Part 5's three-panel histogram needed a manual loop over `axes.flat`, plus `sharey=True` to keep the comparison fair. `facet_wrap()` does both in one line, and **shares scales across panels by default**, the opposite of matplotlib's default and exactly what you want for a fair comparison:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import facet_wrap, geom_histogram, ggsize\n",
    "\n",
    "(\n",
    "    ggplot(results, aes(x=\"exam_score\"))\n",
    "    + geom_histogram(bins=15, fill=\"#4477AA\")\n",
    "    + facet_wrap(facets=\"course\")\n",
    "    + ggsize(700, 250)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Faceting replaces a loop with a declaration</span><br><br>\n",
    "<code>facet_wrap(facets=\"course\")</code> splits the data by the <code>course</code> column and draws one panel per group, using the exact same <code>aes()</code> and <code>geom_*()</code> for every panel. There is no loop to write and no risk of accidentally giving one panel a different scale than the others, the bug from Part 5's Common Mistake callout. Add a second facet variable with <code>facet_grid()</code> for a full grid of panels.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "When you have two grouping variables, `facet_grid()` builds a full grid of panels: one row per level of one variable and one column per level of the other. For this dataset, rows are semesters and columns are courses:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import facet_grid\n",
    "\n",
    "(\n",
    "    ggplot(results, aes(x=\"exam_score\"))\n",
    "    + geom_histogram(bins=12, fill=\"#4477AA\")\n",
    "    + facet_grid(y=\"semester\", x=\"course\")\n",
    "    + ggsize(850, 450)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 2 - Facet by Semester</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Build a scatter plot of <code>study_hours</code> vs. <code>exam_score</code>, coloured by <code>course</code>, faceted into one panel per <code>semester</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>ggplot(results, aes(x=\"study_hours\", y=\"exam_score\", color=\"course\")) \\\n",
    "    + geom_point(alpha=0.5) \\\n",
    "    + facet_wrap(facets=\"semester\")</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: scatter plot, coloured by course, faceted by semester\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "## 6. Titles, Labels, and Scales\n",
    "\n",
    "Every chart so far has used whatever axis labels and legend titles lets-plot generates by default: column names from the DataFrame, which are fine for exploration but not for sharing. `labs()` replaces all of them in one layer. `scale_color_manual()` and `scale_fill_manual()` give you exact control over which colour maps to which group. Passing `pro_colors` from `ark.plot.theme` uses the project's brand palette, the same one `modern_theme()` applies globally in Part 7:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import labs, scale_color_manual\n",
    "\n",
    "from ark.plot.theme import pro_colors\n",
    "\n",
    "(\n",
    "    ggplot(results, aes(x=\"study_hours\", y=\"exam_score\", color=\"course\"))\n",
    "    + geom_point(alpha=0.5)\n",
    "    + geom_smooth(se=False)\n",
    "    + labs(\n",
    "        title=\"Study hours versus exam score\",\n",
    "        subtitle=\"Each point is one student; lines show the per-course trend\",\n",
    "        x=\"Weekly study hours\",\n",
    "        y=\"Exam score (0-100)\",\n",
    "        color=\"Course\",\n",
    "    )\n",
    "    + scale_color_manual(values=pro_colors)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: <code>labs()</code> renames anything auto-generated from a column name</span><br><br>\n",
    "<code>labs(title=..., x=..., y=..., color=...)</code> is one more <code>+</code> layer. The named argument matches the aesthetic: use <code>color=</code> when you have <code>aes(color=\"course\")</code> and <code>fill=</code> when you have <code>aes(fill=\"course\")</code>. <code>pro_colors</code> from <code>ark.plot.theme</code> is the project's brand palette; passing it to <code>scale_color_manual()</code> connects this chart to the same colour system that <code>modern_theme()</code> applies globally in Part 7.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 3 - Label the Density Chart</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Take the <code>geom_density()</code> chart from Section 4 and add a proper title, axis labels, and legend title with <code>labs()</code>.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>ggplot(results, aes(x=\"exam_score\", fill=\"course\")) + geom_density(alpha=0.4) \\\n",
    "    + labs(title=..., x=..., y=..., fill=...)</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: add labs() to the density chart from Section 4\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45",
   "metadata": {},
   "source": [
    "## Capstone: The Three-Panel Report, Declaratively\n",
    "\n",
    "Part 5's capstone built a three-panel course report with a manual `(1, 3)` grid of Axes. Here, build the same idea (one histogram per course) as a **single faceted chart** instead of three separate panels assembled by hand."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Capstone Exercise - Faceted Course Report</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Build one chart: a histogram of <code>exam_score</code>, faceted by <code>course</code>, with <code>geom_vline()</code> marking the overall mean score so each course's panel can be compared against it.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>overall_mean = results[\"exam_score\"].mean()\n",
    "\n",
    "ggplot(results, aes(x=\"exam_score\")) \\\n",
    "    + geom_histogram(bins=15, fill=\"#4477AA\") \\\n",
    "    + geom_vline(xintercept=overall_mean, color=\"#EE6677\", linetype=\"dashed\") \\\n",
    "    + facet_wrap(facets=\"course\") \\\n",
    "    + ggsize(700, 250)</pre>\n",
    "<b>Hint:</b> <code>geom_vline()</code> takes a fixed <code>xintercept</code>, a setting, not a mapping, so it goes outside <code>aes()</code> just like the fixed colours in Sec. 3.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lets_plot import geom_vline\n",
    "\n",
    "overall_mean = results[\"exam_score\"].mean()\n",
    "\n",
    "# TODO: faceted histogram with a reference line at overall_mean\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "| Resource | Why it matters |\n",
    "|---|---|\n",
    "| Wilkinson, L. (2005). *The Grammar of Graphics*, 2nd ed. Springer. | The theory behind the layered grammar; Lets-Plot, ggplot2, and Vega-Altair are all implementations of this framework |\n",
    "| Wickham, H. (2010). [A layered grammar of graphics](https://doi.org/10.1198/jcgs.2009.07098). *Journal of Computational and Graphical Statistics* 19(1), 3–28. | The ggplot2 paper; most directly explains `geom_*`, `aes()`, and `stat_*` concepts that Lets-Plot mirrors |\n",
    "| [Lets-Plot documentation](https://lets-plot.org) | The primary API reference; the gallery is the fastest way to find the right `geom_*` |\n",
    "| [ggplot2 book](https://ggplot2-book.org) (Wickham, 2016) | Free online — Lets-Plot's API maps closely to ggplot2, so this book is directly useful |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Concept | Key rule |\n",
    "|---|---|\n",
    "| Declarative vs. imperative | Describe the mapping, let the library decide how to draw it |\n",
    "| `ggplot(data, aes(...))` | Declares the dataset and which columns map to which visual property |\n",
    "| `geom_*()` | Decides what shape represents each row; swap it to change chart type |\n",
    "| Mapping | `aes(color=\"course\")`, a column that varies per row, gets a legend |\n",
    "| Setting | `color=\"#4477AA\"`, a fixed value, no legend |\n",
    "| `position=\"dodge\"` | Places bars for each group side by side instead of stacking them |\n",
    "| Layering | Stack `geom_*()` calls with `+`; later layers draw on top of earlier ones |\n",
    "| `geom_smooth()` / `geom_density()` | Statistical summaries, computed declaratively from raw data |\n",
    "| `geom_boxplot()` / `geom_violin()` | Distribution per group: box for five-number summary, violin for full shape |\n",
    "| `facet_wrap()` | One declaration replaces a manual subplot loop, with shared scales by default |\n",
    "| `facet_grid(rows=..., cols=...)` | Full grid of panels for two grouping variables |\n",
    "| `labs()` | Replaces any auto-generated axis label or legend title in one layer |\n",
    "| `scale_color_manual()` | Maps groups to specific colours, overriding the default palette |\n",
    "\n",
    "**Next:** `07-data-storytelling.ipynb`, covering what makes a chart good and applying this project's house style to both matplotlib and lets-plot."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ark (3.12.12.final.0)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}