{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "---\n",
    "title: \"Part 9: Pandas Operations\"\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/09-pandas-operations.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/09-pandas-operations.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "**DS-MLOps Data Analysis**\n",
    "\n",
    "**Python 3.12+ | Author: Anthony Faustine**\n",
    "\n",
    "## Before you begin\n",
    "\n",
    "This notebook continues directly from Part 8 (`08-pandas-core.ipynb`). It assumes you can read a CSV, select rows and columns with `.loc`/`.iloc`, and filter with a boolean mask. We reload the same dataset from scratch below, so you can run this notebook on its own.\n",
    "\n",
    "Part 8 covered the dataset's shape and structure. Part 9 covers the operations that turn raw columns into derived ones: applying your own functions row-by-row, counting and filtering categorical values, cleaning text columns, summarizing a dataset, and computing window statistics.\n",
    "\n",
    "Part 10 (`10-combining-reshaping.ipynb`) continues with combining datasets, `groupby`, pivoting, and time series.\n",
    "\n",
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Topics covered\n",
    "\n",
    "| Topic | Why it matters |\n",
    "|---|---|\n",
    "| **Pandas 3: `str` dtype and Copy-on-Write** | Two behaviour changes that affect every example in this notebook and the ones before it |\n",
    "| **`map` and `apply`** | Run your own function over a Series or DataFrame when there is no built-in vectorized operation |\n",
    "| **Value counts, membership, and encoding** | Understand a categorical column, filter on a set of values, and prepare it for a model |\n",
    "| **The `.str` accessor** | String methods on a Series need `.str`; calling them directly is the most common error here |\n",
    "| **Descriptive statistics** | `.describe()` and `.agg()` summarize a dataset in one call instead of many |\n",
    "| **Rolling and expanding windows** | Compute moving averages and cumulative statistics; the foundation of time-series feature engineering |\n",
    "| **`df.query()`** | Express multi-condition filters as a readable string; faster with numexpr on large DataFrames |\n",
    ":::\n",
    "\n",
    "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "::: {.callout-note collapse=\"true\" icon=false}\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of Part 9 you will be able to:\n",
    "\n",
    "| # | Skill | Covered in |\n",
    "|---|---|---|\n",
    "| 1 | Recognise pandas 3's `str` dtype and Copy-on-Write behaviour | Sec. 0 |\n",
    "| 2 | Apply your own function to a Series or DataFrame with `map` and `apply` | Sec. 1 |\n",
    "| 3 | Count, filter, and encode categorical columns | Sec. 2 |\n",
    "| 4 | Clean and query text columns with the `.str` accessor | Sec. 3 |\n",
    "| 5 | Summarize a dataset with descriptive statistics in one call | Sec. 4 |\n",
    "| 6 | Compute rolling and expanding window statistics | Sec. 5a |\n",
    "| 7 | Write readable multi-condition filters with `df.query()` | Sec. 5b |\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"data/university_analytics.csv\")\n",
    "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## 0. What's New in Pandas 3\n",
    "\n",
    "This series is written against pandas 3, released in 2025, and two of its changes are worth understanding before going further, because they change what you see in everyday output."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "Run `df.dtypes` and look at the text columns. In pandas 2 they showed up as `object`, the catch-all dtype for anything that is not a fixed-width number. Pandas 3 gives strings a dedicated `str` dtype instead, backed by PyArrow when it is installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: A dedicated <code>str</code> dtype, not <code>object</code></span><br><br>\n",
    "<code>object</code> dtype could hold strings, but it could just as easily hold a mix of strings, numbers, and Python objects, with no guarantee which. The new <code>str</code> dtype can only hold strings or missing values, so a column typed <code>str</code> is a stronger guarantee than <code>object</code> ever was, and the PyArrow backing makes string operations faster. This is the same dtype the <code>.str</code> accessor in Sec. 3 operates on.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "The second change does not show up in any output, it changes how assignment behaves. Pandas 3 makes Copy-on-Write the only behaviour: selecting a subset of a DataFrame never lets you accidentally modify the original through it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "subset = df[[\"student_id\", \"average_marks\"]]\n",
    "subset.loc[0, \"average_marks\"] = 0.0\n",
    "print(f\"original unaffected: {df.loc[0, 'average_marks']}\")\n",
    "print(f\"subset changed     : {subset.loc[0, 'average_marks']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Copy-on-Write removes the <code>SettingWithCopyWarning</code> guessing game</span><br><br>\n",
    "Before pandas 3, whether <code>subset</code> above was a view into <code>df</code> or an independent copy depended on details of how <code>subset</code> was created, the source of the infamous <code>SettingWithCopyWarning</code>. Pandas 3 makes every such selection behave as an independent copy: modifying <code>subset</code> never touches <code>df</code>. The warning is gone because the ambiguity it warned about is gone.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: <code>pd.col()</code> previews the expression style Polars uses</span><br><br>\n",
    "Pandas 3 adds <code>pd.col(\"name\")</code> as an alternative to a <code>lambda</code> inside <code>.assign()</code>: <code>df.assign(c=pd.col(\"a\") + pd.col(\"b\"))</code> instead of <code>df.assign(c=lambda d: d[\"a\"] + d[\"b\"])</code>. It reads closer to the column-expression style used throughout Polars, covered in Part 11 of this series, and is worth recognising even though this notebook keeps using <code>apply</code> and <code>lambda</code>, the more widely used style today.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "## 1. Function Application with `map` and `apply`\n",
    "\n",
    "NumPy's vectorized operations cover arithmetic and comparisons, but sometimes the transformation you need is your own logic: a lookup, a multi-branch rule, a calculation that combines several columns per row. `map` and `apply` are how pandas runs that logic."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "`Series.map()` is for simple one-to-one substitution: pass it a dict (or a function) and it replaces every value with its mapped equivalent. Recoding the `gender` column's short codes into full labels is a `map` problem, not an `apply` one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "gender_labels = {\"M\": \"Male\", \"F\": \"Female\"}\n",
    "df[\"gender_label\"] = df[\"gender\"].map(gender_labels)\n",
    "df[[\"gender\", \"gender_label\"]].head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: <code>map</code> is for substitution, <code>apply</code> is for logic</span><br><br>\n",
    "<code>map</code> expects a dict or a simple function and only ever looks at one value at a time. <code>apply</code> accepts any function, including one with branching logic, and can run over a Series (one value at a time) or a DataFrame (one row or one column at a time, depending on <code>axis</code>). Reach for <code>map</code> first; reach for <code>apply</code> when the rule does not fit in a lookup table.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "Converting a normalized mark (0 to 1) into a letter grade needs a multi-branch rule, a job for `apply` with a plain Python function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "def letter_grade(mark: float) -> str:\n",
    "    if mark >= 80:\n",
    "        return \"A\"\n",
    "    elif mark >= 60:\n",
    "        return \"B\"\n",
    "    elif mark >= 40:\n",
    "        return \"C\"\n",
    "    else:\n",
    "        return \"D\"\n",
    "\n",
    "\n",
    "df[\"grade\"] = df[\"average_marks\"].apply(letter_grade)\n",
    "df[[\"student_id\", \"average_marks\", \"grade\"]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "`DataFrame.apply(..., axis=1)` runs a function once per row, with the whole row available as a Series. Use it when the rule needs more than one column at a time, for example flagging students who are both low-scoring and without internet access:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "def at_risk(row: pd.Series) -> bool:\n",
    "    return row[\"average_marks\"] < 40 and not row[\"has_internet\"]\n",
    "\n",
    "\n",
    "df[\"at_risk\"] = df.apply(at_risk, axis=1)\n",
    "df[\"at_risk\"].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Reaching for <code>apply(axis=1)</code> when a vectorized operation already does the job</span><br><br>\n",
    "<code>df.apply(at_risk, axis=1)</code> calls a Python function once per row, 17,190 times here, which is far slower than an equivalent boolean mask: <code>(df[\"average_marks\"] &lt; 0.4) &amp; (df[\"has_internet\"] == 0)</code> computes the same result with NumPy operating on whole columns at once. Use <code>apply</code> when the rule cannot be written with column-wise operations and comparisons; reach for masking first.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 1 - Grade Distribution</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Write a function that returns <code>True</code> for marks of 0.6 or higher and <code>False</code> otherwise, apply it to <code>average_marks</code> to create a new column <code>passed</code>, then print how many students passed.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>def passed_threshold(mark: float) -> bool:\n",
    "    ...\n",
    "\n",
    "\n",
    "df[\"passed\"] = df[\"average_marks\"].apply(passed_threshold)\n",
    "df[\"passed\"].sum()\n",
    "# -> 6011</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: write passed_threshold, apply it, and print the count of students who passed\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24",
   "metadata": {},
   "source": [
    "## 2. Unique Values, Value Counts, and Membership\n",
    "\n",
    "`.unique()`, `.value_counts()`, and `.isin()` are usually the first three calls worth making on any categorical column, before any filtering or grouping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"program\"].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "`.value_counts()` counts how many rows fall into each category, sorted from most to least common. Passing `normalize=True` turns the counts into proportions, which is what you want for a question like \"what fraction of students dropped out?\":"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27",
   "metadata": {},
   "source": [
    "<div style='background:#EAF7F0;border-left:5px solid #059669;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#059669;font-weight:bold'><i class=\"bi bi-journal-code\"></i> Example: Pass rate from <code>value_counts</code></span><br><br>\n",
    "<code>df[\"passed\"].value_counts(normalize=True)</code> answers the pass-rate question in a single line, no manual division required.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"passed\"].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29",
   "metadata": {},
   "source": [
    "`.isin()` filters rows whose value is in a given list, the categorical equivalent of a boolean comparison. Filtering to the two largest programs combines `.value_counts()` and `.isin()` directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30",
   "metadata": {},
   "outputs": [],
   "source": [
    "top_programs = df[\"program\"].value_counts().head(2).index\n",
    "df_top_programs = df[df[\"program\"].isin(top_programs)]\n",
    "df_top_programs[\"program\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: <code>.nunique()</code> before <code>.unique()</code> on a column you have not seen yet</span><br><br>\n",
    "<code>.nunique()</code> returns just the count of distinct values. Running it before <code>.unique()</code> tells you whether printing every unique value is even a reasonable idea, useful for a column that turns out to have 4 categories versus one that turns out to have 4,000.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32",
   "metadata": {},
   "source": [
    "A column with a small, fixed set of values is a candidate for the `category` dtype: pandas stores each value once and keeps a compact integer code per row instead of repeating the full string, which is most of `gender`, `program`, and `guardian` here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"program\"] = df[\"program\"].astype(\"category\")\n",
    "df[\"program\"].dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34",
   "metadata": {},
   "source": [
    "Most ML models need numbers, not category labels. `pd.get_dummies()` one-hot encodes a categorical column into one binary column per category, the standard first step before fitting a model on top of this data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35",
   "metadata": {},
   "outputs": [],
   "source": [
    "program_dummies = pd.get_dummies(df[\"program\"], prefix=\"program\")\n",
    "program_dummies.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: <code>category</code> dtype for memory, one-hot encoding for models</span><br><br>\n",
    "<code>.astype(\"category\")</code> is about storage and speed: it does not change what a column means, only how compactly pandas stores it. <code>pd.get_dummies()</code> is about preparing data for a model that expects numeric input: it turns one categorical column into several binary columns. The two are often used together, category dtype while exploring, dummy columns right before training.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 2 - Guardian Breakdown</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Print the value counts for the <code>guardian</code> column, then filter the DataFrame to students whose guardian is <code>\"mother\"</code> or <code>\"father\"</code> using <code>.isin()</code>, and print how many rows remain.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>df[\"guardian\"].value_counts()\n",
    "\n",
    "parents = df[df[\"guardian\"].isin([\"mother\", \"father\"])]\n",
    "len(parents)</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: print value counts for guardian, then filter to mother/father and print row count\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39",
   "metadata": {},
   "source": [
    "## 3. Working with Text Columns: the `.str` Accessor\n",
    "\n",
    "`student_id` is a string column, and pandas keeps every string method behind a `.str` accessor rather than directly on the Series. The accessor exists because a Series can hold any dtype, `.str` is what tells pandas you specifically want the string-handling behaviour."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40",
   "metadata": {},
   "source": [
    "<div style='background:#FEF2F2;border-left:5px solid #DC2626;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#991B1B;font-weight:bold'><i class=\"bi bi-bug-fill\"></i> Common Mistake: Calling a string method directly on a Series</span><br><br>\n",
    "<code>df[\"student_id\"].upper()</code> raises an <code>AttributeError</code>: <code>Series</code> has no <code>upper</code> method. The string methods live on <code>df[\"student_id\"].str</code>, not on the Series itself, because the Series itself is a general-purpose container that happens to hold strings here.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"student_id\"].str.upper().head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42",
   "metadata": {},
   "source": [
    "`.str.len()` gives the length of every string at once, useful for spotting malformed values before they cause problems downstream:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43",
   "metadata": {},
   "outputs": [],
   "source": [
    "id_lengths = df[\"student_id\"].str.len()\n",
    "id_lengths.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44",
   "metadata": {},
   "source": [
    "`.str.extract()` pulls a regex capture group out of every value, the cleanest way to turn a structured string column into a usable numeric one. Every `student_id` here is the letter `S` followed by four digits (`S0001`–`S0400`), so extracting just the digits gives a numeric ID:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"student_id\"].str.match(r\"^S\\d{4}$\").all()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46",
   "metadata": {},
   "outputs": [],
   "source": [
    "numeric_id = df[\"student_id\"].str.extract(r\"S(\\d+)\")[0].astype(\"Int64\")\n",
    "df[\"numeric_id\"] = numeric_id\n",
    "df[[\"student_id\", \"numeric_id\"]].head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 3 - Validate and Filter by ID</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Use <code>.str.startswith(\"S\")</code> to confirm every <code>student_id</code> starts with the letter <code>\"S\"</code>, then use <code>.str.contains()</code> to count how many contain the digit <code>\"0\"</code> anywhere in the ID.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>df[\"student_id\"].str.startswith(\"S\").all()\n",
    "df[\"student_id\"].str.contains(\"0\").sum()</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: confirm every student_id starts with \"s\", then count IDs containing \"0\"\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49",
   "metadata": {},
   "source": [
    "## 4. Descriptive Statistics and Summarization\n",
    "\n",
    "`.describe()` from Part 1 summarizes every numeric column with one call. `.agg()` goes a step further: it computes a chosen list of statistics for a chosen set of columns, in whatever combination you ask for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[[\"midterm_score\", \"final_score\", \"project_score\"]].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51",
   "metadata": {},
   "source": [
    "`.agg()` accepts a list of function names and applies every one of them to every selected column, returning a single small table instead of one Series per statistic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[[\"midterm_score\", \"final_score\", \"project_score\"]].agg([\"mean\", \"std\", \"min\", \"max\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53",
   "metadata": {},
   "source": [
    "`.corr()` computes the pairwise correlation between numeric columns, a quick way to see whether strong performance in one subject tends to come with strong performance in another:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[[\"midterm_score\", \"final_score\", \"project_score\"]].corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: <code>.agg()</code> over chaining several single-statistic calls</span><br><br>\n",
    "Five separate lines like <code>df[\"midterm_score\"].mean()</code>, <code>.std()</code>, <code>.min()</code> read fine in isolation but scatter related numbers across separate outputs. One <code>.agg([...])</code> call keeps them in a single table that is easier to read and to compare across columns.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use <code>groupby().transform()</code> to add a group statistic back to the original rows</span><br><br>\n",
    "<code>groupby().agg()</code> reduces a DataFrame to one row per group. <code>groupby().transform()</code> does the opposite: it computes the same group statistic but returns a Series the exact same length as the original DataFrame, aligned to the original index, so you can assign it as a new column without a merge:\n",
    "\n",
    "<pre style='background:#F4F5F6;padding:10px;border-radius:4px;font-size:0.9em'># agg → one row per program\n",
    "df.groupby(\"program\")[\"average_marks\"].agg(\"mean\")  # shape (4,)\n",
    "\n",
    "# transform → one row per student, aligned to original index\n",
    "df[\"caste_mean_marks\"] = df.groupby(\"program\")[\"average_marks\"].transform(\"mean\")\n",
    "# every student's row now has their caste's mean, ready for feature engineering</pre>\n",
    "\n",
    "Common use: <b>normalising within groups</b>. <code>(df[\"average_marks\"] - df[\"caste_mean_marks\"]) / df.groupby(\"program\")[\"average_marks\"].transform(\"std\")</code> z-scores each student <em>relative to their caste group</em>, not across the whole dataset, in two lines without any merge.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 4 - Subject Spread</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Use <code>.agg()</code> to compute the mean and standard deviation of <code>midterm_score</code>, <code>final_score</code>, and <code>project_score</code> for students with <code>has_internet == True</code> only.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>with_internet = df[df[\"has_internet\"]]\n",
    "with_internet[[\"midterm_score\", \"final_score\", \"project_score\"]].agg([\"mean\", \"std\"])</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: agg mean and std for the three mark columns, restricted to internet == 1\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59",
   "metadata": {},
   "source": [
    "## 5. Window Operations and Expressive Queries\n",
    "\n",
    "Two operations that come up constantly in DS pipelines but do not fit neatly under \"string methods\" or \"statistics\": rolling windows for time-aware aggregation, and `df.query()` for readable multi-condition filters."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60",
   "metadata": {},
   "source": [
    "### 5a. Rolling and Expanding Windows\n",
    "\n",
    "`rolling(n)` groups each row with the `n-1` rows before it and applies an aggregation. It is the standard tool for smoothing noisy signals, computing moving averages, or creating features that capture recent trend for a time-ordered dataset.\n",
    "\n",
    "`expanding()` is the cumulative version: each row aggregates everything from the start of the series up to that point. The first row is its own mean; the second is the mean of rows 1-2; and so on."
   ]
  },
  {
   "cell_type": "raw",
   "id": "61",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "> **Rolling window (size=3): slides one step at a time**\n",
    "\n",
    "```{mermaid}\n",
    "flowchart LR\n",
    "    subgraph W1 [\"window 1 (size=3)\"]\n",
    "        A1[\"day 1: 70\"] --> A2[\"day 2: 80\"] --> A3[\"day 3: 90\"]\n",
    "    end\n",
    "    subgraph W2 [\"window 2 (size=3)\"]\n",
    "        B1[\"day 2: 80\"] --> B2[\"day 3: 90\"] --> B3[\"day 4: 60\"]\n",
    "    end\n",
    "    subgraph W3 [\"window 3 (size=3)\"]\n",
    "        C1[\"day 3: 90\"] --> C2[\"day 4: 60\"] --> C3[\"day 5: 75\"]\n",
    "    end\n",
    "    W1 -->|mean = 80.0| R1[\"result[2]\"]\n",
    "    W2 -->|mean = 76.7| R2[\"result[3]\"]\n",
    "    W3 -->|mean = 75.0| R3[\"result[4]\"]\n",
    "\n",
    "    style R1 fill:#EBF5F0,stroke:#059669,color:#065F46\n",
    "    style R2 fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n",
    "    style R3 fill:#F5F3FF,stroke:#7C3AED,color:#3B0764\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort by average_marks to give the rolling window a meaningful order here\n",
    "df_sorted = df.sort_values(\"average_marks\").reset_index(drop=True)\n",
    "\n",
    "# 5-row rolling mean: smooth out noise\n",
    "df_sorted[\"marks_rolling5\"] = df_sorted[\"average_marks\"].rolling(window=5).mean()\n",
    "\n",
    "# expanding mean: cumulative average as we move through sorted students\n",
    "df_sorted[\"marks_cumulative\"] = df_sorted[\"average_marks\"].expanding().mean()\n",
    "\n",
    "df_sorted[[\"student_id\", \"average_marks\", \"marks_rolling5\", \"marks_cumulative\"]].head(8)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63",
   "metadata": {},
   "source": [
    "<div style='background:#EAF3FA;border-left:5px solid #0369A1;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#0369A1;font-weight:bold'><i class=\"bi bi-info-circle-fill\"></i> Key Concept: Rolling windows require ordered data to be meaningful</span><br><br>\n",
    "<code>rolling(5)</code> looks at the 5 rows immediately before each row in whatever order they sit in the DataFrame. On unsorted data the window picks up random rows and produces meaningless averages. Always sort by the relevant axis (time, score, sequence number) before calling <code>rolling()</code>. The NaN values in the first <code>n-1</code> rows are correct: there is not enough history to fill those windows yet.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64",
   "metadata": {},
   "source": [
    "In time-series work, the pattern is almost always: sort by date, group by entity, then compute a rolling stat per group. The `groupby().transform()` tip from Section 4 applies here too:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Per-program 10-student rolling mean of average_marks\n",
    "# transform keeps the result aligned to the original index\n",
    "df[\"marks_prog_rolling10\"] = (\n",
    "    df.sort_values(\"average_marks\")\n",
    "    .groupby(\"program\")[\"average_marks\"]\n",
    "    .transform(lambda s: s.rolling(10, min_periods=3).mean())\n",
    ")\n",
    "df[[\"program\", \"average_marks\", \"marks_prog_rolling10\"]].dropna().head(6)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66",
   "metadata": {},
   "source": [
    "### 5b. Expressive Filtering with `df.query()`\n",
    "\n",
    "Boolean masks work well for one or two conditions. For complex multi-condition filters, `df.query()` expresses the same logic as a readable string, closer to how you would say it out loud.\n",
    "\n",
    "Reference an external variable inside a query string with the `@` prefix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Boolean mask version -- correct but harder to read\n",
    "mask = (df[\"midterm_score\"] > 70) & (df[\"final_score\"] > 70) & (df[\"program\"].isin([\"Engineering\", \"Sciences\"]))\n",
    "print(f\"mask result  : {mask.sum()} rows\")\n",
    "\n",
    "# Same filter as a query string -- reads like a sentence\n",
    "threshold = 70\n",
    "result = df.query(\"midterm_score > @threshold and final_score > @threshold and program in ['Engineering', 'Sciences']\")\n",
    "print(f\"query result : {len(result)} rows\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68",
   "metadata": {},
   "source": [
    "<div style='background:#F5F3FF;border-left:5px solid #7C3AED;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#5B21B6;font-weight:bold'><i class=\"bi bi-lightbulb-fill\"></i> Pro Tip: Use <code>query()</code> for readability, masks for dynamic conditions</span><br><br>\n",
    "<code>df.query()</code> is evaluated with <code>numexpr</code> when the library is installed, which makes it faster than the equivalent boolean mask on large DataFrames. The downside: the query string is harder to build programmatically. Use <code>query()</code> when you are writing a fixed filter that a reader should understand at a glance; use a mask when the conditions are assembled at runtime from user input or a config.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Activity 5 - Smoothed Grade Distribution</span><br><br>\n",
    "\n",
    "<b>Goal:</b> Sort the full DataFrame by <code>average_marks</code> ascending. Compute a 20-student rolling mean of <code>average_marks</code>. Then use <code>df.query()</code> to find students with a rolling mean above 75 who are in the <code>\"Sciences\"</code> program. Print how many rows match.\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>df_q = df.sort_values(\"average_marks\").reset_index(drop=True)\n",
    "df_q[\"rolling_mean\"] = df_q[\"average_marks\"].rolling(20).mean()\n",
    "result = df_q.query(\"rolling_mean > 75 and program == 'Sciences'\")\n",
    "print(len(result))</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: rolling mean, then query-filter to Sciences students above rolling threshold\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71",
   "metadata": {},
   "source": [
    "## Capstone: Risk and Performance Report\n",
    "\n",
    "Combine everything from this notebook into one short report: a derived grade column, a categorical breakdown, a text-column check, and a summary statistic, the same operations used in any first pass over a new dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72",
   "metadata": {},
   "source": [
    "<div style='background:#EBF5F0;border-left:5px solid #009E73;padding:14px 18px;border-radius:6px;margin:16px 0'>\n",
    "<span style='color:#065F46;font-weight:bold'><i class=\"bi bi-puzzle-fill\"></i> Capstone Exercise - Risk and Performance Report</span><br><br>\n",
    "\n",
    "<b>Goal:</b>\n",
    "<ol>\n",
    "<li>Confirm every <code>student_id</code> matches the pattern <code>S</code> followed by 4 digits, using <code>.str.match()</code> (Sec. 3)</li>\n",
    "<li>Print the <code>final_grade</code> distribution as proportions with <code>.value_counts(normalize=True)</code> (Sec. 1, Sec. 2)</li>\n",
    "<li>Use <code>.agg()</code> to compute the mean <code>average_marks</code> for passed vs failed students separately (Sec. 4)</li>\n",
    "</ol>\n",
    "<pre style='background:#FFF8E1;padding:10px;border-radius:4px;font-size:0.9em'>valid_ids = df[\"student_id\"].str.match(r\"^S\\d{4}$\").all()\n",
    "\n",
    "grade_distribution = df[\"final_grade\"].value_counts(normalize=True)\n",
    "\n",
    "passing = df[df[\"passed\"] == True][\"average_marks\"].agg([\"mean\"])  # noqa: E712\n",
    "failing = df[df[\"passed\"] == False][\"average_marks\"].agg([\"mean\"])  # noqa: E712</pre>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: build the risk and performance report described above\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74",
   "metadata": {},
   "source": [
    "## Further Reading\n",
    "\n",
    "| Resource | Why it matters |\n",
    "|---|---|\n",
    "| McKinney, W. (2022). *Python for Data Analysis*, 3rd ed. O'Reilly. | Chapter 7 (data cleaning) and Chapter 10 (aggregation) are the canonical references for the operations in this notebook |\n",
    "| [pandas documentation — Indexing and selecting data](https://pandas.pydata.org/docs/user_guide/indexing.html) | Covers `.loc`, `.iloc`, boolean indexing, and `MultiIndex` in exhaustive detail |\n",
    "| [pandas documentation — Group by: split-apply-combine](https://pandas.pydata.org/docs/user_guide/groupby.html) | Official guide to `groupby`, `transform`, and `apply`; the examples use the same column types as this notebook |\n",
    "| Wickham, H. (2014). [Tidy data](https://doi.org/10.18637/jss.v059.i10). *Journal of Statistical Software* 59(10). | The conceptual framework behind `.melt()` and `.pivot_table()` introduced later in Part 10 |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Concept | Key rule |\n",
    "|---|---|\n",
    "| `str` dtype | Pandas 3's default for text columns, backed by PyArrow, replacing `object` |\n",
    "| Copy-on-Write | Selecting a subset always behaves as an independent copy; the original is never modified through it |\n",
    "| `Series.map()` | One-to-one substitution with a dict or simple function |\n",
    "| `Series.apply()` | Run any function, including multi-branch logic, one value at a time |\n",
    "| `DataFrame.apply(axis=1)` | Run a function once per row, with the whole row available |\n",
    "| Vectorized ops vs `apply` | Prefer a boolean mask or arithmetic when one exists; `apply` is the fallback, not the default |\n",
    "| `.value_counts(normalize=True)` | Category proportions in one call |\n",
    "| `.isin()` | Filter rows whose value is in a given list |\n",
    "| `.astype(\"category\")` | Compact storage for a column with a small, fixed set of values |\n",
    "| `pd.get_dummies()` | One-hot encode a categorical column before fitting a model |\n",
    "| `.str` accessor | Required for any string method on a Series; calling the method directly raises `AttributeError` |\n",
    "| `.str.extract()` | Pull a regex capture group into a new column |\n",
    "| `.agg([...])` | Compute several statistics for several columns in one call, instead of chaining single-statistic calls |\n",
    "| `rolling(n)` | Compute a statistic over a sliding window of n rows; sort first so the window is meaningful |\n",
    "| `expanding()` | Cumulative statistic from the start of the series to the current row |\n",
    "| `groupby().transform(lambda s: s.rolling(...))` | Per-group rolling stat, result aligned to original index |\n",
    "| `df.query(\"expr\")` | Readable multi-condition filter; use `@var` to reference external variables |\n",
    "\n",
    "**Next:** `10-combining-reshaping.ipynb`, covering concatenation, merging, `groupby`, pivot tables, and time series."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}