{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 9: Pandas Operations\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/09-pandas-operations.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/09-pandas-operations.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook continues directly from Part 8 (`08-pandas-core.ipynb`). It assumes you can read a CSV, select rows and columns with `.loc`/`.iloc`, and filter with a boolean mask. We reload the same dataset from scratch below, so you can run this notebook on its own.\n", "\n", "Part 8 covered the dataset's shape and structure. Part 9 covers the operations that turn raw columns into derived ones: applying your own functions row-by-row, counting and filtering categorical values, cleaning text columns, summarizing a dataset, and computing window statistics.\n", "\n", "Part 10 (`10-combining-reshaping.ipynb`) continues with combining datasets, `groupby`, pivoting, and time series.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Pandas 3: `str` dtype and Copy-on-Write** | Two behaviour changes that affect every example in this notebook and the ones before it |\n", "| **`map` and `apply`** | Run your own function over a Series or DataFrame when there is no built-in vectorized operation |\n", "| **Value counts, membership, and encoding** | Understand a categorical column, filter on a set of values, and prepare it for a model |\n", "| **The `.str` accessor** | String methods on a Series need `.str`; calling them directly is the most common error here |\n", "| **Descriptive statistics** | `.describe()` and `.agg()` summarize a dataset in one call instead of many |\n", "| **Rolling and expanding windows** | Compute moving averages and cumulative statistics; the foundation of time-series feature engineering |\n", "| **`df.query()`** | Express multi-condition filters as a readable string; faster with numexpr on large DataFrames |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 9 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Recognise pandas 3's `str` dtype and Copy-on-Write behaviour | Sec. 0 |\n", "| 2 | Apply your own function to a Series or DataFrame with `map` and `apply` | Sec. 1 |\n", "| 3 | Count, filter, and encode categorical columns | Sec. 2 |\n", "| 4 | Clean and query text columns with the `.str` accessor | Sec. 3 |\n", "| 5 | Summarize a dataset with descriptive statistics in one call | Sec. 4 |\n", "| 6 | Compute rolling and expanding window statistics | Sec. 5a |\n", "| 7 | Write readable multi-condition filters with `df.query()` | Sec. 5b |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/university_analytics.csv\")\n", "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n", "df.head()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 0. What's New in Pandas 3\n", "\n", "This series is written against pandas 3, released in 2025, and two of its changes are worth understanding before going further, because they change what you see in everyday output." ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "Run `df.dtypes` and look at the text columns. In pandas 2 they showed up as `object`, the catch-all dtype for anything that is not a fixed-width number. Pandas 3 gives strings a dedicated `str` dtype instead, backed by PyArrow when it is installed:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "
\n", " Key Concept: A dedicated str dtype, not object

\n", "object dtype could hold strings, but it could just as easily hold a mix of strings, numbers, and Python objects, with no guarantee which. The new str dtype can only hold strings or missing values, so a column typed str is a stronger guarantee than object ever was, and the PyArrow backing makes string operations faster. This is the same dtype the .str accessor in Sec. 3 operates on.\n", "
" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "The second change does not show up in any output, it changes how assignment behaves. Pandas 3 makes Copy-on-Write the only behaviour: selecting a subset of a DataFrame never lets you accidentally modify the original through it." ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "subset = df[[\"student_id\", \"average_marks\"]]\n", "subset.loc[0, \"average_marks\"] = 0.0\n", "print(f\"original unaffected: {df.loc[0, 'average_marks']}\")\n", "print(f\"subset changed : {subset.loc[0, 'average_marks']}\")" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "
\n", " Key Concept: Copy-on-Write removes the SettingWithCopyWarning guessing game

\n", "Before pandas 3, whether subset above was a view into df or an independent copy depended on details of how subset was created, the source of the infamous SettingWithCopyWarning. Pandas 3 makes every such selection behave as an independent copy: modifying subset never touches df. The warning is gone because the ambiguity it warned about is gone.\n", "
" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "
\n", " Pro Tip: pd.col() previews the expression style Polars uses

\n", "Pandas 3 adds pd.col(\"name\") as an alternative to a lambda inside .assign(): df.assign(c=pd.col(\"a\") + pd.col(\"b\")) instead of df.assign(c=lambda d: d[\"a\"] + d[\"b\"]). It reads closer to the column-expression style used throughout Polars, covered in Part 11 of this series, and is worth recognising even though this notebook keeps using apply and lambda, the more widely used style today.\n", "
" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "## 1. Function Application with `map` and `apply`\n", "\n", "NumPy's vectorized operations cover arithmetic and comparisons, but sometimes the transformation you need is your own logic: a lookup, a multi-branch rule, a calculation that combines several columns per row. `map` and `apply` are how pandas runs that logic." ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "`Series.map()` is for simple one-to-one substitution: pass it a dict (or a function) and it replaces every value with its mapped equivalent. Recoding the `gender` column's short codes into full labels is a `map` problem, not an `apply` one:" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "gender_labels = {\"M\": \"Male\", \"F\": \"Female\"}\n", "df[\"gender_label\"] = df[\"gender\"].map(gender_labels)\n", "df[[\"gender\", \"gender_label\"]].head(3)" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "
\n", " Key Concept: map is for substitution, apply is for logic

\n", "map expects a dict or a simple function and only ever looks at one value at a time. apply accepts any function, including one with branching logic, and can run over a Series (one value at a time) or a DataFrame (one row or one column at a time, depending on axis). Reach for map first; reach for apply when the rule does not fit in a lookup table.\n", "
" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "Converting a normalized mark (0 to 1) into a letter grade needs a multi-branch rule, a job for `apply` with a plain Python function:" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "def letter_grade(mark: float) -> str:\n", " if mark >= 80:\n", " return \"A\"\n", " elif mark >= 60:\n", " return \"B\"\n", " elif mark >= 40:\n", " return \"C\"\n", " else:\n", " return \"D\"\n", "\n", "\n", "df[\"grade\"] = df[\"average_marks\"].apply(letter_grade)\n", "df[[\"student_id\", \"average_marks\", \"grade\"]].head()" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "`DataFrame.apply(..., axis=1)` runs a function once per row, with the whole row available as a Series. Use it when the rule needs more than one column at a time, for example flagging students who are both low-scoring and without internet access:" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "def at_risk(row: pd.Series) -> bool:\n", " return row[\"average_marks\"] < 40 and not row[\"has_internet\"]\n", "\n", "\n", "df[\"at_risk\"] = df.apply(at_risk, axis=1)\n", "df[\"at_risk\"].sum()" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "
\n", " Common Mistake: Reaching for apply(axis=1) when a vectorized operation already does the job

\n", "df.apply(at_risk, axis=1) calls a Python function once per row, 17,190 times here, which is far slower than an equivalent boolean mask: (df[\"average_marks\"] < 0.4) & (df[\"has_internet\"] == 0) computes the same result with NumPy operating on whole columns at once. Use apply when the rule cannot be written with column-wise operations and comparisons; reach for masking first.\n", "
" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "
\n", " Activity 1 - Grade Distribution

\n", "\n", "Goal: Write a function that returns True for marks of 0.6 or higher and False otherwise, apply it to average_marks to create a new column passed, then print how many students passed.\n", "
def passed_threshold(mark: float) -> bool:\n",
    "    ...\n",
    "\n",
    "\n",
    "df[\"passed\"] = df[\"average_marks\"].apply(passed_threshold)\n",
    "df[\"passed\"].sum()\n",
    "# -> 6011
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "# TODO: write passed_threshold, apply it, and print the count of students who passed\n", "..." ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "## 2. Unique Values, Value Counts, and Membership\n", "\n", "`.unique()`, `.value_counts()`, and `.isin()` are usually the first three calls worth making on any categorical column, before any filtering or grouping." ] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [ "df[\"program\"].unique()" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "`.value_counts()` counts how many rows fall into each category, sorted from most to least common. Passing `normalize=True` turns the counts into proportions, which is what you want for a question like \"what fraction of students dropped out?\":" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "
\n", " Example: Pass rate from value_counts

\n", "df[\"passed\"].value_counts(normalize=True) answers the pass-rate question in a single line, no manual division required.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "df[\"passed\"].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "`.isin()` filters rows whose value is in a given list, the categorical equivalent of a boolean comparison. Filtering to the two largest programs combines `.value_counts()` and `.isin()` directly:" ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [ "top_programs = df[\"program\"].value_counts().head(2).index\n", "df_top_programs = df[df[\"program\"].isin(top_programs)]\n", "df_top_programs[\"program\"].value_counts()" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "
\n", " Pro Tip: .nunique() before .unique() on a column you have not seen yet

\n", ".nunique() returns just the count of distinct values. Running it before .unique() tells you whether printing every unique value is even a reasonable idea, useful for a column that turns out to have 4 categories versus one that turns out to have 4,000.\n", "
" ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "A column with a small, fixed set of values is a candidate for the `category` dtype: pandas stores each value once and keeps a compact integer code per row instead of repeating the full string, which is most of `gender`, `program`, and `guardian` here:" ] }, { "cell_type": "code", "execution_count": null, "id": "33", "metadata": {}, "outputs": [], "source": [ "df[\"program\"] = df[\"program\"].astype(\"category\")\n", "df[\"program\"].dtype" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "Most ML models need numbers, not category labels. `pd.get_dummies()` one-hot encodes a categorical column into one binary column per category, the standard first step before fitting a model on top of this data:" ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [], "source": [ "program_dummies = pd.get_dummies(df[\"program\"], prefix=\"program\")\n", "program_dummies.head()" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "
\n", " Key Concept: category dtype for memory, one-hot encoding for models

\n", ".astype(\"category\") is about storage and speed: it does not change what a column means, only how compactly pandas stores it. pd.get_dummies() is about preparing data for a model that expects numeric input: it turns one categorical column into several binary columns. The two are often used together, category dtype while exploring, dummy columns right before training.\n", "
" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "
\n", " Activity 2 - Guardian Breakdown

\n", "\n", "Goal: Print the value counts for the guardian column, then filter the DataFrame to students whose guardian is \"mother\" or \"father\" using .isin(), and print how many rows remain.\n", "
df[\"guardian\"].value_counts()\n",
    "\n",
    "parents = df[df[\"guardian\"].isin([\"mother\", \"father\"])]\n",
    "len(parents)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [], "source": [ "# TODO: print value counts for guardian, then filter to mother/father and print row count\n", "..." ] }, { "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "## 3. Working with Text Columns: the `.str` Accessor\n", "\n", "`student_id` is a string column, and pandas keeps every string method behind a `.str` accessor rather than directly on the Series. The accessor exists because a Series can hold any dtype, `.str` is what tells pandas you specifically want the string-handling behaviour." ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "
\n", " Common Mistake: Calling a string method directly on a Series

\n", "df[\"student_id\"].upper() raises an AttributeError: Series has no upper method. The string methods live on df[\"student_id\"].str, not on the Series itself, because the Series itself is a general-purpose container that happens to hold strings here.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "df[\"student_id\"].str.upper().head(3)" ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "`.str.len()` gives the length of every string at once, useful for spotting malformed values before they cause problems downstream:" ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [], "source": [ "id_lengths = df[\"student_id\"].str.len()\n", "id_lengths.value_counts()" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "`.str.extract()` pulls a regex capture group out of every value, the cleanest way to turn a structured string column into a usable numeric one. Every `student_id` here is the letter `S` followed by four digits (`S0001`–`S0400`), so extracting just the digits gives a numeric ID:" ] }, { "cell_type": "code", "execution_count": null, "id": "45", "metadata": {}, "outputs": [], "source": [ "df[\"student_id\"].str.match(r\"^S\\d{4}$\").all()" ] }, { "cell_type": "code", "execution_count": null, "id": "46", "metadata": {}, "outputs": [], "source": [ "numeric_id = df[\"student_id\"].str.extract(r\"S(\\d+)\")[0].astype(\"Int64\")\n", "df[\"numeric_id\"] = numeric_id\n", "df[[\"student_id\", \"numeric_id\"]].head(3)" ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "
\n", " Activity 3 - Validate and Filter by ID

\n", "\n", "Goal: Use .str.startswith(\"S\") to confirm every student_id starts with the letter \"S\", then use .str.contains() to count how many contain the digit \"0\" anywhere in the ID.\n", "
df[\"student_id\"].str.startswith(\"S\").all()\n",
    "df[\"student_id\"].str.contains(\"0\").sum()
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "48", "metadata": {}, "outputs": [], "source": [ "# TODO: confirm every student_id starts with \"s\", then count IDs containing \"0\"\n", "..." ] }, { "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "## 4. Descriptive Statistics and Summarization\n", "\n", "`.describe()` from Part 1 summarizes every numeric column with one call. `.agg()` goes a step further: it computes a chosen list of statistics for a chosen set of columns, in whatever combination you ask for." ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [], "source": [ "df[[\"midterm_score\", \"final_score\", \"project_score\"]].describe()" ] }, { "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "`.agg()` accepts a list of function names and applies every one of them to every selected column, returning a single small table instead of one Series per statistic:" ] }, { "cell_type": "code", "execution_count": null, "id": "52", "metadata": {}, "outputs": [], "source": [ "df[[\"midterm_score\", \"final_score\", \"project_score\"]].agg([\"mean\", \"std\", \"min\", \"max\"])" ] }, { "cell_type": "markdown", "id": "53", "metadata": {}, "source": [ "`.corr()` computes the pairwise correlation between numeric columns, a quick way to see whether strong performance in one subject tends to come with strong performance in another:" ] }, { "cell_type": "code", "execution_count": null, "id": "54", "metadata": {}, "outputs": [], "source": [ "df[[\"midterm_score\", \"final_score\", \"project_score\"]].corr()" ] }, { "cell_type": "markdown", "id": "55", "metadata": {}, "source": [ "
\n", " Pro Tip: .agg() over chaining several single-statistic calls

\n", "Five separate lines like df[\"midterm_score\"].mean(), .std(), .min() read fine in isolation but scatter related numbers across separate outputs. One .agg([...]) call keeps them in a single table that is easier to read and to compare across columns.\n", "
" ] }, { "cell_type": "markdown", "id": "56", "metadata": {}, "source": [ "
\n", " Pro Tip: Use groupby().transform() to add a group statistic back to the original rows

\n", "groupby().agg() reduces a DataFrame to one row per group. groupby().transform() does the opposite: it computes the same group statistic but returns a Series the exact same length as the original DataFrame, aligned to the original index, so you can assign it as a new column without a merge:\n", "\n", "
# agg → one row per program\n",
    "df.groupby(\"program\")[\"average_marks\"].agg(\"mean\")  # shape (4,)\n",
    "\n",
    "# transform → one row per student, aligned to original index\n",
    "df[\"caste_mean_marks\"] = df.groupby(\"program\")[\"average_marks\"].transform(\"mean\")\n",
    "# every student's row now has their caste's mean, ready for feature engineering
\n", "\n", "Common use: normalising within groups. (df[\"average_marks\"] - df[\"caste_mean_marks\"]) / df.groupby(\"program\")[\"average_marks\"].transform(\"std\") z-scores each student relative to their caste group, not across the whole dataset, in two lines without any merge.\n", "
" ] }, { "cell_type": "markdown", "id": "57", "metadata": {}, "source": [ "
\n", " Activity 4 - Subject Spread

\n", "\n", "Goal: Use .agg() to compute the mean and standard deviation of midterm_score, final_score, and project_score for students with has_internet == True only.\n", "
with_internet = df[df[\"has_internet\"]]\n",
    "with_internet[[\"midterm_score\", \"final_score\", \"project_score\"]].agg([\"mean\", \"std\"])
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "58", "metadata": {}, "outputs": [], "source": [ "# TODO: agg mean and std for the three mark columns, restricted to internet == 1\n", "..." ] }, { "cell_type": "markdown", "id": "59", "metadata": {}, "source": [ "## 5. Window Operations and Expressive Queries\n", "\n", "Two operations that come up constantly in DS pipelines but do not fit neatly under \"string methods\" or \"statistics\": rolling windows for time-aware aggregation, and `df.query()` for readable multi-condition filters." ] }, { "cell_type": "markdown", "id": "60", "metadata": {}, "source": [ "### 5a. Rolling and Expanding Windows\n", "\n", "`rolling(n)` groups each row with the `n-1` rows before it and applies an aggregation. It is the standard tool for smoothing noisy signals, computing moving averages, or creating features that capture recent trend for a time-ordered dataset.\n", "\n", "`expanding()` is the cumulative version: each row aggregates everything from the start of the series up to that point. The first row is its own mean; the second is the mean of rows 1-2; and so on." ] }, { "cell_type": "raw", "id": "61", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "> **Rolling window (size=3): slides one step at a time**\n", "\n", "```{mermaid}\n", "flowchart LR\n", " subgraph W1 [\"window 1 (size=3)\"]\n", " A1[\"day 1: 70\"] --> A2[\"day 2: 80\"] --> A3[\"day 3: 90\"]\n", " end\n", " subgraph W2 [\"window 2 (size=3)\"]\n", " B1[\"day 2: 80\"] --> B2[\"day 3: 90\"] --> B3[\"day 4: 60\"]\n", " end\n", " subgraph W3 [\"window 3 (size=3)\"]\n", " C1[\"day 3: 90\"] --> C2[\"day 4: 60\"] --> C3[\"day 5: 75\"]\n", " end\n", " W1 -->|mean = 80.0| R1[\"result[2]\"]\n", " W2 -->|mean = 76.7| R2[\"result[3]\"]\n", " W3 -->|mean = 75.0| R3[\"result[4]\"]\n", "\n", " style R1 fill:#EBF5F0,stroke:#059669,color:#065F46\n", " style R2 fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n", " style R3 fill:#F5F3FF,stroke:#7C3AED,color:#3B0764\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "id": "62", "metadata": {}, "outputs": [], "source": [ "# Sort by average_marks to give the rolling window a meaningful order here\n", "df_sorted = df.sort_values(\"average_marks\").reset_index(drop=True)\n", "\n", "# 5-row rolling mean: smooth out noise\n", "df_sorted[\"marks_rolling5\"] = df_sorted[\"average_marks\"].rolling(window=5).mean()\n", "\n", "# expanding mean: cumulative average as we move through sorted students\n", "df_sorted[\"marks_cumulative\"] = df_sorted[\"average_marks\"].expanding().mean()\n", "\n", "df_sorted[[\"student_id\", \"average_marks\", \"marks_rolling5\", \"marks_cumulative\"]].head(8)" ] }, { "cell_type": "markdown", "id": "63", "metadata": {}, "source": [ "
\n", " Key Concept: Rolling windows require ordered data to be meaningful

\n", "rolling(5) looks at the 5 rows immediately before each row in whatever order they sit in the DataFrame. On unsorted data the window picks up random rows and produces meaningless averages. Always sort by the relevant axis (time, score, sequence number) before calling rolling(). The NaN values in the first n-1 rows are correct: there is not enough history to fill those windows yet.\n", "
" ] }, { "cell_type": "markdown", "id": "64", "metadata": {}, "source": [ "In time-series work, the pattern is almost always: sort by date, group by entity, then compute a rolling stat per group. The `groupby().transform()` tip from Section 4 applies here too:" ] }, { "cell_type": "code", "execution_count": null, "id": "65", "metadata": {}, "outputs": [], "source": [ "# Per-program 10-student rolling mean of average_marks\n", "# transform keeps the result aligned to the original index\n", "df[\"marks_prog_rolling10\"] = (\n", " df.sort_values(\"average_marks\")\n", " .groupby(\"program\")[\"average_marks\"]\n", " .transform(lambda s: s.rolling(10, min_periods=3).mean())\n", ")\n", "df[[\"program\", \"average_marks\", \"marks_prog_rolling10\"]].dropna().head(6)" ] }, { "cell_type": "markdown", "id": "66", "metadata": {}, "source": [ "### 5b. Expressive Filtering with `df.query()`\n", "\n", "Boolean masks work well for one or two conditions. For complex multi-condition filters, `df.query()` expresses the same logic as a readable string, closer to how you would say it out loud.\n", "\n", "Reference an external variable inside a query string with the `@` prefix:" ] }, { "cell_type": "code", "execution_count": null, "id": "67", "metadata": {}, "outputs": [], "source": [ "# Boolean mask version -- correct but harder to read\n", "mask = (df[\"midterm_score\"] > 70) & (df[\"final_score\"] > 70) & (df[\"program\"].isin([\"Engineering\", \"Sciences\"]))\n", "print(f\"mask result : {mask.sum()} rows\")\n", "\n", "# Same filter as a query string -- reads like a sentence\n", "threshold = 70\n", "result = df.query(\"midterm_score > @threshold and final_score > @threshold and program in ['Engineering', 'Sciences']\")\n", "print(f\"query result : {len(result)} rows\")" ] }, { "cell_type": "markdown", "id": "68", "metadata": {}, "source": [ "
\n", " Pro Tip: Use query() for readability, masks for dynamic conditions

\n", "df.query() is evaluated with numexpr when the library is installed, which makes it faster than the equivalent boolean mask on large DataFrames. The downside: the query string is harder to build programmatically. Use query() when you are writing a fixed filter that a reader should understand at a glance; use a mask when the conditions are assembled at runtime from user input or a config.\n", "
" ] }, { "cell_type": "markdown", "id": "69", "metadata": {}, "source": [ "
\n", " Activity 5 - Smoothed Grade Distribution

\n", "\n", "Goal: Sort the full DataFrame by average_marks ascending. Compute a 20-student rolling mean of average_marks. Then use df.query() to find students with a rolling mean above 75 who are in the \"Sciences\" program. Print how many rows match.\n", "
df_q = df.sort_values(\"average_marks\").reset_index(drop=True)\n",
    "df_q[\"rolling_mean\"] = df_q[\"average_marks\"].rolling(20).mean()\n",
    "result = df_q.query(\"rolling_mean > 75 and program == 'Sciences'\")\n",
    "print(len(result))
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "70", "metadata": {}, "outputs": [], "source": [ "# TODO: rolling mean, then query-filter to Sciences students above rolling threshold\n", "..." ] }, { "cell_type": "markdown", "id": "71", "metadata": {}, "source": [ "## Capstone: Risk and Performance Report\n", "\n", "Combine everything from this notebook into one short report: a derived grade column, a categorical breakdown, a text-column check, and a summary statistic, the same operations used in any first pass over a new dataset." ] }, { "cell_type": "markdown", "id": "72", "metadata": {}, "source": [ "
\n", " Capstone Exercise - Risk and Performance Report

\n", "\n", "Goal:\n", "
    \n", "
  1. Confirm every student_id matches the pattern S followed by 4 digits, using .str.match() (Sec. 3)
  2. \n", "
  3. Print the final_grade distribution as proportions with .value_counts(normalize=True) (Sec. 1, Sec. 2)
  4. \n", "
  5. Use .agg() to compute the mean average_marks for passed vs failed students separately (Sec. 4)
  6. \n", "
\n", "
valid_ids = df[\"student_id\"].str.match(r\"^S\\d{4}$\").all()\n",
    "\n",
    "grade_distribution = df[\"final_grade\"].value_counts(normalize=True)\n",
    "\n",
    "passing = df[df[\"passed\"] == True][\"average_marks\"].agg([\"mean\"])  # noqa: E712\n",
    "failing = df[df[\"passed\"] == False][\"average_marks\"].agg([\"mean\"])  # noqa: E712
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "73", "metadata": {}, "outputs": [], "source": [ "# TODO: build the risk and performance report described above\n", "..." ] }, { "cell_type": "markdown", "id": "74", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| McKinney, W. (2022). *Python for Data Analysis*, 3rd ed. O'Reilly. | Chapter 7 (data cleaning) and Chapter 10 (aggregation) are the canonical references for the operations in this notebook |\n", "| [pandas documentation — Indexing and selecting data](https://pandas.pydata.org/docs/user_guide/indexing.html) | Covers `.loc`, `.iloc`, boolean indexing, and `MultiIndex` in exhaustive detail |\n", "| [pandas documentation — Group by: split-apply-combine](https://pandas.pydata.org/docs/user_guide/groupby.html) | Official guide to `groupby`, `transform`, and `apply`; the examples use the same column types as this notebook |\n", "| Wickham, H. (2014). [Tidy data](https://doi.org/10.18637/jss.v059.i10). *Journal of Statistical Software* 59(10). | The conceptual framework behind `.melt()` and `.pivot_table()` introduced later in Part 10 |\n" ] }, { "cell_type": "markdown", "id": "75", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "|---|---|\n", "| `str` dtype | Pandas 3's default for text columns, backed by PyArrow, replacing `object` |\n", "| Copy-on-Write | Selecting a subset always behaves as an independent copy; the original is never modified through it |\n", "| `Series.map()` | One-to-one substitution with a dict or simple function |\n", "| `Series.apply()` | Run any function, including multi-branch logic, one value at a time |\n", "| `DataFrame.apply(axis=1)` | Run a function once per row, with the whole row available |\n", "| Vectorized ops vs `apply` | Prefer a boolean mask or arithmetic when one exists; `apply` is the fallback, not the default |\n", "| `.value_counts(normalize=True)` | Category proportions in one call |\n", "| `.isin()` | Filter rows whose value is in a given list |\n", "| `.astype(\"category\")` | Compact storage for a column with a small, fixed set of values |\n", "| `pd.get_dummies()` | One-hot encode a categorical column before fitting a model |\n", "| `.str` accessor | Required for any string method on a Series; calling the method directly raises `AttributeError` |\n", "| `.str.extract()` | Pull a regex capture group into a new column |\n", "| `.agg([...])` | Compute several statistics for several columns in one call, instead of chaining single-statistic calls |\n", "| `rolling(n)` | Compute a statistic over a sliding window of n rows; sort first so the window is meaningful |\n", "| `expanding()` | Cumulative statistic from the start of the series to the current row |\n", "| `groupby().transform(lambda s: s.rolling(...))` | Per-group rolling stat, result aligned to original index |\n", "| `df.query(\"expr\")` | Readable multi-condition filter; use `@var` to reference external variables |\n", "\n", "**Next:** `10-combining-reshaping.ipynb`, covering concatenation, merging, `groupby`, pivot tables, and time series." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }