{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 9: Pandas Operations\"\n", "---" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/09-pandas-operations.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/09-pandas-operations.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook continues directly from Part 8 (`08-pandas-core.ipynb`). It assumes you can read a CSV, select rows and columns with `.loc`/`.iloc`, and filter with a boolean mask. We reload the same dataset from scratch below, so you can run this notebook on its own.\n", "\n", "Part 8 covered the dataset's shape and structure. Part 9 covers the operations that turn raw columns into derived ones: applying your own functions row-by-row, counting and filtering categorical values, cleaning text columns, summarizing a dataset, and computing window statistics.\n", "\n", "Part 10 (`10-combining-reshaping.ipynb`) continues with combining datasets, `groupby`, pivoting, and time series.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **Pandas 3: `str` dtype and Copy-on-Write** | Two behaviour changes that affect every example in this notebook and the ones before it |\n", "| **`map` and `apply`** | Run your own function over a Series or DataFrame when there is no built-in vectorized operation |\n", "| **Value counts, membership, and encoding** | Understand a categorical column, filter on a set of values, and prepare it for a model |\n", "| **The `.str` accessor** | String methods on a Series need `.str`; calling them directly is the most common error here |\n", "| **Descriptive statistics** | `.describe()` and `.agg()` summarize a dataset in one call instead of many |\n", "| **Rolling and expanding windows** | Compute moving averages and cumulative statistics; the foundation of time-series feature engineering |\n", "| **`df.query()`** | Express multi-condition filters as a readable string; faster with numexpr on large DataFrames |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 9 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Recognise pandas 3's `str` dtype and Copy-on-Write behaviour | Sec. 0 |\n", "| 2 | Apply your own function to a Series or DataFrame with `map` and `apply` | Sec. 1 |\n", "| 3 | Count, filter, and encode categorical columns | Sec. 2 |\n", "| 4 | Clean and query text columns with the `.str` accessor | Sec. 3 |\n", "| 5 | Summarize a dataset with descriptive statistics in one call | Sec. 4 |\n", "| 6 | Compute rolling and expanding window statistics | Sec. 5a |\n", "| 7 | Write readable multi-condition filters with `df.query()` | Sec. 5b |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/university_analytics.csv\")\n", "df[\"average_marks\"] = (df[\"midterm_score\"] + df[\"final_score\"] + df[\"project_score\"]) / 3\n", "df.head()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 0. What's New in Pandas 3\n", "\n", "This series is written against pandas 3, released in 2025, and two of its changes are worth understanding before going further, because they change what you see in everyday output." ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "Run `df.dtypes` and look at the text columns. In pandas 2 they showed up as `object`, the catch-all dtype for anything that is not a fixed-width number. Pandas 3 gives strings a dedicated `str` dtype instead, backed by PyArrow when it is installed:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "
str dtype, not objectobject dtype could hold strings, but it could just as easily hold a mix of strings, numbers, and Python objects, with no guarantee which. The new str dtype can only hold strings or missing values, so a column typed str is a stronger guarantee than object ever was, and the PyArrow backing makes string operations faster. This is the same dtype the .str accessor in Sec. 3 operates on.\n",
"SettingWithCopyWarning guessing gamesubset above was a view into df or an independent copy depended on details of how subset was created, the source of the infamous SettingWithCopyWarning. Pandas 3 makes every such selection behave as an independent copy: modifying subset never touches df. The warning is gone because the ambiguity it warned about is gone.\n",
"pd.col() previews the expression style Polars usespd.col(\"name\") as an alternative to a lambda inside .assign(): df.assign(c=pd.col(\"a\") + pd.col(\"b\")) instead of df.assign(c=lambda d: d[\"a\"] + d[\"b\"]). It reads closer to the column-expression style used throughout Polars, covered in Part 11 of this series, and is worth recognising even though this notebook keeps using apply and lambda, the more widely used style today.\n",
"map is for substitution, apply is for logicmap expects a dict or a simple function and only ever looks at one value at a time. apply accepts any function, including one with branching logic, and can run over a Series (one value at a time) or a DataFrame (one row or one column at a time, depending on axis). Reach for map first; reach for apply when the rule does not fit in a lookup table.\n",
"apply(axis=1) when a vectorized operation already does the jobdf.apply(at_risk, axis=1) calls a Python function once per row, 17,190 times here, which is far slower than an equivalent boolean mask: (df[\"average_marks\"] < 0.4) & (df[\"has_internet\"] == 0) computes the same result with NumPy operating on whole columns at once. Use apply when the rule cannot be written with column-wise operations and comparisons; reach for masking first.\n",
"True for marks of 0.6 or higher and False otherwise, apply it to average_marks to create a new column passed, then print how many students passed.\n",
"def passed_threshold(mark: float) -> bool:\n",
" ...\n",
"\n",
"\n",
"df[\"passed\"] = df[\"average_marks\"].apply(passed_threshold)\n",
"df[\"passed\"].sum()\n",
"# -> 6011\n",
"value_countsdf[\"passed\"].value_counts(normalize=True) answers the pass-rate question in a single line, no manual division required.\n",
".nunique() before .unique() on a column you have not seen yet.nunique() returns just the count of distinct values. Running it before .unique() tells you whether printing every unique value is even a reasonable idea, useful for a column that turns out to have 4 categories versus one that turns out to have 4,000.\n",
"category dtype for memory, one-hot encoding for models.astype(\"category\") is about storage and speed: it does not change what a column means, only how compactly pandas stores it. pd.get_dummies() is about preparing data for a model that expects numeric input: it turns one categorical column into several binary columns. The two are often used together, category dtype while exploring, dummy columns right before training.\n",
"guardian column, then filter the DataFrame to students whose guardian is \"mother\" or \"father\" using .isin(), and print how many rows remain.\n",
"df[\"guardian\"].value_counts()\n",
"\n",
"parents = df[df[\"guardian\"].isin([\"mother\", \"father\"])]\n",
"len(parents)\n",
"df[\"student_id\"].upper() raises an AttributeError: Series has no upper method. The string methods live on df[\"student_id\"].str, not on the Series itself, because the Series itself is a general-purpose container that happens to hold strings here.\n",
".str.startswith(\"S\") to confirm every student_id starts with the letter \"S\", then use .str.contains() to count how many contain the digit \"0\" anywhere in the ID.\n",
"df[\"student_id\"].str.startswith(\"S\").all()\n",
"df[\"student_id\"].str.contains(\"0\").sum()\n",
".agg() over chaining several single-statistic callsdf[\"midterm_score\"].mean(), .std(), .min() read fine in isolation but scatter related numbers across separate outputs. One .agg([...]) call keeps them in a single table that is easier to read and to compare across columns.\n",
"groupby().transform() to add a group statistic back to the original rowsgroupby().agg() reduces a DataFrame to one row per group. groupby().transform() does the opposite: it computes the same group statistic but returns a Series the exact same length as the original DataFrame, aligned to the original index, so you can assign it as a new column without a merge:\n",
"\n",
"# agg → one row per program\n",
"df.groupby(\"program\")[\"average_marks\"].agg(\"mean\") # shape (4,)\n",
"\n",
"# transform → one row per student, aligned to original index\n",
"df[\"caste_mean_marks\"] = df.groupby(\"program\")[\"average_marks\"].transform(\"mean\")\n",
"# every student's row now has their caste's mean, ready for feature engineering\n",
"\n",
"Common use: normalising within groups. (df[\"average_marks\"] - df[\"caste_mean_marks\"]) / df.groupby(\"program\")[\"average_marks\"].transform(\"std\") z-scores each student relative to their caste group, not across the whole dataset, in two lines without any merge.\n",
".agg() to compute the mean and standard deviation of midterm_score, final_score, and project_score for students with has_internet == True only.\n",
"with_internet = df[df[\"has_internet\"]]\n",
"with_internet[[\"midterm_score\", \"final_score\", \"project_score\"]].agg([\"mean\", \"std\"])\n",
"rolling(5) looks at the 5 rows immediately before each row in whatever order they sit in the DataFrame. On unsorted data the window picks up random rows and produces meaningless averages. Always sort by the relevant axis (time, score, sequence number) before calling rolling(). The NaN values in the first n-1 rows are correct: there is not enough history to fill those windows yet.\n",
"query() for readability, masks for dynamic conditionsdf.query() is evaluated with numexpr when the library is installed, which makes it faster than the equivalent boolean mask on large DataFrames. The downside: the query string is harder to build programmatically. Use query() when you are writing a fixed filter that a reader should understand at a glance; use a mask when the conditions are assembled at runtime from user input or a config.\n",
"average_marks ascending. Compute a 20-student rolling mean of average_marks. Then use df.query() to find students with a rolling mean above 75 who are in the \"Sciences\" program. Print how many rows match.\n",
"df_q = df.sort_values(\"average_marks\").reset_index(drop=True)\n",
"df_q[\"rolling_mean\"] = df_q[\"average_marks\"].rolling(20).mean()\n",
"result = df_q.query(\"rolling_mean > 75 and program == 'Sciences'\")\n",
"print(len(result))\n",
"student_id matches the pattern S followed by 4 digits, using .str.match() (Sec. 3)final_grade distribution as proportions with .value_counts(normalize=True) (Sec. 1, Sec. 2).agg() to compute the mean average_marks for passed vs failed students separately (Sec. 4)valid_ids = df[\"student_id\"].str.match(r\"^S\\d{4}$\").all()\n",
"\n",
"grade_distribution = df[\"final_grade\"].value_counts(normalize=True)\n",
"\n",
"passing = df[df[\"passed\"] == True][\"average_marks\"].agg([\"mean\"]) # noqa: E712\n",
"failing = df[df[\"passed\"] == False][\"average_marks\"].agg([\"mean\"]) # noqa: E712\n",
"