{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 11: Polars & the Expression API\"\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/11-polars.ipynb) [![Download Notebook](https://img.shields.io/badge/Download-Notebook-blue.svg?logo=jupyter&logoColor=white)](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/11-polars.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook continues from Part 10 (`10-combining-reshaping.ipynb`). You should be comfortable with pandas DataFrames, `groupby`, merging, and time-indexed data before starting here.\n", "\n", "Polars is a modern DataFrame library written in Rust. It uses an expression-based API and a lazy evaluation model that lets it optimize queries before running them. The same two datasets from Parts 8-10 are used here so you can compare pandas and Polars syntax on familiar ground.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **DataFrame construction** | Polars and pandas share the concept; the API differs in key ways |\n", "| **Expression API** | `pl.col()` expressions replace bracket indexing and make queries composable |\n", "| **GroupBy and aggregation** | Same split-apply-combine logic, cleaner syntax |\n", "| **LazyFrame** | Defer execution and let Polars optimize the query plan before running |\n", "| **Date/time** | Polars parses and handles dates natively, no `.dt` accessor for most operations |\n", "| **Migration patterns** | Side-by-side pandas-to-Polars translations for the operations you already know |\n", "| **`over()` and streaming** | Group-aware window stats without reducing rows; process files larger than RAM in chunks |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "attachments": {}, "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 11 (Polars) you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Build and inspect Polars DataFrames and compare them to pandas | Sec. 1 |\n", "| 2 | Use `pl.col()` expressions to select, filter, and derive columns | Sec. 2 |\n", "| 3 | Aggregate data with `group_by` and the expression API | Sec. 3 |\n", "| 4 | Understand eager vs lazy evaluation and use `LazyFrame` | Sec. 4 |\n", "| 5 | Parse and manipulate dates with Polars's built-in datetime support | Sec. 5 |\n", "| 6 | Translate common pandas operations to Polars equivalents | Sec. 6 |\n", "| 7 | Compute per-group window stats with `over()` and process large files with `scan_csv` | Sec. 7 |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import polars as pl\n", "\n", "# Polars reads CSV directly; no separate parse step needed for standard columns\n", "df = pl.read_csv(\"data/university_analytics.csv\")\n", "print(f\"shape : {df.shape}\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 0. When pandas Is Not Enough\n", "\n", "Your pandas pipeline works perfectly at 50,000 rows. The `groupby` runs in a second, the merge finishes before you blink, and everything fits in memory with room to spare. Then someone drops a 10-million-row export on your desk. The same pipeline now takes three minutes, swaps to disk halfway through, and your laptop fan starts spinning like it is trying to take off.\n", "\n", "This is not a pandas bug. It is a design trade-off. pandas processes one column at a time on a single CPU thread, using a Python object model that was designed for flexibility rather than throughput. For most interactive data science that trade-off is exactly right. But when data gets large enough to feel slow, you need a different tool.\n", "\n", "**Polars** ([pola.rs](https://pola.rs)) was written from scratch in Rust by Ritchie Vink in 2020. It stores data in Apache Arrow format, executes operations across all CPU cores in parallel, and evaluates expressions *lazily* — building a query plan and optimising it before touching a single row. The same operations that take minutes in pandas can take seconds in Polars, without changing what the code looks like.\n", "\n", "### How it compares\n", "\n", "| | pandas | Polars |\n", "| --- | --- | --- |\n", "| Core language | Python / C | Rust |\n", "| Execution | Single-threaded | Multi-threaded |\n", "| Memory model | NumPy / Python objects | Apache Arrow |\n", "| Evaluation | Eager (runs immediately) | Lazy by default (`scan_*` → `.collect()`) |\n", "| API style | Index-based, many methods | Expression-based, composable |\n", "| Ecosystem maturity | Very mature (2008) | Fast-growing (2020) |\n", "| When to use | Up to ~5M rows; default choice | Large files, performance-critical pipelines |\n", "\n", "You do not have to choose one forever. Many production pipelines read and clean with Polars for speed, then convert to pandas for the parts of the ML ecosystem that expect it.\n", "\n", "### Already in your environment\n", "\n", "```bash\n", "uv add polars # for a standalone project\n", "```\n", "\n", "Official docs and user guide: [docs.pola.rs](https://docs.pola.rs/)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "## 1. DataFrame Construction" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "Building a Polars DataFrame from a Python dict looks almost identical to pandas. The difference shows up in the schema: Polars always knows the dtype of every column, and it tells you explicitly:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "import polars as pl\n", "\n", "sample = pl.DataFrame(\n", " {\n", " \"student_id\": [\"s001\", \"s002\", \"s003\"],\n", " \"midterm_score\": [0.62, 0.78, 0.91],\n", " \"gender\": [\"F\", \"M\", \"F\"],\n", " }\n", ")\n", "print(sample.schema)\n", "sample" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "
\n", " Key Concept: Polars enforces strict dtypes; pandas 3 infers them

\n", "Every Polars column has a fixed dtype from creation: pl.Utf8 for strings, pl.Float64 for floats, pl.Int64 for integers. Polars raises an error if you try to mix types within a column. Pandas 3's str dtype is conceptually similar but more permissive: it silently allows None values in a string column. Polars is stricter, which makes bugs surface earlier.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "The first checks after loading data are the same in Polars as in pandas, but the method names differ slightly:" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "print(f\"shape : {df.shape}\")\n", "print(f\"dtypes :\\n{df.dtypes}\")\n", "df.describe()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "
\n", " Pro Tip: Use df.schema to see column names and dtypes in one dict-like view

\n", "df.schema returns a Schema object, a mapping from column name to Polars dtype. It is the fastest way to confirm that read_csv inferred the right types before any analysis starts.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "
\n", " Activity 1 - First Look at the Polars DataFrame

\n", "Goal: Print the schema of df, then use df.describe() to get summary statistics. Compare the dtype labels to the pandas output from Part 8.\n", "
print(df.schema)\n",
    "df.describe()
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "# TODO: print schema, then call df.describe()\n", "..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "## 2. The Expression API: `select`, `filter`, `with_columns`" ] }, { "attachments": {}, "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "The biggest difference from pandas is that Polars does not use bracket indexing for column selection or boolean filtering. Instead, every operation goes through `pl.col()` expressions, which describe what to compute without computing it immediately." ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "# Select specific columns -- equivalent to df[[\"col_a\", \"col_b\"]] in pandas\n", "df.select([\"student_id\", \"midterm_score\", \"final_score\"]).head(3)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "
\n", " Key Concept: pl.col() is an expression, not a value

\n", "pl.col(\"midterm_score\") describes a column reference, not the column itself. Expressions compose: pl.col(\"a\") + pl.col(\"b\") describes an addition without performing it. Polars evaluates all expressions in a .select() or .with_columns() call together, which is what lets it parallelize across columns automatically.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "`filter` is the Polars equivalent of boolean indexing:" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "# filter -- equivalent to df[df[\"midterm_score\"] > 90] in pandas\n", "high_scorers = df.filter(pl.col(\"midterm_score\") > 90)\n", "print(f\"high scorers: {len(high_scorers)} of {len(df)} students\")\n", "high_scorers.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "Combine conditions with `&` and `|`, same as pandas, but directly inside `filter` with `pl.col()` expressions:" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "# combine conditions: failed AND no internet access\n", "failed_no_internet = df.filter(\n", " (pl.col(\"final_grade\") == \"F\") & (pl.col(\"has_internet\") == False) # noqa: E712\n", ")\n", "print(f\"failed with no internet: {len(failed_no_internet)}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "`with_columns` adds or replaces columns. The result is always a new DataFrame; the original is never modified:" ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "# Add average_marks -- equivalent to df[\"average_marks\"] = ... in pandas\n", "df = df.with_columns(average_marks=((pl.col(\"midterm_score\") + pl.col(\"final_score\") + pl.col(\"project_score\")) / 3))\n", "df.select([\"student_id\", \"average_marks\"]).head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "
\n", " Common Mistake: Using bracket assignment instead of with_columns

\n", "df[\"average_marks\"] = ... raises an error in Polars, unlike pandas. Polars DataFrames are immutable by design: the only way to add or modify a column is to call .with_columns() and reassign the result. This enforces the copy-on-write discipline explicitly, rather than relying on the runtime to enforce it as pandas 3 does.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "
\n", " Activity 2 - Filter and Derive

\n", "Goal: Add a passed column that is True when average_marks >= 0.5, using .with_columns() and pl.col(\"average_marks\") >= 0.5. Then filter to passing students and print how many there are.\n", "
df = df.with_columns(\n",
    "    passed=(pl.col(\"average_marks\") >= 0.5)\n",
    ")\n",
    "df.filter(pl.col(\"passed\")).shape[0]
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "# TODO: add \"passed\" column, filter to passing students, print count\n", "..." ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "
\n", " Pro Tip: Use polars.selectors to select columns by type, not by name

\n", "polars.selectors (aliased as cs) provides type-based column selectors, so you can write expressions like \"all numeric columns\" or \"all string columns\" without listing names:\n", "\n", "
import polars.selectors as cs\n",
    "\n",
    "# Normalise every numeric column in one expression\n",
    "df.with_columns(\n",
    "    (cs.numeric() - cs.numeric().mean()) / cs.numeric().std()\n",
    ")\n",
    "\n",
    "# Select only string columns\n",
    "df.select(cs.string())
\n", "\n", "Common selectors: cs.numeric(), cs.string(), cs.boolean(), cs.temporal(), cs.by_dtype(pl.Float64). Combine them with | (union) or & (intersection): cs.numeric() | cs.boolean(). This is especially useful in feature engineering pipelines where the set of columns may change between runs.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "## 3. GroupBy and Aggregation" ] }, { "attachments": {}, "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "Polars `group_by` uses the same split-apply-combine concept as pandas, but the aggregation is specified as a list of expressions inside `.agg()`, not as a method chained after `.mean()` or `.sum()`:" ] }, { "cell_type": "code", "execution_count": null, "id": "31", "metadata": {}, "outputs": [], "source": [ "df.group_by(\"program\").agg(\n", " pl.col(\"average_marks\").mean().alias(\"mean_marks\"),\n", " pl.col(\"average_marks\").std().alias(\"std_marks\"),\n", " pl.len().alias(\"count\"),\n", ").sort(\"program\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "
\n", " Key Concept: .agg() takes a list of expressions, each producing one output column

\n", "In pandas, groupby(\"program\")[\"average_marks\"].agg([\"mean\", \"std\", \"count\"]) returns a DataFrame with the stats as columns. In Polars the same result comes from passing named expressions to .agg(): pl.col(\"average_marks\").mean().alias(\"mean_marks\"). The explicit .alias() is required when the default auto-generated name would be ambiguous.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "Grouping by more than one column works the same way:" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "df.group_by([\"program\", \"gender\"]).agg(\n", " pl.col(\"average_marks\").mean().alias(\"mean_marks\"),\n", ").sort([\"program\", \"gender\"])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "
\n", " Pro Tip: Sort the result explicitly: group_by order is not guaranteed

\n", "Polars group_by does not preserve or sort the output by the grouping key. Always chain .sort(\"key_col\") after .agg() if the output order matters. The pandas equivalent sorted by default in older versions but no longer guarantees it either.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "
\n", " Activity 3 - Pass Rate by Program

\n", "Goal: Group df by program and compute, for each group, the fraction of students whose continue_drop is False. In Polars, (pl.col(\"continue_drop\") == False).mean() gives the fraction directly.\n", "
df.group_by(\"program\").agg(\n",
    "    drop_fraction=(pl.col(\"continue_drop\") == False).mean()\n",
    ").sort(\"program\")
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [], "source": [ "# TODO: group by program, compute pass fraction per program, sort by program\n", "..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "38", "metadata": {}, "source": [ "## 4. LazyFrame and Lazy Evaluation" ] }, { "attachments": {}, "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "Every Polars operation above ran eagerly: it computed the result immediately and returned a new DataFrame. `df.lazy()` switches to lazy mode. In lazy mode, operations build a *query plan* instead of executing immediately. `.collect()` runs the optimized plan." ] }, { "cell_type": "raw", "id": "40", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "> **Lazy evaluation: plan builds up, data reads only on .collect()**\n", "\n", "```{mermaid}\n", "flowchart LR\n", " A[\"scan_csv('file.csv')\n", "no data read yet\"] --> B[\"filter(col > 60)\n", "adds predicate\"]\n", " B --> C[\"group_by('program')\n", "adds aggregation\"]\n", " C --> D[\"select(['program','mean'])\n", "projects columns\"]\n", " D -->|\".collect()\"| E[\"query optimised:\n", "1. predicate pushdown\n", "2. projection pruning\n", "3. execute in chunks\"]\n", " E --> F[\"DataFrame\n", "only now in memory\"]\n", "\n", " style A fill:#F5F3FF,stroke:#7C3AED,color:#3B0764\n", " style E fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n", " style F fill:#EBF5F0,stroke:#059669,color:#065F46\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "# Build a query plan -- nothing executes yet\n", "lazy_query = (\n", " df.lazy()\n", " .filter(pl.col(\"has_internet\") == 1)\n", " .group_by(\"program\")\n", " .agg(pl.col(\"average_marks\").mean().alias(\"mean_marks\"))\n", " .sort(\"mean_marks\", descending=True)\n", ")\n", "\n", "# Inspect the query plan before running\n", "print(lazy_query.explain())" ] }, { "cell_type": "code", "execution_count": null, "id": "42", "metadata": {}, "outputs": [], "source": [ "# Execute the plan and get the result\n", "lazy_query.collect()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "43", "metadata": {}, "source": [ "
\n", " Key Concept: Lazy evaluation lets Polars optimize the query before running it

\n", "When you call .collect(), Polars analyses the whole query plan: it pushes filters as early as possible (predicate pushdown), drops columns that are not needed (projection pushdown), and parallelizes independent operations. The same query in eager mode runs each step sequentially with no cross-step optimization. For large datasets the difference is significant; for small ones it is negligible. The API is identical apart from the .lazy() / .collect() bookends.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "
\n", " Common Mistake: Forgetting .collect() and treating a LazyFrame as a result

\n", "df.lazy().filter(...).group_by(...).agg(...) returns a LazyFrame, not a DataFrame. Printing it shows the query plan, not the data. Call .collect() to get a real DataFrame. If you need the shape, columns, or to iterate over rows, you must collect first.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "45", "metadata": {}, "source": [ "
\n", " Activity 4 - Lazy Top Schools

\n", "Goal: Rewrite the top-schools capstone from Part 8 using a lazy query: filter to students with internet == 1, group by school_id, take the mean average_marks, sort descending, and .collect() the top 5.\n", "
top_schools = (\n",
    "    df.lazy()\n",
    "    .filter(pl.col(\"has_internet\") == 1)\n",
    "    .group_by(\"school_id\")\n",
    "    .agg(pl.col(\"average_marks\").mean().alias(\"mean_marks\"))\n",
    "    .sort(\"mean_marks\", descending=True)\n",
    "    .limit(5)\n",
    "    .collect()\n",
    ")\n",
    "top_schools
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "46", "metadata": {}, "outputs": [], "source": [ "# TODO: lazy top-schools query\n", "..." ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "## 4.5 SQL Interface\n", "\n", "Polars 1.x ships a built-in SQL interface. `pl.SQLContext` registers one or more DataFrames under names and lets you query them with standard SQL. This is useful for teams comfortable with SQL, for migrating existing SQL queries to Polars, and for ad-hoc exploration without learning the full expression API:\n", "\n", "
\n", " Key Concept: SQL queries run on lazy DataFrames and benefit from the same query optimization

\n", "pl.SQLContext compiles SQL into Polars' lazy query plan. The optimizer applies the same predicate pushdown and projection pushdown as a hand-written .lazy().filter().select() chain. You get SQL's readability and Polars' performance for free.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "48", "metadata": {}, "outputs": [], "source": [ "import polars as pl\n", "\n", "# Register DataFrames under SQL table names\n", "ctx = pl.SQLContext(students=df, eager=True)\n", "\n", "# Query 1: top-performing programs — pure SQL\n", "result = ctx.execute(\"\"\"\n", " SELECT\n", " program,\n", " AVG(final_score) AS mean_score,\n", " COUNT(*) AS n_students\n", " FROM students\n", " GROUP BY program\n", " ORDER BY mean_score DESC\n", "\"\"\")\n", "print(result)\n", "\n", "# Query 2: filter + select — same as df.filter().select() but SQL syntax\n", "high_scorers_sql = ctx.execute(\"\"\"\n", " SELECT student_id, final_score\n", " FROM students\n", " WHERE final_score > 0.9\n", " ORDER BY final_score DESC\n", " LIMIT 5\n", "\"\"\")\n", "print(high_scorers_sql)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "## 5. Date/Time in Polars" ] }, { "attachments": {}, "cell_type": "markdown", "id": "50", "metadata": {}, "source": [ "Polars can parse dates directly at read time with `try_parse_dates=True`. There is no separate `to_datetime` step for standard ISO-format date strings:" ] }, { "cell_type": "code", "execution_count": null, "id": "51", "metadata": {}, "outputs": [], "source": [ "attendance = pl.read_csv(\"data/daily_attendance.csv\", try_parse_dates=True)\n", "print(attendance.dtypes)\n", "attendance.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "52", "metadata": {}, "source": [ "
\n", " Key Concept: Polars parses dates at load time; pandas requires an explicit conversion step

\n", "In pandas, date columns read as str dtype until you call pd.to_datetime(). Polars with try_parse_dates=True infers and parses ISO-format dates automatically during read_csv, producing a pl.Date or pl.Datetime column. This means filtering and arithmetic work on dates immediately, without a conversion step.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "53", "metadata": {}, "source": [ "Date-component extraction uses the `.dt` namespace, similar to pandas:" ] }, { "cell_type": "code", "execution_count": null, "id": "54", "metadata": {}, "outputs": [], "source": [ "# Extract year, month, and weekday from the date column\n", "attendance.select(\n", " [\n", " \"date\",\n", " pl.col(\"date\").dt.year().alias(\"year\"),\n", " pl.col(\"date\").dt.month().alias(\"month\"),\n", " pl.col(\"date\").dt.weekday().alias(\"weekday\"), # Mon=1 ... Sun=7\n", " ]\n", ").head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "55", "metadata": {}, "source": [ "Resampling to weekly or monthly means uses `group_by_dynamic` instead of pandas' `resample`:" ] }, { "cell_type": "code", "execution_count": null, "id": "56", "metadata": {}, "outputs": [], "source": [ "# Monthly mean attendance rate per school using group_by_dynamic\n", "school_300 = attendance.filter(pl.col(\"school_id\") == 300)\n", "\n", "school_300.sort(\"date\").group_by_dynamic(\"date\", every=\"1mo\", group_by=\"school_id\").agg(\n", " pl.col(\"attendance_rate\").mean().alias(\"monthly_mean\")\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "57", "metadata": {}, "source": [ "
\n", " Pro Tip: group_by_dynamic replaces resample in Polars

\n", "series.resample(\"M\").mean() in pandas becomes df.sort(\"date\").group_by_dynamic(\"date\", every=\"1mo\").agg(...) in Polars. The sort before group_by_dynamic is required: Polars raises an error if the time column is not sorted.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "58", "metadata": {}, "source": [ "
\n", " Activity 5 - Monthly Attendance by School

\n", "Goal: Use group_by_dynamic to compute the mean attendance_rate per school per month across all schools in the attendance table. Sort by date before grouping.\n", "
attendance.sort(\"date\").group_by_dynamic(\n",
    "    \"date\", every=\"1mo\", group_by=\"school_id\"\n",
    ").agg(\n",
    "    pl.col(\"attendance_rate\").mean().alias(\"monthly_mean\")\n",
    ").sort([\"school_id\", \"date\"])
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "59", "metadata": {}, "outputs": [], "source": [ "# TODO: monthly mean attendance per school using group_by_dynamic\n", "..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "60", "metadata": {}, "source": [ "## 6. Migration Patterns: pandas to Polars" ] }, { "attachments": {}, "cell_type": "markdown", "id": "61", "metadata": {}, "source": [ "The table below maps the most common pandas operations from Parts 8-10 to their Polars equivalents. The concepts are the same; the API surface is different:" ] }, { "attachments": {}, "cell_type": "markdown", "id": "62", "metadata": {}, "source": [ "| Task | pandas | Polars |\n", "|---|---|---|\n", "| Load CSV | `pd.read_csv(path)` | `pl.read_csv(path, try_parse_dates=True)` |\n", "| Select columns | `df[[\"a\", \"b\"]]` | `df.select([\"a\", \"b\"])` |\n", "| Filter rows | `df[df[\"col\"] > 0.5]` | `df.filter(pl.col(\"col\") > 0.5)` |\n", "| Add column | `df[\"c\"] = df[\"a\"] + df[\"b\"]` | `df.with_columns(c=pl.col(\"a\") + pl.col(\"b\"))` |\n", "| Rename column | `df.rename(columns={\"old\": \"new\"})` | `df.rename({\"old\": \"new\"})` |\n", "| GroupBy + agg | `df.groupby(\"k\")[\"v\"].mean()` | `df.group_by(\"k\").agg(pl.col(\"v\").mean())` |\n", "| Sort | `df.sort_values(\"col\", ascending=False)` | `df.sort(\"col\", descending=True)` |\n", "| Missing values | `df.isna().sum()` | `df.null_count()` |\n", "| Fill missing | `df[\"col\"].fillna(val)` | `df.with_columns(pl.col(\"col\").fill_null(val))` |\n", "| One-hot encode | `pd.get_dummies(df[\"col\"])` | `df.to_dummies(\"col\")` |\n", "| Value counts | `df[\"col\"].value_counts()` | `df[\"col\"].value_counts()` |\n", "| Lazy query | _(eager only)_ | `df.lazy()` ... `.collect()` |\n", "| Resample | `series.resample(\"M\").mean()` | `df.sort(\"date\").group_by_dynamic(\"date\", every=\"1mo\").agg(...)` |" ] }, { "attachments": {}, "cell_type": "markdown", "id": "63", "metadata": {}, "source": [ "
\n", " Key Concept: Polars is strict; pandas is permissive

\n", "Polars raises errors early: wrong dtype in a column, missing .collect() on a LazyFrame, unsorted data before group_by_dynamic. Pandas absorbs most of these quietly, sometimes silently producing a wrong result. The Polars errors feel strict at first but save time in larger pipelines where a wrong-but-valid pandas result would surface much later.\n", "
" ] }, { "cell_type": "markdown", "id": "64", "metadata": {}, "source": [ "## 7. Window Functions with `over()` and Streaming Large Files\n", "\n", "Two Polars strengths that have no direct pandas equivalent: `over()` for grouped window calculations without reducing rows, and `scan_csv`/`scan_parquet` for processing files that do not fit in memory." ] }, { "cell_type": "markdown", "id": "65", "metadata": {}, "source": [ "### 7a. The `over()` Expression\n", "\n", "In pandas, adding a per-group statistic back to the original rows requires `groupby().transform()`. In Polars, `pl.col(\"x\").mean().over(\"group\")` does the same thing inside a single `with_columns()` call. It computes the aggregation per group and broadcasts the result back to every row, aligned to the original index automatically." ] }, { "cell_type": "code", "execution_count": null, "id": "66", "metadata": {}, "outputs": [], "source": [ "# Add the per-program mean and std to every student row, in one expression\n", "df = df.with_columns(\n", " program_mean=pl.col(\"average_marks\").mean().over(\"program\"),\n", " program_std=pl.col(\"average_marks\").std().over(\"program\"),\n", ")\n", "\n", "# Z-score: how many standard deviations from the program mean?\n", "df = df.with_columns(marks_zscore=(pl.col(\"average_marks\") - pl.col(\"program_mean\")) / pl.col(\"program_std\"))\n", "\n", "df.select([\"student_id\", \"program\", \"average_marks\", \"program_mean\", \"marks_zscore\"]).head(6)" ] }, { "cell_type": "markdown", "id": "67", "metadata": {}, "source": [ "
\n", " Key Concept: over() computes group stats without reducing rows

\n", "group_by().agg() produces one row per group: the result has as many rows as there are unique group values. .over(\"group\") inside with_columns() keeps all original rows and attaches the group stat to each one. This is the right tool for feature engineering: every student in the \"Engineering\" program gets the engineering mean attached to their row, ready for modelling without a join.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "68", "metadata": {}, "outputs": [], "source": [ "# Rolling mean per school using over() -- group-aware rolling window\n", "import polars as pl\n", "\n", "attendance = pl.read_csv(\"data/daily_attendance.csv\", try_parse_dates=True)\n", "\n", "attendance_with_rolling = attendance.sort([\"school_id\", \"date\"]).with_columns(\n", " rolling_avg=pl.col(\"attendance_rate\").rolling_mean(window_size=5, min_periods=3).over(\"school_id\")\n", ")\n", "\n", "attendance_with_rolling.filter(pl.col(\"school_id\") == 300).head(8)" ] }, { "cell_type": "markdown", "id": "69", "metadata": {}, "source": [ "
\n", " Activity 6 - Per-Program Z-Score

\n", "\n", "Goal: Using over(), add a column gender_mean_marks that is the mean average_marks within each gender group. Then add gender_zscore = (average_marks - gender_mean_marks) / gender_std_marks. Print the top 3 students by z-score in each gender.\n", "
df_z = df.with_columns(\n",
    "    gender_mean = pl.col(\"average_marks\").mean().over(\"gender\"),\n",
    "    gender_std  = pl.col(\"average_marks\").std().over(\"gender\"),\n",
    ").with_columns(\n",
    "    gender_zscore = (pl.col(\"average_marks\") - pl.col(\"gender_mean\")) / pl.col(\"gender_std\")\n",
    ")\n",
    "df_z.sort(\"gender_zscore\", descending=True).select([\"student_id\", \"gender\", \"average_marks\", \"gender_zscore\"]).head(6)
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "70", "metadata": {}, "outputs": [], "source": [ "# TODO: per-gender z-score using over()\n", "..." ] }, { "cell_type": "markdown", "id": "71", "metadata": {}, "source": [ "### 7b. Streaming: Processing Files Larger Than Memory\n", "\n", "`pl.scan_csv()` and `pl.scan_parquet()` read a file lazily: no data loads until `.collect()` is called. Combined with Polars' streaming engine, filters and projections execute in chunks, so the working memory never exceeds what the query actually needs, even if the source file is larger than RAM.\n", "\n", "The API is identical to eager reads; the difference is `scan_` instead of `read_`." ] }, { "cell_type": "code", "execution_count": null, "id": "72", "metadata": {}, "outputs": [], "source": [ "# scan_csv: no data loads here, only a query plan is built\n", "lazy_scan = (\n", " pl.scan_csv(\"data/university_analytics.csv\")\n", " .filter(pl.col(\"final_score\") > 80)\n", " .select([\"student_id\", \"program\", \"final_score\"])\n", " .sort(\"final_score\", descending=True)\n", ")\n", "\n", "# .collect() executes with streaming (chunks through the file)\n", "result = lazy_scan.collect(streaming=True)\n", "print(f\"Rows matching filter: {len(result)}\")\n", "result.head()" ] }, { "cell_type": "markdown", "id": "73", "metadata": {}, "source": [ "
\n", " Pro Tip: Use Parquet instead of CSV for large files

\n", "CSV is plain text: Polars must read every byte to find the columns you asked for. Parquet is columnar: scan_parquet() reads only the columns referenced in your query, skipping all others. A file with 50 columns where you only need 3 loads roughly 3/50th of the data. For production pipelines with large files, converting a CSV to Parquet once and scanning Parquet from then on is a common and effective optimisation.\n", "
# Convert once\n",
    "df.write_parquet(\"data/university_analytics.parquet\")\n",
    "\n",
    "# All future reads: columnar, faster, smaller\n",
    "pl.scan_parquet(\"data/university_analytics.parquet\")   .select([\"student_id\", \"final_score\"])   .filter(pl.col(\"final_score\") > 80)   .collect(streaming=True)
\n", "
" ] }, { "cell_type": "markdown", "id": "74", "metadata": {}, "source": [ "
\n", " Common Mistake: Not sorting before a rolling window in Polars

\n", "pl.col(\"rate\").rolling_mean(5).over(\"school_id\") operates on whatever order the rows are in. Without .sort([\"school_id\", \"date\"]) first, the 5-row window picks up rows in an arbitrary order and produces meaningless values. Polars does not sort automatically; it raises an error only for group_by_dynamic. For rolling_mean().over(), the wrong-order mistake is silent.\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "75", "metadata": {}, "source": [ "## Capstone: Reproduce the pandas Capstone Exercises in Polars" ] }, { "attachments": {}, "cell_type": "markdown", "id": "76", "metadata": {}, "source": [ "
\n", " Capstone Exercise - Top Schools, Risk Report, and Regional Summary

\n", "Goal: Reproduce three capstones from earlier parts using Polars:\n", "
    \n", "
  1. Top schools (Part 8): Filter to students with internet == 1, group by school_id, compute mean average_marks, sort descending, show top 5.
  2. \n", "
  3. Risk report (Part 9): Add a grade column based on average_marks using pl.when().then().otherwise(). Print grade distribution as proportions.
  4. \n", "
  5. Regional summary (Part 10): Create a schools DataFrame mapping school_id to region, join it onto df, and compute mean marks per region.
  6. \n", "
\n", "
# 1. Top schools\n",
    "top_schools = (\n",
    "    df.lazy()\n",
    "    .filter(pl.col(\"has_internet\") == 1)\n",
    "    .group_by(\"school_id\")\n",
    "    .agg(pl.col(\"average_marks\").mean().alias(\"mean_marks\"))\n",
    "    .sort(\"mean_marks\", descending=True)\n",
    "    .limit(5)\n",
    "    .collect()\n",
    ")\n",
    "\n",
    "# 2. Risk report -- pl.when().then().otherwise() replaces df.apply(letter_grade)\n",
    "df_graded = df.with_columns(\n",
    "    grade=pl.when(pl.col(\"average_marks\") >= 0.8).then(pl.lit(\"A\"))\n",
    "           .when(pl.col(\"average_marks\") >= 0.6).then(pl.lit(\"B\"))\n",
    "           .when(pl.col(\"average_marks\") >= 0.4).then(pl.lit(\"C\"))\n",
    "           .otherwise(pl.lit(\"D\"))\n",
    ")\n",
    "grade_distribution = df_graded[\"grade\"].value_counts(normalize=True).sort(\"grade\")\n",
    "\n",
    "# 3. Regional summary\n",
    "import polars as pl\n",
    "school_ids = df[\"school_id\"].unique().sort()\n",
    "regions = [\"North\", \"South\", \"East\", \"West\"]\n",
    "schools = pl.DataFrame({\n",
    "    \"school_id\": school_ids,\n",
    "    \"region\": [regions[i % 4] for i in range(len(school_ids))],\n",
    "})\n",
    "merged = df.join(schools, on=\"school_id\", how=\"left\")\n",
    "regional_summary = merged.group_by(\"region\").agg(\n",
    "    pl.col(\"average_marks\").mean().alias(\"mean_marks\")\n",
    ").sort(\"region\")
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "77", "metadata": {}, "outputs": [], "source": [ "# TODO: reproduce top schools, risk report, and regional summary in Polars\n", "..." ] }, { "cell_type": "markdown", "id": "78", "metadata": {}, "source": [ "## Further Reading\n", "\n", "| Resource | Why it matters |\n", "|---|---|\n", "| [Polars documentation: Getting started](https://docs.pola.rs) | The primary reference; the user guide covers lazy evaluation, the expression API, and performance tips |\n", "| [Polars migration guide from pandas](https://docs.pola.rs/user-guide/migration/pandas/) | A direct comparison of common pandas idioms rewritten in Polars; read this alongside this notebook |\n", "| Vink, R. (2023). [Modern Polars](https://kevinheavey.github.io/modern-polars/) | Free online book that mirrors the official pandas docs structure, useful for side-by-side comparison |\n", "| [Polars expression cheat sheet](https://docs.pola.rs/user-guide/expressions/) | `over()`, `rolling_*`, selectors, and string ops in one place |\n", "| Ritchie Vink, creator of Polars, explains the design in the [Polars blog](https://pola.rs/posts/) | The posts on the expression API and lazy evaluation explain why certain pandas patterns do not translate |" ] }, { "attachments": {}, "cell_type": "markdown", "id": "79", "metadata": {}, "source": [ "## Summary\n", "\n", "| Concept | Key rule |\n", "|---|---|\n", "| `pl.read_csv(try_parse_dates=True)` | Load data with automatic date inference |\n", "| `df.schema` | Dict-like view of column names and Polars dtypes |\n", "| `df.select([...])` | Select columns; equivalent to `df[[\"a\", \"b\"]]` in pandas |\n", "| `pl.col(\"name\")` | Column expression; compose with arithmetic, comparisons, and string ops |\n", "| `df.filter(expr)` | Keep rows where the expression is True |\n", "| `df.with_columns(alias=expr)` | Add or overwrite columns; returns a new DataFrame |\n", "| `df.group_by(key).agg([exprs])` | Split-apply-combine with explicit expression list |\n", "| `.alias(\"name\")` | Rename an expression's output column |\n", "| `df.lazy()` ... `.collect()` | Defer execution so Polars can optimize the query plan |\n", "| `pl.when().then().otherwise()` | Conditional column derivation; replaces `apply(lambda ...)` |\n", "| `df.join(other, on=key, how=...)` | SQL-style join; same `how` values as pandas `merge` |\n", "| `.dt.year()`, `.dt.month()` | Date-component extraction in the `.dt` namespace |\n", "| `group_by_dynamic(\"date\", every=\"1mo\")` | Time-window grouping; requires sorted input |\n", "| `df.null_count()` | Count missing values per column; replaces `df.isna().sum()` |\n", "| `.over(\"group\")` | Broadcast a group aggregation back to every row; no join needed |\n", "| `pl.col(\"x\").rolling_mean(n).over(\"g\")` | Per-group rolling window; sort by group and time first |\n", "| `pl.scan_csv()` / `pl.scan_parquet()` | Lazy file scan; push-down filters execute without loading full file |\n", "| `.collect(streaming=True)` | Execute in chunks; keeps peak memory low for files larger than RAM |" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }