{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Part 11: Polars & the Expression API\"\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/01-python-basics/11-polars.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/01-python-basics/11-polars.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Data Analysis**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook continues from Part 10 (`10-combining-reshaping.ipynb`). You should be comfortable with pandas DataFrames, `groupby`, merging, and time-indexed data before starting here.\n", "\n", "Polars is a modern DataFrame library written in Rust. It uses an expression-based API and a lazy evaluation model that lets it optimize queries before running them. The same two datasets from Parts 8-10 are used here so you can compare pandas and Polars syntax on familiar ground.\n", "\n", "::: {.callout-note collapse=\"true\" icon=false}\n", "## Topics covered\n", "\n", "| Topic | Why it matters |\n", "|---|---|\n", "| **DataFrame construction** | Polars and pandas share the concept; the API differs in key ways |\n", "| **Expression API** | `pl.col()` expressions replace bracket indexing and make queries composable |\n", "| **GroupBy and aggregation** | Same split-apply-combine logic, cleaner syntax |\n", "| **LazyFrame** | Defer execution and let Polars optimize the query plan before running |\n", "| **Date/time** | Polars parses and handles dates natively, no `.dt` accessor for most operations |\n", "| **Migration patterns** | Side-by-side pandas-to-Polars translations for the operations you already know |\n", "| **`over()` and streaming** | Group-aware window stats without reducing rows; process files larger than RAM in chunks |\n", ":::\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "attachments": {}, "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 11 (Polars) you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "|---|---|---|\n", "| 1 | Build and inspect Polars DataFrames and compare them to pandas | Sec. 1 |\n", "| 2 | Use `pl.col()` expressions to select, filter, and derive columns | Sec. 2 |\n", "| 3 | Aggregate data with `group_by` and the expression API | Sec. 3 |\n", "| 4 | Understand eager vs lazy evaluation and use `LazyFrame` | Sec. 4 |\n", "| 5 | Parse and manipulate dates with Polars's built-in datetime support | Sec. 5 |\n", "| 6 | Translate common pandas operations to Polars equivalents | Sec. 6 |\n", "| 7 | Compute per-group window stats with `over()` and process large files with `scan_csv` | Sec. 7 |\n", ":::\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import polars as pl\n", "\n", "# Polars reads CSV directly; no separate parse step needed for standard columns\n", "df = pl.read_csv(\"data/university_analytics.csv\")\n", "print(f\"shape : {df.shape}\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## 0. When pandas Is Not Enough\n", "\n", "Your pandas pipeline works perfectly at 50,000 rows. The `groupby` runs in a second, the merge finishes before you blink, and everything fits in memory with room to spare. Then someone drops a 10-million-row export on your desk. The same pipeline now takes three minutes, swaps to disk halfway through, and your laptop fan starts spinning like it is trying to take off.\n", "\n", "This is not a pandas bug. It is a design trade-off. pandas processes one column at a time on a single CPU thread, using a Python object model that was designed for flexibility rather than throughput. For most interactive data science that trade-off is exactly right. But when data gets large enough to feel slow, you need a different tool.\n", "\n", "**Polars** ([pola.rs](https://pola.rs)) was written from scratch in Rust by Ritchie Vink in 2020. It stores data in Apache Arrow format, executes operations across all CPU cores in parallel, and evaluates expressions *lazily* — building a query plan and optimising it before touching a single row. The same operations that take minutes in pandas can take seconds in Polars, without changing what the code looks like.\n", "\n", "### How it compares\n", "\n", "| | pandas | Polars |\n", "| --- | --- | --- |\n", "| Core language | Python / C | Rust |\n", "| Execution | Single-threaded | Multi-threaded |\n", "| Memory model | NumPy / Python objects | Apache Arrow |\n", "| Evaluation | Eager (runs immediately) | Lazy by default (`scan_*` → `.collect()`) |\n", "| API style | Index-based, many methods | Expression-based, composable |\n", "| Ecosystem maturity | Very mature (2008) | Fast-growing (2020) |\n", "| When to use | Up to ~5M rows; default choice | Large files, performance-critical pipelines |\n", "\n", "You do not have to choose one forever. Many production pipelines read and clean with Polars for speed, then convert to pandas for the parts of the ML ecosystem that expect it.\n", "\n", "### Already in your environment\n", "\n", "```bash\n", "uv add polars # for a standalone project\n", "```\n", "\n", "Official docs and user guide: [docs.pola.rs](https://docs.pola.rs/)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "## 1. DataFrame Construction" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "Building a Polars DataFrame from a Python dict looks almost identical to pandas. The difference shows up in the schema: Polars always knows the dtype of every column, and it tells you explicitly:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "import polars as pl\n", "\n", "sample = pl.DataFrame(\n", " {\n", " \"student_id\": [\"s001\", \"s002\", \"s003\"],\n", " \"midterm_score\": [0.62, 0.78, 0.91],\n", " \"gender\": [\"F\", \"M\", \"F\"],\n", " }\n", ")\n", "print(sample.schema)\n", "sample" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "
pl.Utf8 for strings, pl.Float64 for floats, pl.Int64 for integers. Polars raises an error if you try to mix types within a column. Pandas 3's str dtype is conceptually similar but more permissive: it silently allows None values in a string column. Polars is stricter, which makes bugs surface earlier.\n",
"df.schema to see column names and dtypes in one dict-like viewdf.schema returns a Schema object, a mapping from column name to Polars dtype. It is the fastest way to confirm that read_csv inferred the right types before any analysis starts.\n",
"df, then use df.describe() to get summary statistics. Compare the dtype labels to the pandas output from Part 8.\n",
"print(df.schema)\n",
"df.describe()\n",
"pl.col() is an expression, not a valuepl.col(\"midterm_score\") describes a column reference, not the column itself. Expressions compose: pl.col(\"a\") + pl.col(\"b\") describes an addition without performing it. Polars evaluates all expressions in a .select() or .with_columns() call together, which is what lets it parallelize across columns automatically.\n",
"with_columnsdf[\"average_marks\"] = ... raises an error in Polars, unlike pandas. Polars DataFrames are immutable by design: the only way to add or modify a column is to call .with_columns() and reassign the result. This enforces the copy-on-write discipline explicitly, rather than relying on the runtime to enforce it as pandas 3 does.\n",
"passed column that is True when average_marks >= 0.5, using .with_columns() and pl.col(\"average_marks\") >= 0.5. Then filter to passing students and print how many there are.\n",
"df = df.with_columns(\n",
" passed=(pl.col(\"average_marks\") >= 0.5)\n",
")\n",
"df.filter(pl.col(\"passed\")).shape[0]\n",
"polars.selectors to select columns by type, not by namepolars.selectors (aliased as cs) provides type-based column selectors, so you can write expressions like \"all numeric columns\" or \"all string columns\" without listing names:\n",
"\n",
"import polars.selectors as cs\n",
"\n",
"# Normalise every numeric column in one expression\n",
"df.with_columns(\n",
" (cs.numeric() - cs.numeric().mean()) / cs.numeric().std()\n",
")\n",
"\n",
"# Select only string columns\n",
"df.select(cs.string())\n",
"\n",
"Common selectors: cs.numeric(), cs.string(), cs.boolean(), cs.temporal(), cs.by_dtype(pl.Float64). Combine them with | (union) or & (intersection): cs.numeric() | cs.boolean(). This is especially useful in feature engineering pipelines where the set of columns may change between runs.\n",
".agg() takes a list of expressions, each producing one output columngroupby(\"program\")[\"average_marks\"].agg([\"mean\", \"std\", \"count\"]) returns a DataFrame with the stats as columns. In Polars the same result comes from passing named expressions to .agg(): pl.col(\"average_marks\").mean().alias(\"mean_marks\"). The explicit .alias() is required when the default auto-generated name would be ambiguous.\n",
"group_by order is not guaranteedgroup_by does not preserve or sort the output by the grouping key. Always chain .sort(\"key_col\") after .agg() if the output order matters. The pandas equivalent sorted by default in older versions but no longer guarantees it either.\n",
"df by program and compute, for each group, the fraction of students whose continue_drop is False. In Polars, (pl.col(\"continue_drop\") == False).mean() gives the fraction directly.\n",
"df.group_by(\"program\").agg(\n",
" drop_fraction=(pl.col(\"continue_drop\") == False).mean()\n",
").sort(\"program\")\n",
".collect(), Polars analyses the whole query plan: it pushes filters as early as possible (predicate pushdown), drops columns that are not needed (projection pushdown), and parallelizes independent operations. The same query in eager mode runs each step sequentially with no cross-step optimization. For large datasets the difference is significant; for small ones it is negligible. The API is identical apart from the .lazy() / .collect() bookends.\n",
".collect() and treating a LazyFrame as a resultdf.lazy().filter(...).group_by(...).agg(...) returns a LazyFrame, not a DataFrame. Printing it shows the query plan, not the data. Call .collect() to get a real DataFrame. If you need the shape, columns, or to iterate over rows, you must collect first.\n",
"internet == 1, group by school_id, take the mean average_marks, sort descending, and .collect() the top 5.\n",
"top_schools = (\n",
" df.lazy()\n",
" .filter(pl.col(\"has_internet\") == 1)\n",
" .group_by(\"school_id\")\n",
" .agg(pl.col(\"average_marks\").mean().alias(\"mean_marks\"))\n",
" .sort(\"mean_marks\", descending=True)\n",
" .limit(5)\n",
" .collect()\n",
")\n",
"top_schools\n",
"pl.SQLContext compiles SQL into Polars' lazy query plan. The optimizer applies the same predicate pushdown and projection pushdown as a hand-written .lazy().filter().select() chain. You get SQL's readability and Polars' performance for free.\n",
"str dtype until you call pd.to_datetime(). Polars with try_parse_dates=True infers and parses ISO-format dates automatically during read_csv, producing a pl.Date or pl.Datetime column. This means filtering and arithmetic work on dates immediately, without a conversion step.\n",
"group_by_dynamic replaces resample in Polarsseries.resample(\"M\").mean() in pandas becomes df.sort(\"date\").group_by_dynamic(\"date\", every=\"1mo\").agg(...) in Polars. The sort before group_by_dynamic is required: Polars raises an error if the time column is not sorted.\n",
"group_by_dynamic to compute the mean attendance_rate per school per month across all schools in the attendance table. Sort by date before grouping.\n",
"attendance.sort(\"date\").group_by_dynamic(\n",
" \"date\", every=\"1mo\", group_by=\"school_id\"\n",
").agg(\n",
" pl.col(\"attendance_rate\").mean().alias(\"monthly_mean\")\n",
").sort([\"school_id\", \"date\"])\n",
".collect() on a LazyFrame, unsorted data before group_by_dynamic. Pandas absorbs most of these quietly, sometimes silently producing a wrong result. The Polars errors feel strict at first but save time in larger pipelines where a wrong-but-valid pandas result would surface much later.\n",
"over() computes group stats without reducing rowsgroup_by().agg() produces one row per group: the result has as many rows as there are unique group values. .over(\"group\") inside with_columns() keeps all original rows and attaches the group stat to each one. This is the right tool for feature engineering: every student in the \"Engineering\" program gets the engineering mean attached to their row, ready for modelling without a join.\n",
"over(), add a column gender_mean_marks that is the mean average_marks within each gender group. Then add gender_zscore = (average_marks - gender_mean_marks) / gender_std_marks. Print the top 3 students by z-score in each gender.\n",
"df_z = df.with_columns(\n",
" gender_mean = pl.col(\"average_marks\").mean().over(\"gender\"),\n",
" gender_std = pl.col(\"average_marks\").std().over(\"gender\"),\n",
").with_columns(\n",
" gender_zscore = (pl.col(\"average_marks\") - pl.col(\"gender_mean\")) / pl.col(\"gender_std\")\n",
")\n",
"df_z.sort(\"gender_zscore\", descending=True).select([\"student_id\", \"gender\", \"average_marks\", \"gender_zscore\"]).head(6)\n",
"scan_parquet() reads only the columns referenced in your query, skipping all others. A file with 50 columns where you only need 3 loads roughly 3/50th of the data. For production pipelines with large files, converting a CSV to Parquet once and scanning Parquet from then on is a common and effective optimisation.\n",
"# Convert once\n",
"df.write_parquet(\"data/university_analytics.parquet\")\n",
"\n",
"# All future reads: columnar, faster, smaller\n",
"pl.scan_parquet(\"data/university_analytics.parquet\") .select([\"student_id\", \"final_score\"]) .filter(pl.col(\"final_score\") > 80) .collect(streaming=True)\n",
"pl.col(\"rate\").rolling_mean(5).over(\"school_id\") operates on whatever order the rows are in. Without .sort([\"school_id\", \"date\"]) first, the 5-row window picks up rows in an arbitrary order and produces meaningless values. Polars does not sort automatically; it raises an error only for group_by_dynamic. For rolling_mean().over(), the wrong-order mistake is silent.\n",
"internet == 1, group by school_id, compute mean average_marks, sort descending, show top 5.grade column based on average_marks using pl.when().then().otherwise(). Print grade distribution as proportions.schools DataFrame mapping school_id to region, join it onto df, and compute mean marks per region.# 1. Top schools\n",
"top_schools = (\n",
" df.lazy()\n",
" .filter(pl.col(\"has_internet\") == 1)\n",
" .group_by(\"school_id\")\n",
" .agg(pl.col(\"average_marks\").mean().alias(\"mean_marks\"))\n",
" .sort(\"mean_marks\", descending=True)\n",
" .limit(5)\n",
" .collect()\n",
")\n",
"\n",
"# 2. Risk report -- pl.when().then().otherwise() replaces df.apply(letter_grade)\n",
"df_graded = df.with_columns(\n",
" grade=pl.when(pl.col(\"average_marks\") >= 0.8).then(pl.lit(\"A\"))\n",
" .when(pl.col(\"average_marks\") >= 0.6).then(pl.lit(\"B\"))\n",
" .when(pl.col(\"average_marks\") >= 0.4).then(pl.lit(\"C\"))\n",
" .otherwise(pl.lit(\"D\"))\n",
")\n",
"grade_distribution = df_graded[\"grade\"].value_counts(normalize=True).sort(\"grade\")\n",
"\n",
"# 3. Regional summary\n",
"import polars as pl\n",
"school_ids = df[\"school_id\"].unique().sort()\n",
"regions = [\"North\", \"South\", \"East\", \"West\"]\n",
"schools = pl.DataFrame({\n",
" \"school_id\": school_ids,\n",
" \"region\": [regions[i % 4] for i in range(len(school_ids))],\n",
"})\n",
"merged = df.join(schools, on=\"school_id\", how=\"left\")\n",
"regional_summary = merged.group_by(\"region\").agg(\n",
" pl.col(\"average_marks\").mean().alias(\"mean_marks\")\n",
").sort(\"region\")\n",
"