{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "---\n", "title: \"Data Schema Validation with Pandera\"\n", "---\n" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/sambaiga/ds-mlops-path/blob/main/tutorials/02-dev-tools/08-pandera-schema-validation.ipynb) [](https://raw.githubusercontent.com/sambaiga/ds-mlops-path/main/tutorials/02-dev-tools/08-pandera-schema-validation.ipynb)" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "**DS-MLOps Dev Tools**\n", "\n", "**Python 3.12+ | Author: Anthony Faustine**\n", "\n", "## Before you begin\n", "\n", "This notebook assumes you have completed [Part 19: Data Validation with Pydantic](07-pydantic-validation.ipynb). Pandera extends the same \"validate at the boundary\" principle to DataFrames: where Pydantic validates individual records, Pandera validates the schema of an entire table.\n", "\n", "The `grade-predictor` project continues here: a Pandera schema replaces the implicit assumptions about `university_analytics.csv` with explicit, testable contracts.\n", "\n", "> Callout markers used throughout this notebook are explained on the [book cover page](../../index.qmd#callout-guide)." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "::: {.callout-note collapse=\"true\" icon=false}\n", "## Learning Objectives\n", "\n", "By the end of Part 20 you will be able to:\n", "\n", "| # | Skill | Covered in |\n", "| --- | --- | --- |\n", "| 1 | Explain the difference between row-level (Pydantic) and schema-level (Pandera) validation | Sec. 1 |\n", "| 2 | Define a Pandera `DataFrameSchema` with column types and constraints | Sec. 2 |\n", "| 3 | Use the class-based API with `pa.DataFrameModel` for typed schemas | Sec. 3 |\n", "| 4 | Write custom element-wise and series-level checks | Sec. 4 |\n", "| 5 | Validate DataFrames in a pipeline and collect errors without stopping | Sec. 5 |\n", "| 6 | Use Pandera schemas as pytest fixtures to document data contracts | Sec. 6 |\n", ":::\n" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 0. Pydantic Validated the Row. Who Validates the Table?\n", "\n", "You have a `StudentRecord` Pydantic model. It validates `midterm_score` is between 0 and 100, that `student_id` matches `S\\d{4}`, and that `program` is a non-empty string. Pydantic runs when a *single record* enters the system.\n", "\n", "Now you load `university_analytics.csv` into a DataFrame. There are 2,400 rows. You could loop through them with `StudentRecord.model_validate`, but that tells you nothing about the *table*: whether `student_id` is unique across all rows, whether the distribution of `program` values matches what you expect, whether the proportion of missing values in `has_internet` is within an acceptable bound. A row validator answers \"is this row correct?\". A schema validator answers \"is this table correct?\".\n", "\n", "**Pandera** ([pandera.readthedocs.io](https://pandera.readthedocs.io/)) is a statistical data validation library for DataFrames. You define a schema: column types, constraints, uniqueness, allowed values, statistical properties, and Pandera checks the whole DataFrame against it in one call. It supports pandas and Polars, integrates with pytest, and can generate synthetic data for testing.\n", "\n", "### Install\n", "\n", "```bash\n", "uv add pandera # or: pip install pandera\n", "```" ] }, { "cell_type": "raw", "id": "5", "metadata": { "raw_mimetype": "text/markdown" }, "source": [ "## 1. Row Validation vs Schema Validation\n", "\n", "```{mermaid}\n", "flowchart LR\n", " CSV[\"university_analytics.csv\"] --> DF[\"pandas DataFrame\\n2,400 rows × 10 cols\"]\n", " DF --> PA[\"Pandera schema\\nColumn types, constraints,\\nuniqueness, statistics\"]\n", " PA -->|\"schema violation\"| ERR[\"SchemaError\\ncell location + rule\"]\n", " PA -->|\"all checks pass\"| OK[\"validated DataFrame\\nsafe to use downstream\"]\n", "\n", " style OK fill:#EBF5F0,stroke:#059669,color:#065F46\n", " style ERR fill:#FEF2F2,stroke:#DC2626,color:#991B1B\n", " style PA fill:#EAF3FA,stroke:#0369A1,color:#0C4A6E\n", "```" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "Pydantic and Pandera answer different questions:\n", "\n", "| | Pydantic `BaseModel` | Pandera `DataFrameSchema` |\n", "| --- | --- | --- |\n", "| Unit | One record (row) | Whole DataFrame |\n", "| Checks | Type coercion, field constraints, cross-field | Column dtype, nullability, uniqueness, value ranges, statistical bounds |\n", "| Returns | Validated model instance | Validated DataFrame |\n", "| Error info | Field path + message | Row index + column + failed check |\n", "| Best for | API inputs, config objects | CSVs, pipeline data, feature tables |\n", "\n", "
student_id. Run schema.validate(df) and confirm it raises a SchemaError with a message about uniqueness.\n",
"dup_df = sample_df.copy()\n",
"dup_df.loc[1, \"student_id\"] = \"S0001\" # duplicate\n",
"schema.validate(dup_df) # should raise\n",
"DataFrameModel for reuse, DataFrameSchema for quick scriptsDataFrameModel is easier to subclass, document, and test: it reads like a dataclass and fits naturally alongside Pydantic models. DataFrameSchema is useful when you want to build a schema programmatically at runtime, e.g., from a config file or database metadata.\n",
"StudentDataSchema and add a semester column constrained to [\"Fall\", \"Spring\", \"Summer\"]. Validate a DataFrame that includes the column with valid values, then one that has an invalid value. Confirm only the second raises a SchemaError.\n",
"class ExtendedSchema(StudentDataSchema):\n",
" semester: Series[str] = pa.Field(isin=[\"Fall\", \"Spring\", \"Summer\"])\n",
"@pa.dataframe_check to GradeSchema that verifies midterm_score and final_score are not both 0 for the same student (a student with both scores at 0 is almost certainly an error, not a result). Confirm it passes on valid data and fails when you introduce a row with both at 0.\n",
"@pa.dataframe_check\n",
"def not_both_zero(cls, df: pd.DataFrame) -> pd.Series:\n",
" return ~((df[\"midterm_score\"] == 0) & (df[\"final_score\"] == 0))\n",
"schema.validate(df, lazy=False) (the default) raises a SchemaError on the first failure: fast and clear for development. schema.validate(df, lazy=True) collects every failure and raises a SchemaErrors (note the plural) at the end, better for production, where you want a full error report rather than a partial run.\n",
"sample_df: one with midterm_score=150.0, one with an invalid program, and one with a duplicate student_id. Call StudentDataSchema.validate(bad_df, lazy=True). Catch the SchemaErrors exception and print the failure_cases DataFrame showing all three failures at once.\n",
"pa.DataFrameModel.example() to generate test data automaticallyStudentDataSchema.example(size=50) generates 50 valid rows matching all constraints. This removes the need to hand-craft test fixtures for every new schema.\n",
"StudentDataSchema.example(size=20) to generate 20 synthetic rows. Confirm that the generated DataFrame passes StudentDataSchema.validate() without errors. Then confirm that if you corrupt one cell (set a score to 200), validation fails.\n",
"synthetic = StudentDataSchema.example(size=20)\n",
"StudentDataSchema.validate(synthetic) # should pass\n",
"StudentDataSchema in grade_predictor/schemas.py covering all columns of university_analytics.csvload_and_validate (from Part 19) to run Pandera schema validation after row-level Pydantic validation@pa.dataframe_check that verifies the computed weighted average (using the weights from PipelineConfig) falls in [0, 100] for every rowSchemaError, and one that uses example() to generate synthetic data and confirms it passesuv run pytest -v and confirm all three pass